feat: New functions and operations for working with arrays #6384

izveigor · 2023-05-18T19:24:26Z

Which issue does this PR close?

Closes #6119
Closes #6075.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

…rray_to_string

izveigor · 2023-05-29T20:26:30Z

datafusion/expr/src/function.rs

-            array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
-            fun.volatility(),
-        ),
+        BuiltinScalarFunction::ArrayAppend => Signature::any(2, fun.volatility()),


Are there ways to use List and ARRAY_DATATYPES?

🤔 Given the element type of the list is part of its DataType, you probably can'y use the existing Signatures

Perhaps you could add a new Signature::any_list or something that would only check that the datatype matched DataType::list 🤔

izveigor · 2023-05-29T20:29:16Z

datafusion/physical-expr/src/array_expressions.rs

+}
+
+/// Array_append SQL function
+pub fn array_append(args: &[ColumnarValue]) -> Result<ColumnarValue> {


Should each function accept &[ColumnarValue] or ArrayRef? Is there a difference in these approaches?

The difference is that if you take ColumnarValue we could specialize the kernels to do something faster with scalar (single) values rather than expanding them out to arrays (aka making copies).

For the initial implementation I think converting them all to arrays is the best approach as it is simplest

izveigor · 2023-05-29T20:30:25Z

datafusion/physical-expr/src/array_expressions.rs

+    };
+
+    let element = match &args[1] {
+        ColumnarValue::Scalar(scalar) => scalar.to_array().clone(),


ColumnarValue::Array also makes sense in this situation?

izveigor · 2023-05-29T20:32:50Z

datafusion/physical-expr/src/array_expressions.rs

-    let res = match args[0].data_type() {
+    let data_type = args[0].data_type();
+    let res = match data_type {
+        DataType::List(..) => {


I don't know the ways how to implement FixedSizeList in all the functions, so I preferred to use List. I think it does not affect anything.

As FixedSizedList and List are different data types, if people have data that came from a Parquet file or something that is a FixedSizedList these functions likely wont work,

However, perhaps eventually we can add coercion rules to coerce (automatically cast) FixedSizeList to List

izveigor · 2023-05-29T20:48:07Z

datafusion/physical-expr/src/functions.rs

@@ -2785,73 +2807,6 @@ mod tests {
        Ok(())
    }

-    fn generic_test_array(


It does not work (FixedSizeList replaced with List). With what it can be connected?
Error:

left: `List(Field { name: "item", data_type: UInt32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })`, right: `List(Field { name: "item", data_type: UInt64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })`'

I am not sure

izveigor · 2023-05-29T20:50:04Z

@alamb I wonder if you have time to review this PR.
I left comments on which I would be pleased to hear your opinion.

alamb · 2023-05-30T13:35:52Z

Thank you @izveigor -- I have put this on my review list but I likely won't have a chance to review until tomorrow

alamb · 2023-05-31T21:51:57Z

I didn't make it to this today, but I plan to review it tomororw

alamb

This PR looks really nice @izveigor -- thank you so much!

I haven't had a chance to review all the function implementations yet but the overall structure looks great to me . I am hoping to get @tustvold or someone else who is more of an expert in the arrow-rs structures here to offer an opinion on the structure of the kernels

I'll try and complete my review soon

alamb · 2023-06-01T11:03:55Z

datafusion/core/tests/sqllogictests/test_files/array.slt

+## Array expressions Tests
+#############
+
+# array scalar function #1


These are great @izveigor -- thank you so much

the only thing I recommend is adding some additional tests that have null in the lists.

alamb · 2023-06-01T11:06:26Z

datafusion/expr/src/function.rs

-            array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
-            fun.volatility(),
-        ),
+        BuiltinScalarFunction::ArrayAppend => Signature::any(2, fun.volatility()),


🤔 Given the element type of the list is part of its DataType, you probably can'y use the existing Signatures

Perhaps you could add a new Signature::any_list or something that would only check that the datatype matched DataType::list 🤔

alamb · 2023-06-01T11:06:51Z

datafusion/expr/src/function.rs

-            array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
-            fun.volatility(),
-        ),
+        BuiltinScalarFunction::ArrayAppend => Signature::any(2, fun.volatility()),


🤔 Given the element type of the list is part of its DataType, you probably can'y use the existing Signatures

Perhaps you could add a new Signature::any_list or something that would only check that the datatype matched DataType::list 🤔

alamb · 2023-06-01T11:08:47Z

datafusion/physical-expr/src/array_expressions.rs

-    let res = match args[0].data_type() {
+    let data_type = args[0].data_type();
+    let res = match data_type {
+        DataType::List(..) => {


As FixedSizedList and List are different data types, if people have data that came from a Parquet file or something that is a FixedSizedList these functions likely wont work,

However, perhaps eventually we can add coercion rules to coerce (automatically cast) FixedSizeList to List

alamb · 2023-06-01T11:09:45Z

datafusion/physical-expr/src/array_expressions.rs

+    let data_type = args[0].data_type();
+    let res = match data_type {
+        DataType::List(..) => {
+            let arrays =


@tustvold can you offer some suggestions on using the arrow-rs API to build list arrays? Is this the best way to use that API?

It would perhaps be nicer to use a combination of https://docs.rs/arrow-array/latest/arrow_array/array/struct.GenericListArray.html#method.try_new and https://docs.rs/arrow-select/latest/arrow_select/concat/index.html

MutableArrayData is not the nicest API to use

alamb · 2023-06-01T11:11:36Z

datafusion/physical-expr/src/array_expressions.rs

+        )));
+    }
+
+    let arr = match &args[0] {


I think you can use https://docs.rs/datafusion/latest/datafusion/physical_plan/enum.ColumnarValue.html#method.into_array here

alamb · 2023-06-01T11:13:18Z

datafusion/physical-expr/src/array_expressions.rs

+}
+
+/// Array_append SQL function
+pub fn array_append(args: &[ColumnarValue]) -> Result<ColumnarValue> {


The difference is that if you take ColumnarValue we could specialize the kernels to do something faster with scalar (single) values rather than expanding them out to arrays (aka making copies).

For the initial implementation I think converting them all to arrays is the best approach as it is simplest

alamb · 2023-06-01T11:14:19Z

datafusion/physical-expr/src/functions.rs

@@ -2785,73 +2807,6 @@ mod tests {
        Ok(())
    }

-    fn generic_test_array(


I am not sure

tustvold · 2023-06-01T11:21:21Z

datafusion/physical-expr/src/array_expressions.rs

+    let data_type = arrays[0].data_type();
+    match data_type {
+        DataType::List(..) => {
+            let list_arrays =


I think you could just call to_data() I'm not sure this needs to downcast to ListArray

tustvold · 2023-06-01T11:21:47Z

datafusion/physical-expr/src/array_expressions.rs

+                downcast_vec!(arrays, ListArray).collect::<Result<Vec<&ListArray>>>()?;
+            let len: usize = list_arrays.iter().map(|a| a.values().len()).sum();
+            let capacity = Capacities::Array(
+                list_arrays.iter().map(|a| a.get_buffer_memory_size()).sum(),


Suggested change

list_arrays.iter().map(|a| a.get_buffer_memory_size()).sum(),

list_arrays.iter().map(|a| a.len()).sum(),

The buffer memory size is fairly significant over estimate

tustvold · 2023-06-01T11:24:06Z

datafusion/physical-expr/src/array_expressions.rs

+}
+
+/// Array_concat/Array_cat SQL function
+pub fn array_concat(args: &[ColumnarValue]) -> Result<ColumnarValue> {


It would perhaps be nicer to use a combination of https://docs.rs/arrow-array/latest/arrow_array/array/struct.GenericListArray.html#method.try_new and https://docs.rs/arrow-select/latest/arrow_select/concat/index.html

izveigor · 2023-06-02T19:50:13Z

Hello, @alamb!
I see your and @tustvold comments, thanks for your work!

I analyzed all the comments and came to the conclusion that it is better to implement all other changes in subsequent PR, if the current changes do not contain critical errors. (Because it will be easier to analyze changes and implement their)
So, I have made a list of issues for possible improvements to arrays:
arrow-rs:

Should some of the features be implemented in arrow-rs (for example, position)?
arrow-datafusion:
[Important] Implement unnest function (it would allow arrays to use aggregate functions SELECT sum(a) AS total FROM (SELECT unnest(make_array(3, 5, 6) AS a) AS b;
Support NULLS in arrays (not only NullArray) (I think it would be nice to rewrite make_array function with using try_new method)
array_contains function (LIKE array[1, 2, 3] @> array[1, 1, 2, 3]
Write a Signature method for list datatypes.
Cast between arrays elements.
Support empty array?
Maybe, refactoring some functions if anyone finds a better solution.
FixedSizeList to List

What do you think, @alamb?

alamb · 2023-06-05T10:12:32Z

What do you think, @alamb?

I think this would be ok -- especially as you have a history of continued contribution. However, there are a few instances where engnaged committers committed in the start of promising features (such as the analysis framework from @isidentical) and then were not able to to finish the work for whatever reason. While this is fine, I think it would be better for datafusion to avoid it.

Thus I would like to suggest an alternate approach which is to break this PR down into several smaller ones (perhaps one for each new function?) That way we can give each function the attention during review it deserves (and maybe even parallelize the work)

We have a much better track record of being able to review and merge smaller PRs quickly than single large PRs. So when the functionality can be split up I think that is the best plan.

What do you think @izveigor ?

izveigor · 2023-06-05T11:18:07Z

In my opinion, it would be better to merge this PR, I have some arguments:

This PR is completed, and the next PRs will be only improve it.
If we break this PR down, I think it will increase production time.
The main reason why I don't want to continue this PR is because I want to take a closer look at some issues, but they are mostly related to ready-made functions.
So, I understand your concerns, but i think this way is better.
What do you think, @alamb?

alamb · 2023-06-05T19:36:09Z

This PR is completed, and the next PRs will be only improve it.

I agree this PR is complete (with tests) and is not missing anything major

If we break this PR down, I think it will increase production time.

Yes, I agree breaking the PR down will require more effort on the author's (your) part. However, I do think if you have the time the effort would improve the overall quality of the DataFusion codebase. Finding bandwidth to maintain the code is the primary thing I think we struggle with as a community.

What do you think, @alamb?

I think we can merge this PR as long as the work you have planned is tracked by some tickets (so that if you don't have a chance to get to them at least we will have some institutional knowledge)

Is that acceptable?

izveigor · 2023-06-05T19:43:38Z

I think this option will suit me.
Tomorrow I will try to describe in detail all the ideas in the tickets.

izveigor · 2023-06-06T12:05:09Z

Hello, @alamb!

I have created issues regarding further improvements for working with arrays:
#6555
#6556
#6557
#6558
#6559
#6560
#6561

alamb · 2023-06-06T18:59:23Z

I have created issues regarding further improvements for working with arrays:

Thanks @izveigor -- I added them to #2326 as well. !

jackwener · 2023-06-08T08:24:18Z

This PR exist bug, related with #6596

feat: multidimensional arrays

5680fa7

izveigor marked this pull request as draft May 18, 2023 19:24

github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels May 18, 2023

izveigor added 9 commits May 25, 2023 15:59

feat: array_append, array_prepend, array_concat, array_fill

73797ed

feat: array_dims, array_length

337b4bd

feat: array_position, array_positions, array_remove, array_replace, a…

787f4c6

…rray_to_string

feat: trim_array, cardinality

a667905

feat: docs

7e3f734

Merge branch 'main' into array_functions

b1346c1

refactoring: code cleanup

586bf9f

fix: test_scalar_expr

6e438e7

fix: clippy

312882f

izveigor commented May 29, 2023

View reviewed changes

izveigor marked this pull request as ready for review May 29, 2023 20:50

alamb reviewed Jun 1, 2023

View reviewed changes

tustvold reviewed Jun 1, 2023

View reviewed changes

izveigor added 2 commits June 2, 2023 18:42

Merge branch 'main' into array_functions

61d41dc

Merge branch 'main' into array_functions

fd47ee2

izveigor added 2 commits June 5, 2023 14:09

fix: array_concat capacity

e69a7a1

fix: cargo fmt

8bf120c

alamb mentioned this pull request Jun 5, 2023

Docs: add more PR guidance in contributing guide (smaller PRs) #6546

Merged

alamb merged commit 44b83a1 into apache:main Jun 6, 2023

This was referenced Jul 1, 2023

Column support for array functions #6804

Closed

feat: column support for array_append, array_prepend, array_position and array_positions #6805

Merged

izveigor mentioned this pull request Jul 17, 2023

make_array can not form columns with lists #6993

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: New functions and operations for working with arrays #6384

feat: New functions and operations for working with arrays #6384

izveigor commented May 18, 2023 •

edited by alamb

Loading

izveigor May 29, 2023

alamb Jun 1, 2023

alamb Jun 1, 2023

izveigor May 29, 2023

alamb Jun 1, 2023

izveigor May 29, 2023

izveigor May 29, 2023

alamb Jun 1, 2023

izveigor May 29, 2023

alamb Jun 1, 2023

izveigor commented May 29, 2023

alamb commented May 30, 2023

alamb commented May 31, 2023

alamb left a comment

alamb Jun 1, 2023

alamb Jun 1, 2023

alamb Jun 1, 2023

alamb Jun 1, 2023

alamb Jun 1, 2023

tustvold Jun 1, 2023

alamb Jun 1, 2023

alamb Jun 1, 2023

alamb Jun 1, 2023

tustvold Jun 1, 2023

tustvold Jun 1, 2023 •

edited

Loading

tustvold Jun 1, 2023

izveigor commented Jun 2, 2023

alamb commented Jun 5, 2023

izveigor commented Jun 5, 2023

alamb commented Jun 5, 2023 •

edited

Loading

izveigor commented Jun 5, 2023

izveigor commented Jun 6, 2023

alamb commented Jun 6, 2023

jackwener commented Jun 8, 2023

	list_arrays.iter().map(\|a\| a.get_buffer_memory_size()).sum(),
	list_arrays.iter().map(\|a\| a.len()).sum(),

feat: New functions and operations for working with arrays #6384

feat: New functions and operations for working with arrays #6384

Conversation

izveigor commented May 18, 2023 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

izveigor commented May 29, 2023

alamb commented May 30, 2023

alamb commented May 31, 2023

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jun 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

izveigor commented Jun 2, 2023

alamb commented Jun 5, 2023

izveigor commented Jun 5, 2023

alamb commented Jun 5, 2023 • edited Loading

izveigor commented Jun 5, 2023

izveigor commented Jun 6, 2023

alamb commented Jun 6, 2023

jackwener commented Jun 8, 2023

izveigor commented May 18, 2023 •

edited by alamb

Loading

tustvold Jun 1, 2023 •

edited

Loading

alamb commented Jun 5, 2023 •

edited

Loading