-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: New functions and operations for working with arrays #6384
Conversation
array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(), | ||
fun.volatility(), | ||
), | ||
BuiltinScalarFunction::ArrayAppend => Signature::any(2, fun.volatility()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there ways to use List and ARRAY_DATATYPES?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Given the element type of the list is part of its DataType
, you probably can'y use the existing Signatures
Perhaps you could add a new Signature::any_list
or something that would only check that the datatype matched DataType::list
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Given the element type of the list is part of its DataType
, you probably can'y use the existing Signatures
Perhaps you could add a new Signature::any_list
or something that would only check that the datatype matched DataType::list
🤔
} | ||
|
||
/// Array_append SQL function | ||
pub fn array_append(args: &[ColumnarValue]) -> Result<ColumnarValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should each function accept &[ColumnarValue]
or ArrayRef
? Is there a difference in these approaches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is that if you take ColumnarValue
we could specialize the kernels to do something faster with scalar (single) values rather than expanding them out to arrays (aka making copies).
For the initial implementation I think converting them all to arrays is the best approach as it is simplest
}; | ||
|
||
let element = match &args[1] { | ||
ColumnarValue::Scalar(scalar) => scalar.to_array().clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ColumnarValue::Array
also makes sense in this situation?
let res = match args[0].data_type() { | ||
let data_type = args[0].data_type(); | ||
let res = match data_type { | ||
DataType::List(..) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the ways how to implement FixedSizeList
in all the functions, so I preferred to use List
. I think it does not affect anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As FixedSizedList and List are different data types, if people have data that came from a Parquet file or something that is a FixedSizedList these functions likely wont work,
However, perhaps eventually we can add coercion rules to coerce (automatically cast) FixedSizeList to List
@@ -2785,73 +2807,6 @@ mod tests { | |||
Ok(()) | |||
} | |||
|
|||
fn generic_test_array( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not work (FixedSizeList
replaced with List
). With what it can be connected?
Error:
left: `List(Field { name: "item", data_type: UInt32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })`,
right: `List(Field { name: "item", data_type: UInt64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })`'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure
@alamb I wonder if you have time to review this PR. |
Thank you @izveigor -- I have put this on my review list but I likely won't have a chance to review until tomorrow |
I didn't make it to this today, but I plan to review it tomororw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks really nice @izveigor -- thank you so much!
I haven't had a chance to review all the function implementations yet but the overall structure looks great to me . I am hoping to get @tustvold or someone else who is more of an expert in the arrow-rs structures here to offer an opinion on the structure of the kernels
I'll try and complete my review soon
## Array expressions Tests | ||
############# | ||
|
||
# array scalar function #1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are great @izveigor -- thank you so much
the only thing I recommend is adding some additional tests that have null
in the lists.
array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(), | ||
fun.volatility(), | ||
), | ||
BuiltinScalarFunction::ArrayAppend => Signature::any(2, fun.volatility()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Given the element type of the list is part of its DataType
, you probably can'y use the existing Signatures
Perhaps you could add a new Signature::any_list
or something that would only check that the datatype matched DataType::list
🤔
array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(), | ||
fun.volatility(), | ||
), | ||
BuiltinScalarFunction::ArrayAppend => Signature::any(2, fun.volatility()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Given the element type of the list is part of its DataType
, you probably can'y use the existing Signatures
Perhaps you could add a new Signature::any_list
or something that would only check that the datatype matched DataType::list
🤔
let res = match args[0].data_type() { | ||
let data_type = args[0].data_type(); | ||
let res = match data_type { | ||
DataType::List(..) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As FixedSizedList and List are different data types, if people have data that came from a Parquet file or something that is a FixedSizedList these functions likely wont work,
However, perhaps eventually we can add coercion rules to coerce (automatically cast) FixedSizeList to List
let data_type = args[0].data_type(); | ||
let res = match data_type { | ||
DataType::List(..) => { | ||
let arrays = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tustvold can you offer some suggestions on using the arrow-rs API to build list arrays? Is this the best way to use that API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would perhaps be nicer to use a combination of https://docs.rs/arrow-array/latest/arrow_array/array/struct.GenericListArray.html#method.try_new and https://docs.rs/arrow-select/latest/arrow_select/concat/index.html
MutableArrayData is not the nicest API to use
))); | ||
} | ||
|
||
let arr = match &args[0] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use https://docs.rs/datafusion/latest/datafusion/physical_plan/enum.ColumnarValue.html#method.into_array here
} | ||
|
||
/// Array_append SQL function | ||
pub fn array_append(args: &[ColumnarValue]) -> Result<ColumnarValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is that if you take ColumnarValue
we could specialize the kernels to do something faster with scalar (single) values rather than expanding them out to arrays (aka making copies).
For the initial implementation I think converting them all to arrays is the best approach as it is simplest
@@ -2785,73 +2807,6 @@ mod tests { | |||
Ok(()) | |||
} | |||
|
|||
fn generic_test_array( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure
let data_type = arrays[0].data_type(); | ||
match data_type { | ||
DataType::List(..) => { | ||
let list_arrays = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could just call to_data()
I'm not sure this needs to downcast to ListArray
downcast_vec!(arrays, ListArray).collect::<Result<Vec<&ListArray>>>()?; | ||
let len: usize = list_arrays.iter().map(|a| a.values().len()).sum(); | ||
let capacity = Capacities::Array( | ||
list_arrays.iter().map(|a| a.get_buffer_memory_size()).sum(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list_arrays.iter().map(|a| a.get_buffer_memory_size()).sum(), | |
list_arrays.iter().map(|a| a.len()).sum(), |
The buffer memory size is fairly significant over estimate
} | ||
|
||
/// Array_concat/Array_cat SQL function | ||
pub fn array_concat(args: &[ColumnarValue]) -> Result<ColumnarValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would perhaps be nicer to use a combination of https://docs.rs/arrow-array/latest/arrow_array/array/struct.GenericListArray.html#method.try_new and https://docs.rs/arrow-select/latest/arrow_select/concat/index.html
Hello, @alamb! I analyzed all the comments and came to the conclusion that it is better to implement all other changes in subsequent PR, if the current changes do not contain critical errors. (Because it will be easier to analyze changes and implement their)
What do you think, @alamb? |
I think this would be ok -- especially as you have a history of continued contribution. However, there are a few instances where engnaged committers committed in the start of promising features (such as the analysis framework from @isidentical) and then were not able to to finish the work for whatever reason. While this is fine, I think it would be better for datafusion to avoid it. Thus I would like to suggest an alternate approach which is to break this PR down into several smaller ones (perhaps one for each new function?) That way we can give each function the attention during review it deserves (and maybe even parallelize the work) We have a much better track record of being able to review and merge smaller PRs quickly than single large PRs. So when the functionality can be split up I think that is the best plan. What do you think @izveigor ? |
In my opinion, it would be better to merge this PR, I have some arguments:
|
I agree this PR is complete (with tests) and is not missing anything major
Yes, I agree breaking the PR down will require more effort on the author's (your) part. However, I do think if you have the time the effort would improve the overall quality of the DataFusion codebase. Finding bandwidth to maintain the code is the primary thing I think we struggle with as a community.
I think we can merge this PR as long as the work you have planned is tracked by some tickets (so that if you don't have a chance to get to them at least we will have some institutional knowledge) Is that acceptable? |
I think this option will suit me. |
This PR exist bug, related with #6596 |
Which issue does this PR close?
Closes #6119
Closes #6075.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
Yes