-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lots of code duplication in array_functions (array_replace_all
, array_replace_n
, array_replace
, etc)
#7988
Comments
@jayzhan211 here is another one for you to consider |
Since this issue is not closed maybe we can extend this to all others array_function, lazy to open issue for each one :)
sorted by the size of cargo bloat |
@jayzhan211 Thanks for your kind advice! It seems that the above functions have been all implemented. |
Thank you both 🙏 |
array_replace_all
, array_replace_n
, array_replace
)array_replace_all
, array_replace_n
, array_replace
, etc)
Yeah, all implemented but need to change to general type approach without downcast on each data type. See #8050 #8054 and #8071 |
I think there are still several examples of macros that could/should be functions:
In general, I think we should be aiming for the removal of macros in the array functions as the macros are unecessary, harder to maintain, and generate more code than is necessary |
Work on |
I can help with |
and |
working on |
TODO:
|
Its looking really nice @jayzhan211 -- thank you |
@jayzhan211 hi, I can help with array_array |
Great! |
hi @jayzhan211 I'm going to work on |
It is great to see how we are chipping away at this project. Very nice 👍 |
Work on define_array_slice |
`datafusion_common::scalar::ScalarValue::iter_to_array` is probably not the target function to cleanup ? Since it need to check the data types in Scalar and deal with them specifically. I think macros that generate specialized for for each scalar type could potentially be replaced This then makes a massive about of duplication: Specifically, rather than having specialized code for each possible list element type, perhaps we could have one code that accumulated the values (perhaps recursively calling iter_to_array) and one function that computes the offsets |
@jayzhan211 and @Weijun-H and @Veeupup your efforts appear to be paying off! The size of the datafusion-cli binary has shrunk by 2MB (from 63MB to 61MB) since 33.0.0 🎉 Size on main after merging https://github.com/apache/arrow-datafusion/pull/8414/files andrewlamb@Andrews-MacBook-Pro:~/Software/arrow-datafusion/datafusion-cli$ ls -l `which datafusion-cli`
-rwxr-xr-x@ 1 andrewlamb staff 63M Dec 5 14:07 /Users/andrewlamb/.cargo/bin/datafusion-cli* Size of binary from andrewlamb@Andrews-MacBook-Pro:~/Software/arrow-datafusion/datafusion-cli$ ls -l target/release/datafusion-cli
-rwxr-xr-x@ 1 andrewlamb staff 61M Dec 6 16:48 target/release/datafusion-cli* |
I think we have done all the tasks in this topic |
Describe the bug
There is a significant amount of code generated for array functions.
This both bloats binaries built with DataFusion as well as makes compile times slow.
To Reproduce
cd datafusion/datafusion-cli cargo bloat
Expected behavior
I would like the
array_replace_all
,array_replace_n
,array_replace
functions to be implemented in terms of arrow kernels (such aseq
, andtake
) and manipulations of offset buffers rather than directly creating new lists.For example, the large macro expansion here:
https://github.com/apache/arrow-datafusion/blob/bb1d7f9343532d5fa8df871ff42000fbe836d7d7/datafusion/physical-expr/src/array_expressions.rs#L1431-L1437
I believe generates a bunch of specialized code for each different list element data type 😢
Additional context
No response
The text was updated successfully, but these errors were encountered: