-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Improve pruning logic by replacing arrow::compute::cast with cast_column and add tests for struct statistics and error cases
#17637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kosiew
wants to merge
34
commits into
apache:main
Choose a base branch
from
kosiew:prune-16579
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+604
−39
Open
Changes from all commits
Commits
Show all changes
34 commits
Select commit
Hold shift + click to select a range
9d40c16
fix: replace direct cast with cast_column utility for better type han…
kosiew 4377845
fix: enhance type casting for Utf8 data in statistics to ensure safe …
kosiew 5a73f7d
test: add unit test for building statistics with struct columns
kosiew 0fbaeae
test: add unit tests for handling struct statistics errors in build_s…
kosiew e6ea9e1
fix: update casting options to use safe semantics for binary statistics
kosiew fe25700
fix: use safe casting for binary statistics to prevent invalid UTF-8 …
kosiew 7cac892
feat: add support for nested struct column statistics in pruning logic
kosiew af03ada
fix: use safe casting options for binary statistics to handle invalid…
kosiew 835e362
fix: adjust indentation for clarity in statistics record batch castin…
kosiew 06cbcc8
fix: enhance UTF-8 validation for binary statistics in pruning logic
kosiew 7102a14
refactor: simplify struct column statistics handling and improve requ…
kosiew 4c3e992
fix: implement safe casting for invalid UTF-8 in statistics handling
kosiew ae87b25
fix: ensure invalid UTF-8 is converted to null in statistics record b…
kosiew 904e448
fix: enhance casting logic for statistics handling of binary and larg…
kosiew edb83b3
fix: implement safe casting for invalid UTF-8 bytes in statistics rec…
kosiew d1d1c0b
fix: replace assertions with assert_contains for better error message…
kosiew d9c0bef
fix: enhance statistics handling for binary and UTF-8 types, adding s…
kosiew 0d365d5
fix: replace assert with assert_contains for improved error message c…
kosiew d9def1f
fix: remove unused variable in build_statistics_record_batch function
kosiew f74b7b2
fix: remove unused imports in pruning_predicate.rs for cleaner code
kosiew f087733
fix: add TransformedResult to datafusion_common tree_node imports for…
kosiew 8a36e51
fix: remove TransformedResult import for cleaner code in pruning_pred…
kosiew f465e87
fix: add TransformedResult import in pruning_predicate.rs for improve…
kosiew d85c342
feat: add sanitize_binary_array_for_utf8 function to preprocess Binar…
kosiew 27f8004
fix: update condition to sanitize binary arrays for UTF-8 compatibility
kosiew 5be80d5
test: add unit test for building statistics from BinaryViewArray to U…
kosiew f3afa05
fix: enhance sanitize_binary_array_for_utf8 to check for invalid UTF-…
kosiew d0c6227
feat: enhance struct casting to support safe UTF-8 conversion from Bi…
kosiew 8e27fde
Merge branch 'main' into prune-16579
kosiew 7c81a51
refactor: reorganize imports in pruning_predicate.rs for clarity
kosiew bf499f5
refactor: simplify array initialization in tests for clarity
kosiew 1751bcf
refactor: improve documentation for sanitize_binary_array_for_utf8 to…
kosiew 5cb940b
refactor: enhance test documentation for statistics casting and handl…
kosiew e1b53d2
refactor: add comments to clarify manual UTF-8 validation in sanitize…
kosiew File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems redundant with calling https://docs.rs/arrow/latest/arrow/compute/fn.cast_with_options.html with
CastOptions::safe = true🤔There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not handle BinaryView
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that is a great find and I am sorry I did not understand that context.
This sounds like a bug in upstream arrow -- I think we should file a ticket in arrow-rs and then leave a reference to that ticket in this implementation (so we can remove the datafusion code when the corresponding code in arrow-rs is released_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create the issue.