fix(parquet): exclude single-leaf struct roots from predicate cache#9983
Open
imhy wants to merge 1 commit into
Open
fix(parquet): exclude single-leaf struct roots from predicate cache#9983imhy wants to merge 1 commit into
imhy wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Root cause
ProjectionMask::without_nested_types(parquet/src/arrow/mod.rs:427) decides which leaves the predicate cache may cover. The check before this fix was:PR #8866 added
!root.is_list()to exclude lists, but a struct root with a single leaf still satisfies the condition and gets cached.Fix (1 line)
parquet/src/arrow/mod.rs:455:Are these changes tested?
Tests added on the branch
1. Reproducer integration test
File:
parquet/tests/arrow_reader/predicate_cache.rsName:
test_async_predicate_on_single_leaf_nullable_structBuilds an in-memory Parquet file with
OPTIONAL group b { REQUIRED BYTE_ARRAY aa (UTF8); }, writes two rows (parent NULL, parent non-NULL), then runs the sameIS NULLrow filter through the async reader twice: once with the default cache, once withwith_max_predicate_cache_size(0). It asserts thataddressNULL row matches);Pre-fix: panic at
struct_array.rs:142.Post-fix: passes (1 row in both cases).
2. Unit test
File:
parquet/src/arrow/mod.rs(test module)Name:
test_projection_mask_without_nested_single_leaf_structDirectly checks
ProjectionMask::without_nested_typesagainst a schema withOPTIONAL group address { REQUIRED BYTE_ARRAY street; } REQUIRED INT32 id, for three input masks (single nested leaf, mixed, all leaves). All three expected outputs reflect that the struct's leaf is now considered nested.Pre-fix: would return
Some([street_leaf])for the single-leaf-only mask.Post-fix: returns
Nonefor the single-leaf-only mask; returnsSome([id])for mixed.Verification matrix
test_projection_mask_without_nested_single_leaf_struct(new unit)test_async_predicate_on_single_leaf_nullable_struct(new integration)predicate_cache::test_default_readpredicate_cache::test_async_cache_with_filterspredicate_cache::test_sync_cache_with_filterspredicate_cache::test_cache_disabled_with_filterspredicate_cache::test_cache_projection_excludes_nested_columnstest_projection_mask_without_nested_*(5 existing)Are there any user-facing changes?