Skip to content

fix(parquet): exclude single-leaf struct roots from predicate cache#9983

Open
imhy wants to merge 1 commit into
apache:mainfrom
imhy:fix-predicate-cache-single-leaf-struct
Open

fix(parquet): exclude single-leaf struct roots from predicate cache#9983
imhy wants to merge 1 commit into
apache:mainfrom
imhy:fix-predicate-cache-single-leaf-struct

Conversation

@imhy
Copy link
Copy Markdown

@imhy imhy commented May 16, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Root cause

ProjectionMask::without_nested_types (parquet/src/arrow/mod.rs:427) decides which leaves the predicate cache may cover. The check before this fix was:

if root_leaf_counts[root_idx] == 1 && !root.is_list() {
    included_leaves.push(leaf_idx);
}

PR #8866 added !root.is_list() to exclude lists, but a struct root with a single leaf still satisfies the condition and gets cached.

Fix (1 line)

parquet/src/arrow/mod.rs:455:

-                if root_leaf_counts[root_idx] == 1 && !root.is_list() {
+                if root_leaf_counts[root_idx] == 1 && root.is_primitive() {
                     included_leaves.push(leaf_idx);
                 }

Are these changes tested?

Tests added on the branch

1. Reproducer integration test

File: parquet/tests/arrow_reader/predicate_cache.rs
Name: test_async_predicate_on_single_leaf_nullable_struct

Builds an in-memory Parquet file with OPTIONAL group b { REQUIRED BYTE_ARRAY aa (UTF8); }, writes two rows (parent NULL, parent non-NULL), then runs the same IS NULL row filter through the async reader twice: once with the default cache, once with with_max_predicate_cache_size(0). It asserts that

  • the uncached control yields exactly 1 row (address NULL row matches);
  • the cached run yields the same row count as the uncached one.

Pre-fix: panic at struct_array.rs:142.
Post-fix: passes (1 row in both cases).

2. Unit test

File: parquet/src/arrow/mod.rs (test module)
Name: test_projection_mask_without_nested_single_leaf_struct

Directly checks ProjectionMask::without_nested_types against a schema with OPTIONAL group address { REQUIRED BYTE_ARRAY street; } REQUIRED INT32 id, for three input masks (single nested leaf, mixed, all leaves). All three expected outputs reflect that the struct's leaf is now considered nested.

Pre-fix: would return Some([street_leaf]) for the single-leaf-only mask.
Post-fix: returns None for the single-leaf-only mask; returns Some([id]) for mixed.

Verification matrix

Test Pre-fix Post-fix
test_projection_mask_without_nested_single_leaf_struct (new unit) would FAIL PASS
test_async_predicate_on_single_leaf_nullable_struct (new integration) PANIC PASS
predicate_cache::test_default_read PASS PASS
predicate_cache::test_async_cache_with_filters PASS PASS
predicate_cache::test_sync_cache_with_filters PASS PASS
predicate_cache::test_cache_disabled_with_filters PASS PASS
predicate_cache::test_cache_projection_excludes_nested_columns PASS PASS
test_projection_mask_without_nested_* (5 existing) PASS PASS

Are there any user-facing changes?

@github-actions github-actions Bot added the parquet Changes to the parquet crate label May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

parquet predicate-cache: panic / silent row drop on single-leaf nullable struct

1 participant