refactor(parquet-datasource): split bloom_filter out of row_group_filter.rs#22348
Open
adriangb wants to merge 4 commits into
Open
refactor(parquet-datasource): split bloom_filter out of row_group_filter.rs#22348adriangb wants to merge 4 commits into
adriangb wants to merge 4 commits into
Conversation
…ter.rs Pure code motion, no behavior change and no public API change. Extracts `BloomFilterStatistics` — the loaded Split Block Bloom Filter (SBBF) data plus its `PruningStatistics` adapter — from the ~1,900 LOC `row_group_filter.rs` into a new `bloom_filter.rs`. This separates "data we loaded from the file" (`BloomFilterStatistics`) from "the access-plan filter that consumes it" (`RowGroupAccessPlanFilter`), leaving `row_group_filter.rs` focused on the latter. `BloomFilterStatistics` is crate-internal; `row_group_filter` re-exports it (`pub(crate) use`) so the existing `crate::row_group_filter::BloomFilterStatistics` path keeps resolving for in-crate callers and this PR touches no other file. Split out of apache#22156. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
@xudong963 wonder if you could review this refactor? |
xudong963
approved these changes
May 19, 2026
| /// Value: | ||
| /// * [`Sbbf`] (Bloom filter), | ||
| /// * Parquet physical [`Type`] needed to evaluate literals against the filter | ||
| pub(crate) column_sbbf: HashMap<String, (Sbbf, Type)>, |
Member
There was a problem hiding this comment.
Do we need the pub(crate)? moving bloom-filter-specific tests into bloom_filter.rs should avoid this
Contributor
Author
There was a problem hiding this comment.
Good call — dropped it. Done across three follow-up commits:
13379592a3buildsBloomFilterStatisticsvia itswith_capacity/insertconstructors in the test (the patternopener.rsalready uses) instead of a struct literal, socolumn_sbbfstays a private field.5089cf49e6extractsExpectedPruning— the one test helper shared with the row-group tests — into a new#[cfg(test)] mod test_util.f232e73811moves the bloom-filter tests (test_row_group_bloom_*,BloomFilterTest, and the helper) intobloom_filter.rs, next to the code they exercise.
The only visibility addition is a #[cfg(test)] pub(crate) fn access_plan() accessor on RowGroupAccessPlanFilter, so the shared ExpectedPruning assertion can run from a sibling module without exposing the field itself.
…ctors in tests The row-group bloom-filter test built `BloomFilterStatistics` with a struct literal, which required widening the `column_sbbf` field to `pub(crate)` when the struct moved to its own module. Use the existing `with_capacity` + `insert` constructors instead -- the same pattern the production code in `opener.rs` already uses -- so the field stays private. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…est_util module `ExpectedPruning` is the only test helper shared between the row-group statistics tests and the bloom-filter tests. Move it into a new `#[cfg(test)] mod test_util` so the bloom-filter tests can be relocated into `bloom_filter.rs` without it. Its `assert` method previously reached into the private `access_plan` field of `RowGroupAccessPlanFilter`; add a test-only `pub(crate)` `access_plan()` accessor so the helper works from a sibling module without widening the field's visibility. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er.rs Relocate the bloom-filter pruning tests -- the `test_row_group_bloom_*` cases, the `BloomFilterTest` builder, and the `test_row_group_bloom_filter_pruning_predicate` helper -- from `row_group_filter.rs` into `bloom_filter.rs`, next to the `BloomFilterStatistics` code they exercise. The shared `ExpectedPruning` helper is consumed from `crate::test_util`. Pure test code motion; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
@xudong963 I moved the tests but that made it a bit bigger of a refactor. Please let me know if it still looks good to you and I'll go ahead and merge. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Relates to the discussion in #22024 about the Parquet datasource crate becoming hard to navigate. Split out of #22156, which bundled several code-motion moves into one PR — this is one of three smaller, independently-reviewable PRs that replace it.
Rationale for this change
row_group_filter.rshad grown to ~1,900 LOC. It mixes "data we loaded from the file" with "the access-plan filter that consumes it." This PR is code motion only: no behavior change and no public API change.What changes are included in this PR?
Extracts
BloomFilterStatistics— the loaded Split Block Bloom Filter (SBBF) data plus itsPruningStatisticsadapter — fromrow_group_filter.rsinto a newbloom_filter.rs, and moves the bloom-filter tests alongside it. This separatesBloomFilterStatistics(data loaded from the file) fromRowGroupAccessPlanFilter(the access-plan filter that consumes it), leavingrow_group_filter.rsfocused on the latter.Each commit builds green on its own:
bloom_filterout ofrow_group_filter.rs— moveBloomFilterStatisticsand itsPruningStatisticsadapter into a newbloom_filter.rs.BloomFilterStatisticsvia its constructors in tests — the test built it with a struct literal; use the existingwith_capacity/insertconstructors (the patternopener.rsalready uses) so thecolumn_sbbffield stays private.ExpectedPruninginto a sharedtest_utilmodule — the one test helper shared between the row-group and bloom-filter tests, so the bloom-filter tests can move out. Adds a#[cfg(test)] pub(crate) fn access_plan()accessor onRowGroupAccessPlanFilterso the helper can assert from a sibling module without widening field visibility.bloom_filter.rs— relocatetest_row_group_bloom_*, theBloomFilterTestbuilder, and its helper next to the code they exercise.BloomFilterStatisticsis crate-internal;row_group_filterre-exports it (pub(crate) use) so the existingcrate::row_group_filter::BloomFilterStatisticspath keeps resolving for in-crate callers. Aside fromrow_group_filter.rsandbloom_filter.rs, this adds a newsrc/test_util.rsand a one-line module declaration inmod.rs.Are these changes tested?
Yes, covered by existing tests.
cargo test -p datafusion-datasource-parquet --all-features(122 passing) andcargo clippy -p datafusion-datasource-parquet --all-targets --all-features -- -D warningsboth pass.Are there any user-facing changes?
No.
BloomFilterStatisticsis crate-internal; this only reorganizes files inside the crate.🤖 Generated with Claude Code