Skip to content

refactor(parquet-datasource): split bloom_filter out of row_group_filter.rs#22348

Open
adriangb wants to merge 4 commits into
apache:mainfrom
adriangb:refactor/split-parquet-row-group-filter
Open

refactor(parquet-datasource): split bloom_filter out of row_group_filter.rs#22348
adriangb wants to merge 4 commits into
apache:mainfrom
adriangb:refactor/split-parquet-row-group-filter

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented May 18, 2026

Which issue does this PR close?

Relates to the discussion in #22024 about the Parquet datasource crate becoming hard to navigate. Split out of #22156, which bundled several code-motion moves into one PR — this is one of three smaller, independently-reviewable PRs that replace it.

Rationale for this change

row_group_filter.rs had grown to ~1,900 LOC. It mixes "data we loaded from the file" with "the access-plan filter that consumes it." This PR is code motion only: no behavior change and no public API change.

What changes are included in this PR?

Extracts BloomFilterStatistics — the loaded Split Block Bloom Filter (SBBF) data plus its PruningStatistics adapter — from row_group_filter.rs into a new bloom_filter.rs, and moves the bloom-filter tests alongside it. This separates BloomFilterStatistics (data loaded from the file) from RowGroupAccessPlanFilter (the access-plan filter that consumes it), leaving row_group_filter.rs focused on the latter.

Each commit builds green on its own:

  1. Split bloom_filter out of row_group_filter.rs — move BloomFilterStatistics and its PruningStatistics adapter into a new bloom_filter.rs.
  2. Build BloomFilterStatistics via its constructors in tests — the test built it with a struct literal; use the existing with_capacity/insert constructors (the pattern opener.rs already uses) so the column_sbbf field stays private.
  3. Extract ExpectedPruning into a shared test_util module — the one test helper shared between the row-group and bloom-filter tests, so the bloom-filter tests can move out. Adds a #[cfg(test)] pub(crate) fn access_plan() accessor on RowGroupAccessPlanFilter so the helper can assert from a sibling module without widening field visibility.
  4. Move the bloom-filter tests into bloom_filter.rs — relocate test_row_group_bloom_*, the BloomFilterTest builder, and its helper next to the code they exercise.

BloomFilterStatistics is crate-internal; row_group_filter re-exports it (pub(crate) use) so the existing crate::row_group_filter::BloomFilterStatistics path keeps resolving for in-crate callers. Aside from row_group_filter.rs and bloom_filter.rs, this adds a new src/test_util.rs and a one-line module declaration in mod.rs.

Are these changes tested?

Yes, covered by existing tests. cargo test -p datafusion-datasource-parquet --all-features (122 passing) and cargo clippy -p datafusion-datasource-parquet --all-targets --all-features -- -D warnings both pass.

Are there any user-facing changes?

No. BloomFilterStatistics is crate-internal; this only reorganizes files inside the crate.

🤖 Generated with Claude Code

…ter.rs

Pure code motion, no behavior change and no public API change.
Extracts `BloomFilterStatistics` — the loaded Split Block Bloom Filter
(SBBF) data plus its `PruningStatistics` adapter — from the ~1,900 LOC
`row_group_filter.rs` into a new `bloom_filter.rs`.

This separates "data we loaded from the file" (`BloomFilterStatistics`)
from "the access-plan filter that consumes it"
(`RowGroupAccessPlanFilter`), leaving `row_group_filter.rs` focused on
the latter.

`BloomFilterStatistics` is crate-internal; `row_group_filter`
re-exports it (`pub(crate) use`) so the existing
`crate::row_group_filter::BloomFilterStatistics` path keeps resolving
for in-crate callers and this PR touches no other file.

Split out of apache#22156.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb
Copy link
Copy Markdown
Contributor Author

@xudong963 wonder if you could review this refactor?

/// Value:
/// * [`Sbbf`] (Bloom filter),
/// * Parquet physical [`Type`] needed to evaluate literals against the filter
pub(crate) column_sbbf: HashMap<String, (Sbbf, Type)>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the pub(crate)? moving bloom-filter-specific tests into bloom_filter.rs should avoid this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — dropped it. Done across three follow-up commits:

  • 13379592a3 builds BloomFilterStatistics via its with_capacity/insert constructors in the test (the pattern opener.rs already uses) instead of a struct literal, so column_sbbf stays a private field.
  • 5089cf49e6 extracts ExpectedPruning — the one test helper shared with the row-group tests — into a new #[cfg(test)] mod test_util.
  • f232e73811 moves the bloom-filter tests (test_row_group_bloom_*, BloomFilterTest, and the helper) into bloom_filter.rs, next to the code they exercise.

The only visibility addition is a #[cfg(test)] pub(crate) fn access_plan() accessor on RowGroupAccessPlanFilter, so the shared ExpectedPruning assertion can run from a sibling module without exposing the field itself.

adriangb and others added 3 commits May 18, 2026 21:08
…ctors in tests

The row-group bloom-filter test built `BloomFilterStatistics` with a
struct literal, which required widening the `column_sbbf` field to
`pub(crate)` when the struct moved to its own module. Use the existing
`with_capacity` + `insert` constructors instead -- the same pattern the
production code in `opener.rs` already uses -- so the field stays private.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…est_util module

`ExpectedPruning` is the only test helper shared between the row-group
statistics tests and the bloom-filter tests. Move it into a new
`#[cfg(test)] mod test_util` so the bloom-filter tests can be relocated
into `bloom_filter.rs` without it.

Its `assert` method previously reached into the private `access_plan`
field of `RowGroupAccessPlanFilter`; add a test-only `pub(crate)`
`access_plan()` accessor so the helper works from a sibling module
without widening the field's visibility.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er.rs

Relocate the bloom-filter pruning tests -- the `test_row_group_bloom_*`
cases, the `BloomFilterTest` builder, and the
`test_row_group_bloom_filter_pruning_predicate` helper -- from
`row_group_filter.rs` into `bloom_filter.rs`, next to the
`BloomFilterStatistics` code they exercise. The shared `ExpectedPruning`
helper is consumed from `crate::test_util`.

Pure test code motion; no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb
Copy link
Copy Markdown
Contributor Author

@xudong963 I moved the tests but that made it a bit bigger of a refactor. Please let me know if it still looks good to you and I'll go ahead and merge.

Copy link
Copy Markdown
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to me

@adriangb adriangb enabled auto-merge May 19, 2026 11:29
@adriangb adriangb added this pull request to the merge queue May 19, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to no response for status checks May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants