Skip to content

refactor(parquet-datasource): mechanical cleanup of opener / file_format / row_group_filter#22156

Draft
adriangb wants to merge 1 commit into
apache:mainfrom
adriangb:refactor/parquet-opener-modules-and-optimizers
Draft

refactor(parquet-datasource): mechanical cleanup of opener / file_format / row_group_filter#22156
adriangb wants to merge 1 commit into
apache:mainfrom
adriangb:refactor/parquet-opener-modules-and-optimizers

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented May 13, 2026

Which issue does this PR close?

Relates to the discussion in #22024 about the Parquet datasource crate becoming hard to navigate as more features land. Does not close that issue — this is a small, mechanical step that makes future hook/extension work easier.

Rationale for this change

Several files in the parquet datasource crate have grown to bundle multiple responsibilities into one place — making it hard to read and review changes in isolation. This PR is pure code motion: no public API changes (existing import paths preserved via re-exports), no behavior changes.

The hook/extension trait that addresses #22024's broader concern ("the opener is becoming a mini planner") is being prepared as a separate PR so reviewers can evaluate each change on its own merits.

What changes are included in this PR?

Five separate moves grouped into one PR because they're all the same flavor of work:

1. opener.rs (~2,700 LOC) → opener/ module

  • New opener/early_stop.rsEarlyStoppingStream (~100 LOC), the dynamic-filter early-termination wrapper used at the end of build_stream.
  • New opener/encryption.rsEncryptionContext + the ParquetMorselizer::get_encryption_context helpers. Isolates the #[cfg(feature = \"parquet_encryption\")] gating that previously bled through the main file.
  • New opener/push_decoder_stream.rsPushDecoderStreamState, the inner stream state machine that drives a ParquetPushDecoder. Distinct from the outer (open-file) state machine they previously shared a file with.

2. file_format.rs (~2,000 LOC) split

  • New sink.rsParquetSink + the parallel-write machinery (column_serializer_task, spawn_column_parallel_row_group_writer, output_single_parquet_file_parallelized, concatenate_parallel_row_groups, etc.). The historical file_format::ParquetSink path is preserved via pub use.
  • New schema_coercion.rs — Arrow-schema coercion utilities (apply_file_schema_type_coercions, coerce_int96_to_resolution, coerce_file_schema_to_view_type, coerce_file_schema_to_string_type, transform_schema_to_view, transform_binary_to_string, field_with_new_type) and their tests. Re-exported at the crate root for backward compat.

3. row_group_filter.rs (~1,900 LOC)

  • New bloom_filter.rsBloomFilterStatistics (the loaded SBBF data + its PruningStatistics adapter). Separates "data we loaded from the file" from "the access-plan filter that consumes it."

Net effect

File Before After Δ
opener.rs (now opener/mod.rs) 2,717 2,433 -10%
file_format.rs 2,038 633 -69%
row_group_filter.rs 1,929 1,756 -9%

Are these changes tested?

Yes:

  • All 99 existing datafusion-datasource-parquet unit tests pass (the two coerce_int96_to_resolution_* tests moved with the function to schema_coercion.rs).
  • cargo fmt --all, ./dev/rust_lint.sh, cargo clippy -p datafusion-datasource-parquet --all-targets --all-features -- -D warnings all pass.
  • Downstream datafusion core builds clean.
  • datafusion-proto (which imports ParquetSink via the historical file_format::ParquetSink path) builds clean thanks to the pub use re-export.

Are there any user-facing changes?

No. Public API is unchanged — every previously-public item is still reachable at the same crate-root path. The only difference is the file organization inside the crate.

@github-actions github-actions Bot added the datasource Changes to the datasource crate label May 13, 2026
@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 13, 2026

We could potentially split this into module moves too

@adriangb adriangb force-pushed the refactor/parquet-opener-modules-and-optimizers branch from b044e05 to 3efc367 Compare May 14, 2026 00:29
@adriangb adriangb changed the title refactor(parquet): split opener.rs into module + add ParquetAccessPlanOptimizer trait refactor(parquet): split opener.rs into a module May 14, 2026
@adriangb adriangb force-pushed the refactor/parquet-opener-modules-and-optimizers branch from 3efc367 to 17679e3 Compare May 14, 2026 02:51
@adriangb adriangb changed the title refactor(parquet): split opener.rs into a module refactor(parquet-datasource): mechanical cleanup of opener / file_format / row_group_filter May 14, 2026
@github-actions github-actions Bot added the auto detected api change Auto detected API change label May 14, 2026
…ile_format.rs, row_group_filter.rs

A series of pure code-motion splits to make the parquet datasource crate
easier to navigate and modify. No public API changes (existing import
paths preserved via re-exports), no behavior changes.

Five separate moves:

1. `opener.rs` (~2,700 LOC) → `opener/` module:
   - `opener/early_stop.rs` — `EarlyStoppingStream`, the dynamic-filter
     early-termination wrapper used at the end of `build_stream`.
   - `opener/encryption.rs` — `EncryptionContext` + the
     `ParquetMorselizer::get_encryption_context` helpers. Isolates the
     `#[cfg(feature = "parquet_encryption")]` gating.
   - `opener/push_decoder_stream.rs` — `PushDecoderStreamState`, the
     inner stream state machine that drives a `ParquetPushDecoder`.

2. `file_format.rs` (~2,000 LOC) split:
   - `sink.rs` — `ParquetSink` + the parallel-write machinery
     (`column_serializer_task`, `spawn_column_parallel_row_group_writer`,
     `output_single_parquet_file_parallelized`, etc.). The historical
     `file_format::ParquetSink` path is preserved via `pub use`.
   - `schema_coercion.rs` — Arrow-schema coercion utilities
     (`apply_file_schema_type_coercions`, `coerce_int96_to_resolution`,
     `coerce_file_schema_to_view_type`, `coerce_file_schema_to_string_type`,
     `transform_schema_to_view`, `transform_binary_to_string`,
     `field_with_new_type`) and their tests. Re-exported at the crate
     root for backward compat.

3. `row_group_filter.rs` (~1,900 LOC):
   - `bloom_filter.rs` — `BloomFilterStatistics` (the loaded SBBF data
     + its `PruningStatistics` adapter). Separates "data we loaded
     from the file" from "the access-plan filter that consumes it".

After this PR:
- `opener.rs`: 2,717 → 2,433 LOC (-10%; further drop possible by
  extracting the test module, but tests stay near the code under
  test for now).
- `file_format.rs`: 2,038 → 633 LOC (-69%).
- `row_group_filter.rs`: 1,929 → 1,756 LOC (-9%; bloom code moved
  out, the file is now focused entirely on `RowGroupAccessPlanFilter`).
@adriangb adriangb force-pushed the refactor/parquet-opener-modules-and-optimizers branch from 17679e3 to 2706027 Compare May 14, 2026 03:32
@github-actions github-actions Bot removed the auto detected api change Auto detected API change label May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants