Skip to content

refactor(parquet-datasource): extract DecoderProjection from build_stream#22398

Merged
adriangb merged 3 commits into
apache:mainfrom
adriangb:parquet-decoder-projection
May 21, 2026
Merged

refactor(parquet-datasource): extract DecoderProjection from build_stream#22398
adriangb merged 3 commits into
apache:mainfrom
adriangb:parquet-decoder-projection

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented May 20, 2026

Which issue does this PR close?

  • Closes #.

Rationale for this change

RowGroupsPrunedParquetOpen::build_stream inlines the
build_projection_read_plan + reassign_expr_columns + make_projector + replace_schema quartet right next to the decoder / stream wiring, which makes the opener's main orchestration body harder to follow and mixes two concerns: building the per-file projection vs. wiring it through the push-decoder stream.

This PR isolates that block behind a small DecoderProjection type whose public surface is just "give me the projection mask" and "project this decoded batch onto the output schema."

What changes are included in this PR?

  • New decoder_projection module with a DecoderProjection type:
    • DecoderProjection::try_new(projection, physical_file_schema, parquet_schema, output_schema) constructs the per-file projection in one call.
    • projection_mask() returns the mask installed on every decoder run.
    • map(&batch) applies the projector and, when needed, rebuilds the batch with output_schema to recover metadata / nullability that the file schema does not carry.
    • Fields are private.
  • PushDecoderStreamState collapses three fields (projector, output_schema, replace_schema) into a single decoder_projection: DecoderProjection. project_batch becomes a one-line delegate to DecoderProjection::map.
  • replace_schema is now derived from the projector's output schema (rather than the read plan's projected schema) so it stays correct under future widening of the decoder mask.
  • DecoderBuilderConfig carries the projection mask directly (projection_mask: &ProjectionMask) instead of the full ParquetReadPlan, since the read plan's projected_schema is no longer needed in this layer.

No behaviour change.

Are these changes tested?

Covered by existing tests:

  • `cargo test -p datafusion-datasource-parquet` — 123 pass.
  • `cargo test -p datafusion --test parquet_integration` — 202 pass.
  • `cargo clippy -p datafusion-datasource-parquet --all-targets --all-features -- -D warnings` — clean.

Are there any user-facing changes?

No. All affected types are `pub(crate)`.

🤖 Generated with Claude Code

…ream

`RowGroupsPrunedParquetOpen::build_stream` used to inline the
`build_projection_read_plan` + `reassign_expr_columns` + `make_projector`
+ `replace_schema` quartet right next to the decoder / stream wiring,
which made the opener's main orchestration body harder to follow.

Move that block into a new `decoder_projection` module exposing a
single `DecoderProjection::build(projection, physical_file_schema,
parquet_schema, output_schema)` entry point. The struct keeps its
fields private and exposes:

* `projection_mask()` for the decoder builder, and
* `map(&batch)` which applies the projector and, when needed, rebuilds
  the batch with `output_schema` to recover metadata / nullability the
  file schema does not carry.

This collapses three fields on `PushDecoderStreamState` (`projector`,
`output_schema`, `replace_schema`) into a single `decoder_projection:
DecoderProjection`, and lets `project_batch` delegate to
`DecoderProjection::map`. `replace_schema` is derived from the
projector's output schema (rather than the read plan's projected
schema) so it stays correct under future widening of the decoder mask.

`DecoderBuilderConfig` now carries the projection mask directly
(`projection_mask: &ProjectionMask`) instead of the full
`ParquetReadPlan`, since the read plan's `projected_schema` is no
longer needed in this layer.

No behaviour change. All existing parquet tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the datasource Changes to the datasource crate label May 20, 2026
@adriangb adriangb requested a review from xudong963 May 20, 2026 19:44
@adriangb
Copy link
Copy Markdown
Contributor Author

@xudong963 another PR to factor complexity out of the opener 🙏🏻

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread datafusion/datasource-parquet/src/decoder_projection.rs Outdated
Co-authored-by: xudong.w <wxd963996380@gmail.com>
@adriangb adriangb enabled auto-merge May 21, 2026 12:43
@adriangb adriangb added this pull request to the merge queue May 21, 2026
Merged via the queue into apache:main with commit ad6a507 May 21, 2026
35 checks passed
@adriangb adriangb deleted the parquet-decoder-projection branch May 21, 2026 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants