Skip to content

feat(table): support column projection in ReadBuilder#153

Open
QuakeWang wants to merge 1 commit intoapache:mainfrom
QuakeWang:feat/column-projection
Open

feat(table): support column projection in ReadBuilder#153
QuakeWang wants to merge 1 commit intoapache:mainfrom
QuakeWang:feat/column-projection

Conversation

@QuakeWang
Copy link
Contributor

Purpose

Linked issue: close #146

Add column projection support to ReadBuilder, allowing users to specify which columns to read from Parquet data files. This reduces unnecessary I/O for wide tables, especially on remote storage (S3/OSS), by leveraging the existing ProjectionMask in the Parquet reader layer.

Brief change log

  • crates/paimon/src/table/read_builder.rs

    • Add ReadBuilder::with_projection(&[&str]) — accepts column names, stores them for deferred resolution.
    • Add resolve_projected_fields — resolves names to DataFields in caller-specified order; rejects unknown columns (ColumnNotExist) and duplicates (ConfigInvalid).
    • Change TableRead::read_type from &[DataField] to Vec<DataField> to support projected subsets.
  • crates/paimon/src/arrow/reader.rs

    • Always apply ProjectionMask to clip Parquet schema to read_type columns (previously only applied conditionally).
    • Add column reorder step after Parquet read to guarantee RecordBatch schema matches read_type order (Parquet ProjectionMask returns columns in file-schema order, not projection order).
    • Use fail-fast error propagation instead of silent fallback when projected columns are missing from batch schema.

Tests

  • cargo test -p paimon-integration-tests test_read_with_column_projection
  • cargo test -p paimon-integration-tests test_read_projection_empty
  • cargo test -p paimon-integration-tests test_read_projection_unknown_column
  • cargo test -p paimon-integration-tests test_read_projection_all_invalid
  • cargo test -p paimon-integration-tests test_read_projection_duplicate_column

API and Format

  • New public API: ReadBuilder::with_projection(&[&str]) -> Self
  • TableRead::read_type() return type changed from &[DataField] (borrowed from table schema) to &[DataField] (borrowed from owned Vec<DataField>). No breaking change for callers.
  • No storage format changes.

Documentation

@QuakeWang
Copy link
Contributor Author

@JingsongLi hello, PTAL, and need to trigger re-run ci : )

@JingsongLi JingsongLi closed this Mar 25, 2026
@JingsongLi JingsongLi reopened this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(table): support column projection in ReadBuilder

2 participants