[python] Add nested-field projection support (append-only paths) by TheR1sing3un · Pull Request #7796 · apache/paimon

TheR1sing3un · 2026-05-09T09:25:28Z

Purpose

ReadBuilder.with_projection only accepts top-level column names today; dotted forms like 'struct.subfield' are silently dropped. Reading just one leaf out of a struct ends up materialising the whole parent and discarding the rest client-side.

This PR ports nested projection on append-only read paths.

Commits

Projection utility + ReadBuilder API — pypaimon/utils/projection.py (top-level / nested / empty subclasses). with_projection(['a.b.c']) resolves dotted names; with_nested_projection(int[][]) is the low-level entry point. Flat output names are _-joined with _$N suffix on collisions; leaf field IDs are preserved for schema-evolution remap.
Parquet/ORC pushdown — FormatPyArrowReader switches to dataset.scanner(columns={flat: ds.field(*path)}) when any path has length > 1; missing leaves return NULL. SplitRead threads the paths through with small bypasses on _construct_partition_mapping, _get_fields_and_predicate, _get_final_read_data_fields, and create_index_mapping. Avro / Lance / Vortex / Blob raise NotImplementedError if a nested path reaches them. PK tables that need a merge read also raise; raw-convertible PK splits work.
Avro fallback — fastavro has no native nested pruning; the reader walks each record dict by path. Top-level-only stays on the record.get(name) fast path.

Tests

test_projection_utility.py (21) + test_read_builder_nested_projection.py (10) cover the API and the projection utility.
test_nested_projection_e2e.py (8) covers Parquet dotted projection, mixed nested + top-level, low-level integer paths, partitioned tables, Avro fallback, top-level fast-path regression, PK + merge NotImplementedError.
Lint (flake8 --config=dev/cfg.ini) clean. Existing reader_*_test regression clean.

Out of scope (follow-up)

PK tables that go through MergeFileSplitRead — needs an OuterProjectionRecordReader so the merge function still sees full parent structs. Tracked separately; raises NotImplementedError until then.
ARRAY<ROW> / MAP nested paths.
Sub-field schema evolution by field ID (nested currently walks by name → NULL on rename).

API / format

Backward compatible: top-level-only with_projection callers see no change. New: with_nested_projection. No file format change.

Generative AI disclosure

Drafted with AI assistance; every behavioural guarantee is exercised by a test in one of the three new test files.

Introduces a small ``Projection`` abstraction with three concrete shapes (empty / top-level / nested) that maps an integer-index or integer-path projection onto the table's flat ``DataField`` list. Nested paths flatten into top-level fields whose names are the underscore-joined original path (``a_b`` for ``a.b``, with a ``_$N`` suffix on collisions) and whose IDs are inherited from the leaf so schema-evolution remapping by field ID continues to work. ReadBuilder grows two API surfaces: - ``with_projection(['struct.subfield'])`` — names with a dot are resolved into integer paths against the table schema. Top-level-only callers are unchanged. Unknown names continue to be silently skipped. - ``with_nested_projection([[1, 0], [1, 1]])`` — low-level integer-path entry point. ``read_type()`` materialises the projected fields via ``Projection`` when nested paths are set, otherwise keeps the existing top-level name-based path. This commit only adds the utility, the new API, and unit tests. The file readers and SplitRead still see only top-level fields — file- level nested pushdown and PK-side outer projection follow in subsequent commits.

…end-only Wires the ``Projection`` infrastructure from the previous commit through ``ReadBuilder.new_read`` → ``TableRead`` → ``SplitRead`` → ``FormatPyArrowReader`` so a request like ``with_projection(['mv.latest_version'])`` actually prunes the nested column at the file-read stage instead of materialising the parent struct and discarding the unwanted children. When at least one path has length > 1 the format reader switches from a list-form ``columns=...`` scanner to a dict-form one with ``ds.field(*path)`` expressions, mirroring the engine's own nested column-pruning surface. Sub-field schema evolution (a leaf renamed or removed) is detected up front via ``_path_exists_in_arrow_schema`` and the missing column is served as NULL — same shape as the existing top-level missing-field handling. SplitRead grows three small helpers and one bypass: * ``_nested_path_by_name`` indexes the user-facing flat names back onto their original-schema paths so ``file_reader_supplier`` can align ``nested_name_paths`` with the reader's ``ordered_read_fields``. * ``_get_fields_and_predicate`` checks the path's top-level name against the file schema instead of the flat name, otherwise nested fields would be filtered out before reaching the format reader. * ``_get_final_read_data_fields`` returns the user-facing flat names directly in nested mode; the trimmed-fields machinery cannot find matches by leaf field ID. * ``create_index_mapping`` returns identity in nested mode because the format reader already emits batches whose columns match ``read_fields`` exactly. Avro / Lance / Vortex / Blob raise ``NotImplementedError`` when a nested path reaches them — Avro's Python-side fallback ships in the next commit; the others have no nested-pruning support to mirror. Primary-key tables that go through ``MergeFileSplitRead`` likewise raise: an outer-projection layer that lets the merge function see complete parent structs is the next phase. Tests: ``test_nested_projection_e2e.py`` covers the dotted-name path, the low-level integer-path API, mixed projection ordering, the top-level fast path stays intact, and the PK-merge-path NotImplementedError guard.

fastavro has no native nested column pruning, so the reader walks each record dict step-by-step using the same ``nested_name_paths`` the previous commit threaded through ``SplitRead``. Top-level-only projection keeps the existing ``record.get(name)`` fast path. The walk helper returns ``None`` when any path segment is missing or hits a non-dict value, which lets sub-field schema evolution surface as a NULL column instead of an exception — same shape as the Parquet/ORC missing-leaf handling.

JingsongLi

+1

TheR1sing3un added 3 commits May 9, 2026 18:23

TheR1sing3un force-pushed the py-nested-projection branch from e7e5835 to 6826254 Compare May 9, 2026 10:24

JingsongLi approved these changes May 9, 2026

View reviewed changes

JingsongLi merged commit 6fda857 into apache:master May 9, 2026
6 checks passed

TheR1sing3un mentioned this pull request May 9, 2026

[python] Add nested-field projection on primary-key merge-read path #7801

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Add nested-field projection support (append-only paths)#7796

[python] Add nested-field projection support (append-only paths)#7796
JingsongLi merged 3 commits intoapache:masterfrom
TheR1sing3un:py-nested-projection

TheR1sing3un commented May 9, 2026 •

edited

Loading

Uh oh!

JingsongLi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TheR1sing3un commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Commits

Tests

Out of scope (follow-up)

API / format

Generative AI disclosure

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TheR1sing3un commented May 9, 2026 •

edited

Loading