[python] Add nested-field projection support (append-only paths)#7796
Merged
JingsongLi merged 3 commits intoapache:masterfrom May 9, 2026
Merged
[python] Add nested-field projection support (append-only paths)#7796JingsongLi merged 3 commits intoapache:masterfrom
JingsongLi merged 3 commits intoapache:masterfrom
Conversation
Introduces a small ``Projection`` abstraction with three concrete shapes (empty / top-level / nested) that maps an integer-index or integer-path projection onto the table's flat ``DataField`` list. Nested paths flatten into top-level fields whose names are the underscore-joined original path (``a_b`` for ``a.b``, with a ``_$N`` suffix on collisions) and whose IDs are inherited from the leaf so schema-evolution remapping by field ID continues to work. ReadBuilder grows two API surfaces: - ``with_projection(['struct.subfield'])`` — names with a dot are resolved into integer paths against the table schema. Top-level-only callers are unchanged. Unknown names continue to be silently skipped. - ``with_nested_projection([[1, 0], [1, 1]])`` — low-level integer-path entry point. ``read_type()`` materialises the projected fields via ``Projection`` when nested paths are set, otherwise keeps the existing top-level name-based path. This commit only adds the utility, the new API, and unit tests. The file readers and SplitRead still see only top-level fields — file- level nested pushdown and PK-side outer projection follow in subsequent commits.
…end-only Wires the ``Projection`` infrastructure from the previous commit through ``ReadBuilder.new_read`` → ``TableRead`` → ``SplitRead`` → ``FormatPyArrowReader`` so a request like ``with_projection(['mv.latest_version'])`` actually prunes the nested column at the file-read stage instead of materialising the parent struct and discarding the unwanted children. When at least one path has length > 1 the format reader switches from a list-form ``columns=...`` scanner to a dict-form one with ``ds.field(*path)`` expressions, mirroring the engine's own nested column-pruning surface. Sub-field schema evolution (a leaf renamed or removed) is detected up front via ``_path_exists_in_arrow_schema`` and the missing column is served as NULL — same shape as the existing top-level missing-field handling. SplitRead grows three small helpers and one bypass: * ``_nested_path_by_name`` indexes the user-facing flat names back onto their original-schema paths so ``file_reader_supplier`` can align ``nested_name_paths`` with the reader's ``ordered_read_fields``. * ``_get_fields_and_predicate`` checks the path's top-level name against the file schema instead of the flat name, otherwise nested fields would be filtered out before reaching the format reader. * ``_get_final_read_data_fields`` returns the user-facing flat names directly in nested mode; the trimmed-fields machinery cannot find matches by leaf field ID. * ``create_index_mapping`` returns identity in nested mode because the format reader already emits batches whose columns match ``read_fields`` exactly. Avro / Lance / Vortex / Blob raise ``NotImplementedError`` when a nested path reaches them — Avro's Python-side fallback ships in the next commit; the others have no nested-pruning support to mirror. Primary-key tables that go through ``MergeFileSplitRead`` likewise raise: an outer-projection layer that lets the merge function see complete parent structs is the next phase. Tests: ``test_nested_projection_e2e.py`` covers the dotted-name path, the low-level integer-path API, mixed projection ordering, the top-level fast path stays intact, and the PK-merge-path NotImplementedError guard.
fastavro has no native nested column pruning, so the reader walks each record dict step-by-step using the same ``nested_name_paths`` the previous commit threaded through ``SplitRead``. Top-level-only projection keeps the existing ``record.get(name)`` fast path. The walk helper returns ``None`` when any path segment is missing or hits a non-dict value, which lets sub-field schema evolution surface as a NULL column instead of an exception — same shape as the Parquet/ORC missing-leaf handling.
e7e5835 to
6826254
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
ReadBuilder.with_projectiononly accepts top-level column names today; dotted forms like'struct.subfield'are silently dropped. Reading just one leaf out of a struct ends up materialising the whole parent and discarding the rest client-side.This PR ports nested projection on append-only read paths.
Commits
Projectionutility +ReadBuilderAPI —pypaimon/utils/projection.py(top-level / nested / empty subclasses).with_projection(['a.b.c'])resolves dotted names;with_nested_projection(int[][])is the low-level entry point. Flat output names are_-joined with_$Nsuffix on collisions; leaf field IDs are preserved for schema-evolution remap.Parquet/ORC pushdown —
FormatPyArrowReaderswitches todataset.scanner(columns={flat: ds.field(*path)})when any path has length > 1; missing leaves return NULL.SplitReadthreads the paths through with small bypasses on_construct_partition_mapping,_get_fields_and_predicate,_get_final_read_data_fields, andcreate_index_mapping. Avro / Lance / Vortex / Blob raiseNotImplementedErrorif a nested path reaches them. PK tables that need a merge read also raise; raw-convertible PK splits work.Avro fallback — fastavro has no native nested pruning; the reader walks each record dict by path. Top-level-only stays on the
record.get(name)fast path.Tests
test_projection_utility.py(21) +test_read_builder_nested_projection.py(10) cover the API and the projection utility.test_nested_projection_e2e.py(8) covers Parquet dotted projection, mixed nested + top-level, low-level integer paths, partitioned tables, Avro fallback, top-level fast-path regression, PK + mergeNotImplementedError.flake8 --config=dev/cfg.ini) clean. Existingreader_*_testregression clean.Out of scope (follow-up)
MergeFileSplitRead— needs anOuterProjectionRecordReaderso the merge function still sees full parent structs. Tracked separately; raisesNotImplementedErroruntil then.ARRAY<ROW>/MAPnested paths.API / format
Backward compatible: top-level-only
with_projectioncallers see no change. New:with_nested_projection. No file format change.Generative AI disclosure
Drafted with AI assistance; every behavioural guarantee is exercised by a test in one of the three new test files.