Skip to content

[python] Add nested-field projection support (append-only paths)#7796

Merged
JingsongLi merged 3 commits intoapache:masterfrom
TheR1sing3un:py-nested-projection
May 9, 2026
Merged

[python] Add nested-field projection support (append-only paths)#7796
JingsongLi merged 3 commits intoapache:masterfrom
TheR1sing3un:py-nested-projection

Conversation

@TheR1sing3un
Copy link
Copy Markdown
Member

@TheR1sing3un TheR1sing3un commented May 9, 2026

Purpose

ReadBuilder.with_projection only accepts top-level column names today; dotted forms like 'struct.subfield' are silently dropped. Reading just one leaf out of a struct ends up materialising the whole parent and discarding the rest client-side.

This PR ports nested projection on append-only read paths.

Commits

  1. Projection utility + ReadBuilder APIpypaimon/utils/projection.py (top-level / nested / empty subclasses). with_projection(['a.b.c']) resolves dotted names; with_nested_projection(int[][]) is the low-level entry point. Flat output names are _-joined with _$N suffix on collisions; leaf field IDs are preserved for schema-evolution remap.

  2. Parquet/ORC pushdownFormatPyArrowReader switches to dataset.scanner(columns={flat: ds.field(*path)}) when any path has length > 1; missing leaves return NULL. SplitRead threads the paths through with small bypasses on _construct_partition_mapping, _get_fields_and_predicate, _get_final_read_data_fields, and create_index_mapping. Avro / Lance / Vortex / Blob raise NotImplementedError if a nested path reaches them. PK tables that need a merge read also raise; raw-convertible PK splits work.

  3. Avro fallback — fastavro has no native nested pruning; the reader walks each record dict by path. Top-level-only stays on the record.get(name) fast path.

Tests

  • test_projection_utility.py (21) + test_read_builder_nested_projection.py (10) cover the API and the projection utility.
  • test_nested_projection_e2e.py (8) covers Parquet dotted projection, mixed nested + top-level, low-level integer paths, partitioned tables, Avro fallback, top-level fast-path regression, PK + merge NotImplementedError.
  • Lint (flake8 --config=dev/cfg.ini) clean. Existing reader_*_test regression clean.

Out of scope (follow-up)

  • PK tables that go through MergeFileSplitRead — needs an OuterProjectionRecordReader so the merge function still sees full parent structs. Tracked separately; raises NotImplementedError until then.
  • ARRAY<ROW> / MAP nested paths.
  • Sub-field schema evolution by field ID (nested currently walks by name → NULL on rename).

API / format

Backward compatible: top-level-only with_projection callers see no change. New: with_nested_projection. No file format change.

Generative AI disclosure

Drafted with AI assistance; every behavioural guarantee is exercised by a test in one of the three new test files.

Introduces a small ``Projection`` abstraction with three concrete shapes
(empty / top-level / nested) that maps an integer-index or
integer-path projection onto the table's flat ``DataField`` list.
Nested paths flatten into top-level fields whose names are the
underscore-joined original path (``a_b`` for ``a.b``, with a ``_$N``
suffix on collisions) and whose IDs are inherited from the leaf so
schema-evolution remapping by field ID continues to work.

ReadBuilder grows two API surfaces:

- ``with_projection(['struct.subfield'])`` — names with a dot are
  resolved into integer paths against the table schema. Top-level-only
  callers are unchanged. Unknown names continue to be silently skipped.
- ``with_nested_projection([[1, 0], [1, 1]])`` — low-level integer-path
  entry point.

``read_type()`` materialises the projected fields via ``Projection``
when nested paths are set, otherwise keeps the existing top-level
name-based path.

This commit only adds the utility, the new API, and unit tests. The
file readers and SplitRead still see only top-level fields — file-
level nested pushdown and PK-side outer projection follow in
subsequent commits.
…end-only

Wires the ``Projection`` infrastructure from the previous commit
through ``ReadBuilder.new_read`` → ``TableRead`` → ``SplitRead``
→ ``FormatPyArrowReader`` so a request like
``with_projection(['mv.latest_version'])`` actually prunes the
nested column at the file-read stage instead of materialising the
parent struct and discarding the unwanted children.

When at least one path has length > 1 the format reader switches
from a list-form ``columns=...`` scanner to a dict-form one with
``ds.field(*path)`` expressions, mirroring the engine's own nested
column-pruning surface. Sub-field schema evolution (a leaf renamed
or removed) is detected up front via ``_path_exists_in_arrow_schema``
and the missing column is served as NULL — same shape as the existing
top-level missing-field handling.

SplitRead grows three small helpers and one bypass:

* ``_nested_path_by_name`` indexes the user-facing flat names back
  onto their original-schema paths so ``file_reader_supplier`` can
  align ``nested_name_paths`` with the reader's ``ordered_read_fields``.
* ``_get_fields_and_predicate`` checks the path's top-level name
  against the file schema instead of the flat name, otherwise nested
  fields would be filtered out before reaching the format reader.
* ``_get_final_read_data_fields`` returns the user-facing flat names
  directly in nested mode; the trimmed-fields machinery cannot find
  matches by leaf field ID.
* ``create_index_mapping`` returns identity in nested mode because
  the format reader already emits batches whose columns match
  ``read_fields`` exactly.

Avro / Lance / Vortex / Blob raise ``NotImplementedError`` when a
nested path reaches them — Avro's Python-side fallback ships in the
next commit; the others have no nested-pruning support to mirror.

Primary-key tables that go through ``MergeFileSplitRead`` likewise
raise: an outer-projection layer that lets the merge function see
complete parent structs is the next phase.

Tests: ``test_nested_projection_e2e.py`` covers the dotted-name path,
the low-level integer-path API, mixed projection ordering, the
top-level fast path stays intact, and the PK-merge-path
NotImplementedError guard.
fastavro has no native nested column pruning, so the reader walks
each record dict step-by-step using the same ``nested_name_paths``
the previous commit threaded through ``SplitRead``. Top-level-only
projection keeps the existing ``record.get(name)`` fast path.

The walk helper returns ``None`` when any path segment is missing or
hits a non-dict value, which lets sub-field schema evolution surface
as a NULL column instead of an exception — same shape as the
Parquet/ORC missing-leaf handling.
@TheR1sing3un TheR1sing3un force-pushed the py-nested-projection branch from e7e5835 to 6826254 Compare May 9, 2026 10:24
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit 6fda857 into apache:master May 9, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants