[python] Fix predicate index stale after with_projection narrows read_type by TheR1sing3un · Pull Request #7741 · apache/paimon

TheR1sing3un · 2026-04-29T13:08:21Z

Purpose

Predicate leaves are built against the original table schema via PredicateBuilder, so each leaf's index field encodes a position in that schema's column order. SplitRead.__init__ then takes the predicate as-is and forwards it to the row-level FilterRecordReader.

When the caller chains .with_projection(...) onto the read builder, read_type becomes narrower or reordered relative to the original schema. The OffsetRow handed to FilterRecordReader uses read_type indices, so the stale index carried on the predicate either

raises "Position N is out of bounds for row arity M" when N is past the end of the projected row, or
silently selects the wrong column when N is still in range but no longer points at the predicate field.

This shows up most easily on PK tables (the row-level filter path) when a non-PK column is filtered while the projection narrows the read.

Fix

Add rewrite_predicate_indices in push_down_utils.py and call it from SplitRead.__init__, against read_type rather than self.read_fields. The distinction matters:

MergeFileSplitRead augments read_fields with _KEY_*/_SEQ/_KIND prefixes for the underlying KV scan.
KeyValueUnwrapRecordReader returns kv.value whose arity equals len(read_type) and whose coordinate space is read_type.

FilterRecordReader evaluates against the unwrapped value, so read_type is the correct coordinate space for both MergeFileSplitRead and RawFileSplitRead. The membership check (whether to apply the filter at all) also moves onto read_type for the same reason — a predicate field that is in read_fields only via a _KEY_* augmentation is not visible to FilterRecordReader.

Helper contract: every predicate leaf field must be present in read_fields, otherwise the helper raises with a clear message; the caller in SplitRead already filters out the off-projection case before invoking the rewrite.

Linked issue

N/A — surfaced when reading a PK table with with_projection(['id', 'value']).with_filter(equal('value', ...)). The pre-fix path raised IndexError from OffsetRow.get_field.

Tests

New pypaimon/tests/projection_predicate_index_test.py:

End-to-end (kernel ReadBuilder, no extra integrations):

test_pk_filter_on_non_pk_with_projection_keeping_filter_column — was the IndexError trigger.
test_pk_filter_on_non_pk_with_projection_reordering_columns — projection reorders columns; remap must follow.
test_pk_filter_no_projection_still_works — sanity: no-projection path matches pre-fix output.
test_pk_filter_column_outside_projection_drops_filter — when the filter column is not projected the filter cannot apply at row level; we drop it, we do not misapply with a stale index.
test_append_only_filter_with_projection_unchanged — append-only file-level pushdown uses field names rather than indices; guard against any regression introduced by the fix.

Direct unit tests for the helper:

test_remaps_leaf_index_to_position_in_read_fields — single leaf, narrowed/reordered projection.
test_remaps_inside_and_or — nested AND of leaves and a nested OR; every leaf rebound.
test_returns_none_for_none_input.
test_raises_when_leaf_field_missing — leaf field absent from read_fields raises a clear ValueError.

Local: pytest pypaimon/tests/projection_predicate_index_test.py → 9 passed; flake8 --config=dev/cfg.ini clean. Surrounding test suites (predicates_test, reader_primary_key_test, reader_append_only_test, data_evolution_test) show no new failures vs origin/master; the only failures in those files are the pre-existing lance / vortex environment issues.

API and format

No public API change. No file format change. The behavioural change is restricted to scans that combine with_filter() and with_projection() and previously either crashed or returned wrong rows.

Documentation

The new helper carries a docstring explaining when and why the index remap is needed; the call site in SplitRead.__init__ keeps a long-form comment documenting the read_type vs read_fields distinction so future maintainers don't accidentally regress.

Generative AI disclosure

Drafted with assistance from an AI coding tool; root cause and fix verified by the author against the row-level FilterRecordReader path on PK tables, and exercised by the regression tests above.

…_type Predicate leaves are built against the original table schema via PredicateBuilder, so each leaf's `index` field encodes a position in that schema's column order. SplitRead.__init__ then takes that predicate as-is and forwards it to the row-level FilterRecordReader. When the caller has chained with_projection(...) onto the read builder, read_type is narrower or reordered relative to the original schema. The OffsetRow handed to FilterRecordReader uses read_type indices, so the stale `index` either: * raises "Position N is out of bounds for row arity M" (when N is past the end of the projected row), or * silently selects the wrong column (when N is in range but no longer points at the predicate field). Fix: add rewrite_predicate_indices in push_down_utils.py, called from SplitRead.__init__ against `read_type` (not `self.read_fields`). The distinction matters: MergeFileSplitRead augments `read_fields` with _KEY_*/_SEQ/_KIND prefixes for the underlying KV scan, but KeyValueUnwrapRecordReader returns kv.value whose arity equals len(read_type) and whose coordinate space is read_type — that is the space FilterRecordReader actually evaluates against. The membership check (whether to apply the filter at all) is also moved onto read_type for the same reason: a predicate field that is in `read_fields` only via a _KEY_* augmentation is not visible to FilterRecordReader. Tests: * projection_predicate_index_test.py exercises both halves of the bug: PK + filter on non-PK + projection that keeps the filter column (was the IndexError); PK + projection that reorders columns (would silently pick the wrong column); PK + filter column outside the projection (filter must drop, not misapply); append-only sanity via the file-level pushdown route which uses field names rather than indices. * RewritePredicateIndicesUnitTest exercises the helper directly: leaf remap, nested AND/OR remap, None passthrough, and the "field not in read_fields" error path.

JingsongLi

+1

JingsongLi approved these changes Apr 29, 2026

View reviewed changes

JingsongLi merged commit 12c32b1 into apache:master Apr 29, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Fix predicate index stale after with_projection narrows read_type#7741

[python] Fix predicate index stale after with_projection narrows read_type#7741
JingsongLi merged 1 commit into
apache:masterfrom
TheR1sing3un:py-fix-predicate-index-after-projection

TheR1sing3un commented Apr 29, 2026

Uh oh!

JingsongLi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TheR1sing3un commented Apr 29, 2026

Purpose

Fix

Linked issue

Tests

API and format

Documentation

Generative AI disclosure

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants