Skip to content

[python] Fix predicate index stale after with_projection narrows read_type#7741

Merged
JingsongLi merged 1 commit into
apache:masterfrom
TheR1sing3un:py-fix-predicate-index-after-projection
Apr 29, 2026
Merged

[python] Fix predicate index stale after with_projection narrows read_type#7741
JingsongLi merged 1 commit into
apache:masterfrom
TheR1sing3un:py-fix-predicate-index-after-projection

Conversation

@TheR1sing3un
Copy link
Copy Markdown
Member

Purpose

Predicate leaves are built against the original table schema via PredicateBuilder, so each leaf's index field encodes a position in that schema's column order. SplitRead.__init__ then takes the predicate as-is and forwards it to the row-level FilterRecordReader.

When the caller chains .with_projection(...) onto the read builder, read_type becomes narrower or reordered relative to the original schema. The OffsetRow handed to FilterRecordReader uses read_type indices, so the stale index carried on the predicate either

  • raises "Position N is out of bounds for row arity M" when N is past the end of the projected row, or
  • silently selects the wrong column when N is still in range but no longer points at the predicate field.

This shows up most easily on PK tables (the row-level filter path) when a non-PK column is filtered while the projection narrows the read.

Fix

Add rewrite_predicate_indices in push_down_utils.py and call it from SplitRead.__init__, against read_type rather than self.read_fields. The distinction matters:

  • MergeFileSplitRead augments read_fields with _KEY_*/_SEQ/_KIND prefixes for the underlying KV scan.
  • KeyValueUnwrapRecordReader returns kv.value whose arity equals len(read_type) and whose coordinate space is read_type.

FilterRecordReader evaluates against the unwrapped value, so read_type is the correct coordinate space for both MergeFileSplitRead and RawFileSplitRead. The membership check (whether to apply the filter at all) also moves onto read_type for the same reason — a predicate field that is in read_fields only via a _KEY_* augmentation is not visible to FilterRecordReader.

Helper contract: every predicate leaf field must be present in read_fields, otherwise the helper raises with a clear message; the caller in SplitRead already filters out the off-projection case before invoking the rewrite.

Linked issue

N/A — surfaced when reading a PK table with with_projection(['id', 'value']).with_filter(equal('value', ...)). The pre-fix path raised IndexError from OffsetRow.get_field.

Tests

New pypaimon/tests/projection_predicate_index_test.py:

End-to-end (kernel ReadBuilder, no extra integrations):

  • test_pk_filter_on_non_pk_with_projection_keeping_filter_column — was the IndexError trigger.
  • test_pk_filter_on_non_pk_with_projection_reordering_columns — projection reorders columns; remap must follow.
  • test_pk_filter_no_projection_still_works — sanity: no-projection path matches pre-fix output.
  • test_pk_filter_column_outside_projection_drops_filter — when the filter column is not projected the filter cannot apply at row level; we drop it, we do not misapply with a stale index.
  • test_append_only_filter_with_projection_unchanged — append-only file-level pushdown uses field names rather than indices; guard against any regression introduced by the fix.

Direct unit tests for the helper:

  • test_remaps_leaf_index_to_position_in_read_fields — single leaf, narrowed/reordered projection.
  • test_remaps_inside_and_or — nested AND of leaves and a nested OR; every leaf rebound.
  • test_returns_none_for_none_input.
  • test_raises_when_leaf_field_missing — leaf field absent from read_fields raises a clear ValueError.

Local: pytest pypaimon/tests/projection_predicate_index_test.py → 9 passed; flake8 --config=dev/cfg.ini clean. Surrounding test suites (predicates_test, reader_primary_key_test, reader_append_only_test, data_evolution_test) show no new failures vs origin/master; the only failures in those files are the pre-existing lance / vortex environment issues.

API and format

No public API change. No file format change. The behavioural change is restricted to scans that combine with_filter() and with_projection() and previously either crashed or returned wrong rows.

Documentation

The new helper carries a docstring explaining when and why the index remap is needed; the call site in SplitRead.__init__ keeps a long-form comment documenting the read_type vs read_fields distinction so future maintainers don't accidentally regress.

Generative AI disclosure

Drafted with assistance from an AI coding tool; root cause and fix verified by the author against the row-level FilterRecordReader path on PK tables, and exercised by the regression tests above.

…_type

Predicate leaves are built against the original table schema via
PredicateBuilder, so each leaf's `index` field encodes a position in
that schema's column order. SplitRead.__init__ then takes that
predicate as-is and forwards it to the row-level FilterRecordReader.

When the caller has chained with_projection(...) onto the read builder,
read_type is narrower or reordered relative to the original schema. The
OffsetRow handed to FilterRecordReader uses read_type indices, so the
stale `index` either:

  * raises "Position N is out of bounds for row arity M" (when N is
    past the end of the projected row), or
  * silently selects the wrong column (when N is in range but no
    longer points at the predicate field).

Fix: add rewrite_predicate_indices in push_down_utils.py, called from
SplitRead.__init__ against `read_type` (not `self.read_fields`). The
distinction matters: MergeFileSplitRead augments `read_fields` with
_KEY_*/_SEQ/_KIND prefixes for the underlying KV scan, but
KeyValueUnwrapRecordReader returns kv.value whose arity equals
len(read_type) and whose coordinate space is read_type — that is the
space FilterRecordReader actually evaluates against.

The membership check (whether to apply the filter at all) is also
moved onto read_type for the same reason: a predicate field that is
in `read_fields` only via a _KEY_* augmentation is not visible to
FilterRecordReader.

Tests:

  * projection_predicate_index_test.py exercises both halves of the bug:
    PK + filter on non-PK + projection that keeps the filter column
    (was the IndexError); PK + projection that reorders columns
    (would silently pick the wrong column); PK + filter column outside
    the projection (filter must drop, not misapply); append-only sanity
    via the file-level pushdown route which uses field names rather
    than indices.
  * RewritePredicateIndicesUnitTest exercises the helper directly:
    leaf remap, nested AND/OR remap, None passthrough, and the
    "field not in read_fields" error path.
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit 12c32b1 into apache:master Apr 29, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants