[python] Fix predicate index stale after with_projection narrows read_type#7741
Merged
JingsongLi merged 1 commit intoApr 29, 2026
Conversation
…_type
Predicate leaves are built against the original table schema via
PredicateBuilder, so each leaf's `index` field encodes a position in
that schema's column order. SplitRead.__init__ then takes that
predicate as-is and forwards it to the row-level FilterRecordReader.
When the caller has chained with_projection(...) onto the read builder,
read_type is narrower or reordered relative to the original schema. The
OffsetRow handed to FilterRecordReader uses read_type indices, so the
stale `index` either:
* raises "Position N is out of bounds for row arity M" (when N is
past the end of the projected row), or
* silently selects the wrong column (when N is in range but no
longer points at the predicate field).
Fix: add rewrite_predicate_indices in push_down_utils.py, called from
SplitRead.__init__ against `read_type` (not `self.read_fields`). The
distinction matters: MergeFileSplitRead augments `read_fields` with
_KEY_*/_SEQ/_KIND prefixes for the underlying KV scan, but
KeyValueUnwrapRecordReader returns kv.value whose arity equals
len(read_type) and whose coordinate space is read_type — that is the
space FilterRecordReader actually evaluates against.
The membership check (whether to apply the filter at all) is also
moved onto read_type for the same reason: a predicate field that is
in `read_fields` only via a _KEY_* augmentation is not visible to
FilterRecordReader.
Tests:
* projection_predicate_index_test.py exercises both halves of the bug:
PK + filter on non-PK + projection that keeps the filter column
(was the IndexError); PK + projection that reorders columns
(would silently pick the wrong column); PK + filter column outside
the projection (filter must drop, not misapply); append-only sanity
via the file-level pushdown route which uses field names rather
than indices.
* RewritePredicateIndicesUnitTest exercises the helper directly:
leaf remap, nested AND/OR remap, None passthrough, and the
"field not in read_fields" error path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Predicate leaves are built against the original table schema via
PredicateBuilder, so each leaf'sindexfield encodes a position in that schema's column order.SplitRead.__init__then takes the predicate as-is and forwards it to the row-levelFilterRecordReader.When the caller chains
.with_projection(...)onto the read builder,read_typebecomes narrower or reordered relative to the original schema. TheOffsetRowhanded toFilterRecordReaderusesread_typeindices, so the staleindexcarried on the predicate either"Position N is out of bounds for row arity M"whenNis past the end of the projected row, orNis still in range but no longer points at the predicate field.This shows up most easily on PK tables (the row-level filter path) when a non-PK column is filtered while the projection narrows the read.
Fix
Add
rewrite_predicate_indicesinpush_down_utils.pyand call it fromSplitRead.__init__, againstread_typerather thanself.read_fields. The distinction matters:MergeFileSplitReadaugmentsread_fieldswith_KEY_*/_SEQ/_KINDprefixes for the underlying KV scan.KeyValueUnwrapRecordReaderreturnskv.valuewhose arity equalslen(read_type)and whose coordinate space isread_type.FilterRecordReaderevaluates against the unwrapped value, soread_typeis the correct coordinate space for bothMergeFileSplitReadandRawFileSplitRead. The membership check (whether to apply the filter at all) also moves ontoread_typefor the same reason — a predicate field that is inread_fieldsonly via a_KEY_*augmentation is not visible toFilterRecordReader.Helper contract: every predicate leaf field must be present in
read_fields, otherwise the helper raises with a clear message; the caller inSplitReadalready filters out the off-projection case before invoking the rewrite.Linked issue
N/A — surfaced when reading a PK table with
with_projection(['id', 'value']).with_filter(equal('value', ...)). The pre-fix path raisedIndexErrorfromOffsetRow.get_field.Tests
New
pypaimon/tests/projection_predicate_index_test.py:End-to-end (kernel
ReadBuilder, no extra integrations):test_pk_filter_on_non_pk_with_projection_keeping_filter_column— was theIndexErrortrigger.test_pk_filter_on_non_pk_with_projection_reordering_columns— projection reorders columns; remap must follow.test_pk_filter_no_projection_still_works— sanity: no-projection path matches pre-fix output.test_pk_filter_column_outside_projection_drops_filter— when the filter column is not projected the filter cannot apply at row level; we drop it, we do not misapply with a stale index.test_append_only_filter_with_projection_unchanged— append-only file-level pushdown uses field names rather than indices; guard against any regression introduced by the fix.Direct unit tests for the helper:
test_remaps_leaf_index_to_position_in_read_fields— single leaf, narrowed/reordered projection.test_remaps_inside_and_or— nested AND of leaves and a nested OR; every leaf rebound.test_returns_none_for_none_input.test_raises_when_leaf_field_missing— leaf field absent fromread_fieldsraises a clearValueError.Local:
pytest pypaimon/tests/projection_predicate_index_test.py→ 9 passed;flake8 --config=dev/cfg.iniclean. Surrounding test suites (predicates_test,reader_primary_key_test,reader_append_only_test,data_evolution_test) show no new failures vsorigin/master; the only failures in those files are the pre-existing lance / vortex environment issues.API and format
No public API change. No file format change. The behavioural change is restricted to scans that combine
with_filter()andwith_projection()and previously either crashed or returned wrong rows.Documentation
The new helper carries a docstring explaining when and why the index remap is needed; the call site in
SplitRead.__init__keeps a long-form comment documenting theread_typevsread_fieldsdistinction so future maintainers don't accidentally regress.Generative AI disclosure
Drafted with assistance from an AI coding tool; root cause and fix verified by the author against the row-level
FilterRecordReaderpath on PK tables, and exercised by the regression tests above.