Skip to content

[python] Push limit down to the reader layer for append table#8102

Merged
JingsongLi merged 6 commits into
apache:masterfrom
discivigour:p/limitPushRead
Jun 3, 2026
Merged

[python] Push limit down to the reader layer for append table#8102
JingsongLi merged 6 commits into
apache:masterfrom
discivigour:p/limitPushRead

Conversation

@discivigour
Copy link
Copy Markdown
Contributor

Purpose

Push limit down to the reader layer for append table.

Tests

  • LimitPushdownTest.test_append_only_split_read_limit_truncates_within_split()

@JingsongLi
Copy link
Copy Markdown
Contributor

I found a regression in the new LimitedRecordBatchReader wrapper.

RecordBatchReader.read_batch() relies on the reader's file_io, blob_field_indices, and vector_field_indices attributes when it creates OffsetRows. Other batch-reader wrappers propagate those attributes from the inner reader, but LimitedRecordBatchReader currently doesn't. Since RawFileSplitRead now wraps append-only readers with this class whenever with_limit() is set, iterator reads lose the BLOB/VECTOR metadata.

A minimal repro is an append-only table with a BLOB column:

rb = table.new_read_builder().with_limit(1)
splits = rb.new_scan().plan().splits()
row = next(rb.new_read().to_iterator(splits))
row.get_blob(blob_pos)

On this PR that raises TypeError: Field at position ... is not a BLOB field, while the same table works without with_limit(). Please propagate the inner reader metadata in LimitedRecordBatchReader (and add a regression test for with_limit() + to_iterator() + get_blob(); get_vector() is affected by the same missing metadata).

@discivigour
Copy link
Copy Markdown
Contributor Author

I found a regression in the new LimitedRecordBatchReader wrapper.

RecordBatchReader.read_batch() relies on the reader's file_io, blob_field_indices, and vector_field_indices attributes when it creates OffsetRows. Other batch-reader wrappers propagate those attributes from the inner reader, but LimitedRecordBatchReader currently doesn't. Since RawFileSplitRead now wraps append-only readers with this class whenever with_limit() is set, iterator reads lose the BLOB/VECTOR metadata.

A minimal repro is an append-only table with a BLOB column:

rb = table.new_read_builder().with_limit(1)
splits = rb.new_scan().plan().splits()
row = next(rb.new_read().to_iterator(splits))
row.get_blob(blob_pos)

On this PR that raises TypeError: Field at position ... is not a BLOB field, while the same table works without with_limit(). Please propagate the inner reader metadata in LimitedRecordBatchReader (and add a regression test for with_limit() + to_iterator() + get_blob(); get_vector() is affected by the same missing metadata).

Good point. Fixed.

@JingsongLi
Copy link
Copy Markdown
Contributor

Rechecked the latest commit. The metadata propagation in LimitedRecordBatchReader addresses my previous concern: with_limit() + to_iterator() + row.get_blob() now works in my local repro, and the focused limit/blob tests pass locally.

Thanks for the quick fix.

@JingsongLi
Copy link
Copy Markdown
Contributor

+1

@JingsongLi JingsongLi merged commit c864884 into apache:master Jun 3, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants