Skip to content

[python] Add StreamReadBuilder and AsyncStreamingTableScan#7343

Open
tub wants to merge 2 commits intoapache:masterfrom
tub:python-streaming-2-core
Open

[python] Add StreamReadBuilder and AsyncStreamingTableScan#7343
tub wants to merge 2 commits intoapache:masterfrom
tub:python-streaming-2-core

Conversation

@tub
Copy link
Contributor

@tub tub commented Mar 4, 2026

Summary

PR 2 of 3 for pure-Python streaming reads. This PR adds the core streaming API:

  • StreamReadBuilder for configuring streaming reads with predicates, projections, consumer IDs, sharding, and poll intervals
  • AsyncStreamingTableScan with async generator and sync wrapper for continuous snapshot polling
  • Table interface additions: new_stream_read_builder() on Table, FileStoreTable, FormatTable (stub), IcebergTable (stub)
  • setup.py: add oss/lance extras
  • Documentation: streaming reads section and supported features list

11 files changed, +2287 / -2 lines (incremental)

PR Stack

  1. Streaming infrastructure (scanners, consumers, caching, sharding)
  2. 👉 this PR — Core streaming (StreamReadBuilder, AsyncStreamingTableScan, table integration)
  3. Optional - CLI (paimon tail command)

Incremental diff (just this PR's changes): python-streaming-1-infra...tub:paimon:python-streaming-2-core

Merge workflow: Merge PR 1, rebase PR 2 onto updated master (PR 1 commits drop out), merge PR 2, repeat for PR 3.

Test plan

  • python -m pytest pypaimon/tests — 594 passed (9 pre-existing lance failures)
  • python -c "from pypaimon import CatalogFactory" — no import errors
  • Unit tests for StreamReadBuilder validation, sharding, and AsyncStreamingTableScan filtering
  • Integration tests for stream_read_builder shard passthrough and FileScanner filtering

🤖 Generated with Claude Code

tub added 2 commits March 4, 2026 15:09
…sharding

Add foundational infrastructure for pure-Python streaming reads:

- Follow-up scanners (delta, changelog, incremental diff) for
  continuous snapshot polling
- Consumer manager for persisting read progress
- LRU caching for snapshots, manifests, and manifest lists
- Batch existence checks for efficient file IO
- Bucket-based sharding for parallel consumption
- Row kind support in table reads
- Streaming-related core options
- Backtick support for identifier parsing

Includes unit tests for all new components.
…ation

Add core streaming read API:

- StreamReadBuilder for configuring streaming reads with predicates,
  projections, consumer IDs, sharding, and poll intervals
- AsyncStreamingTableScan with async generator and sync wrapper
  for continuous snapshot polling
- Table interface additions: new_stream_read_builder() on Table,
  FileStoreTable, FormatTable (stub), IcebergTable (stub)
- Minor fix in split_read.py
- Add oss/lance extras to setup.py
- Streaming reads documentation and supported features list

Includes streaming table scan and sharding unit tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant