Skip to content

[python] Add streaming infrastructure: scanners, consumers, caching#7342

Open
tub wants to merge 1 commit intoapache:masterfrom
tub:python-streaming-1-infra
Open

[python] Add streaming infrastructure: scanners, consumers, caching#7342
tub wants to merge 1 commit intoapache:masterfrom
tub:python-streaming-1-infra

Conversation

@tub
Copy link
Contributor

@tub tub commented Mar 4, 2026

Summary

PR 1 of 3 for pure-Python streaming reads. This PR adds foundational infrastructure:

  • Follow-up scanners (delta, changelog, incremental diff) for continuous snapshot polling
  • Consumer manager for persisting read progress to the table path
  • LRU caching for snapshots, manifests, and manifest lists
  • Batch existence checks for efficient file IO
  • Bucket-based sharding params in FileScanner for parallel consumption
  • Row kind support in table reads
  • Streaming-related core options
  • Backtick support for identifier parsing

25 files changed, +2701 / -31 lines

PR Stack

  1. 👉 this PR — Streaming infrastructure (scanners, consumers, caching, sharding)
  2. Core streaming (StreamReadBuilder, AsyncStreamingTableScan, table integration)
  3. Optional - CLI (paimon tail command)

Merge workflow: Merge PR 1, rebase PR 2 onto updated master (PR 1 commits drop out), merge PR 2, repeat for PR 3.

Test plan

  • python -m pytest pypaimon/tests
  • python -c "from pypaimon import CatalogFactory" — no import errors
  • Unit tests for all new scanners, consumer manager, manifest caching, identifier parsing
  • Integration tests for FileScanner shard filtering

🤖 Generated with Claude Code

…sharding

Add foundational infrastructure for pure-Python streaming reads:

- Follow-up scanners (delta, changelog, incremental diff) for
  continuous snapshot polling
- Consumer manager for persisting read progress
- LRU caching for snapshots, manifests, and manifest lists
- Batch existence checks for efficient file IO
- Bucket-based sharding for parallel consumption
- Row kind support in table reads
- Streaming-related core options
- Backtick support for identifier parsing

Includes unit tests for all new components.
@tub tub force-pushed the python-streaming-1-infra branch from 314f2fe to a9c6bc7 Compare March 4, 2026 15:10
@tub tub marked this pull request as ready for review March 4, 2026 15:52
@JingsongLi
Copy link
Contributor

Thanks @tub for the contribution! This feature looks amazing!

Can you split this PR too? (I know the feature already be split to 3 PRs, but this PR is too large...)

For example, consumer and cosumer_manager can be a separate PR.

@tub
Copy link
Contributor Author

tub commented Mar 5, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants