Skip to content

[python] Add SnapshotManager batch lookahead with get_snapshots_batch and find_next_scannable#7418

Merged
JingsongLi merged 3 commits intoapache:masterfrom
tub:python-streaming-2b-snapshot-lookahead
Mar 13, 2026
Merged

[python] Add SnapshotManager batch lookahead with get_snapshots_batch and find_next_scannable#7418
JingsongLi merged 3 commits intoapache:masterfrom
tub:python-streaming-2b-snapshot-lookahead

Conversation

@tub
Copy link
Contributor

@tub tub commented Mar 12, 2026

Summary

  • Adds SnapshotManager.get_snapshots_batch(snapshot_ids, max_workers=4): batch-checks existence of a list of snapshot files via file_io.exists_batch(), then fetches the existing ones in parallel using ThreadPoolExecutor, returning {id: Snapshot|None}
  • Adds SnapshotManager.find_next_scannable(start_id, should_scan, lookahead_size=10, max_workers=4): looks ahead lookahead_size snapshot IDs, fetches them in one batch, and returns (snapshot, next_id, skipped_count) for the first snapshot passing should_scan

These two methods are the performance foundation for AsyncStreamingTableScan (coming in 2c):

  • exists_batch replaces N serial file existence checks with a single round trip — important on object stores where per-call latency is significant
  • Lookahead allows the scan loop to skip non-scannable commits (e.g. COMPACT, OVERWRITE) without a separate round trip per snapshot
  • skipped_count in the return value enables the prefetch path in the streaming scan to submit the next lookahead fetch to a background thread while the consumer processes the current plan

Stack context

This is part of a stack of PRs adding streaming read support to paimon-python. Each PR is independently reviewable with a narrow scope:

PR Branch Content
#7417 python-streaming-2a-changelog-producer ChangelogProducer enum + config option
This PR python-streaming-2b-snapshot-lookahead SnapshotManager.get_snapshots_batch() + find_next_scannable()
Next python-streaming-2c-scan-and-builder AsyncStreamingTableScan, StreamReadBuilder, Table.new_stream_read_builder()
Next python-streaming-2d-consumer Consumer ID integration into scan/builder
Next python-streaming-2e-acceptance-docs IncrementalDiffScanner acceptance tests + streaming docs

Tracking issue: #7152

Test plan

  • cd paimon-python && python -m pytest pypaimon/tests/snapshot_manager_test.py -v
  • cd paimon-python && flake8 pypaimon/snapshot/snapshot_manager.py pypaimon/tests/snapshot_manager_test.py

🤖 Generated with Claude Code

tub and others added 2 commits March 12, 2026 15:32
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… and find_next_scannable

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clarifying docstring
Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit d91ff26 into apache:master Mar 13, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants