Skip to content

[python] Add paimon tail CLI for streaming table reads#7344

Draft
tub wants to merge 3 commits intoapache:masterfrom
tub:python-streaming-3-cli
Draft

[python] Add paimon tail CLI for streaming table reads#7344
tub wants to merge 3 commits intoapache:masterfrom
tub:python-streaming-3-cli

Conversation

@tub
Copy link
Contributor

@tub tub commented Mar 4, 2026

Summary

This PR is an optional addition to the python streaming feature - I just found it useful for debugging data issues locally without having to run Flink or Spark.

PR 3 of 3 for pure-Python streaming reads. This PR adds the paimon tail CLI command:

  • paimon tail: stream data from Paimon tables, similar to kafka-console-consumer
    • Multiple output formats: jsonl, json, csv, table
    • Filtering and column projection
    • Consumer IDs for checkpointing
    • Flexible start position: earliest, latest, snapshot:ID, time:-1h
    • --to end position for bounded reads
    • --follow for continuous streaming
  • CLI utilities for time parsing and output formatting
  • console_scripts entry point in setup.py
  • Documentation: CLI section and supported features list

7 files changed, +1396 lines (incremental)

PR Stack

  1. Streaming infrastructure (scanners, consumers, caching, sharding)
  2. Core streaming (StreamReadBuilder, AsyncStreamingTableScan, table integration)
  3. 👉 this PR — CLI (paimon tail command)

Incremental diff (just this PR's changes): python-streaming-2-core...tub:paimon:python-streaming-3-cli

Merge workflow: Merge PR 1, rebase PR 2 onto updated master (PR 1 commits drop out), merge PR 2, repeat for PR 3.

Test plan

  • python -m pytest pypaimon/tests — 630 passed (9 pre-existing lance failures)
  • python -c "from pypaimon import CatalogFactory" — no import errors
  • Unit tests for CLI utilities (time parsing, output formatting)

🤖 Generated with Claude Code

tub added 3 commits March 4, 2026 15:09
…sharding

Add foundational infrastructure for pure-Python streaming reads:

- Follow-up scanners (delta, changelog, incremental diff) for
  continuous snapshot polling
- Consumer manager for persisting read progress
- LRU caching for snapshots, manifests, and manifest lists
- Batch existence checks for efficient file IO
- Bucket-based sharding for parallel consumption
- Row kind support in table reads
- Streaming-related core options
- Backtick support for identifier parsing

Includes unit tests for all new components.
…ation

Add core streaming read API:

- StreamReadBuilder for configuring streaming reads with predicates,
  projections, consumer IDs, sharding, and poll intervals
- AsyncStreamingTableScan with async generator and sync wrapper
  for continuous snapshot polling
- Table interface additions: new_stream_read_builder() on Table,
  FileStoreTable, FormatTable (stub), IcebergTable (stub)
- Minor fix in split_read.py
- Add oss/lance extras to setup.py
- Streaming reads documentation and supported features list

Includes streaming table scan and sharding unit tests.
Add command-line interface for streaming Paimon table data:

- paimon tail command: stream data from tables similar to
  kafka-console-consumer, with support for multiple output formats
  (jsonl, json, csv, table), filtering, column projection, consumer
  IDs, flexible start position (earliest, latest, snapshot:ID,
  time:-1h), and --to end position
- CLI utilities for time parsing and output formatting
- console_scripts entry point in setup.py
- CLI documentation and supported features list

Includes CLI utility tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant