[python] Add statistics infrastructure and ANALYZE commit path by JunRuiLee · Pull Request #7916 · apache/paimon

JunRuiLee · 2026-05-20T09:37:36Z

Purpose

Internal infrastructure for ANALYZE TABLE:

ColStats / Statistics / StatsFileHandler — Java-compatible stats JSON format
FileStoreCommit.commit_statistics() — creates ANALYZE snapshot, preserves watermark/next_row_id, inherits stats on subsequent commits
StatisticsCollector — reads merged data via read path, computes distinctCount/nullCount/min/max/avgLen/maxLen with correct type gating and Java-compatible serialization (DATE→epoch days, TIME→millis, TIMESTAMP→formatted string,
string/binary→no min/max)

Tests

ColStats/Statistics serialization roundtrip + Java key format
Empty colStats always emitted (avoids Java NPE)
StatsFileHandler write/read via snapshot reference

Add ColStats (per-column stats), Statistics (table-level stats), and StatsFileHandler for reading/writing JSON files under <table>/statistics/. Format matches Java org.apache.paimon.stats for cross-engine compatibility. colStats is always emitted in JSON (even empty) to avoid Java NPE.

Add FileStoreCommit.commit_statistics() that creates a new snapshot with commit_kind='ANALYZE' and the statistics file name. Preserves watermark and next_row_id from previous snapshot. Normal APPEND/OVERWRITE commits inherit the statistics field when schema is unchanged. Add BatchTableCommit.update_statistics() as the public API entry point.

Reads fully merged data via the Python read path (including PK deduplication) and computes real column statistics: - distinctCount, nullCount for all analyzable types - min/max for numeric/date/time only (string/binary/fixed_size_binary skipped, matching Java/Spark hasMinMax semantics) - avgLen/maxLen for string/binary/fixed_size_binary types - Serialization: DATE as epoch days, TIME as millis-of-day, TIMESTAMP as formatted string (Java TimestampSerializer compat) - Struct/list/map/nested types are skipped (matches Spark type gate) Empty tables return zero-count statistics without raising.

JingsongLi · 2026-05-20T11:24:39Z

I want to ask if this is used in the production environment? I haven't seen it yet.

JunRuiLee · 2026-05-20T11:54:49Z

I want to ask if this is used in the production environment? I haven't seen it yet.

Yes, there are scenarios where users access table/statistics information from Python, and that was the original motivation for this PR.

After reconsidering it, I think this requirement may be mostly covered by system tables, especially for reading existing metadata or statistics. The ANALYZE commit support may be a larger scope than necessary for PyPaimon at this stage.

If the community does not think this is useful enough or worth the maintenance cost, I am fine with closing this PR.

JingsongLi · 2026-05-20T13:35:22Z

@JunRuiLee Let's close this PR.

JunRuiLee added 3 commits May 20, 2026 17:00

JunRuiLee force-pushed the pypaimon-statistics-infra branch from 96eeac4 to bfa64f6 Compare May 20, 2026 10:34

JingsongLi closed this May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Add statistics infrastructure and ANALYZE commit path#7916

[python] Add statistics infrastructure and ANALYZE commit path#7916
JunRuiLee wants to merge 3 commits into
apache:masterfrom
JunRuiLee:pypaimon-statistics-infra

JunRuiLee commented May 20, 2026

Uh oh!

JingsongLi commented May 20, 2026

Uh oh!

JunRuiLee commented May 20, 2026

Uh oh!

JingsongLi commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JunRuiLee commented May 20, 2026

Purpose

Tests

Uh oh!

JingsongLi commented May 20, 2026

Uh oh!

JunRuiLee commented May 20, 2026

Uh oh!

JingsongLi commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants