Skip to content

[python] Add statistics infrastructure and ANALYZE commit path#7916

Closed
JunRuiLee wants to merge 3 commits into
apache:masterfrom
JunRuiLee:pypaimon-statistics-infra
Closed

[python] Add statistics infrastructure and ANALYZE commit path#7916
JunRuiLee wants to merge 3 commits into
apache:masterfrom
JunRuiLee:pypaimon-statistics-infra

Conversation

@JunRuiLee
Copy link
Copy Markdown
Contributor

Purpose

Internal infrastructure for ANALYZE TABLE:

  • ColStats / Statistics / StatsFileHandler — Java-compatible stats JSON format
  • FileStoreCommit.commit_statistics() — creates ANALYZE snapshot, preserves watermark/next_row_id, inherits stats on subsequent commits
  • StatisticsCollector — reads merged data via read path, computes distinctCount/nullCount/min/max/avgLen/maxLen with correct type gating and Java-compatible serialization (DATE→epoch days, TIME→millis, TIMESTAMP→formatted string,
    string/binary→no min/max)

Tests

  • ColStats/Statistics serialization roundtrip + Java key format
  • Empty colStats always emitted (avoids Java NPE)
  • StatsFileHandler write/read via snapshot reference

JunRuiLee added 3 commits May 20, 2026 17:00
Add ColStats (per-column stats), Statistics (table-level stats), and
StatsFileHandler for reading/writing JSON files under <table>/statistics/.
Format matches Java org.apache.paimon.stats for cross-engine compatibility.
colStats is always emitted in JSON (even empty) to avoid Java NPE.
Add FileStoreCommit.commit_statistics() that creates a new snapshot with
commit_kind='ANALYZE' and the statistics file name. Preserves watermark
and next_row_id from previous snapshot. Normal APPEND/OVERWRITE commits
inherit the statistics field when schema is unchanged.

Add BatchTableCommit.update_statistics() as the public API entry point.
Reads fully merged data via the Python read path (including PK
deduplication) and computes real column statistics:
- distinctCount, nullCount for all analyzable types
- min/max for numeric/date/time only (string/binary/fixed_size_binary
  skipped, matching Java/Spark hasMinMax semantics)
- avgLen/maxLen for string/binary/fixed_size_binary types
- Serialization: DATE as epoch days, TIME as millis-of-day,
  TIMESTAMP as formatted string (Java TimestampSerializer compat)
- Struct/list/map/nested types are skipped (matches Spark type gate)

Empty tables return zero-count statistics without raising.
@JunRuiLee JunRuiLee force-pushed the pypaimon-statistics-infra branch from 96eeac4 to bfa64f6 Compare May 20, 2026 10:34
@JingsongLi
Copy link
Copy Markdown
Contributor

I want to ask if this is used in the production environment? I haven't seen it yet.

@JunRuiLee
Copy link
Copy Markdown
Contributor Author

I want to ask if this is used in the production environment? I haven't seen it yet.

Yes, there are scenarios where users access table/statistics information from Python, and that was the original motivation for this PR.

After reconsidering it, I think this requirement may be mostly covered by system tables, especially for reading existing metadata or statistics. The ANALYZE commit support may be a larger scope than necessary for PyPaimon at this stage.

If the community does not think this is useful enough or worth the maintenance cost, I am fine with closing this PR.

@JingsongLi
Copy link
Copy Markdown
Contributor

@JunRuiLee Let's close this PR.

@JingsongLi JingsongLi closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants