[python] Add statistics infrastructure and ANALYZE commit path#7916
Closed
JunRuiLee wants to merge 3 commits into
Closed
[python] Add statistics infrastructure and ANALYZE commit path#7916JunRuiLee wants to merge 3 commits into
JunRuiLee wants to merge 3 commits into
Conversation
Add ColStats (per-column stats), Statistics (table-level stats), and StatsFileHandler for reading/writing JSON files under <table>/statistics/. Format matches Java org.apache.paimon.stats for cross-engine compatibility. colStats is always emitted in JSON (even empty) to avoid Java NPE.
Add FileStoreCommit.commit_statistics() that creates a new snapshot with commit_kind='ANALYZE' and the statistics file name. Preserves watermark and next_row_id from previous snapshot. Normal APPEND/OVERWRITE commits inherit the statistics field when schema is unchanged. Add BatchTableCommit.update_statistics() as the public API entry point.
Reads fully merged data via the Python read path (including PK deduplication) and computes real column statistics: - distinctCount, nullCount for all analyzable types - min/max for numeric/date/time only (string/binary/fixed_size_binary skipped, matching Java/Spark hasMinMax semantics) - avgLen/maxLen for string/binary/fixed_size_binary types - Serialization: DATE as epoch days, TIME as millis-of-day, TIMESTAMP as formatted string (Java TimestampSerializer compat) - Struct/list/map/nested types are skipped (matches Spark type gate) Empty tables return zero-count statistics without raising.
96eeac4 to
bfa64f6
Compare
Contributor
|
I want to ask if this is used in the production environment? I haven't seen it yet. |
Contributor
Author
Yes, there are scenarios where users access table/statistics information from Python, and that was the original motivation for this PR. After reconsidering it, I think this requirement may be mostly covered by system tables, especially for reading existing metadata or statistics. The ANALYZE commit support may be a larger scope than necessary for PyPaimon at this stage. If the community does not think this is useful enough or worth the maintenance cost, I am fine with closing this PR. |
Contributor
|
@JunRuiLee Let's close this PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Internal infrastructure for ANALYZE TABLE:
ColStats/Statistics/StatsFileHandler— Java-compatible stats JSON formatFileStoreCommit.commit_statistics()— creates ANALYZE snapshot, preserves watermark/next_row_id, inherits stats on subsequent commitsStatisticsCollector— reads merged data via read path, computes distinctCount/nullCount/min/max/avgLen/maxLen with correct type gating and Java-compatible serialization (DATE→epoch days, TIME→millis, TIMESTAMP→formatted string,string/binary→no min/max)
Tests