New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec: Add partition stats spec #7105
Changes from 6 commits
22068c1
c03032e
dbddca9
beee31d
c2e7c03
90af96b
6a91f06
04060cb
5352c96
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -671,6 +671,7 @@ Table metadata consists of the following fields: | |
| _optional_ | _required_ | **`default-sort-order-id`**| Default sort order id of the table. Note that this could be used by writers, but is not used when reading because reads use the specs stored in manifest files. | | ||
| | _optional_ | **`refs`** | A map of snapshot references. The map keys are the unique snapshot reference names in the table, and the map values are snapshot reference objects. There is always a `main` branch reference pointing to the `current-snapshot-id` even if the `refs` map is null. | | ||
| _optional_ | _optional_ | **`statistics`** | A list (optional) of [table statistics](#table-statistics). | | ||
| _optional_ | _optional_ | **`partition-statistics`** | A list (optional) of [partition statistics](#partition-statistics). | | ||
|
||
For serialization details, see Appendix C. | ||
|
||
|
@@ -702,6 +703,47 @@ Blob metadata is a struct with the following fields: | |
| _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. | | ||
|
||
|
||
#### Partition statistics | ||
|
||
Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). | ||
Partition statistics are not required for reading or planning and readers may ignore them. | ||
Each table snapshot may be associated with at most one partition statistic file. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Minor: |
||
A writer can optionally write the partition statistics file during each write operation, and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shall we add a note that it can be also computed on demand rather that in each write? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added with some rewording |
||
it must be registered in the table metadata file to be considered as a valid statistics file for the reader. | ||
|
||
`partition-statistics` field of table metadata is an optional list of struct with the following fields: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Minor: |
||
|
||
| v1 | v2 | Field name | Type | Description | | ||
aokolnychyi marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|----|----|------------|------|-------------| | ||
| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. | | ||
| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like we call it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since it is a single file, I thought adding There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see your point but isn't it also a path to a single file in case of table stats? I would align the naming to be consistent if it indicates a path to a single file in both cases. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok. Updated as suggested. |
||
|
||
#### Partition Statistics file | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: Shall There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes. updated. Also updated the header of Table statistics -> Table Statistics |
||
|
||
Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure it is a good idea for the spec to make this assumption. I can see this being configurable even for tables that store Avro data to use Parquet or ORC for partition stats. Why can't we just default this in the implementation? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Russell gave a comment to explicitly mention the format type. I have removed the "default" word and reworded a bit. Implementation can take a call whether to use the default table's format or the one specified in a table property. |
||
These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning. | ||
|
||
The schema of the partition statistics file is as follows: | ||
|
||
| v1 | v2 | Field id, name | Type | Description | | ||
|----|----|----------------|------|-------------| | ||
| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table | | ||
| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id | | ||
ajantha-bhat marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I assume a single file would cover all specs? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, the partition tuple is a unified type. Hence, it is a coerced result from all the specs. Do you feel it is redundant and we can remove it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Personally, I think it's good this way: clearer. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the spec ID is required to reconstruct the actual partition tuple, if needed. The main question is whether it is easier to work with a unified tuple or a spec-specific tuple. If most use cases need a spec-specific tuple and would require a projection, nothing prevents us from having a file per spec and annotating each partition stats file with a spec ID instead of persisting it for each record. Can we think through our initial plans for writing and reading these files? Doesn't have to be very elaborate at this point. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I already have a POC PR, which read and write these files. I think unified tuple will be good for updates. If we keep spec-specific tuple, the stats of same partition (after spec evolution) will be distributed to multiple buckets and hard for the reader. Even existing partitions metadata table also uses the same unified partition tuple There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll take a look at the PoC PR as soon as 1.4 is out (within next few days). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I took a look at #8488 and I am not sure I how feel about generating these files during commits (in fact, during each commit attempt). I'd personally start by adding API and core logic to be able to add these files on demand and implement an action to actually produce these files (either incrementally or from scratch). Updating these stats for larger tables in each commit attempt will cause issues. In the action, we can do this more efficiently. We can also call this action immediately after writes but at least it will not be part of the commit. It would also drastically reduce the amount of changes in #8488, we won't have to touch every operation type. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can discuss the interest from the community for the synchronous writes. Agree that we should first go with async implementation to make things easier. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, Trino is currently writing Puffin in both sync and async way. Dremio is also intersted in sync stats. |
||
| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files | | ||
ajantha-bhat marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. It is applicable to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not a native speaker, so I searched around. Seems
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @flyrain, I too found that initially. But after digging a bit more, internet says both are valid. So, I decided we can go with anyone. But we just have to standardise it. Maybe need to check what Spark, Hive and other products follow as standard. |
||
| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes | | ||
| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files | | ||
ajantha-bhat marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files | | ||
| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files | | ||
| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files | | ||
| _optional_ | _optional_ | **`10 total_record_count`** | `long` | Accurate count of records in a partition after applying the delete files if any | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Am I right this would only be possible to compute by reading data and applying deletes? If so, are we planning to make this optional and not populate by default? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are right. That is why schema is kept optional. Implementation will not populate this by default (can be controlled by a property or the way of writing. For example, async write can compute it but not the incremental sync writes) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This makes sense to me. |
||
| _optional_ | _optional_ | **`11 last_updated_at`** | `long` | Timestamp in milliseconds from the unix epoch when the partition was last updated | | ||
| _optional_ | _optional_ | **`12 last_updated_snapshot_id`** | `long` | ID of snapshot that last updated this partition | | ||
|
||
Note that partition data tuple's schema is based on the partition spec output using partition field ids for the struct field ids. | ||
The unified partition type is a struct containing all fields that have ever been a part of any spec in the table. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not specified enough imo. What does this look like, does order matter, etc ... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @RussellSpitzer you mean at documentation level right (in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Elaborated the description and added the examples. During implementation I will be using existing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ajantha-bhat thanks for the update ! |
||
In other words, the struct fields represent a union of all known partition fields. | ||
|
||
#### Commit Conflict Resolution and Retry | ||
|
||
When two commits happen at the same time and are based on the same version, only one commit will succeed. In most cases, the failed commit can be applied to the new current version of table metadata and retried. Updates verify the conditions under which they can be applied to a new version and retry if those conditions are met. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is there a particular reason to use capital letters for
Partition Statistics
? It seems inconsistent with other places.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, changed to keep capital only for headers.