Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: Add partition stats spec #7105

Merged
merged 9 commits into from Nov 1, 2023
54 changes: 53 additions & 1 deletion format/spec.md
Expand Up @@ -671,10 +671,11 @@ Table metadata consists of the following fields:
| _optional_ | _required_ | **`default-sort-order-id`**| Default sort order id of the table. Note that this could be used by writers, but is not used when reading because reads use the specs stored in manifest files. |
| | _optional_ | **`refs`** | A map of snapshot references. The map keys are the unique snapshot reference names in the table, and the map values are snapshot reference objects. There is always a `main` branch reference pointing to the `current-snapshot-id` even if the `refs` map is null. |
| _optional_ | _optional_ | **`statistics`** | A list (optional) of [table statistics](#table-statistics). |
| _optional_ | _optional_ | **`partition-statistics`** | A list (optional) of [partition statistics](#partition-statistics). |

For serialization details, see Appendix C.

#### Table statistics
#### Table Statistics

Table statistics files are valid [Puffin files](../puffin-spec). Statistics are informational. A reader can choose to
ignore statistics information. Statistics support is not required to read the table correctly. A table can contain
Expand Down Expand Up @@ -702,6 +703,57 @@ Blob metadata is a struct with the following fields:
| _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |


#### Partition Statistics

Partition statistics files are based on [partition statistics file spec](#partition-statistics-file).
Partition statistics are not required for reading or planning and readers may ignore them.
Each table snapshot may be associated with at most one partition statistics file.
A writer can optionally write the partition statistics file during each write operation, or it can also be computed on demand.
Partition statistics file must be registered in the table metadata file to be considered as a valid statistics file for the reader.

`partition-statistics` field of table metadata is an optional list of structs with the following fields:

| v1 | v2 | Field name | Type | Description |
aokolnychyi marked this conversation as resolved.
Show resolved Hide resolved
|----|----|------------|------|-------------|
| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
| _required_ | _required_ | **`statistics-path`** | `string` | Path of the partition statistics file. See [Partition statistics file](#partition-statistics-file). |
| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |

#### Partition Statistics File

Statistics information for each unique partition tuple is stored as a row in any of the data file format of the table (for example, Parquet or ORC).
These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.

The schema of the partition statistics file is as follows:

| v1 | v2 | Field id, name | Type | Description |
|----|----|----------------|------|-------------|
| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
ajantha-bhat marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume a single file would cover all specs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the partition tuple is a unified type. Hence, it is a coerced result from all the specs.
This spec id is just incase if we want to know the latest spec that has modified this partition.

Do you feel it is redundant and we can remove it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think it's good this way: clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the spec ID is required to reconstruct the actual partition tuple, if needed. The main question is whether it is easier to work with a unified tuple or a spec-specific tuple. If most use cases need a spec-specific tuple and would require a projection, nothing prevents us from having a file per spec and annotating each partition stats file with a spec ID instead of persisting it for each record.

Can we think through our initial plans for writing and reading these files? Doesn't have to be very elaborate at this point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already have a POC PR, which read and write these files.
#8488

I think unified tuple will be good for updates. If we keep spec-specific tuple, the stats of same partition (after spec evolution) will be distributed to multiple buckets and hard for the reader. Even existing partitions metadata table also uses the same unified partition tuple

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at the PoC PR as soon as 1.4 is out (within next few days).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at #8488 and I am not sure I how feel about generating these files during commits (in fact, during each commit attempt). I'd personally start by adding API and core logic to be able to add these files on demand and implement an action to actually produce these files (either incrementally or from scratch). Updating these stats for larger tables in each commit attempt will cause issues. In the action, we can do this more efficiently. We can also call this action immediately after writes but at least it will not be part of the commit. It would also drastically reduce the amount of changes in #8488, we won't have to touch every operation type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss the interest from the community for the synchronous writes.
Some of them might be intersted.

Agree that we should first go with async implementation to make things easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, Trino is currently writing Puffin in both sync and async way. Dremio is also intersted in sync stats.

| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
ajantha-bhat marked this conversation as resolved.
Show resolved Hide resolved
| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are a bit inconsistent throughout the code with naming data_files_count vs data_file_count. In quite a bit of cases, we are using plural words (like Action API, spec for manifest lists).

Have we discussed the preferred naming to follow in the future?

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is applicable to data_record_count and other fields also.
Agree that we need to standardise this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a native speaker, so I searched around. Seems file count, record count is the right way to go.

The reason "file count" is the correct phrase is because it follows the standard rules of English grammar for compound nouns. When you have a compound noun made up of two nouns, like "file" and "count," the first noun (in this case, "file") acts as an adjective describing the second noun (in this case, "count").

So, "file count" means the count of files, or in other words, it specifies what kind of count you are referring to – a count of files. This is a common construction in English, where the first noun helps specify or describe the second noun, and it's the reason "file count" is used rather than "files count."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @flyrain, I too found that initially. But after digging a bit more, internet says both are valid.

So, I decided we can go with anyone. But we just have to standardise it. Maybe need to check what Spark, Hive and other products follow as standard.

| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
ajantha-bhat marked this conversation as resolved.
Show resolved Hide resolved
| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
| _optional_ | _optional_ | **`10 total_record_count`** | `long` | Accurate count of records in a partition after applying the delete files if any |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right this would only be possible to compute by reading data and applying deletes? If so, are we planning to make this optional and not populate by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. That is why schema is kept optional.

Implementation will not populate this by default (can be controlled by a property or the way of writing. For example, async write can compute it but not the incremental sync writes)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me.

| _optional_ | _optional_ | **`11 last_updated_at`** | `long` | Timestamp in milliseconds from the unix epoch when the partition was last updated |
| _optional_ | _optional_ | **`12 last_updated_snapshot_id`** | `long` | ID of snapshot that last updated this partition |

Note that partition data tuple's schema is based on the partition spec output using partition field ids for the struct field ids.
The unified partition type is a struct containing all fields that have ever been a part of any spec in the table
and sorted by the field ids in ascending order.
In other words, the struct fields represent a union of all known partition fields sorted in ascending order by the field ids.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is this new line intentional? Seems like it all belongs to the same paragraph.

For example,
1) spec#0 has two fields {field#1, field#2}
and then the table has evolved into spec#1 which has three fields {field#1, field#2, field#3}.
The unified partition type looks like Struct<field#1, field#2, field#3>.

2) spec#0 has two fields {field#1, field#2}
and then the table has evolved into spec#1 which has just one field {field#2}.
The unified partition type looks like Struct<field#1, field#2>.

#### Commit Conflict Resolution and Retry

When two commits happen at the same time and are based on the same version, only one commit will succeed. In most cases, the failed commit can be applied to the new current version of table metadata and retried. Updates verify the conditions under which they can be applied to a new version and retry if those conditions are met.
Expand Down