Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: Add partition stats spec #7105

Merged
merged 9 commits into from Nov 1, 2023
42 changes: 42 additions & 0 deletions format/spec.md
Expand Up @@ -671,6 +671,7 @@ Table metadata consists of the following fields:
| _optional_ | _required_ | **`default-sort-order-id`**| Default sort order id of the table. Note that this could be used by writers, but is not used when reading because reads use the specs stored in manifest files. |
| | _optional_ | **`refs`** | A map of snapshot references. The map keys are the unique snapshot reference names in the table, and the map values are snapshot reference objects. There is always a `main` branch reference pointing to the `current-snapshot-id` even if the `refs` map is null. |
| _optional_ | _optional_ | **`statistics`** | A list (optional) of [table statistics](#table-statistics). |
| _optional_ | _optional_ | **`partition-statistics`** | A list (optional) of [partition statistics](#partition-statistics). |

For serialization details, see Appendix C.

Expand Down Expand Up @@ -702,6 +703,47 @@ Blob metadata is a struct with the following fields:
| _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |


#### Partition statistics

Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is there a particular reason to use capital letters for Partition Statistics? It seems inconsistent with other places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, changed to keep capital only for headers.

Partition statistics are not required for reading or planning and readers may ignore them.
Each table snapshot may be associated with at most one partition statistic file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: one partition statistic file -> one partition statistics file to be consistent with other places?

A writer can optionally write the partition statistics file during each write operation, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a note that it can be also computed on demand rather that in each write?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added with some rewording

it must be registered in the table metadata file to be considered as a valid statistics file for the reader.

`partition-statistics` field of table metadata is an optional list of struct with the following fields:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: an optional list of struct -> an optional list of structs?


| v1 | v2 | Field name | Type | Description |
aokolnychyi marked this conversation as resolved.
Show resolved Hide resolved
|----|----|------------|------|-------------|
| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we call it statistics-path in table stats.
Is there a particular reason to call it statistics-file-path here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is a single file, I thought adding file keyword gives more clarity. General statistics-path may give an impression that it is a folder or multiple files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point but isn't it also a path to a single file in case of table stats? I would align the naming to be consistent if it indicates a path to a single file in both cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Updated as suggested.


#### Partition Statistics file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Shall file be File since it is a header?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. updated.

Also updated the header of Table statistics -> Table Statistics


Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it is a good idea for the spec to make this assumption. I can see this being configurable even for tables that store Avro data to use Parquet or ORC for partition stats. Why can't we just default this in the implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Russell gave a comment to explicitly mention the format type.

I have removed the "default" word and reworded a bit. Implementation can take a call whether to use the default table's format or the one specified in a table property.

These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.

The schema of the partition statistics file is as follows:

| v1 | v2 | Field id, name | Type | Description |
|----|----|----------------|------|-------------|
| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
ajantha-bhat marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume a single file would cover all specs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the partition tuple is a unified type. Hence, it is a coerced result from all the specs.
This spec id is just incase if we want to know the latest spec that has modified this partition.

Do you feel it is redundant and we can remove it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think it's good this way: clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the spec ID is required to reconstruct the actual partition tuple, if needed. The main question is whether it is easier to work with a unified tuple or a spec-specific tuple. If most use cases need a spec-specific tuple and would require a projection, nothing prevents us from having a file per spec and annotating each partition stats file with a spec ID instead of persisting it for each record.

Can we think through our initial plans for writing and reading these files? Doesn't have to be very elaborate at this point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already have a POC PR, which read and write these files.
#8488

I think unified tuple will be good for updates. If we keep spec-specific tuple, the stats of same partition (after spec evolution) will be distributed to multiple buckets and hard for the reader. Even existing partitions metadata table also uses the same unified partition tuple

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at the PoC PR as soon as 1.4 is out (within next few days).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at #8488 and I am not sure I how feel about generating these files during commits (in fact, during each commit attempt). I'd personally start by adding API and core logic to be able to add these files on demand and implement an action to actually produce these files (either incrementally or from scratch). Updating these stats for larger tables in each commit attempt will cause issues. In the action, we can do this more efficiently. We can also call this action immediately after writes but at least it will not be part of the commit. It would also drastically reduce the amount of changes in #8488, we won't have to touch every operation type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss the interest from the community for the synchronous writes.
Some of them might be intersted.

Agree that we should first go with async implementation to make things easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, Trino is currently writing Puffin in both sync and async way. Dremio is also intersted in sync stats.

| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
ajantha-bhat marked this conversation as resolved.
Show resolved Hide resolved
| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are a bit inconsistent throughout the code with naming data_files_count vs data_file_count. In quite a bit of cases, we are using plural words (like Action API, spec for manifest lists).

Have we discussed the preferred naming to follow in the future?

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is applicable to data_record_count and other fields also.
Agree that we need to standardise this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a native speaker, so I searched around. Seems file count, record count is the right way to go.

The reason "file count" is the correct phrase is because it follows the standard rules of English grammar for compound nouns. When you have a compound noun made up of two nouns, like "file" and "count," the first noun (in this case, "file") acts as an adjective describing the second noun (in this case, "count").

So, "file count" means the count of files, or in other words, it specifies what kind of count you are referring to – a count of files. This is a common construction in English, where the first noun helps specify or describe the second noun, and it's the reason "file count" is used rather than "files count."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @flyrain, I too found that initially. But after digging a bit more, internet says both are valid.

So, I decided we can go with anyone. But we just have to standardise it. Maybe need to check what Spark, Hive and other products follow as standard.

| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
ajantha-bhat marked this conversation as resolved.
Show resolved Hide resolved
| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
| _optional_ | _optional_ | **`10 total_record_count`** | `long` | Accurate count of records in a partition after applying the delete files if any |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right this would only be possible to compute by reading data and applying deletes? If so, are we planning to make this optional and not populate by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. That is why schema is kept optional.

Implementation will not populate this by default (can be controlled by a property or the way of writing. For example, async write can compute it but not the incremental sync writes)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me.

| _optional_ | _optional_ | **`11 last_updated_at`** | `long` | Timestamp in milliseconds from the unix epoch when the partition was last updated |
| _optional_ | _optional_ | **`12 last_updated_snapshot_id`** | `long` | ID of snapshot that last updated this partition |

Note that partition data tuple's schema is based on the partition spec output using partition field ids for the struct field ids.
The unified partition type is a struct containing all fields that have ever been a part of any spec in the table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not specified enough imo. What does this look like, does order matter, etc ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer you mean at documentation level right (in the spec.md), right ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elaborated the description and added the examples.

During implementation I will be using existing Partitioning#partitionType() for this.
But Spec should not talk about code. Hence, excluded that info.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajantha-bhat thanks for the update !

In other words, the struct fields represent a union of all known partition fields.

#### Commit Conflict Resolution and Retry

When two commits happen at the same time and are based on the same version, only one commit will succeed. In most cases, the failed commit can be applied to the new current version of table metadata and retried. Updates verify the conditions under which they can be applied to a new version and retry if those conditions are met.
Expand Down