Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: Add partition stats spec #7105

Merged
merged 9 commits into from Nov 1, 2023
Merged

Conversation

ajantha-bhat
Copy link
Member

@ajantha-bhat ajantha-bhat commented Mar 14, 2023

Adding the spec changes for partition stats.

Based on the discussed design:
https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing

Fixes: #8451, #8452

@ajantha-bhat
Copy link
Member Author

@ajantha-bhat ajantha-bhat force-pushed the pstats_spec branch 2 times, most recently from 08b37e9 to e6c2610 Compare March 14, 2023 16:27
format/spec.md Outdated Show resolved Hide resolved
@jackye1995 jackye1995 changed the title Docs: Add partition stats spec Spec: Add partition stats spec Mar 14, 2023
@aokolnychyi
Copy link
Contributor

I am still yet to perform that test to check if 1 file is sufficient. I also worry about the cost of incremental updates if we have to replace the entire file.

@ajantha-bhat
Copy link
Member Author

I am still yet to perform that test to check if 1 file is sufficient. I also worry about the cost of incremental updates if we have to replace the entire file.

Yeah. Looking forward to the test results.

We may end up supporting both copy-on-write (single output file) and merge-on-read (multiple output files) as it boils down to whether the user wants faster read time or writing time. Initially, we may just support a single output file. But yeah, let us discuss more/conclude based on the test results.

@jackye1995
Copy link
Contributor

We may end up supporting both copy-on-write (single output file) and merge-on-read (multiple output files) as it boils down to whether the user wants faster read time or writing time.

This makes me feel if we should really use a storage for this. Technically it sound like we should use a key-value store for (partition, spec-id) -> (statistics values) instead of a file-based storage.

Maybe we should build an interface for it and just make storage a specific implementation, instead of saying it has to be backed by a file or multiple files, because it is has to be inefficient in read or write in that way. Just a thought.

@ajantha-bhat
Copy link
Member Author

This makes me feel if we should really use a storage for this. Technically it sound like we should use a key-value store for (partition, spec-id) -> (statistics values) instead of a file-based storage.
Maybe we should build an interface for it and just make storage a specific implementation, instead of saying it has to be backed by a file or multiple files, because it is has to be inefficient in read or write in that way. Just a thought.

During the design phase, I did evaluate storing it in a KV store (like rocksdb or hbase index). It is also captured in the design document.
But we concluded that we don't want to bring external dependencies into Iceberg core functions and also we cannot use the current clean-up logic with this. Hence, the File is chosen.

@jackye1995
Copy link
Contributor

sorry I totally overlooked that, I would also -1 for using a specific external dependency like RocksDB or HBase, that was probably why I just quickly skipped those options...

But I feel the semantics required for partition stats just does not fit a file storage system, as you said it ends up having to choose between CoW and MoR, which seems like too much complexity to just manage some additional stats.

I think we can start from a file storage (FileIO) based solution, but the spec should be at higher level such that it could be backed by more efficient solutions.

I guess there is the same argument also for things like manifest list, today rolling up manifest list is a bottleneck for write operations, and some kind of design backed by a key-value store could solve that bottleneck. Maybe we should think about that and try to solve these cases together? Just like we have FileIO that works really well with object storage semantics, we can have something like VersionedListStore that works well with any mutable but versioned list.

@ajantha-bhat
Copy link
Member Author

I guess there is the same argument also for things like manifest list, today rolling up manifest list is a bottleneck for write operations, and some kind of design backed by a key-value store could solve that bottleneck. Maybe we should think about that and try to solve these cases together? Just like we have FileIO that works really well with object storage semantics, we can have something like VersionedListStore that works well with any mutable but versioned list.

Yes. I totally agree that storing metadata in files can be a bottleneck (applies to stats as well). I think now catalogs are maturing enough to store metadata in the DB (REST, maybe Nessie in future). But the migration from one catalog to another needs special handling in these cases.

We should separately discuss this and handle it together for all the metadata.

@jackye1995
Copy link
Contributor

@rdblue @danielcweeks any thoughts about the discussion above?

@ajantha-bhat
Copy link
Member Author

I am still yet to perform that test to check if 1 file is sufficient. I also worry about the cost of incremental updates if we have to replace the entire file.

Hi @aokolnychyi, any update on this?

@ajantha-bhat
Copy link
Member Author

ajantha-bhat commented May 9, 2023

Hi @aokolnychyi and @rdblue,

I just created a single parquet file with a million sorted unique timestamp (partition value) and the same schema as expected partition stats. The file size turned out to be 12.6MB, Unsorted random file was around 30MB.

https://gist.github.com/ajantha-bhat/3401f2a29ddfa6b2b42f9168461ce98b

I agree that rewriting the whole file is expensive once the data grows. But creating multiple small files for incremental updates will not sort whole data and may slow the look-up of partition values during query.

Also, just summarising the plan we had for partition stats:
Step 1: define the partition stats schema and requirements then add them to the Iceberg spec.

Step 2: define how to track partition stats files and add it to the Iceberg spec. (relies on step 1)

Step 3: build a util to create a partition stats file for a table on a single node, like how the partitions table currently works (relies on step 1)

Step 4: implement stats file tracking (relies on step 2)

Step 5: build incremental update for a stats file based on incremental scan (relies on step 3)

Step 6: implement incremental update using commit stats (SnapshotSummary) and add a flag to turn it off (relies on steps 3 and 4 and possibly 5)

Step 7: implement parallel stats using metadata tables and a store procedure

@rdblue
Copy link
Contributor

rdblue commented May 22, 2023

@ajantha-bhat, I think it's safe to move forward with the next steps and assume we'll go with one file.

@ajantha-bhat
Copy link
Member Author

@ajantha-bhat, I think it's safe to move forward with the next steps and assume we'll go with one file.

Thanks for the reply. I think the PR is ready then.
I will do the other PRs as per the above plan.

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
| 5 | position_delete_record_count | LongType | true | Count of records in position delete files |
| 6 | position_delete_file_count | IntegerType | true | Count of position delete files |
| 7 | equality_delete_record_count | LongType | true | Count of records in equality delete files |
| 8 | equality_delete_file_count | IntegerType | true | Count of equality delete files |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why include equality/position delete counts rather than requiring an accurate record count? I don't have an opinion yet either way, but we could require accurate counts here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think computing the actual count can be a very expensive operation. That too when equality deletes are involved.
I have followed the same schema that I have implemented for Partitions metadata table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think computing the actual count can be a very expensive operation.

I think that's the point. Maybe we should have more accurate data since this can be computed async.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we apply the deletes, maybe the equality_delete and position_delete fields can be empty? If we don't apply the deletes (in some case where we complete stats synchronously during the write in future) it can have the actual delete stats. So, the current schema can still hold good.

Partitions metadata table is also an async call w.r.t write. But we still don't apply the actual delete files.
@szehon-ho: Do you have any suggestions?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be reasonable to make this an actual record count for the partition even with the presence of row-level deletes. Writers that produce positional deletes would be able to produce this count directly during writes. Equality delete writers like Flink wouldn't be able to produce the actual count, in which case they shouldn't write any partition stats for the snapshot. Either an async job as suggested, or regularly scheduled compaction jobs, could produce the partition stats in that case. If partition stats are not present, readers can either choose to use partition stats from earlier snapshots - e.g. from the last compaction - or just fall back to estimates based on other sources (full table rowcount, etc.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writers that produce positional deletes would be able to produce this count directly during writes.

This may not be true. For example, for a data file df1, we could have two positional delete files(pdf1, pdf2) applying to it. pdf1 and pdf2 may delete the same row in df1. The writer may not know if a row has been deleted previously. Not 100% sure though. Needs inputs from @szehon-ho @aokolnychyi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to read all of the valid positional delete files to know the number of unique deletes for a data file. We wouldn't need to scan data files though, so that's nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An additional scan on all positional delete files of partition changed are needed. This may still be expensive for sync operations when a write touch a lot of partitions.

Admittedly, the precise actual record count is useful. I still think the separated metrics(row count, file size) for data files, delete files are valuable. Users could use these metrics to decide if a compaction is needed, or debug why certain scan is slow. For example, if we see the there are several delete files in a partition, we may compact it for better scan perf. With that, I propose to keep them instead of removing.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajantha-bhat I think file count should remain just the data file count. Mixing the two reduces the value of the count in my opinion. Knowing delete file counts isn't that useful in my experience, there's not a lot of choices the planner can make regarding how delete files are handled, even if there are a lot of them.

Copy link
Collaborator

@szehon-ho szehon-ho Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partitions metadata table is also an async call w.r.t write. But we still don't apply the actual delete files.
@szehon-ho: Do you have any suggestions?

I still not very convinced that Partitions metadata table should have that column, as it would really reduce the performance of that table. It's a metadata table, so its sync in the sense that it needs to be calculated when the user queries it.

I slightly prefer having more stats than less. ie total_record_count, data_file_record_count , eq_delete_record_count, pos_delete_record_count. Internally we can definitely make use of delete record count in planning, for example, like potentially whether to cache deletes, etc. Users can also use them as @flyrain mentioned. It is true we already can access all file record counts via manifest entries, so maybe an argument can be made against having them here. But I think as they are cheap to get, don't see a strong reason why to hide this from the user in partition stats.

The total_record_count seems the most valuable but expensive to compute. Also agree with @flyrain that total count may not be available in writer, for example row-level operation writers , not sure how the writer will know how many total data records are in partition after the write..?

@ajantha-bhat
Copy link
Member Author

@rdblue: I have addressed the comments. Please take a look at it again. Thanks.

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated
|----|----|------------|------|-------------|
| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also added this field today. It can help in avoiding one IO to get the file size to reach parquet footer. Puffin also has this.

- Add file-size-in-bytes field to partition-statistics struct
@flyrain
Copy link
Contributor

flyrain commented Sep 19, 2023

Hi @aokolnychyi, @rdblue, can you take a look at the spec change? If it looks good, we'll move ahead with the merge.

format/spec.md Outdated
| v1 | v2 | Field name | Type | Description |
|----|----|------------|------|-------------|
| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we call it statistics-path in table stats.
Is there a particular reason to call it statistics-file-path here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is a single file, I thought adding file keyword gives more clarity. General statistics-path may give an impression that it is a folder or multiple files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point but isn't it also a path to a single file in case of table stats? I would align the naming to be consistent if it indicates a path to a single file in both cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Updated as suggested.

| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are a bit inconsistent throughout the code with naming data_files_count vs data_file_count. In quite a bit of cases, we are using plural words (like Action API, spec for manifest lists).

Have we discussed the preferred naming to follow in the future?

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is applicable to data_record_count and other fields also.
Agree that we need to standardise this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a native speaker, so I searched around. Seems file count, record count is the right way to go.

The reason "file count" is the correct phrase is because it follows the standard rules of English grammar for compound nouns. When you have a compound noun made up of two nouns, like "file" and "count," the first noun (in this case, "file") acts as an adjective describing the second noun (in this case, "count").

So, "file count" means the count of files, or in other words, it specifies what kind of count you are referring to – a count of files. This is a common construction in English, where the first noun helps specify or describe the second noun, and it's the reason "file count" is used rather than "files count."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @flyrain, I too found that initially. But after digging a bit more, internet says both are valid.

So, I decided we can go with anyone. But we just have to standardise it. Maybe need to check what Spark, Hive and other products follow as standard.

| v1 | v2 | Field id, name | Type | Description |
|----|----|----------------|------|-------------|
| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume a single file would cover all specs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the partition tuple is a unified type. Hence, it is a coerced result from all the specs.
This spec id is just incase if we want to know the latest spec that has modified this partition.

Do you feel it is redundant and we can remove it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think it's good this way: clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the spec ID is required to reconstruct the actual partition tuple, if needed. The main question is whether it is easier to work with a unified tuple or a spec-specific tuple. If most use cases need a spec-specific tuple and would require a projection, nothing prevents us from having a file per spec and annotating each partition stats file with a spec ID instead of persisting it for each record.

Can we think through our initial plans for writing and reading these files? Doesn't have to be very elaborate at this point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already have a POC PR, which read and write these files.
#8488

I think unified tuple will be good for updates. If we keep spec-specific tuple, the stats of same partition (after spec evolution) will be distributed to multiple buckets and hard for the reader. Even existing partitions metadata table also uses the same unified partition tuple

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at the PoC PR as soon as 1.4 is out (within next few days).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at #8488 and I am not sure I how feel about generating these files during commits (in fact, during each commit attempt). I'd personally start by adding API and core logic to be able to add these files on demand and implement an action to actually produce these files (either incrementally or from scratch). Updating these stats for larger tables in each commit attempt will cause issues. In the action, we can do this more efficiently. We can also call this action immediately after writes but at least it will not be part of the commit. It would also drastically reduce the amount of changes in #8488, we won't have to touch every operation type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss the interest from the community for the synchronous writes.
Some of them might be intersted.

Agree that we should first go with async implementation to make things easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, Trino is currently writing Puffin in both sync and async way. Dremio is also intersted in sync stats.

@ajantha-bhat
Copy link
Member Author

@rdblue: Can you also please take a look?

@aokolnychyi
Copy link
Contributor

Getting to this today!

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a detailed pass over this PR and changes in #8488. Below are my thoughts.

Single file vs multiple files

One of the concerns was whether a single file would be enough. Based on estimates performed by @ajantha-bhat, we would need 12 to 30 MB to cover 1 million partitions. What we see in our larger tables using SPJ, the number of partitions can easily reach 5 to 10 million as each partition mostly holds a single data file. Even in that case, I guess it will still fit.

This file may have multiple row groups so we can read it concurrently. We won't be able to produce it in a distributed fashion, however. Doesn't seem to be a big concern? One potential issue is the local sort. If we need to store stats for 5+ million partitions, we may need to spill because of the sort. We can probably do some smart optimizations and sort in a distributed fashion and then stitch these together.

All in all, I am OK with a single file.

Unified partition type vs file per spec

Another question was whether to store all specs in a single file or have a file per spec. Each option has its own tradeoffs, hard to predict what would be more beneficial in the future.

Let's just stick to one file and a unified partition type, as in the original proposal.

Synchronous vs asynchronous computation

I think the spec should state that these files can be generated on demand after commits. It felt like it is either during commits or not at all.

Naming consistency

I have a question whether statistics-file-path should become statistics-path to match table stats files and whether we should use plural or singular words with counts. It looks like the correct version is record_count, XXX_file_count but our existing manifest lists use plural versions records_count, XXX_files_count.

What does everyone think about these points?

Thank you for working on this, @ajantha-bhat!

format/spec.md Outdated
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
| _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |


#### Partition statistics

Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is there a particular reason to use capital letters for Partition Statistics? It seems inconsistent with other places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, changed to keep capital only for headers.

format/spec.md Outdated
| v1 | v2 | Field name | Type | Description |
|----|----|------------|------|-------------|
| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point but isn't it also a path to a single file in case of table stats? I would align the naming to be consistent if it indicates a path to a single file in both cases.

format/spec.md Show resolved Hide resolved
format/spec.md Outdated
Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file).
Partition statistics are not required for reading or planning and readers may ignore them.
Each table snapshot may be associated with at most one partition statistic file.
A writer can optionally write the partition statistics file during each write operation, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a note that it can be also computed on demand rather that in each write?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added with some rewording

format/spec.md Outdated
| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |

#### Partition Statistics file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Shall file be File since it is a header?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. updated.

Also updated the header of Table statistics -> Table Statistics

format/spec.md Outdated

#### Partition Statistics file

Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it is a good idea for the spec to make this assumption. I can see this being configurable even for tables that store Avro data to use Parquet or ORC for partition stats. Why can't we just default this in the implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Russell gave a comment to explicitly mention the format type.

I have removed the "default" word and reworded a bit. Implementation can take a call whether to use the default table's format or the one specified in a table property.

| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
| _optional_ | _optional_ | **`10 total_record_count`** | `long` | Accurate count of records in a partition after applying the delete files if any |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right this would only be possible to compute by reading data and applying deletes? If so, are we planning to make this optional and not populate by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. That is why schema is kept optional.

Implementation will not populate this by default (can be controlled by a property or the way of writing. For example, async write can compute it but not the incremental sync writes)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me.

@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Oct 10, 2023
@ajantha-bhat
Copy link
Member Author

@aokolnychyi: Thanks for the detailed review and also going through the POC PRs.
I have addressed all the comments. Please have a look again.

@aokolnychyi
Copy link
Contributor

I added this PR to our community sync. I am not sure I will be there this week but I'll sync with Russell and Yufei afterwards.

@aokolnychyi
Copy link
Contributor

Sorry, we didn't get to discussing this during the sync. Shall we do a separate sync to talk about this?

@jbonofre
Copy link
Member

@aokolnychyi sure, no problem to have a specific meeting about that.

@ajantha-bhat
Copy link
Member Author

Sorry, we didn't get to discussing this during the sync. Shall we do a separate sync to talk about this?

This is not the first time this happened. From past few community sync, partition stats was always in the topic of discussion. But we fail to cover it due to time restriction and also looks like we don't follow the addition order of topics.

Anyways, I started discussing this in the mailing list now. If people still feel a sync is required, I am happy to arrange one.

@jbonofre
Copy link
Member

I took a new complete look on this PR. @ajantha-bhat isolated specs change in this PR, which is a good approach (to focus only on specs and not the impact on impl).
As this PR:

  • focus on table/partition spec
  • new table/partition spec properties are optional
    I think it's reasonable to merge it (or at least to have a new review).

@rdblue @aokolnychyi do you mind to take a new look ? IMHO, it's good for me and we can merge it. The impl/engine changes will be in other PRs (eventually iterating on spec change if needed).

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get the spec addition as is. I left a few minor questions/comments about the wording. @ajantha-bhat, could you check so that we can get it in?

I think the next step would be to create an action to generate these files on demand. Afterwards, there would be a lot more clarity on how to do this synchronously. Doing the action first would not require much changes to the core library, so we will get something working faster. I'll help reviewing.

format/spec.md Outdated
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
| _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |


#### Partition Statistics

Partition statistics files are based on [Partition statistics file spec](#partition-statistics-file).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using a capital letter for Partition statistics file spec on purpose?

format/spec.md Outdated

Partition statistics files are based on [Partition statistics file spec](#partition-statistics-file).
Partition statistics are not required for reading or planning and readers may ignore them.
Each table snapshot may be associated with at most one partition statistic file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: one partition statistic file -> one partition statistics file to be consistent with other places?

format/spec.md Outdated
A writer can optionally write the partition statistics file during each write operation, or it can also be computed on demand.
Partition statistics file must be registered in the table metadata file to be considered as a valid statistics file for the reader.

`partition-statistics` field of table metadata is an optional list of struct with the following fields:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: an optional list of struct -> an optional list of structs?

format/spec.md Outdated
and sorted by the field ids in ascending order.
In other words, the struct fields represent a union of all known partition fields sorted in ascending order by the field ids.

For Example,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: For Example -> For example?

Note that partition data tuple's schema is based on the partition spec output using partition field ids for the struct field ids.
The unified partition type is a struct containing all fields that have ever been a part of any spec in the table
and sorted by the field ids in ascending order.
In other words, the struct fields represent a union of all known partition fields sorted in ascending order by the field ids.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is this new line intentional? Seems like it all belongs to the same paragraph.

format/spec.md Outdated
For Example,
1) spec#0 has two fields {field#1, field#2}
and then the table has evolved into spec#1 which has three fields {field#1, field#2, field#3}.
The unified partition type looks like Struct<field#1, field#2, field#3>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Missing . at the end?

format/spec.md Outdated

2) spec#0 has two fields {field#1, field#2}
and then the table has evolved into spec#1 which has just one field {field#2}.
The unified partition type looks like Struct<field#1, field#2>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Missing . at the end?

@ajantha-bhat
Copy link
Member Author

@aokolnychyi: I have addressed the nits and pushed. PR is ready.

I think the next step would be to create an action to generate these files on demand. Afterwards, there would be a lot more clarity on how to do this synchronously. Doing the action first would not require much changes to the core library, so we will get something working faster. I'll help reviewing.

Totally agree. Synchronous writing can be the last step.
I will first breakdown the steps/PRs needed for on demand generation in doc or in dev slack channel. Lets us continue there. Thanks.

@aokolnychyi aokolnychyi merged commit 52e69fb into apache:main Nov 1, 2023
2 checks passed
Partition stats collection automation moved this from In progress to Done Nov 1, 2023
@aokolnychyi
Copy link
Contributor

aokolnychyi commented Nov 1, 2023

Thanks everyone involved! Thanks for pushing this, @ajantha-bhat!

I will first breakdown the steps/PRs needed for on demand generation in doc or in dev slack channel.

Whatever works for you. If the implementation is simple enough, you may just go ahead and create a PR. If you feel there are a few items to be discussed first, please share the design doc to the dev list.

@jbonofre
Copy link
Member

jbonofre commented Nov 1, 2023

@aokolnychyi @ajantha-bhat thank guys ! Good one !

As the implementation doesn't seem so complex, I suggest to create the PR including doc. We can start discussion directly in the PR if needed.

Thanks again !

@ajantha-bhat
Copy link
Member Author

I am focusing only on the Async write first (spark action and procedure) as discussed last time.

I see that it has to be done in 4 PRs.

a) Add a util to read and write partition stats
#9170

b) Track the partition stats file from TableMetadata
#8502

c) Spark Action to compute and write the partition stats and registering it to table metadata

d) A call procedure wrapper for spark action.

First two PRs are ready for review. I am working on last two PRs this week.

please review and help in merging.
cc: @aokolnychyi, @RussellSpitzer, @flyrain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Specification Issues that may introduce spec changes.
Development

Successfully merging this pull request may close these issues.

Define the partition stats schema and requirements then add them to the Iceberg spec.