Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Api: Track partition statistics via TableMetadata #8502

Merged
merged 3 commits into from Dec 18, 2023

Conversation

ajantha-bhat
Copy link
Member

@ajantha-bhat ajantha-bhat commented Sep 5, 2023

  • Introduce PartitionStatisticsFile as per the Spec.
  • Tracking PartitionStatisticsFile in a same way as how StatisticsFile is already tracked.

Fixes #8457

* <p>Statistics are informational. A reader can choose to ignore statistics information. Statistics
* support is not required to read the table correctly.
*/
public interface PartitionStatisticsFile extends Serializable {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi
Copy link
Contributor

Ack, will review this week.

@ajantha-bhat
Copy link
Member Author

Ack, will review this week.

Thanks.
Also, please include #9170 in the review

@aokolnychyi
Copy link
Contributor

I started looking on Friday but got distracted. I will try to finish by end of Wed.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did one pass. Overall, this seems solid. I'll need to do another pass with fresh eyes.

Are saving the rest of changes in engines to keep these files during cleanup for future PRs? Shall we test the core expiry of snapshots in this PR? It would be unfortunate to add partition stats files and let the expiry process remove them.

@@ -873,6 +873,10 @@ acceptedBreaks:
new: "method void org.apache.iceberg.encryption.Ciphers::<init>()"
justification: "Static utility class - should not have public constructor"
"1.4.0":
org.apache.iceberg:iceberg-api:
- code: "java.method.addedToInterface"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we did a similar breaking change when adding table stats. I wonder whether that was the correct decision, however. Why not return an empty list by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added as default

import java.io.Serializable;

/**
* Represents a partition statistics file in the table default format, that can be used to read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether in the table default format will always be accurate. What about just this?

Represents a partition statistics file that can be used to read table data more efficiently.

long snapshotId();

/**
* Returns fully qualified path to the file, suitable for constructing a Hadoop Path. Never null.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't encourage using Hadoop Path given that we have our own FileIO and InputFile.
What about dropping suitable for constructing a Hadoop Path?

* @return this for method chaining
*/
UpdatePartitionStatistics setPartitionStatistics(
long snapshotId, PartitionStatisticsFile partitionStatisticsFile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, PartitionStatisticsFile already returns snapshotId().
Why also pass it explicitly and validate they are equal in the implementation?
Seems to match what we do for UpdateStatistics, though.

@@ -262,7 +262,8 @@ public void cleanFiles(TableMetadata beforeExpiration, TableMetadata afterExpira
deleteFiles(manifestsToDelete, "manifest");
deleteFiles(manifestListsToDelete, "manifest list");

if (!beforeExpiration.statisticsFiles().isEmpty()) {
if (!beforeExpiration.statisticsFiles().isEmpty()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a helper method like hasStatsFiles or something to simplify the condition and stay on 1 line?

@@ -82,7 +82,8 @@ public void cleanFiles(TableMetadata beforeExpiration, TableMetadata afterExpira

deleteFiles(manifestListsToDelete, "manifest list");

if (!beforeExpiration.statisticsFiles().isEmpty()) {
if (!beforeExpiration.statisticsFiles().isEmpty()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A helper method here too?

*/
@Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the purpose of this method and how it is used, what about simply extending it to return all statistics files? Then we won't need changes in consumers and won't need another method. We can clarify that this returns all stats files in Javadoc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the changes in this file as the new APIs added was not called from anywhere (this is needed for spark action).

I will raise a follow up PR for Spark actions to consider partition stats file for remove orphan files and expire snapshots action.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since, it was needed only for spark action, moved these changes into a separate PR with the tests for expire snapshots and remove orphan files.

#9284


public class SetPartitionStatistics implements UpdatePartitionStatistics {
private final TableOperations ops;
private final Map<Long, Optional<PartitionStatisticsFile>> partitionStatisticsToSet =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how I feel about using a map and giving Optional a special meaning. Would having a map of stats to set and a separate set of snapshot IDs to remove easier to read/understand? What do you think, @ajantha-bhat?

private TableMetadata internalApply(TableMetadata base) {
  TableMetadata.Builder builder = TableMetadata.buildFrom(base);
  toSet.forEach(builder::setPartitionStatistics);
  toRemove.forEach(builder::removePartitionStatistics);
  return builder.build();
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -948,6 +957,9 @@ private Builder(TableMetadata base) {
this.refs = Maps.newHashMap(base.refs);
this.statisticsFiles =
base.statisticsFiles.stream().collect(Collectors.groupingBy(StatisticsFile::snapshotId));
this.partitionStatisticsFiles =
base.partitionStatisticsFiles.stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether a helper method would make it easier to read.

],
"snapshot-log": [],
"metadata-log": []
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing empty line?

@ajantha-bhat
Copy link
Member Author

@aokolnychyi: Thanks for the review.

  • I see that most of the questions or comments related to why puffin stats followed that style. We can fix in this PR for partition stats and later back port to puffin too.
  • I didn't add expire snapshots and remove orphan files test code (but have test case of RemoveSnapshots API) as I was thinking to keep the scope of PR to non-Spark. I will have a PR that depends on this PR which tests these function. So, this PR can be merged.

Got a little busy week. I will finish addressing comments and the follow up Spark PR for expire snapshots and remove orphan files by Monday.

Meanwhile you can also review the independent PR (Util for partition stats reading and writing) : #9170

Tracking `PartitionStatisticsFile` in a same way as how `StatisticsFile` is already tracked.
@ajantha-bhat
Copy link
Member Author

@aokolnychyi: Fixed all the comments and also opened a new Spark module PR (which is dependent on this) to ensure partition stats are considered for GC (expire snapshots and remove orphan files). I will rebase that PR once this is merged.

#9284

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks close. I did another pass, I'll need to check the tests with fresh eyes.

* <p>Statistics are informational. A reader can choose to ignore statistics information. Statistics
* support is not required to read the table correctly.
*/
public interface PartitionStatisticsFile extends Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure it is a good idea to make PartitionStatisticsFile serializable? I would probably not do that unless there is a good reason right now. None of our existing files are serializable by contract (they may be in practice but not by API).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. We can add it back if required during an end to end implementation.

@@ -946,15 +955,25 @@ private Builder(TableMetadata base) {
this.previousFileLocation = base.metadataFileLocation;
this.previousFiles = base.previousFiles;
this.refs = Maps.newHashMap(base.refs);
this.statisticsFiles =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajantha-bhat, just to make sure I understand. We try to replace the partition stats for each snapshot but it is not required by the spec so it is technically possible to have multiple files for one snapshot?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently one stats file per snapshot. But I think in future it may track multiple files too.

I followed the same pattern as existing puffin files. I want to keep the interfaces consistent.

}

public Builder removePartitionStatistics(long snapshotId) {
Preconditions.checkNotNull(snapshotId, "snapshotId is null");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can it be null if it is primitive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I copy pasted from existing removeStatistics which has this problem. I overlooked or assumed things are correct. I will be careful next time.

this.snapshotsById = Maps.newHashMap(base.snapshotsById);
this.schemasById = Maps.newHashMap(base.schemasById);
this.specsById = Maps.newHashMap(base.specsById);
this.sortOrdersById = Maps.newHashMap(base.sortOrdersById);
}

private static Map<Long, List<StatisticsFile>> statsFileBySnapshotID(TableMetadata base) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is there a pattern in this class to have static methods at the end? If so, can we put these methods together with other static methods? If not, it is OK to keep them here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved

.collect(Collectors.groupingBy(StatisticsFile::snapshotId));
}

private static Map<Long, List<PartitionStatisticsFile>> partitionStatsFileBySnapshotID(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is common for this class to call such methods indexSmth and pass actual elements. If so, what about something like below? We can use shorter variables to stay on line but that's totally optional. Up to you, method names are examples.

private static Map<Long, List<StatisticsFile>> indexStatistics(List<StatisticsFile> files) {
  return files.stream().collect(Collectors.groupingBy(StatisticsFile::snapshotId));
}

private static Map<Long, List<PartitionStatisticsFile>> indexPartitionStatistics(
    List<PartitionStatisticsFile> files) {
  return files.stream().collect(Collectors.groupingBy(PartitionStatisticsFile::snapshotId));
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -1288,6 +1307,24 @@ public Builder suppressHistoricalSnapshots() {
return this;
}

public Builder setPartitionStatistics(PartitionStatisticsFile partitionStatisticsFile) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I wonder whether using just file or partitionStats to shorten the lines would make sense.
Completely up to you.


@Override
public UpdatePartitionStatistics setPartitionStatistics(
PartitionStatisticsFile partitionStatisticsFile) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the context in this class is pretty clear and we can call it stats, partitionStats or file to shorten the lines. Then we can also call other variables partitionStatsToSet or statsToSet.


public class SetPartitionStatistics implements UpdatePartitionStatistics {
private final TableOperations ops;
private final Set<PartitionStatisticsFile> partitionStatisticsToSet = Sets.newHashSet();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using Map<Long, PartitionStatisticsFile> to make sure the files are overridden if multiple files are passed for the same snapshot? We can then simply call statsToSet.values().forEach(...) below?

@@ -561,4 +576,21 @@ private static List<StatisticsFile> statisticsFilesFromJson(JsonNode statisticsF

return statisticsFilesBuilder.build();
}

private static List<PartitionStatisticsFile> partitionStatisticsFilesFromJson(
JsonNode partitionStatisticsFilesList) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind using longer variables but I wonder whether a shorter version and staying on one line in some places would be more readable in this method.

@aokolnychyi
Copy link
Contributor

@ajantha-bhat, I'll take a look at other PRs once this is in. I feel this one is almost ready to go.

@ajantha-bhat
Copy link
Member Author

@aokolnychyi: I have handled all the new comments. Thanks again for the review.

@ajantha-bhat
Copy link
Member Author

Re-triggering build due to flaky test in Flink

https://github.com/apache/iceberg/actions/runs/7222400549/job/19679307743?pr=8502

@ajantha-bhat ajantha-bhat reopened this Dec 15, 2023
Partition stats collection automation moved this from In progress to Done Dec 15, 2023
Partition stats collection automation moved this from Done to In progress Dec 15, 2023
import org.apache.iceberg.relocated.com.google.common.collect.Maps;
import org.apache.iceberg.relocated.com.google.common.collect.Sets;

public class SetPartitionStatistics implements UpdatePartitionStatistics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to be public?

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I had only one comment about the access modifier for the base implementation class.
This is a big PR so I will not hold it because of that. However, can we please follow up, if needed?

Great work, @ajantha-bhat!

@aokolnychyi aokolnychyi merged commit 6e21bbf into apache:main Dec 18, 2023
78 of 84 checks passed
Partition stats collection automation moved this from In progress to Done Dec 18, 2023
@ajantha-bhat
Copy link
Member Author

@aokolnychyi: Thanks for the detailed review and merge.

The next small PR to review will be #9284 (Spark 3.5: Ensure that partition stats files are considered for GC procedures)

I had only one comment about the access modifier for the base implementation class.

I will check this in a followup. Again this follows the style from puffin files.
I will also handle the applicable comments from this PR to stats files code (puffin files) in a follow up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Implement stats file tracking via TableMetadata.
2 participants