Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add table metadata changes for statistics information in table metadata #5450

Merged
merged 1 commit into from Aug 24, 2022

Conversation

findepi
Copy link
Collaborator

@findepi findepi commented Aug 5, 2022

Extracted from #5021

#5021 (comment)

.palantir/revapi.yml Outdated Show resolved Hide resolved
@findepi findepi force-pushed the findepi/stats-table-metadata branch 2 times, most recently from 30f6308 to 6b6ca9e Compare August 8, 2022 10:00
@findepi
Copy link
Collaborator Author

findepi commented Aug 8, 2022

Thanks @rdblue for a thorough review. Comments applied, please take another look if you want.

@findepi findepi force-pushed the findepi/stats-table-metadata branch 3 times, most recently from 76d0743 to c00b965 Compare August 11, 2022 13:02
@findepi
Copy link
Collaborator Author

findepi commented Aug 11, 2022

Per #5450 (comment), i have rebased this off of #5021.

@rdblue please take another look when you want

}
}

class RemoveStatistics implements MetadataUpdate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this going to be used? Should we have a single ReplaceStatistics operation that is idempotent instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When working on Trino stats support in Hive and Delta we found out that ability to remove stats is useful. We need a way to remove statistics from a table, for example if we detect that application that wrote stats did it incorrectly, or when stats are correct, but a query engine makes wrong conclusions using those stats and plans are bad in practice.

@rdblue how would you want to model this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me. Is SetStatistics intended to be idempotent? That would make sense so that we don't need both remove and set if we want to replace.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, SetStatistics should be idempotent and also should allow replacing an existing stat file, so we don't need remove when we want to set.

@findepi findepi force-pushed the findepi/stats-table-metadata branch from c00b965 to c5ad62f Compare August 23, 2022 12:33
@findepi
Copy link
Collaborator Author

findepi commented Aug 23, 2022

Thanks @rdblue for your review. Updated the PR thanks to your suggestions.
Open questions: #5450 (comment) and optionally #5450 (comment)

@@ -817,6 +824,7 @@ public static class Builder {
private long currentSnapshotId;
private List<Snapshot> snapshots;
private final Map<String, SnapshotRef> refs;
private final Map<Long, List<StatisticsFile>> statisticsFiles;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a multimap? Shouldn't setStatistics replace the previous file? I think that's the behavior that is currently implemented, so I think this should just be a regular map to ensure there aren't somehow multiple files for a snapshot ID.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a multimap?

This is so per #5021 (comment)

Please clarify where I can use a Map and where I should not, the lifecycle of these metadata objects is still not crystal clear to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so the idea is to simply preserve cases where there are multiple stats files? I guess I'm okay with that. I'd also be fine with discarding one of them.

@findepi findepi force-pushed the findepi/stats-table-metadata branch from c5ad62f to ad6c355 Compare August 24, 2022 20:37
@rdblue
Copy link
Contributor

rdblue commented Aug 24, 2022

Looks good. I'll merge when tests are passing.

@findepi
Copy link
Collaborator Author

findepi commented Aug 24, 2022

Looks good. I'll merge when tests are passing.

thanks for all the review comments!

@rdblue rdblue merged commit 007ef6e into apache:master Aug 24, 2022
@rdblue
Copy link
Contributor

rdblue commented Aug 24, 2022

Thanks, @findepi!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants