-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ManifestFile Stats in snapshot summary. #10246
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -156,6 +156,8 @@ public List<ManifestFile> apply(TableMetadata base, Snapshot snapshot) { | |
manifests.addAll(snapshot.allManifests(ops.io())); | ||
} | ||
|
||
manifests.forEach(summaryBuilder::addedManifestStats); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Stupid question: do we have a rough estimate on this change in terms of manifest append performance ? I guess it's not important, but just wondering. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is also my main question. My train of thought: You will need to read the manifest-list in any situation. The number of manifest can vary widely:
With the knowledge from the summary, you could spin up executors before reading the manifest-list, but this can be difficult since you would also need to know the sizes of the manifest to do some effective planning. The downside is that we add extra information to the metadata-JSON, which can also grow in size when there are many commits. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @Fokko for feedback.
With manifest stats, It can help with better planning in terms of manifest file scans. Without these stats in metadata json either someone will have to assume or do one more IO with manifestList.
I agree it adds some extra bytes to metadataJson. But I think here we have options to optimize with expire snapshots. But to avoid an extra IO for planning I dont find any other way. WDYT ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It is only going to add cost with metadata json size. Adding it to snapshot summary is inline operation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @nk1506 I think the main question (at least from me) is essentially: Is information on just the total number of data or delete manifests actually helpful for being able to help with whatever planning estimation you're trying to do? As @Fokko said you'll probably want to know manifest sizes (in bytes) of the files involved in planning since there can be variance depending on the strategy used for the append. I think with just determining based on number of files (and not including sizes, which means you have to read the manifest anyways) there would be a lot of overprovisioning and underprovisioning (if you're trying to a distributed planning). But definitely open to hearing more, and especially if you are able to share any data points on how this metric helped ; that would give us some more confidence that this is useful and can generalize. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
+1 on this, how would knowing these 2 new stats help with planning?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the feedback. Regarding the usages of manifest counts for planning here is my feedback:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had a discussion with @nk1506 to understand it better. Dremio (or query engines that uses CBO) need to estimate the cost of the query plan. Query has planning and execution phase. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ping @Fokko, @amogh-jahagirdar, @dramaticlly: Can we conclude (move forward) on this based on my above comment? Having these stats will leads to better estimate based on parallelism. Agree that Having size based cost estimation will be more accurate. But count based estimation is still better than no stats. Is other engines that use CBO like Trino is intersted in these stats? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @electrum |
||
|
||
return manifests; | ||
} | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -58,6 +58,8 @@ public class SnapshotSummary { | |
public static final String SOURCE_SNAPSHOT_ID_PROP = "source-snapshot-id"; | ||
public static final String REPLACE_PARTITIONS_PROP = "replace-partitions"; | ||
public static final String EXTRA_METADATA_PREFIX = "snapshot-property."; | ||
public static final String TOTAL_DATA_MANIFEST_FILES = "total-data-manifest-files"; | ||
public static final String TOTAL_DELETE_MANIFEST_FILES = "total-delete-manifest-files"; | ||
|
||
public static final MapJoiner MAP_JOINER = Joiner.on(",").withKeyValueSeparator("="); | ||
|
||
|
@@ -144,6 +146,10 @@ public void addedManifest(ManifestFile manifest) { | |
metrics.addedManifest(manifest); | ||
} | ||
|
||
public void addedManifestStats(ManifestFile manifest) { | ||
metrics.addedManifestStats(manifest); | ||
} | ||
|
||
public void set(String property, String value) { | ||
properties.put(property, value); | ||
} | ||
|
@@ -229,6 +235,8 @@ private static class UpdateMetrics { | |
private long removedPosDeletes = 0L; | ||
private long addedEqDeletes = 0L; | ||
private long removedEqDeletes = 0L; | ||
private long totalDataManifestFiles = 0L; | ||
private long totalDeleteManifestFiles = 0L; | ||
private boolean trustSizeAndDeleteCounts = true; | ||
|
||
void clear() { | ||
|
@@ -248,6 +256,8 @@ void clear() { | |
this.removedPosDeletes = 0L; | ||
this.addedEqDeletes = 0L; | ||
this.removedEqDeletes = 0L; | ||
this.totalDataManifestFiles = 0L; | ||
this.totalDeleteManifestFiles = 0L; | ||
this.trustSizeAndDeleteCounts = true; | ||
} | ||
|
||
|
@@ -263,6 +273,12 @@ void addTo(ImmutableMap.Builder<String, String> builder) { | |
setIf(removedDeleteFiles > 0, builder, REMOVED_DELETE_FILES_PROP, removedDeleteFiles); | ||
setIf(addedRecords > 0, builder, ADDED_RECORDS_PROP, addedRecords); | ||
setIf(deletedRecords > 0, builder, DELETED_RECORDS_PROP, deletedRecords); | ||
setIf(totalDataManifestFiles > 0, builder, TOTAL_DATA_MANIFEST_FILES, totalDataManifestFiles); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If there are zero counts these stats will be not added to summary. |
||
setIf( | ||
totalDeleteManifestFiles > 0, | ||
builder, | ||
TOTAL_DELETE_MANIFEST_FILES, | ||
totalDeleteManifestFiles); | ||
|
||
if (trustSizeAndDeleteCounts) { | ||
setIf(addedSize > 0, builder, ADDED_FILE_SIZE_PROP, addedSize); | ||
|
@@ -336,6 +352,16 @@ void addedManifest(ManifestFile manifest) { | |
} | ||
} | ||
|
||
void addedManifestStats(ManifestFile manifest) { | ||
switch (manifest.content()) { | ||
case DATA: | ||
this.totalDataManifestFiles++; | ||
break; | ||
case DELETES: | ||
this.totalDeleteManifestFiles++; | ||
} | ||
} | ||
|
||
void merge(UpdateMetrics other) { | ||
this.addedFiles += other.addedFiles; | ||
this.removedFiles += other.removedFiles; | ||
|
@@ -353,6 +379,8 @@ void merge(UpdateMetrics other) { | |
this.removedPosDeletes += other.removedPosDeletes; | ||
this.addedEqDeletes += other.addedEqDeletes; | ||
this.removedEqDeletes += other.removedEqDeletes; | ||
this.totalDataManifestFiles += other.totalDataManifestFiles; | ||
this.totalDeleteManifestFiles += other.totalDeleteManifestFiles; | ||
this.trustSizeAndDeleteCounts = trustSizeAndDeleteCounts && other.trustSizeAndDeleteCounts; | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add manifestStats with any newManifest.