New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Avoid generating a large ManifestFile when committing #6335
Conversation
Hi @szehon-ho @rdblue @pvary @stevenzwu @chenjunjiedada pls help to review this when you are free. Thanks a lot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ConeyLiu! just had some questions/comments on this, but this behavior is something I observed as well.
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
Outdated
Show resolved
Hide resolved
It is mentioned in the docs that |
@nastra |
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
Outdated
Show resolved
Hide resolved
This seems like a reasonable thing to add to me. I'm actually more concerned about the use case that caused this. @ConeyLiu, what was the case where you were creating huge manifests? How many data files were committed? |
3307397
to
6b21694
Compare
@rdblue thanks for the reviewing. It happens for the larger table (has several PBs data) and with many columns(thousand columns which is very common for log data or feature data). And the size of the DataFile metrics' is increased with the number of columns. So the manifest file could be really large when the user adds into/overwrites a larger number of data (eg, sync hourly data). |
6b21694
to
85c05cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see two sets of changes here, one around "caching" and one around rolling over manifest files. Can we separate out the changes?
@@ -49,8 +51,9 @@ class FastAppend extends SnapshotProducer<AppendFiles> implements AppendFiles { | |||
private final List<DataFile> newFiles = Lists.newArrayList(); | |||
private final List<ManifestFile> appendManifests = Lists.newArrayList(); | |||
private final List<ManifestFile> rewrittenAppendManifests = Lists.newArrayList(); | |||
private ManifestFile newManifest = null; | |||
private List<ManifestFile> cachedNewManifests = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer Do you mean the caching
part here? Here just rename the variable because it is plural now.
85c05cc
to
7ac0847
Compare
@rdblue @RussellSpitzer @nastra @amogh-jahagirdar Would you mind taking another look at this when you are free? Thanks a lot. |
@rdblue @RussellSpitzer @szehon-ho can we have another look at this review and merge if possible ? thanks. large manifests have huge performance penalties for PetaByte tables. |
d1e1f1d
to
617647e
Compare
Yea sorry let me try to review this tomorrow. |
9074d09
to
e27ba9b
Compare
import org.apache.iceberg.relocated.com.google.common.collect.Lists; | ||
|
||
/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest file. */ | ||
abstract class RollingManifestWriter<T extends ContentFile<T>> implements FileAppender<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szehon-ho pls take a look if this is the right direction. Here implements FileAppender
instead of FileWriter
that's because the ManifestWriter
implements FileAppender
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this definitely looks cleaner.
Im just not sure what to do with existing hierarchy (ManifestWriter), as there are many duplications.
Did we think about modifying ManifestWriter, to take extra variable 'targetFileSizeBytes', then trigger tryRollover only in this case.
The problem may be the method manifestFile() in writer? It should have been manifestFiles(). Maybe we can deprecate existing method?
CC @aokolnychyi .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually on second read, this may be ok, as its a composition, and not duplicating that much code. Lets see if Anton has any comments
// // Operation succeeds and calls cleanUncommitted' | ||
// txn.newFastAppend().appendFile(...).commit(); | ||
// // Some other operations ... | ||
// // Commit fails and needs to clean up newManifests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should this comment be after next one? (seems comments refer to the line above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add the blank line between the comments to separate them.
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java
Outdated
Show resolved
Hide resolved
import org.apache.iceberg.relocated.com.google.common.collect.Lists; | ||
|
||
/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest file. */ | ||
abstract class RollingManifestWriter<T extends ContentFile<T>> implements FileAppender<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this definitely looks cleaner.
Im just not sure what to do with existing hierarchy (ManifestWriter), as there are many duplications.
Did we think about modifying ManifestWriter, to take extra variable 'targetFileSizeBytes', then trigger tryRollover only in this case.
The problem may be the method manifestFile() in writer? It should have been manifestFiles(). Maybe we can deprecate existing method?
CC @aokolnychyi .
I will take a look with fresh eyes by the end of tomorrow. |
core/src/main/java/org/apache/iceberg/ManifestOutputFileFactory.java
Outdated
Show resolved
Hide resolved
private void tryRollingToNewFile() { | ||
currentFileRows++; | ||
|
||
if (currentWriter.length() > targetFileSizeInBytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be expensive to call for every file? In rolling file writers, we check only every 1000 records. I understand that value may be too big but I still don't think it is a good idea to check the length for every entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about checking every 16? Just simply calculated as: 1000 / (512 / 8)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check the average cost of an entry in a manifest in some of your real tables? My assumption was something around 250 but let's check the average size of an entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked some tables. The average size has a lot to do with the schema of the table, the more columns take up more storage space. That's because of the column metrics. I noticed the biggest one (with thousand columns) needs about 5KB for one entry. The small tables (about 10 columns) need about several hundred bytes. So 250 should be an acceptable number.
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
Outdated
Show resolved
Hide resolved
I feel like this would be super valuable for CTAS kind of use cases where a large chunk of data is being added at the same time. I had a few suggestions on how to simplify the rolling writer. Nice work, @ConeyLiu! |
966a7f8
to
d83ad26
Compare
private boolean closed = false; | ||
|
||
public RollingManifestWriter( | ||
FileIO fileIO, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used to delete the empty file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it is still needed? We switched to a lazy writer that is initialized only when there is a new entry to write. The same also applies to cases when we roll to a new file. It can only happen if there is a pending entry to write. If I am not missing anything, we can drop FileIO
here and simplify closeCurrentWriter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry my bad, the empty file checking is not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks good to me. I had one question. @ConeyLiu, could you check if I missed something?
deleteFile(cachedNewDataManifest.path()); | ||
this.cachedNewDataManifest = null; | ||
if (cachedNewDataManifests != null) { | ||
List<ManifestFile> committedNewManifests = Lists.newArrayList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: committedNewManifests
-> committedNewDataManifests
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
private boolean closed = false; | ||
|
||
public RollingManifestWriter( | ||
FileIO fileIO, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it is still needed? We switched to a lazy writer that is initialized only when there is a new entry to write. The same also applies to cases when we roll to a new file. It can only happen if there is a pending entry to write. If I am not missing anything, we can drop FileIO
here and simplify closeCurrentWriter
.
@aokolnychyi thanks for the details reviewing. The empty file checking is not necessary. Reverted related code changes. |
Thanks, @ConeyLiu! Thanks everyone for reviewing! |
Thanks @aokolnychyi for mering this. Thanks @rdblue @RussellSpitzer @szehon-ho @amogh-jahagirdar for reviewing. |
In our production env, we noticed the manifest files have a large random size, ranging from several KB to larger than 100 MB. It seems the
MANIFEST_TARGET_SIZE_BYTES
has not worked during the commit phase.In this PR, we avoid generating a manifest file larger than
MANIFEST_TARGET_SIZE_BYTES
for newly added content files and will generate multiple manifest files when the size is covered.