Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Avoid generating a large ManifestFile when committing #6335

Merged
merged 2 commits into from Jul 28, 2023

Conversation

ConeyLiu
Copy link
Contributor

@ConeyLiu ConeyLiu commented Dec 1, 2022

In our production env, we noticed the manifest files have a large random size, ranging from several KB to larger than 100 MB. It seems the MANIFEST_TARGET_SIZE_BYTES has not worked during the commit phase.

In this PR, we avoid generating a manifest file larger than MANIFEST_TARGET_SIZE_BYTES for newly added content files and will generate multiple manifest files when the size is covered.

@github-actions github-actions bot added the core label Dec 1, 2022
@ConeyLiu
Copy link
Contributor Author

ConeyLiu commented Dec 1, 2022

Hi @szehon-ho @rdblue @pvary @stevenzwu @chenjunjiedada pls help to review this when you are free. Thanks a lot.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ConeyLiu! just had some questions/comments on this, but this behavior is something I observed as well.

@nastra
Copy link
Contributor

nastra commented Dec 5, 2022

It is mentioned in the docs that MANIFEST_TARGET_SIZE_BYTES relates to Target size when merging manifest files, meaning that this setting only takes effect when merging of manifest files happens (e.g. when using newAppend()). Merging of manifest files would not happen when using newFastAppend() for example. This might explain why it seemed that this setting wouldn't take any effect in your env.

@ConeyLiu
Copy link
Contributor Author

ConeyLiu commented Dec 6, 2022

@nastra newAppend will create a large manifest as well. It only merges small manifests into a large one, and does not split large one into target file size.

@rdblue
Copy link
Contributor

rdblue commented Dec 18, 2022

This seems like a reasonable thing to add to me. I'm actually more concerned about the use case that caused this.

@ConeyLiu, what was the case where you were creating huge manifests? How many data files were committed?

@ConeyLiu
Copy link
Contributor Author

@rdblue thanks for the reviewing. It happens for the larger table (has several PBs data) and with many columns(thousand columns which is very common for log data or feature data). And the size of the DataFile metrics' is increased with the number of columns. So the manifest file could be really large when the user adds into/overwrites a larger number of data (eg, sync hourly data).

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two sets of changes here, one around "caching" and one around rolling over manifest files. Can we separate out the changes?

@@ -49,8 +51,9 @@ class FastAppend extends SnapshotProducer<AppendFiles> implements AppendFiles {
private final List<DataFile> newFiles = Lists.newArrayList();
private final List<ManifestFile> appendManifests = Lists.newArrayList();
private final List<ManifestFile> rewrittenAppendManifests = Lists.newArrayList();
private ManifestFile newManifest = null;
private List<ManifestFile> cachedNewManifests = null;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer Do you mean the caching part here? Here just rename the variable because it is plural now.

@ConeyLiu
Copy link
Contributor Author

@rdblue @RussellSpitzer @nastra @amogh-jahagirdar Would you mind taking another look at this when you are free? Thanks a lot.

@zinking
Copy link
Contributor

zinking commented Jun 1, 2023

@rdblue @RussellSpitzer @szehon-ho can we have another look at this review and merge if possible ? thanks. large manifests have huge performance penalties for PetaByte tables.

@szehon-ho
Copy link
Collaborator

Yea sorry let me try to review this tomorrow.

import org.apache.iceberg.relocated.com.google.common.collect.Lists;

/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest file. */
abstract class RollingManifestWriter<T extends ContentFile<T>> implements FileAppender<T> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szehon-ho pls take a look if this is the right direction. Here implements FileAppender instead of FileWriter that's because the ManifestWriter implements FileAppender.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this definitely looks cleaner.

Im just not sure what to do with existing hierarchy (ManifestWriter), as there are many duplications.

Did we think about modifying ManifestWriter, to take extra variable 'targetFileSizeBytes', then trigger tryRollover only in this case.

The problem may be the method manifestFile() in writer? It should have been manifestFiles(). Maybe we can deprecate existing method?

CC @aokolnychyi .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually on second read, this may be ok, as its a composition, and not duplicating that much code. Lets see if Anton has any comments

core/src/main/java/org/apache/iceberg/FastAppend.java Outdated Show resolved Hide resolved
// // Operation succeeds and calls cleanUncommitted'
// txn.newFastAppend().appendFile(...).commit();
// // Some other operations ...
// // Commit fails and needs to clean up newManifests
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should this comment be after next one? (seems comments refer to the line above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add the blank line between the comments to separate them.

core/src/main/java/org/apache/iceberg/FastAppend.java Outdated Show resolved Hide resolved
core/src/main/java/org/apache/iceberg/ManifestFiles.java Outdated Show resolved Hide resolved
core/src/main/java/org/apache/iceberg/ManifestFiles.java Outdated Show resolved Hide resolved
import org.apache.iceberg.relocated.com.google.common.collect.Lists;

/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest file. */
abstract class RollingManifestWriter<T extends ContentFile<T>> implements FileAppender<T> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this definitely looks cleaner.

Im just not sure what to do with existing hierarchy (ManifestWriter), as there are many duplications.

Did we think about modifying ManifestWriter, to take extra variable 'targetFileSizeBytes', then trigger tryRollover only in this case.

The problem may be the method manifestFile() in writer? It should have been manifestFiles(). Maybe we can deprecate existing method?

CC @aokolnychyi .

@aokolnychyi
Copy link
Contributor

I will take a look with fresh eyes by the end of tomorrow.

private void tryRollingToNewFile() {
currentFileRows++;

if (currentWriter.length() > targetFileSizeInBytes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be expensive to call for every file? In rolling file writers, we check only every 1000 records. I understand that value may be too big but I still don't think it is a good idea to check the length for every entry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about checking every 16? Just simply calculated as: 1000 / (512 / 8).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check the average cost of an entry in a manifest in some of your real tables? My assumption was something around 250 but let's check the average size of an entry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked some tables. The average size has a lot to do with the schema of the table, the more columns take up more storage space. That's because of the column metrics. I noticed the biggest one (with thousand columns) needs about 5KB for one entry. The small tables (about 10 columns) need about several hundred bytes. So 250 should be an acceptable number.

@aokolnychyi
Copy link
Contributor

I feel like this would be super valuable for CTAS kind of use cases where a large chunk of data is being added at the same time. I had a few suggestions on how to simplify the rolling writer.

Nice work, @ConeyLiu!

private boolean closed = false;

public RollingManifestWriter(
FileIO fileIO,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used to delete the empty file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is still needed? We switched to a lazy writer that is initialized only when there is a new entry to write. The same also applies to cases when we roll to a new file. It can only happen if there is a pending entry to write. If I am not missing anything, we can drop FileIO here and simplify closeCurrentWriter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry my bad, the empty file checking is not necessary.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good to me. I had one question. @ConeyLiu, could you check if I missed something?

deleteFile(cachedNewDataManifest.path());
this.cachedNewDataManifest = null;
if (cachedNewDataManifests != null) {
List<ManifestFile> committedNewManifests = Lists.newArrayList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: committedNewManifests -> committedNewDataManifests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

private boolean closed = false;

public RollingManifestWriter(
FileIO fileIO,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is still needed? We switched to a lazy writer that is initialized only when there is a new entry to write. The same also applies to cases when we roll to a new file. It can only happen if there is a pending entry to write. If I am not missing anything, we can drop FileIO here and simplify closeCurrentWriter.

@ConeyLiu
Copy link
Contributor Author

@aokolnychyi thanks for the details reviewing. The empty file checking is not necessary. Reverted related code changes.

@aokolnychyi aokolnychyi merged commit a7a09d4 into apache:master Jul 28, 2023
41 checks passed
@aokolnychyi
Copy link
Contributor

Thanks, @ConeyLiu! Thanks everyone for reviewing!

@ConeyLiu
Copy link
Contributor Author

Thanks @aokolnychyi for mering this. Thanks @rdblue @RussellSpitzer @szehon-ho @amogh-jahagirdar for reviewing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants