-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Optimize manifest evaluation for super wide tables #9147
Core: Optimize manifest evaluation for super wide tables #9147
Conversation
Resolves issue #9118 |
@rdblue I would really appreciate if you could take a look |
api/src/main/java/org/apache/iceberg/expressions/NamedReference.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this @irshadcc, this looks good to me. I've left two small comments, could you take a peek at those? Thanks for fixing this! 🙌
I've added the Javadoc and removed the empty line. |
@Fokko Can we merge this PR ? |
c10ff54
to
05badc4
Compare
kind ping @Fokko |
Hey @irshadcc Sorry for the long wait here, and thanks for pinging me. Let's get this in 🚀 |
During the snapshot commit process of MergingSnapshotProducer, Iceberg tries to merge the manifest files to increase the planning performance. During operations like overwrite, MergingSnapshotProducer finds the matching manifest files by deleteExpression and rewrites the manifests by filtering out the deleted manifest entries.
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestFilterManager.java#L330
While finding the matching manifests by deleteExpression, we found that the Schema is created every time the expression needs to be bound. This has proven very expensive when there are large number of manifest files (~25,000 manifest files) for a super wide table (~35,000 columns).
https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/NamedReference.java#L41
For optimising the manifest evaluation, we can add a method called asSchema() in the StructType class to avoid creating the new Schema every time the filter needs to be bound.