New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-25615: Fix Hive on tez will generate at least one MapContainer per 0 length file #4738
Conversation
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information The version of Java (11.0.8) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17. |
@@ -35,6 +35,9 @@ public class ColumnarSplitSizeEstimator implements SplitSizeEstimator { | |||
@Override | |||
public long getEstimatedSize(InputSplit inputSplit) throws IOException { | |||
long colProjSize = inputSplit.getLength(); | |||
if (colProjSize == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code had two paths to overwrite the value provided from inputSplit.getLength
and now we skip those paths. Also, what if the columnar projection size or the inner split has 0 bytes? And the original root cause of the issue is at the end of this method:
if (colProjSize <= 0) {
/* columnar splits of unknown size - estimate worst-case */
return Integer.MAX_VALUE;
}
What about changing colProjSize <= 0
to colProjSize < 0
so that we can keep the original logic and fix the Integer.MAX_VALUE issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,thanks for review the code
The reason for judging the split length is 0 at the beginning is to keep the original logic.
I think,the original logic is that if the columnar projection size or the inner split has 0 bytes, it does not mean that the length of this split is 0. Returning Integer.MAX_VALUE is a safer method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I didn't know about that behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that 0 things was explicitly added as part of https://issues.apache.org/jira/browse/HIVE-13821 for ACID
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
What changes were proposed in this pull request?
If the length of the InputSplit is found to be 0, return 0 immediately.
Why are the changes needed?
When tez read a table with many 0 length files, ColumnarSplitSizeEstimator will return Integer.MAX_VALUE bytes length for every 0 length file.Then,TezSplitGrouper will treat those files as big files,and generate at least one MapContainer per 0 file to handle it.This is incorrect and even wasteful.
Does this PR introduce any user-facing change?
NO
Is the change a dependency upgrade?
NO
How was this patch tested?
org.apache.hadoop.hive.ql.io.orc.TestInputOutputFormat#testSplitSizeEstimator