Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-25615: Fix Hive on tez will generate at least one MapContainer per 0 length file #4738

Closed
wants to merge 1 commit into from

Conversation

bwzheng2010
Copy link

What changes were proposed in this pull request?

If the length of the InputSplit is found to be 0, return 0 immediately.

Why are the changes needed?

When tez read a table with many 0 length files, ColumnarSplitSizeEstimator will return Integer.MAX_VALUE bytes length for every 0 length file.Then,TezSplitGrouper will treat those files as big files,and generate at least one MapContainer per 0 file to handle it.This is incorrect and even wasteful.

Does this PR introduce any user-facing change?

NO

Is the change a dependency upgrade?

NO

How was this patch tested?

org.apache.hadoop.hive.ql.io.orc.TestInputOutputFormat#testSplitSizeEstimator

@sonarcloud
Copy link

sonarcloud bot commented Sep 22, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot E 1 Security Hotspot
Code Smell A 4 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

warning The version of Java (11.0.8) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17.
Read more here

@@ -35,6 +35,9 @@ public class ColumnarSplitSizeEstimator implements SplitSizeEstimator {
@Override
public long getEstimatedSize(InputSplit inputSplit) throws IOException {
long colProjSize = inputSplit.getLength();
if (colProjSize == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code had two paths to overwrite the value provided from inputSplit.getLength and now we skip those paths. Also, what if the columnar projection size or the inner split has 0 bytes? And the original root cause of the issue is at the end of this method:

    if (colProjSize <= 0) {
      /* columnar splits of unknown size - estimate worst-case */
      return Integer.MAX_VALUE;
    }

What about changing colProjSize <= 0 to colProjSize < 0 so that we can keep the original logic and fix the Integer.MAX_VALUE issue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,thanks for review the code

The reason for judging the split length is 0 at the beginning is to keep the original logic.

I think,the original logic is that if the columnar projection size or the inner split has 0 bytes, it does not mean that the length of this split is 0. Returning Integer.MAX_VALUE is a safer method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I didn't know about that behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that 0 things was explicitly added as part of https://issues.apache.org/jira/browse/HIVE-13821 for ACID

Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

@github-actions github-actions bot added the stale label Dec 25, 2023
@github-actions github-actions bot closed this Jan 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants