HIVE-25615: Fix Hive on tez will generate at least one MapContainer per 0 length file #4738

bwzheng2010 · 2023-09-22T11:18:05Z

What changes were proposed in this pull request?

If the length of the InputSplit is found to be 0, return 0 immediately.

Why are the changes needed?

When tez read a table with many 0 length files, ColumnarSplitSizeEstimator will return Integer.MAX_VALUE bytes length for every 0 length file.Then,TezSplitGrouper will treat those files as big files,and generate at least one MapContainer per 0 file to handle it.This is incorrect and even wasteful.

Does this PR introduce any user-facing change?

NO

Is the change a dependency upgrade?

NO

How was this patch tested?

org.apache.hadoop.hive.ql.io.orc.TestInputOutputFormat#testSplitSizeEstimator

…er 0 length file

sonarcloud · 2023-09-22T12:46:11Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
1 Security Hotspot
4 Code Smells

No Coverage information
No Duplication information

The version of Java (11.0.8) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17.
Read more here

InvisibleProgrammer · 2023-09-25T06:43:08Z

ql/src/java/org/apache/hadoop/hive/ql/exec/tez/ColumnarSplitSizeEstimator.java

@@ -35,6 +35,9 @@ public class ColumnarSplitSizeEstimator implements SplitSizeEstimator {
  @Override
  public long getEstimatedSize(InputSplit inputSplit) throws IOException {
    long colProjSize = inputSplit.getLength();
+    if (colProjSize == 0) {


The original code had two paths to overwrite the value provided from inputSplit.getLength and now we skip those paths. Also, what if the columnar projection size or the inner split has 0 bytes? And the original root cause of the issue is at the end of this method:

if (colProjSize <= 0) { /* columnar splits of unknown size - estimate worst-case */ return Integer.MAX_VALUE; }

What about changing colProjSize <= 0 to colProjSize < 0 so that we can keep the original logic and fix the Integer.MAX_VALUE issue?

Hi，thanks for review the code

The reason for judging the split length is 0 at the beginning is to keep the original logic.

I think，the original logic is that if the columnar projection size or the inner split has 0 bytes, it does not mean that the length of this split is 0. Returning Integer.MAX_VALUE is a safer method.

Thank you, I didn't know about that behaviour.

I think that 0 things was explicitly added as part of https://issues.apache.org/jira/browse/HIVE-13821 for ACID

github-actions · 2023-12-25T00:20:22Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

HIVE-25615: Fix Hive on tez will generate at least one MapContainer p…

66e6574

…er 0 length file

github-actions bot requested a review from abstractdog September 22, 2023 11:18

asf-ci-hive added the tests pending label Sep 22, 2023

asf-ci-hive added tests passed and removed tests pending labels Sep 22, 2023

InvisibleProgrammer reviewed Sep 25, 2023

View reviewed changes

github-actions bot added the stale label Dec 25, 2023

github-actions bot closed this Jan 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-25615: Fix Hive on tez will generate at least one MapContainer per 0 length file #4738

HIVE-25615: Fix Hive on tez will generate at least one MapContainer per 0 length file #4738

bwzheng2010 commented Sep 22, 2023

sonarcloud bot commented Sep 22, 2023

InvisibleProgrammer Sep 25, 2023

bwzheng2010 Sep 25, 2023

InvisibleProgrammer Sep 26, 2023

ayushtkn Oct 25, 2023

github-actions bot commented Dec 25, 2023

HIVE-25615: Fix Hive on tez will generate at least one MapContainer per 0 length file #4738

HIVE-25615: Fix Hive on tez will generate at least one MapContainer per 0 length file #4738

Conversation

bwzheng2010 commented Sep 22, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

sonarcloud bot commented Sep 22, 2023

InvisibleProgrammer Sep 25, 2023

Choose a reason for hiding this comment

bwzheng2010 Sep 25, 2023

Choose a reason for hiding this comment

InvisibleProgrammer Sep 26, 2023

Choose a reason for hiding this comment

ayushtkn Oct 25, 2023

Choose a reason for hiding this comment

github-actions bot commented Dec 25, 2023