HIVE-24669: Improve FileSystem usage in Hive::loadPartitionInternal #1893

pvargacl · 2021-01-20T14:25:30Z

What changes were proposed in this pull request?

Improve FileSystem usage in Hive::loadPartitionInternal to improve performance n S3

Why are the changes needed?

Performance improvement

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Current unit tests + local performance measurements.

deniskuzZ · 2021-01-20T14:38:50Z

ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java

+          return true;
+        }
+        path = path.getParent();
+      } while (!path.equals(basePath));


Is it possible that path.equals(basePath) won't ever be true? or path becomes null?

It will fail with a nullpointer if you call it with an unrelated path, I will add the nullcheck

ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java

steveloughran

This is significantly better on S3 as it will go from an O(dirs) number of LIST calls, at 500ms+/dir to 1 LIST per few hundred files; 200 on a versioned bucket, I think.

s3 and soon abfs will do async prefetch of results. If you can do any overlap of computation then for large/deep dir trees, even that list cost might be swallowed. But as that would be a more significant change, it's something to consider in future rather than do in this PR.

Do log the iterator.toString @ debug though; with the HADOOP-16830 patch in the s3a incremental iterators all print out the performance details of their network IO.

steveloughran · 2021-01-22T16:26:25Z

ql/src/java/org/apache/hadoop/hive/ql/io/HdfsUtils.java

@@ -117,6 +117,10 @@ public static long createTestFileId(
    }
    return result;
  }
+  public static List<Path> listPath(final FileSystem fs, final Path path, final PathFilter filter,


if you could process the results with any computation per record, you will get full benefits of the async page fetch offered by s3a and (soon) abfs; at 600ms a list for 200 records on s3, that's potentially 3ms/record saving

If listLocatedFileStatus() logged @ debug the toString() value of the iterator, the S3A FS iterator will print out its IOStats, including #of S3 list requests and min/mean/max durations.

Thanks @steveloughran for taking a look at this. I will try to take advantage of the async prefetch feature in later PRs, it seems promising, but it would need bigger code change. I will checkout the IOStats downstream, I have seen it is already available there.

deniskuzZ · 2021-01-25T16:12:23Z

ql/src/java/org/apache/hadoop/hive/ql/io/HdfsUtils.java

@@ -117,6 +117,10 @@ public static long createTestFileId(
    }
    return result;
  }
+  public static List<Path> listPath(final FileSystem fs, final Path path, final PathFilter filter,
+      final boolean recursive) throws IOException {
+    return listLocatedFileStatus(fs, path, filter, recursive).stream().map(FileStatus::getPath).collect(Collectors.toList());


formatting: put map/collect on new lines

deniskuzZ

LGTM

HIVE-24669: Improve FileSystem usage in Hive::loadPartitionInternal

471b83a

kgyrtkirk added the tests pending label Jan 20, 2021

deniskuzZ reviewed Jan 20, 2021

View reviewed changes

ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java Show resolved Hide resolved

deniskuzZ reviewed Jan 20, 2021

View reviewed changes

ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java Show resolved Hide resolved

Add nullcheck

11b4953

kgyrtkirk added tests unstable tests pending tests passed and removed tests pending tests unstable labels Jan 21, 2021

Add some more improvement

a509926

kgyrtkirk added tests pending tests unstable and removed tests passed tests pending labels Jan 21, 2021

Fix weird staging layout issue

de99025

kgyrtkirk added tests pending and removed tests unstable labels Jan 21, 2021

Try to fix the root cause

f0a79f1

kgyrtkirk added tests passed tests pending tests unstable and removed tests pending tests passed labels Jan 21, 2021

Try again

46ed31b

kgyrtkirk added tests pending and removed tests unstable labels Jan 22, 2021

steveloughran reviewed Jan 22, 2021

View reviewed changes

kgyrtkirk added tests pending tests unstable tests passed and removed tests unstable tests pending labels Jan 22, 2021

deniskuzZ reviewed Jan 25, 2021

View reviewed changes

deniskuzZ approved these changes Jan 25, 2021

View reviewed changes

pvargacl added 2 commits January 25, 2021 20:11

fix review comment

30d61ab

Merge remote-tracking branch 'origin/master' into dynpartitionFS

b1131e7

kgyrtkirk added tests pending tests unstable tests passed and removed tests passed tests pending tests unstable labels Jan 25, 2021

deniskuzZ merged commit fa987bd into apache:master Jan 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-24669: Improve FileSystem usage in Hive::loadPartitionInternal #1893

HIVE-24669: Improve FileSystem usage in Hive::loadPartitionInternal #1893

pvargacl commented Jan 20, 2021

deniskuzZ Jan 20, 2021 •

edited

pvargacl Jan 20, 2021

deniskuzZ Jan 20, 2021

steveloughran left a comment

steveloughran Jan 22, 2021

pvargacl Jan 25, 2021

deniskuzZ Jan 25, 2021

deniskuzZ left a comment

HIVE-24669: Improve FileSystem usage in Hive::loadPartitionInternal #1893

HIVE-24669: Improve FileSystem usage in Hive::loadPartitionInternal #1893

Conversation

pvargacl commented Jan 20, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

deniskuzZ Jan 20, 2021 • edited

Choose a reason for hiding this comment

pvargacl Jan 20, 2021

Choose a reason for hiding this comment

deniskuzZ Jan 20, 2021

Choose a reason for hiding this comment

steveloughran left a comment

Choose a reason for hiding this comment

steveloughran Jan 22, 2021

Choose a reason for hiding this comment

pvargacl Jan 25, 2021

Choose a reason for hiding this comment

deniskuzZ Jan 25, 2021

Choose a reason for hiding this comment

deniskuzZ left a comment

Choose a reason for hiding this comment

deniskuzZ Jan 20, 2021 •

edited