[HUDI-1054][Peformance] Several performance fixes during finalizing writes #1768

umehrot2 · 2020-06-27T01:29:11Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

This PR does several performance improvements described in https://issues.apache.org/jira/browse/HUDI-1054 that are specially useful for S3, but will be beneficial in general for hudi performance too.

My sample test data is a 1 TB data set, having 8000 partitions and approximately 190000 files for which finalize writes used to take 35-40 minutes. With these changes I am able to bring it down to less than 5 minutes.

Brief change log

Verify this pull request

Existing unit tests
Tested with sample data on EMR cluster to observe performance difference

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

umehrot2 · 2020-06-27T03:13:14Z

@bvaradar fyi

leesf · 2020-06-27T08:48:48Z

hudi-common/pom.xml

+    <!-- Spark -->
+    <dependency>
+      <groupId>org.apache.spark</groupId>
+      <artifactId>spark-core_${scala.binary.version}</artifactId>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.spark</groupId>
+      <artifactId>spark-sql_${scala.binary.version}</artifactId>
+    </dependency>
+


hudi-common is a base module and feels a little weird that it relies on spark-core/spark-sql, would we remove them and move getAllDataFilesForMarkers method to hudi-client module, wdyt? cc @bvaradar @vinothchandar

yes.. we cannot depend on spark in hudi-common

Okay so is the suggestion to move this method over to HoodieTable in hudi-client module, considering that is the only place this method is used ?

yes.. that ll do..

vinothchandar

@umehrot2 Thanks for raising this.. Seems like an important difference w.r.t hdfs.

Waiting on some more feedback.. left Some minor comments

vinothchandar · 2020-06-28T08:48:12Z

hudi-common/pom.xml

+    <!-- Spark -->
+    <dependency>
+      <groupId>org.apache.spark</groupId>
+      <artifactId>spark-core_${scala.binary.version}</artifactId>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.spark</groupId>
+      <artifactId>spark-sql_${scala.binary.version}</artifactId>
+    </dependency>
+


yes.. we cannot depend on spark in hudi-common

vinothchandar · 2020-06-28T08:55:19Z

hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java

-      String pathStr = status.getPath().toString();
-      if (pathStr.endsWith(HoodieTableMetaClient.MARKER_EXTN)) {
-        dataFiles.add(FSUtils.translateMarkerToDataPath(basePath, pathStr, instantTs, baseFileExtension));
+  public static Set<String> getAllDataFilesForMarkers(JavaSparkContext jsc, FileSystem fs, String basePath,


i think this is the reason for needing spark in hudi-common.. we can move refactor the code to hudi-client..

In fact, #1755 has already modularized this more..

vinothchandar · 2020-06-28T08:56:39Z

hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java

-        LOG.info("Removing marker directory=" + markerDir);
+        LOG.info("Removing marker directory = " + markerDir);
+
+        FileStatus[] fileStatuses = fs.listStatus(markerDir);


@umehrot2 so seems like, for object stores this is different.. and makes sense completely to do parallel cleaning of individual files.

cc @n3nash should we have flag to protect this for HDFS.. i.e if the recursive delete works better there (IIUC). you might want to tradeoff less RPCs ..?
we can override defaults at spark datasource level, and set these based on StorageSchemes as well.

@vinothchandar I don't think there is any different between EmrFS which uses S3 vs HDFS working w.r.t RPC calls made here. EmrFS just implements the HDFS interface, but the internals like RPC calls to the namenode etc. remain the same.

vinothchandar · 2020-06-28T09:00:07Z

hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java

-    }, false);
+    }
+
+    parallelism = subDirectories.size() < parallelism ? subDirectories.size() : parallelism;


Math.min(subDirectories.size(), parallelism)?

Will address in the next revision, one I have more feedback as needed from @n3nash

vinothchandar · 2020-06-28T09:00:31Z

hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java

+    }
+
+    parallelism = subDirectories.size() < parallelism ? subDirectories.size() : parallelism;
+    dataFiles.addAll(jsc.parallelize(subDirectories, parallelism).flatMap(directory -> {


similar question here.. cc @n3nash ..

vinothchandar · 2020-06-30T10:58:37Z

marked this a blocker for 0.6.0.. @n3nash can you please chime in with any side effects you see for hdfs?

vinothchandar · 2020-07-21T07:13:33Z

@umehrot2 just landed the changes I mentioned. can we rework this PR and try again . We can make things parallel i.e working for s3 for now. and then we can adjust for HDFS later on. So we should be able close the loop faster.

I do want to get this into 0.6.0 so also please let me know if you are unable to take a stab at this

umehrot2 · 2020-07-28T02:38:35Z

@umehrot2 just landed the changes I mentioned. can we rework this PR and try again . We can make things parallel i.e working for s3 for now. and then we can adjust for HDFS later on. So we should be able close the loop faster.

I do want to get this into 0.6.0 so also please let me know if you are unable to take a stab at this

Working on it @vinothchandar. There has been quite a refactoring it seems, which is making the re-basing tricky as now these functions are being called from places which do not even have spark context.

umehrot2 · 2020-07-30T00:25:55Z

@vinothchandar finally got the unit and integration tests to pass. This is ready for review.

vinothchandar

LGTM overall.

One high level question though. this parallelizes till the first level only right? so we are assuming this helps the common cases like date based tables with multiple years of data? I mean - if you only have a few years of data <10 say, and yyyy is the top level partitioning field, would this parallelization still help?

vinothchandar · 2020-07-30T13:27:00Z

hudi-client/src/main/java/org/apache/hudi/table/MarkerFiles.java

+          jsc.parallelize(markerDirSubPaths, parallelism).foreach(subPathStr -> {
+            Path subPath = new Path(subPathStr);
+            FileSystem fileSystem = subPath.getFileSystem(conf.get());
+            fileSystem.delete(subPath, true);


note to self: this will still work when subPath is a file. i.e non-partitioned tables

hudi-client/src/main/java/org/apache/hudi/table/MarkerFiles.java

umehrot2 · 2020-07-30T23:13:58Z

LGTM overall.

One high level question though. this parallelizes till the first level only right? so we are assuming this helps the common cases like date based tables with multiple years of data? I mean - if you only have a few years of data <10 say, and yyyy is the top level partitioning field, would this parallelization still help?

@vinothchandar you are right about this. It will parallelize only on the top level partition folder. I think this will still help with parallelization, and would work best where there is only one level of partitioning. But I agree there is scope to further improve this by getting leaf level partition directories instead to help with multi level partitioning scenario. Is it okay if I open a JIRA for this and pursue it separately ?

vinothchandar · 2020-07-31T04:55:20Z

Is it okay if I open a JIRA for this and pursue it separately ?

Sounds good. let's lump this into the JIRA we have for marker file improvements more holistically ?

umehrot2 · 2020-07-31T17:25:33Z

Is it okay if I open a JIRA for this and pursue it separately ?

Sounds good. let's lump this into the JIRA we have for marker file improvements more holistically ?

Added a comment about this on https://issues.apache.org/jira/browse/HUDI-1138. Let me try to fix some conflict that it is showing with current master.

zhangyue19921010 · 2021-11-10T09:54:01Z

Hi guys, this PR helps a lot to do archive works based on S3! Also Just find an another performance issues and raise a PR trying to fix it. #3920
Hope you are interested in it :)

umehrot2 force-pushed the finalize-writes branch from 12debb4 to 6fb6941 Compare June 27, 2020 03:11

leesf reviewed Jun 27, 2020

View reviewed changes

leesf assigned bvaradar Jun 27, 2020

vinothchandar reviewed Jun 28, 2020

View reviewed changes

bvaradar mentioned this pull request Jul 17, 2020

[SUPPORT]S3 file listing causing compaction to get eventually slow #1837

Closed

umehrot2 force-pushed the finalize-writes branch 4 times, most recently from 39108c1 to be8e6f4 Compare July 29, 2020 23:50

vinothchandar approved these changes Jul 30, 2020

View reviewed changes

umehrot2 force-pushed the finalize-writes branch from be8e6f4 to 8ad8a26 Compare July 31, 2020 17:44

[Peformance] Several performance fixes during finalizing writes

4cecc15

umehrot2 force-pushed the finalize-writes branch from 8ad8a26 to 4cecc15 Compare July 31, 2020 19:58

vinothchandar merged commit e79fbc0 into apache:master Aug 1, 2020

zhangyue19921010 mentioned this pull request Nov 9, 2021

[HUDI-2683] Parallelize deleting archived hoodie commits #3920

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-1054][Peformance] Several performance fixes during finalizing writes #1768

[HUDI-1054][Peformance] Several performance fixes during finalizing writes #1768

umehrot2 commented Jun 27, 2020 •

edited

umehrot2 commented Jun 27, 2020

leesf Jun 27, 2020

vinothchandar Jun 28, 2020

umehrot2 Jun 29, 2020

vinothchandar Jun 30, 2020

vinothchandar left a comment

vinothchandar Jun 28, 2020

vinothchandar Jun 28, 2020

vinothchandar Jun 28, 2020

vinothchandar Jun 28, 2020

umehrot2 Jul 15, 2020

vinothchandar Jun 28, 2020

umehrot2 Jun 29, 2020

vinothchandar Jun 28, 2020

vinothchandar commented Jun 30, 2020

vinothchandar commented Jul 21, 2020

umehrot2 commented Jul 28, 2020

umehrot2 commented Jul 30, 2020

vinothchandar left a comment

vinothchandar Jul 30, 2020

umehrot2 commented Jul 30, 2020

vinothchandar commented Jul 31, 2020 •

edited

umehrot2 commented Jul 31, 2020

zhangyue19921010 commented Nov 10, 2021

[HUDI-1054][Peformance] Several performance fixes during finalizing writes #1768

[HUDI-1054][Peformance] Several performance fixes during finalizing writes #1768

Conversation

umehrot2 commented Jun 27, 2020 • edited

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

umehrot2 commented Jun 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinothchandar commented Jun 30, 2020

vinothchandar commented Jul 21, 2020

umehrot2 commented Jul 28, 2020

umehrot2 commented Jul 30, 2020

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

umehrot2 commented Jul 30, 2020

vinothchandar commented Jul 31, 2020 • edited

umehrot2 commented Jul 31, 2020

zhangyue19921010 commented Nov 10, 2021

umehrot2 commented Jun 27, 2020 •

edited

vinothchandar commented Jul 31, 2020 •

edited