[HUDI-1764] Add Hudi-CLI support for clustering #2773

jintaoguan · 2021-04-06T07:05:06Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

Currently, Hudi-CLI doesn't have the capability to schedule or run clustering.
So I we would like to add it to Hudi-CLI tool.

Brief change log

Added ClusteringCommands to support schedule and run clustering from Hudi-CLI tool.
Updated hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java to support clustering.

Verify this pull request

This pull request is already covered by existing tests, such as TestHoodieDeltaStreamer.
Manually verified scheduling and running clustering from Hudi-CLI on Yarn cluster.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
Necessary doc changes done or have another open PR

codecov-io · 2021-04-06T08:19:35Z

Codecov Report

Merging #2773 (582e348) into master (920537c) will increase coverage by 17.31%.
The diff coverage is 0.00%.

@@              Coverage Diff              @@
##             master    #2773       +/-   ##
=============================================
+ Coverage     52.30%   69.61%   +17.31%     
+ Complexity     3689      373     -3316     
=============================================
  Files           483       54      -429     
  Lines         23099     1998    -21101     
  Branches       2460      236     -2224     
=============================================
- Hits          12082     1391    -10691     
+ Misses         9949      475     -9474     
+ Partials       1068      132      -936

Flag	Coverage Δ	Complexity Δ
hudicli	`?`	`?`
hudiclient	`?`	`?`
hudicommon	`?`	`?`
hudiflink	`?`	`?`
hudihadoopmr	`?`	`?`
hudisparkdatasource	`?`	`?`
hudisync	`?`	`?`
huditimelineservice	`?`	`?`
hudiutilities	`69.61% <0.00%> (-0.13%)`	`0.00 <0.00> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...org/apache/hudi/utilities/HoodieClusteringJob.java	`62.50% <0.00%> (-2.72%)`	`9.00 <0.00> (ø)`
...apache/hudi/utilities/deltastreamer/DeltaSync.java	`71.08% <0.00%> (-0.35%)`	`55.00% <0.00%> (-1.00%)`
.../common/bloom/HoodieDynamicBoundedBloomFilter.java
...java/org/apache/hudi/common/fs/StorageSchemes.java
...e/hudi/common/table/timeline/dto/FileGroupDTO.java
...he/hudi/hadoop/SafeParquetRecordReaderWrapper.java
...n/java/org/apache/hudi/common/HoodieCleanStat.java
.../hudi/common/config/SerializableConfiguration.java
...e/hudi/exception/HoodieDeltaStreamerException.java
...org/apache/hudi/common/model/HoodieFileFormat.java
... and 421 more

lw309637554 · 2021-04-07T07:17:09Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java

@@ -156,15 +155,15 @@ private int doCluster(JavaSparkContext jsc) throws Exception {
  }

  @TestOnly
-  public Option<String> doSchedule() throws Exception {


return the schedule instant time will be more clear.

The schedule instant time is already in HoodieClusteringJob.Config. If doSchedule() succeeds and returns 0, we should be able to get the clustering instant time from the config.
I am trying to use the same patern as doSchedule() of HoodieCompactor. Correct me if I misunderstand it.

lw309637554 · 2021-04-07T07:20:40Z

hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java

+          if (args.length > 6) {
+            configs.addAll(Arrays.asList(args).subList(6, args.length));
+          }
+          returnCode = cluster(jsc, args[1], args[2], args[3], 1, args[4], 0, true, propsFilePath, configs);


clusteringInstant use schedule generate will be more resonable. Because user set instant time may conflict with hudi. More information can see comments of #2379

Yes. The Clustering Instant here (args[4]) is generated by HoodieActiveTimeline.createNewInstantTime(); at Line 57 of ClusteringCommand.java above. Users cannot set instant time directly for clustering.

lw309637554 · 2021-04-07T09:42:46Z

hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamer.java

@@ -1013,26 +1014,22 @@ public void testHoodieAsyncClusteringJob() throws Exception {
    HoodieDeltaStreamer ds = new HoodieDeltaStreamer(cfg, jsc);
    deltaStreamerTestRunner(ds, cfg, (r) -> {
      TestHelpers.assertAtLeastNCommits(2, tableBasePath, dfs);
+      String scheduleClusteringInstantTime = HoodieActiveTimeline.createNewInstantTime();


this have change HoodieClusteringJob usage mode, Now if user use HoodieClusteringJob need first HoodieActiveTimeline.createNewInstantTime(); Can we compatibility old usage mode?

Sure I will make it compatible with the old usage mode. The behavior will be

if the user provides an instant time, we will use it to schedule clustering and return it to the user.

if the user doesn't provide an instant time, we will generate one and return it to the user.

yes. We also have a doc for async compaction usage
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance

lw309637554 · 2021-04-08T02:38:02Z

hudi-cli/src/main/java/org/apache/hudi/cli/commands/ClusteringCommand.java

+          unspecifiedDefaultValue = "") final String propsFilePath,
+      @CliOption(key = "hoodieConfigs", help = "Any configuration that can be set in the properties file can be passed here in the form of an array",
+          unspecifiedDefaultValue = "") final String[] configs) throws Exception {
+    HoodieTableMetaClient client = HoodieCLI.getTableMetaClient();


why we do not need initfs just like compaction command?
HoodieTableMetaClient client = checkAndGetMetaClient();
boolean initialized = HoodieCLI.initConf();
HoodieCLI.initFS(initialized);

Good catch! Thanks.

lw309637554 · 2021-04-08T02:42:20Z

hudi-cli/src/main/java/org/apache/hudi/cli/commands/ClusteringCommand.java

+    if (exitCode != 0) {
+      return "Failed to schedule clustering for " + clusteringInstantTime;
+    }
+    return "Attempted to schedule clustering for " + clusteringInstantTime;


Succeed to schedule clustering for " + clusteringInstantTime

lw309637554 · 2021-04-08T02:42:44Z

hudi-cli/src/main/java/org/apache/hudi/cli/commands/ClusteringCommand.java

+    if (exitCode != 0) {
+      return "Failed to run clustering for " + clusteringInstantTime;
+    }
+    return "Clustering successfully completed for " + clusteringInstantTime;


Succeed to run clustering for " + clusteringInstantTime

lw309637554 · 2021-04-08T02:47:24Z

@jintaoguan add some minor comments

jintaoguan · 2021-04-08T07:23:29Z

@lw309637554 I updated the PR according to your comments above. Please take a look and let me know if it looks good to you. Thanks.

lw309637554 · 2021-04-08T09:51:57Z

LGTM , @satishkotha @n3nash can you also review?

satishkotha · 2021-04-09T22:20:46Z

@jintaoguan LGTM. Can you raise PR to update documentation on CLI page and add example command line screenshots? the documentation is in 'asf-site' branch. See "content/docs/deployment.html" in the branch.

jintaoguan · 2021-04-09T22:34:35Z

Sure. Will do that.

lw309637554 · 2021-04-14T02:17:53Z

@jintaoguan hello ,can you open a new issue and raise PR to update documentation on CLI page. I can merge this pr

jintaoguan · 2021-04-19T18:12:22Z

@lw309637554 I have opened a new issue (https://issues.apache.org/jira/browse/HUDI-1813) for updating the documentation of CLI. Thank you.

vinothchandar · 2021-04-19T22:45:05Z

@satishkotha @lw309637554 is this now good to go?

lw309637554 · 2021-04-20T01:39:24Z

@satishkotha @lw309637554 is this now good to go?

it is good.

jintaoguan added 5 commits April 5, 2021 23:31

tmp base

301d393

update

28db22c

update unit test

1731cd7

update

d49e8b0

update

a85a0fe

jintaoguan changed the title ~~Cli clustering support~~ [HUDI-1764] Cli clustering support Apr 6, 2021

jintaoguan changed the title ~~[HUDI-1764] Cli clustering support~~ [HUDI-1764] Hudi-CLI support for clustering Apr 6, 2021

update CLI parameters

58748ab

satishkotha requested a review from lw309637554 April 7, 2021 06:53

linting

fc2f340

lw309637554 reviewed Apr 7, 2021

View reviewed changes

jintaoguan changed the title ~~[HUDI-1764] Hudi-CLI support for clustering~~ [HUDI-1764] Add Hudi-CLI support for clustering Apr 7, 2021

lw309637554 reviewed Apr 7, 2021

View reviewed changes

jintaoguan added 2 commits April 7, 2021 17:25

update doSchedule in HoodieClusteringJob

c67cdc6

update

582e348

lw309637554 reviewed Apr 8, 2021

View reviewed changes

update diff according to comments

079375f

yanghua assigned lw309637554 Apr 13, 2021

vinothchandar added this to Ready For Review in PR Tracker Board Apr 15, 2021

vinothchandar assigned satishkotha Apr 19, 2021

vinothchandar moved this from Opened PRs to Review in progress in PR Tracker Board Apr 19, 2021

satishkotha merged commit 3253079 into apache:master Apr 20, 2021

PR Tracker Board automation moved this from Review in progress to Done Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-1764] Add Hudi-CLI support for clustering #2773

[HUDI-1764] Add Hudi-CLI support for clustering #2773

jintaoguan commented Apr 6, 2021 •

edited

codecov-io commented Apr 6, 2021 •

edited

lw309637554 Apr 7, 2021

jintaoguan Apr 7, 2021 •

edited

lw309637554 Apr 7, 2021

jintaoguan Apr 7, 2021 •

edited

lw309637554 Apr 7, 2021

jintaoguan Apr 8, 2021

lw309637554 Apr 8, 2021

lw309637554 Apr 8, 2021

jintaoguan Apr 8, 2021

lw309637554 Apr 8, 2021

jintaoguan Apr 8, 2021

lw309637554 Apr 8, 2021

jintaoguan Apr 8, 2021

lw309637554 commented Apr 8, 2021

jintaoguan commented Apr 8, 2021

lw309637554 commented Apr 8, 2021

satishkotha commented Apr 9, 2021 •

edited

jintaoguan commented Apr 9, 2021

lw309637554 commented Apr 14, 2021

jintaoguan commented Apr 19, 2021 •

edited

vinothchandar commented Apr 19, 2021

lw309637554 commented Apr 20, 2021

[HUDI-1764] Add Hudi-CLI support for clustering #2773

[HUDI-1764] Add Hudi-CLI support for clustering #2773

Conversation

jintaoguan commented Apr 6, 2021 • edited

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

codecov-io commented Apr 6, 2021 • edited

Codecov Report

Choose a reason for hiding this comment

jintaoguan Apr 7, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jintaoguan Apr 7, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lw309637554 commented Apr 8, 2021

jintaoguan commented Apr 8, 2021

lw309637554 commented Apr 8, 2021

satishkotha commented Apr 9, 2021 • edited

jintaoguan commented Apr 9, 2021

lw309637554 commented Apr 14, 2021

jintaoguan commented Apr 19, 2021 • edited

vinothchandar commented Apr 19, 2021

lw309637554 commented Apr 20, 2021

jintaoguan commented Apr 6, 2021 •

edited

codecov-io commented Apr 6, 2021 •

edited

jintaoguan Apr 7, 2021 •

edited

jintaoguan Apr 7, 2021 •

edited

satishkotha commented Apr 9, 2021 •

edited

jintaoguan commented Apr 19, 2021 •

edited