Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-1764] Add Hudi-CLI support for clustering #2773

Merged
merged 10 commits into from Apr 20, 2021
Merged

[HUDI-1764] Add Hudi-CLI support for clustering #2773

merged 10 commits into from Apr 20, 2021

Conversation

jintaoguan
Copy link
Contributor

@jintaoguan jintaoguan commented Apr 6, 2021

Tips

What is the purpose of the pull request

Currently, Hudi-CLI doesn't have the capability to schedule or run clustering.
So I we would like to add it to Hudi-CLI tool.

Brief change log

  • Added ClusteringCommands to support schedule and run clustering from Hudi-CLI tool.
  • Updated hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java to support clustering.

Verify this pull request

  • This pull request is already covered by existing tests, such as TestHoodieDeltaStreamer.
  • Manually verified scheduling and running clustering from Hudi-CLI on Yarn cluster.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

  • Necessary doc changes done or have another open PR

@jintaoguan jintaoguan changed the title Cli clustering support [HUDI-1764] Cli clustering support Apr 6, 2021
@codecov-io
Copy link

codecov-io commented Apr 6, 2021

Codecov Report

Merging #2773 (582e348) into master (920537c) will increase coverage by 17.31%.
The diff coverage is 0.00%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master    #2773       +/-   ##
=============================================
+ Coverage     52.30%   69.61%   +17.31%     
+ Complexity     3689      373     -3316     
=============================================
  Files           483       54      -429     
  Lines         23099     1998    -21101     
  Branches       2460      236     -2224     
=============================================
- Hits          12082     1391    -10691     
+ Misses         9949      475     -9474     
+ Partials       1068      132      -936     
Flag Coverage Δ Complexity Δ
hudicli ? ?
hudiclient ? ?
hudicommon ? ?
hudiflink ? ?
hudihadoopmr ? ?
hudisparkdatasource ? ?
hudisync ? ?
huditimelineservice ? ?
hudiutilities 69.61% <0.00%> (-0.13%) 0.00 <0.00> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...org/apache/hudi/utilities/HoodieClusteringJob.java 62.50% <0.00%> (-2.72%) 9.00 <0.00> (ø)
...apache/hudi/utilities/deltastreamer/DeltaSync.java 71.08% <0.00%> (-0.35%) 55.00% <0.00%> (-1.00%)
.../common/bloom/HoodieDynamicBoundedBloomFilter.java
...java/org/apache/hudi/common/fs/StorageSchemes.java
...e/hudi/common/table/timeline/dto/FileGroupDTO.java
...he/hudi/hadoop/SafeParquetRecordReaderWrapper.java
...n/java/org/apache/hudi/common/HoodieCleanStat.java
.../hudi/common/config/SerializableConfiguration.java
...e/hudi/exception/HoodieDeltaStreamerException.java
...org/apache/hudi/common/model/HoodieFileFormat.java
... and 421 more

@jintaoguan jintaoguan changed the title [HUDI-1764] Cli clustering support [HUDI-1764] Hudi-CLI support for clustering Apr 6, 2021
@@ -156,15 +155,15 @@ private int doCluster(JavaSparkContext jsc) throws Exception {
}

@TestOnly
public Option<String> doSchedule() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return the schedule instant time will be more clear.

Copy link
Contributor Author

@jintaoguan jintaoguan Apr 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schedule instant time is already in HoodieClusteringJob.Config. If doSchedule() succeeds and returns 0, we should be able to get the clustering instant time from the config.
I am trying to use the same patern as doSchedule() of HoodieCompactor. Correct me if I misunderstand it.

if (args.length > 6) {
configs.addAll(Arrays.asList(args).subList(6, args.length));
}
returnCode = cluster(jsc, args[1], args[2], args[3], 1, args[4], 0, true, propsFilePath, configs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusteringInstant use schedule generate will be more resonable. Because user set instant time may conflict with hudi. More information can see comments of #2379

Copy link
Contributor Author

@jintaoguan jintaoguan Apr 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The Clustering Instant here (args[4]) is generated by HoodieActiveTimeline.createNewInstantTime(); at Line 57 of ClusteringCommand.java above. Users cannot set instant time directly for clustering.

@jintaoguan jintaoguan changed the title [HUDI-1764] Hudi-CLI support for clustering [HUDI-1764] Add Hudi-CLI support for clustering Apr 7, 2021
@@ -1013,26 +1014,22 @@ public void testHoodieAsyncClusteringJob() throws Exception {
HoodieDeltaStreamer ds = new HoodieDeltaStreamer(cfg, jsc);
deltaStreamerTestRunner(ds, cfg, (r) -> {
TestHelpers.assertAtLeastNCommits(2, tableBasePath, dfs);
String scheduleClusteringInstantTime = HoodieActiveTimeline.createNewInstantTime();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this have change HoodieClusteringJob usage mode, Now if user use HoodieClusteringJob need first HoodieActiveTimeline.createNewInstantTime(); Can we compatibility old usage mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I will make it compatible with the old usage mode. The behavior will be

  1. if the user provides an instant time, we will use it to schedule clustering and return it to the user.
  2. if the user doesn't provide an instant time, we will generate one and return it to the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unspecifiedDefaultValue = "") final String propsFilePath,
@CliOption(key = "hoodieConfigs", help = "Any configuration that can be set in the properties file can be passed here in the form of an array",
unspecifiedDefaultValue = "") final String[] configs) throws Exception {
HoodieTableMetaClient client = HoodieCLI.getTableMetaClient();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we do not need initfs just like compaction command?
HoodieTableMetaClient client = checkAndGetMetaClient();
boolean initialized = HoodieCLI.initConf();
HoodieCLI.initFS(initialized);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks.

if (exitCode != 0) {
return "Failed to schedule clustering for " + clusteringInstantTime;
}
return "Attempted to schedule clustering for " + clusteringInstantTime;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Succeed to schedule clustering for " + clusteringInstantTime

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

if (exitCode != 0) {
return "Failed to run clustering for " + clusteringInstantTime;
}
return "Clustering successfully completed for " + clusteringInstantTime;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Succeed to run clustering for " + clusteringInstantTime

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@lw309637554
Copy link
Contributor

@jintaoguan add some minor comments

@jintaoguan
Copy link
Contributor Author

@lw309637554 I updated the PR according to your comments above. Please take a look and let me know if it looks good to you. Thanks.

@lw309637554
Copy link
Contributor

LGTM , @satishkotha @n3nash can you also review?

@satishkotha
Copy link
Member

satishkotha commented Apr 9, 2021

@jintaoguan LGTM. Can you raise PR to update documentation on CLI page and add example command line screenshots? the documentation is in 'asf-site' branch. See "content/docs/deployment.html" in the branch.

@jintaoguan
Copy link
Contributor Author

Sure. Will do that.

@lw309637554
Copy link
Contributor

@jintaoguan hello ,can you open a new issue and raise PR to update documentation on CLI page. I can merge this pr

@vinothchandar vinothchandar added this to Ready For Review in PR Tracker Board Apr 15, 2021
@jintaoguan
Copy link
Contributor Author

jintaoguan commented Apr 19, 2021

@lw309637554 I have opened a new issue (https://issues.apache.org/jira/browse/HUDI-1813) for updating the documentation of CLI. Thank you.

@vinothchandar
Copy link
Member

@satishkotha @lw309637554 is this now good to go?

@vinothchandar vinothchandar moved this from Opened PRs to Review in progress in PR Tracker Board Apr 19, 2021
@lw309637554
Copy link
Contributor

@satishkotha @lw309637554 is this now good to go?

it is good.

@satishkotha satishkotha merged commit 3253079 into apache:master Apr 20, 2021
PR Tracker Board automation moved this from Review in progress to Done Apr 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants