New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-1764] Add Hudi-CLI support for clustering #2773
[HUDI-1764] Add Hudi-CLI support for clustering #2773
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2773 +/- ##
=============================================
+ Coverage 52.30% 69.61% +17.31%
+ Complexity 3689 373 -3316
=============================================
Files 483 54 -429
Lines 23099 1998 -21101
Branches 2460 236 -2224
=============================================
- Hits 12082 1391 -10691
+ Misses 9949 475 -9474
+ Partials 1068 132 -936
Flags with carried forward coverage won't be shown. Click here to find out more. |
@@ -156,15 +155,15 @@ private int doCluster(JavaSparkContext jsc) throws Exception { | |||
} | |||
|
|||
@TestOnly | |||
public Option<String> doSchedule() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return the schedule instant time will be more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schedule instant time is already in HoodieClusteringJob.Config
. If doSchedule()
succeeds and returns 0, we should be able to get the clustering instant time from the config.
I am trying to use the same patern as doSchedule()
of HoodieCompactor
. Correct me if I misunderstand it.
if (args.length > 6) { | ||
configs.addAll(Arrays.asList(args).subList(6, args.length)); | ||
} | ||
returnCode = cluster(jsc, args[1], args[2], args[3], 1, args[4], 0, true, propsFilePath, configs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clusteringInstant use schedule generate will be more resonable. Because user set instant time may conflict with hudi. More information can see comments of #2379
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The Clustering Instant here (args[4]
) is generated by HoodieActiveTimeline.createNewInstantTime();
at Line 57 of ClusteringCommand.java
above. Users cannot set instant time directly for clustering.
@@ -1013,26 +1014,22 @@ public void testHoodieAsyncClusteringJob() throws Exception { | |||
HoodieDeltaStreamer ds = new HoodieDeltaStreamer(cfg, jsc); | |||
deltaStreamerTestRunner(ds, cfg, (r) -> { | |||
TestHelpers.assertAtLeastNCommits(2, tableBasePath, dfs); | |||
String scheduleClusteringInstantTime = HoodieActiveTimeline.createNewInstantTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this have change HoodieClusteringJob usage mode, Now if user use HoodieClusteringJob need first HoodieActiveTimeline.createNewInstantTime(); Can we compatibility old usage mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I will make it compatible with the old usage mode. The behavior will be
- if the user provides an instant time, we will use it to schedule clustering and return it to the user.
- if the user doesn't provide an instant time, we will generate one and return it to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. We also have a doc for async compaction usage
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance
unspecifiedDefaultValue = "") final String propsFilePath, | ||
@CliOption(key = "hoodieConfigs", help = "Any configuration that can be set in the properties file can be passed here in the form of an array", | ||
unspecifiedDefaultValue = "") final String[] configs) throws Exception { | ||
HoodieTableMetaClient client = HoodieCLI.getTableMetaClient(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we do not need initfs just like compaction command?
HoodieTableMetaClient client = checkAndGetMetaClient();
boolean initialized = HoodieCLI.initConf();
HoodieCLI.initFS(initialized);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Thanks.
if (exitCode != 0) { | ||
return "Failed to schedule clustering for " + clusteringInstantTime; | ||
} | ||
return "Attempted to schedule clustering for " + clusteringInstantTime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Succeed to schedule clustering for " + clusteringInstantTime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
if (exitCode != 0) { | ||
return "Failed to run clustering for " + clusteringInstantTime; | ||
} | ||
return "Clustering successfully completed for " + clusteringInstantTime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Succeed to run clustering for " + clusteringInstantTime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
@jintaoguan add some minor comments |
@lw309637554 I updated the PR according to your comments above. Please take a look and let me know if it looks good to you. Thanks. |
LGTM , @satishkotha @n3nash can you also review? |
@jintaoguan LGTM. Can you raise PR to update documentation on CLI page and add example command line screenshots? the documentation is in 'asf-site' branch. See "content/docs/deployment.html" in the branch. |
Sure. Will do that. |
@jintaoguan hello ,can you open a new issue and raise PR to update documentation on CLI page. I can merge this pr |
@lw309637554 I have opened a new issue (https://issues.apache.org/jira/browse/HUDI-1813) for updating the documentation of CLI. Thank you. |
@satishkotha @lw309637554 is this now good to go? |
it is good. |
Tips
What is the purpose of the pull request
Currently, Hudi-CLI doesn't have the capability to schedule or run clustering.
So I we would like to add it to Hudi-CLI tool.
Brief change log
ClusteringCommands
to support schedule and run clustering from Hudi-CLI tool.hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
to support clustering.Verify this pull request
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
Necessary doc changes done or have another open PR