[HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL #4901

huberylee · 2022-02-24T12:54:10Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Supporting clustering command for SparkSQL

Brief change log

Add RunClusteringProcedure and ShowClusteringProcedure
Add test case in TestCallProcedure

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

…ark SQL

xiarixiaoyao · 2022-02-26T03:35:06Z

@XuQianJin-Stars do you have some time to review this pr, thanks, may be we can add this to HUDI-3161

xiarixiaoyao · 2022-02-26T03:36:19Z

@huberylee Thank you very much for contributing， pls fixed the CI build

XuQianJin-Stars · 2022-02-26T04:30:09Z

@XuQianJin-Stars do you have some time to review this pr, thanks, may be we can add this to HUDI-3161

well, Let me review this pr.

XuQianJin-Stars · 2022-02-28T11:32:56Z

hi @huberylee pls fixed the CI build.

huberylee · 2022-02-28T16:22:39Z

@xiarixiaoyao @XuQianJin-Stars Sorry for the late reply, I will fix it right now.

huberylee · 2022-03-01T03:37:29Z

@hudi-bot run azure

…ark SQL

XuQianJin-Stars · 2022-03-03T13:09:56Z

...java/org/apache/hudi/table/action/cluster/strategy/PartitionAwareClusteringPlanStrategy.java

+  public List<String> getMatchedPartitions(HoodieWriteConfig config, List<String> partitionPaths) {
+    String partitionSelected = config.getClusteringPartitionSelected();
+    if (!StringUtils.isNullOrEmpty(partitionSelected)) {
+      return Arrays.asList(partitionSelected.split(","));


delimiterwhich is comma by default?

Yes, the filtered partition is separated by comma when pruning partition, see the end of org.apache.spark.sql.hudi.command.procedures.RunClusteringProcedure#prunePartition for more detail.

Yes, the filtered partition is separated by comma when pruning partition, see the end of org.apache.spark.sql.hudi.command.procedures.RunClusteringProcedure#prunePartition for more detail.

Change to a constant or set writer's option?

XuQianJin-Stars · 2022-03-03T13:34:17Z

...e/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/BaseProcedure.scala

+        case BinaryType => value.getBytes(Charset.forName("utf-8"))
+        case BooleanType => value.toBoolean
+        case DoubleType => value.toDouble
+        case _: DecimalType => Decimal(BigDecimal(value))


Does the precision of decimal need not be processed?

XuQianJin-Stars · 2022-03-03T13:36:41Z

...ark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/RunClusteringProcedure.scala

+   * [ORDER BY (col_name1 [, ...] ) ]
+   */
+  private val PARAMETERS = Array[ProcedureParameter](
+    ProcedureParameter.optional(0, "table", DataTypes.StringType, None),


table is also optional，not required？

Yes, no matter in RunClusteringProcedure or ShowClusteringProcedure, table and path are optional, and it's ok to calling those procedures by table or path. What should be noticed is that one of them must be set in one specific call, and this guaranteed by org.apache.spark.sql.hudi.command.procedures.BaseProcedure#getBasePath.

XuQianJin-Stars · 2022-03-03T13:47:04Z

...rk/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowClusteringProcedure.scala

+
+class ShowClusteringProcedure extends BaseProcedure with ProcedureBuilder with SparkAdapterSupport with Logging {
+  private val PARAMETERS = Array[ProcedureParameter](
+    ProcedureParameter.optional(0, "table", DataTypes.StringType, None),


ditto table

XuQianJin-Stars · 2022-03-03T13:53:11Z

This PR looks good overall. Thanks @huberylee

…ark SQL

hudi-bot · 2022-03-03T16:23:28Z

CI report:

9515806 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

alexeykudinkin · 2022-03-04T04:37:46Z

@huberylee can you please add a description for this PR?

alexeykudinkin · 2022-03-04T04:38:35Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCommonUtils.scala

+   * [[StructField]] object for every field of the provided [[StructType]], recursively.
+   *
+   * For example, following struct
+   * <pre>


Can you please make sure that the formatting of the comment is preserved? It become practically unreadable

Sorry, I'll reformat this comment in #4945

…d for Spark SQL (apache#4901) * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL Co-authored-by: shibei <huberylee.li@alibaba-inc.com>

huberylee changed the title ~~[WIP][HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL~~ [WIP][HUDI-3445] Supporting Clustering Command Based on Call Procedure Command for Spark SQL Feb 24, 2022

[HUDI-3445] Clustering Command Based on Call Procedure Command for Sp…

a25bc26

…ark SQL

huberylee changed the title ~~[WIP][HUDI-3445] Supporting Clustering Command Based on Call Procedure Command for Spark SQL~~ [HUDI-3445] Supporting Clustering Command Based on Call Procedure Command for Spark SQL Feb 25, 2022

xushiyan assigned XuQianJin-Stars Feb 28, 2022

[HUDI-3445] Clustering Command Based on Call Procedure Command for Sp…

7eb3a7f

…ark SQL

huberylee changed the title ~~[HUDI-3445] Supporting Clustering Command Based on Call Procedure Command for Spark SQL~~ [HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL Mar 1, 2022

huberylee mentioned this pull request Mar 3, 2022

[HUDI-3538] Support Compaction Command Based on Call Procedure Command for Spark SQL #4945

Merged

5 tasks

XuQianJin-Stars reviewed Mar 3, 2022

View reviewed changes

[HUDI-3445] Clustering Command Based on Call Procedure Command for Sp…

9515806

…ark SQL

XuQianJin-Stars approved these changes Mar 3, 2022

View reviewed changes

xiarixiaoyao merged commit 62f534d into apache:master Mar 4, 2022

alexeykudinkin reviewed Mar 4, 2022

View reviewed changes

huberylee mentioned this pull request Mar 8, 2022

[HUDI-3567] Refactor HoodieCommonUtils to make code more reasonable #4982

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL #4901

[HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL #4901

huberylee commented Feb 24, 2022

xiarixiaoyao commented Feb 26, 2022

xiarixiaoyao commented Feb 26, 2022

XuQianJin-Stars commented Feb 26, 2022

XuQianJin-Stars commented Feb 28, 2022

huberylee commented Feb 28, 2022 •

edited

huberylee commented Mar 1, 2022

XuQianJin-Stars Mar 3, 2022

huberylee Mar 3, 2022

XuQianJin-Stars Mar 3, 2022

XuQianJin-Stars Mar 3, 2022

XuQianJin-Stars Mar 3, 2022

huberylee Mar 3, 2022

XuQianJin-Stars Mar 3, 2022

XuQianJin-Stars commented Mar 3, 2022

hudi-bot commented Mar 3, 2022

alexeykudinkin commented Mar 4, 2022

alexeykudinkin Mar 4, 2022

huberylee Mar 4, 2022

[HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL #4901

[HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL #4901

Conversation

huberylee commented Feb 24, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

xiarixiaoyao commented Feb 26, 2022

xiarixiaoyao commented Feb 26, 2022

XuQianJin-Stars commented Feb 26, 2022

XuQianJin-Stars commented Feb 28, 2022

huberylee commented Feb 28, 2022 • edited

huberylee commented Mar 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XuQianJin-Stars commented Mar 3, 2022

hudi-bot commented Mar 3, 2022

CI report:

alexeykudinkin commented Mar 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huberylee commented Feb 28, 2022 •

edited