Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL #4901

Merged
merged 3 commits into from Mar 4, 2022
Merged

Conversation

huberylee
Copy link
Contributor

Tips

What is the purpose of the pull request

Supporting clustering command for SparkSQL

Brief change log

  • Add RunClusteringProcedure and ShowClusteringProcedure
  • Add test case in TestCallProcedure

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@huberylee huberylee changed the title [WIP][HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL [WIP][HUDI-3445] Supporting Clustering Command Based on Call Procedure Command for Spark SQL Feb 24, 2022
@huberylee huberylee changed the title [WIP][HUDI-3445] Supporting Clustering Command Based on Call Procedure Command for Spark SQL [HUDI-3445] Supporting Clustering Command Based on Call Procedure Command for Spark SQL Feb 25, 2022
@xiarixiaoyao
Copy link
Contributor

@XuQianJin-Stars do you have some time to review this pr, thanks, may be we can add this to HUDI-3161

@xiarixiaoyao
Copy link
Contributor

@huberylee Thank you very much for contributing, pls fixed the CI build

@XuQianJin-Stars
Copy link
Contributor

@XuQianJin-Stars do you have some time to review this pr, thanks, may be we can add this to HUDI-3161

well, Let me review this pr.

@XuQianJin-Stars
Copy link
Contributor

hi @huberylee pls fixed the CI build.

@huberylee
Copy link
Contributor Author

huberylee commented Feb 28, 2022

@xiarixiaoyao @XuQianJin-Stars Sorry for the late reply, I will fix it right now.

@huberylee
Copy link
Contributor Author

@hudi-bot run azure

@huberylee huberylee changed the title [HUDI-3445] Supporting Clustering Command Based on Call Procedure Command for Spark SQL [HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL Mar 1, 2022
public List<String> getMatchedPartitions(HoodieWriteConfig config, List<String> partitionPaths) {
String partitionSelected = config.getClusteringPartitionSelected();
if (!StringUtils.isNullOrEmpty(partitionSelected)) {
return Arrays.asList(partitionSelected.split(","));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delimiterwhich is comma by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the filtered partition is separated by comma when pruning partition, see the end of org.apache.spark.sql.hudi.command.procedures.RunClusteringProcedure#prunePartition for more detail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the filtered partition is separated by comma when pruning partition, see the end of org.apache.spark.sql.hudi.command.procedures.RunClusteringProcedure#prunePartition for more detail.

Change to a constant or set writer's option?

case BinaryType => value.getBytes(Charset.forName("utf-8"))
case BooleanType => value.toBoolean
case DoubleType => value.toDouble
case _: DecimalType => Decimal(BigDecimal(value))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the precision of decimal need not be processed?

* [ORDER BY (col_name1 [, ...] ) ]
*/
private val PARAMETERS = Array[ProcedureParameter](
ProcedureParameter.optional(0, "table", DataTypes.StringType, None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table is also optional,not required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, no matter in RunClusteringProcedure or ShowClusteringProcedure, table and path are optional, and it's ok to calling those procedures by table or path. What should be noticed is that one of them must be set in one specific call, and this guaranteed by org.apache.spark.sql.hudi.command.procedures.BaseProcedure#getBasePath.


class ShowClusteringProcedure extends BaseProcedure with ProcedureBuilder with SparkAdapterSupport with Logging {
private val PARAMETERS = Array[ProcedureParameter](
ProcedureParameter.optional(0, "table", DataTypes.StringType, None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto table

@XuQianJin-Stars
Copy link
Contributor

This PR looks good overall. Thanks @huberylee

@hudi-bot
Copy link

hudi-bot commented Mar 3, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@xiarixiaoyao xiarixiaoyao merged commit 62f534d into apache:master Mar 4, 2022
@alexeykudinkin
Copy link
Contributor

@huberylee can you please add a description for this PR?

* [[StructField]] object for every field of the provided [[StructType]], recursively.
*
* For example, following struct
* <pre>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please make sure that the formatting of the comment is preserved? It become practically unreadable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'll reformat this comment in #4945

vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
…d for Spark SQL (apache#4901)

* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

Co-authored-by: shibei <huberylee.li@alibaba-inc.com>
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 12, 2022
…d for Spark SQL (apache#4901)

* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

Co-authored-by: shibei <huberylee.li@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants