New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL #4901
Conversation
@XuQianJin-Stars do you have some time to review this pr, thanks, may be we can add this to HUDI-3161 |
@huberylee Thank you very much for contributing, pls fixed the CI build |
well, Let me review this pr. |
hi @huberylee pls fixed the CI build. |
@xiarixiaoyao @XuQianJin-Stars Sorry for the late reply, I will fix it right now. |
@hudi-bot run azure |
public List<String> getMatchedPartitions(HoodieWriteConfig config, List<String> partitionPaths) { | ||
String partitionSelected = config.getClusteringPartitionSelected(); | ||
if (!StringUtils.isNullOrEmpty(partitionSelected)) { | ||
return Arrays.asList(partitionSelected.split(",")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delimiter
which is comma by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the filtered partition is separated by comma when pruning partition, see the end of org.apache.spark.sql.hudi.command.procedures.RunClusteringProcedure#prunePartition
for more detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the filtered partition is separated by comma when pruning partition, see the end of
org.apache.spark.sql.hudi.command.procedures.RunClusteringProcedure#prunePartition
for more detail.
Change to a constant or set writer's option?
case BinaryType => value.getBytes(Charset.forName("utf-8")) | ||
case BooleanType => value.toBoolean | ||
case DoubleType => value.toDouble | ||
case _: DecimalType => Decimal(BigDecimal(value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the precision
of decimal need not be processed?
* [ORDER BY (col_name1 [, ...] ) ] | ||
*/ | ||
private val PARAMETERS = Array[ProcedureParameter]( | ||
ProcedureParameter.optional(0, "table", DataTypes.StringType, None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
table is also optional,not required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, no matter in RunClusteringProcedure
or ShowClusteringProcedure
, table and path are optional, and it's ok to calling those procedures by table or path. What should be noticed is that one of them must be set in one specific call, and this guaranteed by org.apache.spark.sql.hudi.command.procedures.BaseProcedure#getBasePath
.
|
||
class ShowClusteringProcedure extends BaseProcedure with ProcedureBuilder with SparkAdapterSupport with Logging { | ||
private val PARAMETERS = Array[ProcedureParameter]( | ||
ProcedureParameter.optional(0, "table", DataTypes.StringType, None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto table
This PR looks good overall. Thanks @huberylee |
@huberylee can you please add a description for this PR? |
* [[StructField]] object for every field of the provided [[StructType]], recursively. | ||
* | ||
* For example, following struct | ||
* <pre> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please make sure that the formatting of the comment is preserved? It become practically unreadable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'll reformat this comment in #4945
…d for Spark SQL (apache#4901) * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL Co-authored-by: shibei <huberylee.li@alibaba-inc.com>
…d for Spark SQL (apache#4901) * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL Co-authored-by: shibei <huberylee.li@alibaba-inc.com>
Tips
What is the purpose of the pull request
Supporting clustering command for SparkSQL
Brief change log
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.