[HUDI-2207] Support independent flink hudi clustering function #3599

yuzhaojing · 2021-09-04T12:34:38Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

danny0405 · 2021-10-21T03:56:30Z

hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java

+  public static final ConfigOption<Integer> CLUSTERING_PLAN_STRATEGY_TARGET_FILE_MAX_BYTES = ConfigOptions
+      .key("clustering.plan.strategy.target.file.max.bytes")
+      .intType()
+      .defaultValue(1024) // default 1 GB


I guess the unit is MB ?

Already fix it.

danny0405 · 2021-10-25T04:15:33Z

...a/org/apache/hudi/client/clustering/plan/strategy/FlinkRecentDaysClusteringPlanStrategy.java

+ */
+public class FlinkRecentDaysClusteringPlanStrategy<T extends HoodieRecordPayload<T>>
+    extends PartitionAwareClusteringPlanStrategy<T, List<HoodieRecord<T>>, List<HoodieKey>, List<WriteStatus>> {
+  private static final Logger LOG = LogManager.getLogger(FlinkRecentDaysClusteringPlanStrategy.class);


Should we extend from FlinkSizeBasedClusteringPlanStrategy instead ?

danny0405 · 2021-10-25T04:18:36Z

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkWriteClient.java

+    if (writeStats.stream().mapToLong(s -> s.getTotalWriteErrors()).sum() > 0) {
+      throw new HoodieClusteringException("Clustering failed to write to files:"
+          + writeStats.stream().filter(s -> s.getTotalWriteErrors() > 0L).map(s -> s.getFileId()).collect(Collectors.joining(",")));
+    }


writeTableMetadata is missing.

danny0405 · 2021-10-25T04:27:02Z

...nt/src/main/java/org/apache/hudi/table/action/cluster/FlinkClusteringPlanActionExecutor.java

+
+  private static final Logger LOG = LogManager.getLogger(FlinkClusteringPlanActionExecutor.class);
+
+  public FlinkClusteringPlanActionExecutor(HoodieEngineContext context,


We can use the HoodieData to merge the code with SparkClusteringPlanActionExecutor, this can be done with separate following PR.

danny0405 · 2021-10-25T04:32:44Z

hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java

+      .key("clustering.schedule.enabled")
+      .booleanType()
+      .defaultValue(false) // default false for pipeline
+      .withDescription("Async clustering, default false for pipeline");


Schedule the compaction plan, default false

danny0405 · 2021-10-25T04:33:26Z

hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java

+      .key("clustering.tasks")
+      .intType()
+      .defaultValue(10)
+      .withDescription("Parallelism of tasks that do actual clustering, default is 10");


Change the default value same with compaction.tasks, which is 4.

danny0405 · 2021-10-25T06:43:32Z

hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringCommitEvent.java

+   * The clustering task identifier.
+   */
+  private int taskID;
+


The event should include a fileId to deduplicate for tasks failover/retry. Take CompactionCommitEvent as a reference. Because there are multiple input file ids for a HoodieClusteringGroup thus the CompactionCommitEvent, we can use the first file group id to distinguish.

danny0405 · 2021-10-25T06:44:48Z

hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringFunction.java

+    this.schema = new Schema.Parser().parse(writeConfig.getSchema());
+    this.readerSchema = HoodieAvroUtils.addMetadataFields(this.schema);
+    this.requiredPos = getRequiredPositions();
+


Where is the requiredPos used for ?

danny0405 · 2021-10-25T06:46:28Z

hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringFunction.java

+    for (ClusteringOperation clusteringOp : clusteringOps) {
+      try {
+        Schema readerSchema = HoodieAvroUtils.addMetadataFields(new Schema.Parser().parse(writeConfig.getSchema()));
+        HoodieFileReader<? extends IndexedRecord> baseFileReader = HoodieFileReaderFactory.getFileReader(table.getHadoopConf(), new Path(clusteringOp.getDataFilePath()));


Use the readerSchema ?

danny0405 · 2021-10-25T06:48:05Z

hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringFunction.java

+  private void doClustering(String instantTime, HoodieClusteringGroup clusteringGroup, Collector<ClusteringCommitEvent> collector) throws IOException {
+    List<ClusteringOperation> clusteringOps = clusteringGroup.getSlices().stream().map(ClusteringOperation::create).collect(Collectors.toList());
+    boolean hasLogFiles = clusteringOps.stream().anyMatch(op -> op.getDeltaFilePaths().size() > 0);
+


The HoodieClusteringGroup has num of output file groups, the current code has only one file group (or more if the parquet size hits the threshold), can we find a way to set up the parallelism of bulk_insert writer as that ?

danny0405 · 2021-10-25T06:49:51Z

hudi-flink/src/main/java/org/apache/hudi/sink/clustering/FlinkClusteringConfig.java

+    conf.setInteger(FlinkOptions.CLUSTERING_PLAN_STRATEGY_SKIP_PARTITIONS_FROM_LATEST, config.skipFromLatestPartitions);
+    if (config.sortColumns != null) {
+      conf.setString(FlinkOptions.CLUSTERING_SORT_COLUMNS, config.sortColumns);
+    }


Where is the CLUSTERING_SORT_COLUMNS used for ?

vinothchandar

cc @yihua wondering if we can reuse a lot more code here?

yihua · 2021-12-16T19:33:15Z

cc @yihua wondering if we can reuse a lot more code here?

Yes, the core clustering action should be extracted out, independent of engines, using HoodieData abstraction. Right now Spark and Java have their own classes. I filed a ticket here: https://issues.apache.org/jira/browse/HUDI-3042

yuzhaojing · 2022-03-01T14:59:17Z

@hudi-bot run azure

danny0405 · 2022-03-07T06:40:44Z

Hello, there seems come conflicts and the compile failure, can you fix that @yuzhaojing ?

yuzhaojing · 2022-03-07T06:42:36Z

Sure, I will fix this.

yuzhaojing · 2022-05-23T02:25:24Z

@hudi-bot run azure

yuzhaojing · 2022-05-23T08:59:55Z

@hudi-bot run azure

danny0405 · 2022-05-23T10:20:03Z

packaging/hudi-flink-bundle/pom.xml

+      <artifactId>flink-avro</artifactId>
+      <version>${flink.version}</version>
+      <scope>compile</scope>
+    </dependency>


Why this change ?

HoodieClusteringGroup is a avro model in ClusteringPlanEvent.

But you should not depend on flink-avro i guess, there are already many model clazzes in hudi-flink code paths.

fixed. use ClusteringGroupInfo instead of HoodieClusteringGroup.

danny0405 · 2022-05-23T10:20:16Z

hudi-flink-datasource/hudi-flink/pom.xml

+            <groupId>org.apache.flink</groupId>
+            <artifactId>flink-avro</artifactId>
+            <version>${flink.version}</version>
+            <scope>provided</scope>


Why this change ?

danny0405 · 2022-05-24T07:23:29Z

We may need to resolve the conflict.

yuzhaojing · 2022-05-24T09:52:17Z

We may need to resolve the conflict.

fixed.

danny0405 · 2022-05-24T12:15:02Z

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkWriteClient.java

@@ -399,6 +404,59 @@ public HoodieWriteMetadata<List<WriteStatus>> cluster(final String clusteringIns
    throw new HoodieNotSupportedException("Clustering is not supported yet");
  }

+  private void updateTableMetadata(HoodieTable<T, List<HoodieRecord<T>>, List<HoodieKey>, List<WriteStatus>> table,
+                                   HoodieCommitMetadata commitMetadata,
+                                   HoodieInstant hoodieInstant) {


updateTableMetadat seems useless now ~

danny0405

+1, thanks for contribution @yuzhaojing , we may need to fix the clustering plan scheduling issue in the following PR.

yuzhaojing · 2022-05-24T12:23:38Z

+1, thanks for contribution @yuzhaojing , we may need to fix the clustering plan scheduling issue in the following PR.

Sure, I will fix the issue in following PR to support clustering plan scheduling in coordinator.

yuzhaojing · 2022-05-24T12:24:36Z

Thanks for suggest, Danny!

hudi-bot · 2022-05-24T16:45:53Z

CI report:

c340530 UNKNOWN
e1689c4 UNKNOWN
eb878d4 UNKNOWN
c20db99 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yuzhaojing force-pushed the HUDI-2207 branch 4 times, most recently from c795bbe to 9443126 Compare September 7, 2021 02:07

vinothchandar added this to Ready for Review in PR Tracker Board Sep 7, 2021

vinothchandar assigned danny0405 and yanghua Sep 7, 2021

yuzhaojing force-pushed the HUDI-2207 branch 5 times, most recently from 2828a67 to 1bb7f93 Compare October 21, 2021 03:38

danny0405 reviewed Oct 21, 2021

View reviewed changes

danny0405 added the flink Issues related to flink label Oct 21, 2021

yuzhaojing force-pushed the HUDI-2207 branch from 1bb7f93 to 5138567 Compare October 21, 2021 04:07

danny0405 reviewed Oct 25, 2021

View reviewed changes

vinothchandar reviewed Dec 15, 2021

View reviewed changes

vinothchandar moved this from Ready for Review to Under Discussion PRs in PR Tracker Board Dec 15, 2021

yuzhaojing force-pushed the HUDI-2207 branch 3 times, most recently from f86df2e to d3eb4e3 Compare March 1, 2022 14:30

yuzhaojing force-pushed the HUDI-2207 branch from d3eb4e3 to 5de4958 Compare March 1, 2022 15:10

yuzhaojing closed this Mar 3, 2022

PR Tracker Board automation moved this from Under Discussion PRs to Done Mar 3, 2022

yuzhaojing reopened this Mar 3, 2022

PR Tracker Board automation moved this from Done to Under Discussion PRs Mar 3, 2022

yuzhaojing force-pushed the HUDI-2207 branch 2 times, most recently from e1689c4 to cf4355d Compare May 21, 2022 15:06

This comment was marked as resolved.

Sign in to view

yuzhaojing closed this May 21, 2022

PR Tracker Board automation moved this from Nearing Landing to Done May 21, 2022

yuzhaojing reopened this May 21, 2022

PR Tracker Board automation moved this from Done to Under Discussion PRs May 21, 2022

yuzhaojing force-pushed the HUDI-2207 branch 2 times, most recently from eb878d4 to 9f98d28 Compare May 22, 2022 03:35

yuzhaojing closed this May 23, 2022

PR Tracker Board automation moved this from Under Discussion PRs to Done May 23, 2022

yuzhaojing reopened this May 23, 2022

PR Tracker Board automation moved this from Done to Under Discussion PRs May 23, 2022

danny0405 reviewed May 23, 2022

View reviewed changes

yuzhaojing force-pushed the HUDI-2207 branch from 9f98d28 to f1bc2f0 Compare May 24, 2022 03:44

yuzhaojing force-pushed the HUDI-2207 branch from f1bc2f0 to e8b1a55 Compare May 24, 2022 09:51

danny0405 reviewed May 24, 2022

View reviewed changes

[HUDI-2207] Support independent flink hudi clustering function

c20db99

yuzhaojing force-pushed the HUDI-2207 branch from e8b1a55 to c20db99 Compare May 24, 2022 12:17

danny0405 approved these changes May 24, 2022

View reviewed changes

PR Tracker Board automation moved this from Under Discussion PRs to Nearing Landing May 24, 2022

yuzhaojing merged commit 18635b5 into apache:master May 24, 2022

PR Tracker Board automation moved this from Nearing Landing to Done May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-2207] Support independent flink hudi clustering function #3599

[HUDI-2207] Support independent flink hudi clustering function #3599

yuzhaojing commented Sep 4, 2021

danny0405 Oct 21, 2021

yuzhaojing Oct 21, 2021

danny0405 Oct 25, 2021

danny0405 Oct 25, 2021

danny0405 Oct 25, 2021

yihua Dec 16, 2021

danny0405 Oct 25, 2021

danny0405 Oct 25, 2021

danny0405 Oct 25, 2021

danny0405 Oct 25, 2021

danny0405 Oct 25, 2021

danny0405 Oct 25, 2021

danny0405 Oct 25, 2021

vinothchandar left a comment

yihua commented Dec 16, 2021

yuzhaojing commented Mar 1, 2022

danny0405 commented Mar 7, 2022

yuzhaojing commented Mar 7, 2022

This comment was marked as resolved.

yuzhaojing commented May 23, 2022

yuzhaojing commented May 23, 2022

danny0405 May 23, 2022

yuzhaojing May 23, 2022

danny0405 May 24, 2022

yuzhaojing May 24, 2022

danny0405 May 23, 2022

yuzhaojing May 23, 2022

danny0405 commented May 24, 2022

yuzhaojing commented May 24, 2022

danny0405 May 24, 2022

danny0405 left a comment

yuzhaojing commented May 24, 2022

yuzhaojing commented May 24, 2022

hudi-bot commented May 24, 2022


		private static final Logger LOG = LogManager.getLogger(FlinkClusteringPlanActionExecutor.class);

		public FlinkClusteringPlanActionExecutor(HoodieEngineContext context,

[HUDI-2207] Support independent flink hudi clustering function #3599

[HUDI-2207] Support independent flink hudi clustering function #3599

Conversation

yuzhaojing commented Sep 4, 2021

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinothchandar left a comment

Choose a reason for hiding this comment

yihua commented Dec 16, 2021

yuzhaojing commented Mar 1, 2022

danny0405 commented Mar 7, 2022

yuzhaojing commented Mar 7, 2022

This comment was marked as resolved.

yuzhaojing commented May 23, 2022

yuzhaojing commented May 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danny0405 commented May 24, 2022

yuzhaojing commented May 24, 2022

Choose a reason for hiding this comment

danny0405 left a comment

Choose a reason for hiding this comment

yuzhaojing commented May 24, 2022

yuzhaojing commented May 24, 2022

hudi-bot commented May 24, 2022

CI report: