[HUDI-5289] Avoiding repeated trigger of clustering dag by nsivabalan · Pull Request #8275 · apache/hudi

nsivabalan · 2023-03-23T05:31:19Z

Change Logs

Looks like clustering dag is triggered twice even in happy path. This patch attempts at fixing the issue. This follows similar approach as compactor, where before triggering the dag for the first time, we persist the rdd. And then once the CommitMetadata is returned from table to the client, client clones the commitMetadata and any further execution will not trigger the dag again.

Impact

Clustering will by robust and will not result in spurious data files (which might eventually cleaned up anyways).

Risk level (write none, low medium or high below)

low.

Documentation Update

N/A

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

nsivabalan · 2023-03-23T06:39:31Z

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala

+   * We leverage spark event listener to validate it.
+   */
+  @Test
+  def testValidateClusteringForRepeatedDag(): Unit = {


Note to Reviewer: this test fails w/o the fix in this patch. w/ the fix, it succeeds.

hudi-bot · 2023-03-24T00:17:32Z

CI report:

8b4506e Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

...lient-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java

KnightChess

have a doubt, why not use writeStats to jude is empty, cache writeStatuses also can cause a part of partitions recomputing if some error cause executor be removed.

nsivabalan · 2023-03-24T17:49:07Z

hey @KnightChess :
not sure whats your suggestion.
We do already check isEmpty in SparkRddWriteClient.

  private void validateClusteringCommit(HoodieWriteMetadata<JavaRDD<WriteStatus>> clusteringMetadata, String clusteringCommitTime, HoodieTable table) {
    if (clusteringMetadata.getWriteStatuses().isEmpty()) {
      HoodieClusteringPlan clusteringPlan = ClusteringUtils.getClusteringPlan(
              table.getMetaClient(), HoodieTimeline.getReplaceCommitRequestedInstant(clusteringCommitTime))
          .map(Pair::getRight).orElseThrow(() -> new HoodieClusteringException(
              "Unable to read clustering plan for instant: " + clusteringCommitTime));
      throw new HoodieClusteringException("Clustering plan produced 0 WriteStatus for " + clusteringCommitTime
          + " #groups: " + clusteringPlan.getInputGroups().size() + " expected at least "
          + clusteringPlan.getInputGroups().stream().mapToInt(HoodieClusteringGroup::getNumOutputFileGroups).sum()
          + " write statuses");
    }
  }

KnightChess · 2023-03-25T04:11:22Z

@nsivabalan use writeStats will not trigger clustering dag too, I think it has no result gap if use it

- Avoiding repeated trigger of clustering dag

Avoiding repeated trigger of clustering dag

0bd5204

nsivabalan commented Mar 23, 2023

View reviewed changes

nsivabalan force-pushed the test_clustering_dup_files branch from ede3616 to 00f671c Compare March 23, 2023 06:40

nsivabalan assigned danny0405 Mar 23, 2023

Adding tests to validate

8b4506e

nsivabalan force-pushed the test_clustering_dup_files branch from 00f671c to 8b4506e Compare March 23, 2023 20:33

codope approved these changes Mar 24, 2023

View reviewed changes

...lient-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java Show resolved Hide resolved

codope added priority:critical Production degraded; pipelines stalled engine:spark Spark integration area:table-service Table services labels Mar 24, 2023

KnightChess reviewed Mar 24, 2023

View reviewed changes

nsivabalan merged commit 41026ef into apache:master Mar 24, 2023

nsivabalan added a commit to nsivabalan/hudi that referenced this pull request Mar 25, 2023

[HUDI-5289] Avoiding repeated trigger of clustering dag (apache#8275)

43d4cbc

- Avoiding repeated trigger of clustering dag

nsivabalan added a commit to nsivabalan/hudi that referenced this pull request Mar 31, 2023

[HUDI-5289] Avoiding repeated trigger of clustering dag (apache#8275)

c6b34bc

- Avoiding repeated trigger of clustering dag

fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023

[HUDI-5289] Avoiding repeated trigger of clustering dag (apache#8275)

136562a

- Avoiding repeated trigger of clustering dag

stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 20, 2023

[HUDI-5289] Avoiding repeated trigger of clustering dag (apache#8275)

8401484

- Avoiding repeated trigger of clustering dag

KnightChess pushed a commit to KnightChess/hudi that referenced this pull request Jan 2, 2024

[HUDI-5289] Avoiding repeated trigger of clustering dag (apache#8275)

825478e

- Avoiding repeated trigger of clustering dag

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-5289] Avoiding repeated trigger of clustering dag#8275

[HUDI-5289] Avoiding repeated trigger of clustering dag#8275
nsivabalan merged 2 commits intoapache:masterfrom
nsivabalan:test_clustering_dup_files

nsivabalan commented Mar 23, 2023

Uh oh!

nsivabalan Mar 23, 2023

Uh oh!

hudi-bot commented Mar 24, 2023

Uh oh!

Uh oh!

KnightChess left a comment

Uh oh!

nsivabalan commented Mar 24, 2023

Uh oh!

KnightChess commented Mar 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

nsivabalan commented Mar 23, 2023

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

nsivabalan Mar 23, 2023

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Mar 24, 2023

CI report:

Uh oh!

Uh oh!

KnightChess left a comment

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Mar 24, 2023

Uh oh!

KnightChess commented Mar 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants