Skip to content

[HUDI-5289] Avoiding repeated trigger of clustering dag#8275

Merged
nsivabalan merged 2 commits intoapache:masterfrom
nsivabalan:test_clustering_dup_files
Mar 24, 2023
Merged

[HUDI-5289] Avoiding repeated trigger of clustering dag#8275
nsivabalan merged 2 commits intoapache:masterfrom
nsivabalan:test_clustering_dup_files

Conversation

@nsivabalan
Copy link
Contributor

Change Logs

Looks like clustering dag is triggered twice even in happy path. This patch attempts at fixing the issue. This follows similar approach as compactor, where before triggering the dag for the first time, we persist the rdd. And then once the CommitMetadata is returned from table to the client, client clones the commitMetadata and any further execution will not trigger the dag again.

Impact

Clustering will by robust and will not result in spurious data files (which might eventually cleaned up anyways).

Risk level (write none, low medium or high below)

low.

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

* We leverage spark event listener to validate it.
*/
@Test
def testValidateClusteringForRepeatedDag(): Unit = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to Reviewer: this test fails w/o the fix in this patch. w/ the fix, it succeeds.

@nsivabalan nsivabalan force-pushed the test_clustering_dup_files branch from 00f671c to 8b4506e Compare March 23, 2023 20:33
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope added priority:critical Production degraded; pipelines stalled engine:spark Spark integration area:table-service Table services labels Mar 24, 2023
Copy link
Contributor

@KnightChess KnightChess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have a doubt, why not use writeStats to jude is empty, cache writeStatuses also can cause a part of partitions recomputing if some error cause executor be removed.

@nsivabalan
Copy link
Contributor Author

hey @KnightChess :
not sure whats your suggestion.
We do already check isEmpty in SparkRddWriteClient.

  private void validateClusteringCommit(HoodieWriteMetadata<JavaRDD<WriteStatus>> clusteringMetadata, String clusteringCommitTime, HoodieTable table) {
    if (clusteringMetadata.getWriteStatuses().isEmpty()) {
      HoodieClusteringPlan clusteringPlan = ClusteringUtils.getClusteringPlan(
              table.getMetaClient(), HoodieTimeline.getReplaceCommitRequestedInstant(clusteringCommitTime))
          .map(Pair::getRight).orElseThrow(() -> new HoodieClusteringException(
              "Unable to read clustering plan for instant: " + clusteringCommitTime));
      throw new HoodieClusteringException("Clustering plan produced 0 WriteStatus for " + clusteringCommitTime
          + " #groups: " + clusteringPlan.getInputGroups().size() + " expected at least "
          + clusteringPlan.getInputGroups().stream().mapToInt(HoodieClusteringGroup::getNumOutputFileGroups).sum()
          + " write statuses");
    }
  }

@nsivabalan nsivabalan merged commit 41026ef into apache:master Mar 24, 2023
@KnightChess
Copy link
Contributor

@nsivabalan use writeStats will not trigger clustering dag too, I think it has no result gap if use it
image

nsivabalan added a commit to nsivabalan/hudi that referenced this pull request Mar 25, 2023
- Avoiding repeated trigger of clustering dag
nsivabalan added a commit to nsivabalan/hudi that referenced this pull request Mar 31, 2023
- Avoiding repeated trigger of clustering dag
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
- Avoiding repeated trigger of clustering dag
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 20, 2023
KnightChess pushed a commit to KnightChess/hudi that referenced this pull request Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:table-service Table services engine:spark Spark integration priority:critical Production degraded; pipelines stalled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants