[HUDI-5289] Fix writeStatus RDD recalculated in cluster #7373

Zouxxyy · 2022-12-03T11:14:34Z

Change Logs

fix https://issues.apache.org/jira/projects/HUDI/issues/HUDI-5289

There is a patch similar to this problem, but after this patch is added, the problem still exists.

Impact

Could improve the clustering performance.

Risk level (write none, low medium or high below)

low

Documentation Update

None

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

nsivabalan

when I put up the fix, I did verify that clustering was not triggered multiple times. Can you confirm that you are seeing clustering being triggered twice?
If you keep inspecting the data directory, you will find additional data files just during the validaiton.
Or other way to test this is, marker directory reconciliation will delete some additional files if dag was triggered twice. If not, you should not see any marker file deletions.

Zouxxyy · 2022-12-06T06:23:08Z

@nsivabalan yes, I confirm, and @boneanxs find it too, So I think we should persist it

hudi-bot · 2022-12-06T11:09:12Z

CI report:

5f13f4c Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

boneanxs · 2022-12-06T12:00:46Z

@nsivabalan yes, I confirm, and @boneanxs find it too, So I think we should persist it

I find there was a issue when I first implemented clustering as row, but didn't investigate too much at that time.

But I'm thinking whether we can do collectAsList and then parallelize this list to build HoodieData[WriteStatus] in performClusteringWithRecordsRDD, just like performClusteringWithRecordsAsRow has already done (dereference the RDD to a list) to permanently fix this?

nsivabalan · 2022-12-10T04:11:43Z

...n/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java

@@ -122,6 +123,8 @@ public HoodieWriteMetadata<HoodieData<WriteStatus>> performClustering(final Hood
        .stream();
    JavaRDD<WriteStatus>[] writeStatuses = convertStreamToArray(writeStatusesStream.map(HoodieJavaRDD::getJavaRDD));
    JavaRDD<WriteStatus> writeStatusRDD = engineContext.union(writeStatuses);
+    // Persist writeStatus, since it may be reused
+    writeStatusRDD.persist(StorageLevel.MEMORY_AND_DISK());


we can't rely on cache since cache could be invalidated and we have had experiences where actual files written did not match w/ whats in writeStatus. Hence if something fails, we let dag to get retriggered here. we added guard rails at BasecommitActionExecutor by means of cloning.
CC @alexeykudinkin

nsivabalan · 2022-12-10T04:12:35Z

Are you guys saying that, even in a successful path, the dag is getting triggered twice. Or it happens only during exception path.

Zouxxyy · 2022-12-12T02:01:40Z

@nsivabalan
yes, even in a successful path, the dag is getting triggered twice. But you must turn off hoodie.datasource.write.row.writer.enable, because when it is turned on, it will avoid recalculation becauce of the collect at HoodieDatasetBulkInsertHelper. At present, it seems that we do not recommend using the persist, and I will close these two PRs.

[HUDI-5289] Fix writeStatus RDD recalculated in cluster

a683b28

Zouxxyy mentioned this pull request Dec 5, 2022

[HUDI-5327] Fix spark stages when using row writer #7374

Closed

4 tasks

nsivabalan reviewed Dec 5, 2022

View reviewed changes

nsivabalan added priority:blocker release-0.12.2 Patches targetted for 0.12.2 labels Dec 5, 2022

Zouxxyy force-pushed the xinyu/fix-rdd-recaluated branch from c762aa3 to a683b28 Compare December 6, 2022 05:40

persist writeStatusRDD

5f13f4c

codope assigned nsivabalan Dec 7, 2022

nsivabalan reviewed Dec 10, 2022

View reviewed changes

Zouxxyy closed this Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-5289] Fix writeStatus RDD recalculated in cluster #7373

[HUDI-5289] Fix writeStatus RDD recalculated in cluster #7373

Zouxxyy commented Dec 3, 2022 •

edited

Loading

nsivabalan left a comment

Zouxxyy commented Dec 6, 2022

hudi-bot commented Dec 6, 2022

boneanxs commented Dec 6, 2022 •

edited

Loading

nsivabalan Dec 10, 2022

nsivabalan commented Dec 10, 2022

Zouxxyy commented Dec 12, 2022 •

edited

Loading

[HUDI-5289] Fix writeStatus RDD recalculated in cluster #7373

[HUDI-5289] Fix writeStatus RDD recalculated in cluster #7373

Conversation

Zouxxyy commented Dec 3, 2022 • edited Loading

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

nsivabalan left a comment

Choose a reason for hiding this comment

Zouxxyy commented Dec 6, 2022

hudi-bot commented Dec 6, 2022

CI report:

boneanxs commented Dec 6, 2022 • edited Loading

nsivabalan Dec 10, 2022

Choose a reason for hiding this comment

nsivabalan commented Dec 10, 2022

Zouxxyy commented Dec 12, 2022 • edited Loading

Zouxxyy commented Dec 3, 2022 •

edited

Loading

boneanxs commented Dec 6, 2022 •

edited

Loading

Zouxxyy commented Dec 12, 2022 •

edited

Loading