[HUDI-3397] Guard repeated rdd triggers by nsivabalan · Pull Request #6878 · apache/hudi

nsivabalan · 2022-10-06T06:55:22Z

Change Logs

We have had issues where stages involving writing data to disk were de-referenced multiple times which should not be happening. We are putting in guard rails so avoid such mis-steps.In this patch, we take control of the dag and trigger de-referencing of actual write and create Rdd again so that any downstream callers will never be able to trigger previous write stage by mistake.

Impact

We are changing our dag in the sense that, when exactly the write happens will change after this patch.

Risk level: high

We are changing the dag since we are explicitly de-referencing. For eg, incase of auto commit disable, after writeclient.insert() returns to the caller and if caller tries to dereference WriteStatus, the actual execution kicks in and the write gets triggered. But after this change, it may not be the case. Write will get triggered just before index update irrespective of whether auto commit is enabled or disabled.

Documentation Update

Not applicable.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

hudi-bot · 2022-10-13T02:47:46Z

CI report:

19d74d6 UNKNOWN
a4a7199 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

xushiyan · 2022-10-23T19:18:37Z

hudi-common/src/main/java/org/apache/hudi/common/data/HoodieListData.java


+  @Override
+  public int getNumPartitions() {
+    return 1;


to revert unneeded change

xushiyan · 2022-10-23T19:30:22Z

...-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java

    HoodieWriteMetadata<HoodieData<WriteStatus>> result = new HoodieWriteMetadata<>();
-    updateIndexAndCommitIfNeeded(writeStatuses, result);
+    // dereference rdd so that no double de-referencing can happen by mistake.
+    int numPartitions = Math.max(1, writeStatuses.getNumPartitions());


don't think we need to guard it by min 1. the API getNumPartitions() should guarantee meaningful return value.

@nsivabalan we just might need to guard against an empty RDD, but otherwise since we're working w/ an RDD in here i think we can assume that it shouldn't be returning an invalid value

guanziyue · 2022-12-02T06:09:28Z

...-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java

-    updateIndexAndCommitIfNeeded(writeStatuses, result);
+    // dereference rdd so that no double de-referencing can happen by mistake.
+    int numPartitions = Math.max(1, writeStatuses.getNumPartitions());
+    HoodieData<WriteStatus> computedWriteStatus = HoodieJavaRDD.of(writeStatuses.collectAsList(), (HoodieSparkEngineContext) context, numPartitions);


If write status tracking success record, will collecting them in driver bring pressure on driver memory?

+1 This may not work in case of large inserts with millions of inserts. Success record tracking will be enabled when certain indexes (e.g. record index) is enabled in the MDT.

bvaradar · 2024-03-31T06:24:15Z

@nsivabalan : Is this PR still relevant or can be closed ?

nsivabalan force-pushed the guardMultipleRddTriggers branch 2 times, most recently from ffb92d0 to 536bae7 Compare October 12, 2022 22:04

adding write metadata holder

911947c

nsivabalan force-pushed the guardMultipleRddTriggers branch 3 times, most recently from 19d74d6 to 43193e5 Compare October 12, 2022 23:02

Adding guards to avoid double de-referencing to RDDs

a4a7199

nsivabalan force-pushed the guardMultipleRddTriggers branch from 43193e5 to a4a7199 Compare October 12, 2022 23:11

nsivabalan marked this pull request as ready for review October 12, 2022 23:12

nsivabalan assigned alexeykudinkin Oct 20, 2022

nsivabalan added the priority:critical Production degraded; pipelines stalled label Oct 20, 2022

xushiyan reviewed Oct 23, 2022

View reviewed changes

nsivabalan added the status:in-progress Work in progress label Nov 2, 2022

guanziyue reviewed Dec 2, 2022

View reviewed changes

vinothchandar added the release-0.14.0 label Apr 25, 2023

nsivabalan removed the release-0.14.0 label Jul 3, 2023

github-actions bot added the size:S PR with lines of changes in (10, 100] label Feb 26, 2024

nsivabalan closed this Sep 11, 2024

hudi-bot mentioned this pull request Dec 9, 2025

Make sure Spark RDDs triggering actual FS activity are only dereferenced once #15017

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-3397] Guard repeated rdd triggers#6878

[HUDI-3397] Guard repeated rdd triggers#6878
nsivabalan wants to merge 2 commits intoapache:masterfrom
nsivabalan:guardMultipleRddTriggers

nsivabalan commented Oct 6, 2022 •

edited

Loading

Uh oh!

hudi-bot commented Oct 13, 2022

Uh oh!

xushiyan Oct 23, 2022

Uh oh!

xushiyan Oct 23, 2022

Uh oh!

alexeykudinkin Oct 24, 2022

Uh oh!

guanziyue Dec 2, 2022 •

edited

Loading

Uh oh!

prashantwason Mar 7, 2023

Uh oh!

bvaradar commented Mar 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

nsivabalan commented Oct 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Oct 13, 2022

CI report:

Uh oh!

xushiyan Oct 23, 2022

Choose a reason for hiding this comment

Uh oh!

xushiyan Oct 23, 2022

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Oct 24, 2022

Choose a reason for hiding this comment

Uh oh!

guanziyue Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prashantwason Mar 7, 2023

Choose a reason for hiding this comment

Uh oh!

bvaradar commented Mar 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

nsivabalan commented Oct 6, 2022 •

edited

Loading

guanziyue Dec 2, 2022 •

edited

Loading