Skip to content

[HUDI-3397] Guard repeated rdd triggers#6878

Closed
nsivabalan wants to merge 2 commits intoapache:masterfrom
nsivabalan:guardMultipleRddTriggers
Closed

[HUDI-3397] Guard repeated rdd triggers#6878
nsivabalan wants to merge 2 commits intoapache:masterfrom
nsivabalan:guardMultipleRddTriggers

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Oct 6, 2022

Change Logs

We have had issues where stages involving writing data to disk were de-referenced multiple times which should not be happening. We are putting in guard rails so avoid such mis-steps.In this patch, we take control of the dag and trigger de-referencing of actual write and create Rdd again so that any downstream callers will never be able to trigger previous write stage by mistake.

Impact

We are changing our dag in the sense that, when exactly the write happens will change after this patch.

Risk level: high

We are changing the dag since we are explicitly de-referencing. For eg, incase of auto commit disable, after writeclient.insert() returns to the caller and if caller tries to dereference WriteStatus, the actual execution kicks in and the write gets triggered. But after this change, it may not be the case. Write will get triggered just before index update irrespective of whether auto commit is enabled or disabled.

Documentation Update

Not applicable.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@nsivabalan nsivabalan force-pushed the guardMultipleRddTriggers branch 2 times, most recently from ffb92d0 to 536bae7 Compare October 12, 2022 22:04
@nsivabalan nsivabalan force-pushed the guardMultipleRddTriggers branch 3 times, most recently from 19d74d6 to 43193e5 Compare October 12, 2022 23:02
@nsivabalan nsivabalan force-pushed the guardMultipleRddTriggers branch from 43193e5 to a4a7199 Compare October 12, 2022 23:11
@nsivabalan nsivabalan marked this pull request as ready for review October 12, 2022 23:12
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label Oct 20, 2022

@Override
public int getNumPartitions() {
return 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to revert unneeded change

HoodieWriteMetadata<HoodieData<WriteStatus>> result = new HoodieWriteMetadata<>();
updateIndexAndCommitIfNeeded(writeStatuses, result);
// dereference rdd so that no double de-referencing can happen by mistake.
int numPartitions = Math.max(1, writeStatuses.getNumPartitions());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think we need to guard it by min 1. the API getNumPartitions() should guarantee meaningful return value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsivabalan we just might need to guard against an empty RDD, but otherwise since we're working w/ an RDD in here i think we can assume that it shouldn't be returning an invalid value

@nsivabalan nsivabalan added the status:in-progress Work in progress label Nov 2, 2022
updateIndexAndCommitIfNeeded(writeStatuses, result);
// dereference rdd so that no double de-referencing can happen by mistake.
int numPartitions = Math.max(1, writeStatuses.getNumPartitions());
HoodieData<WriteStatus> computedWriteStatus = HoodieJavaRDD.of(writeStatuses.collectAsList(), (HoodieSparkEngineContext) context, numPartitions);
Copy link
Contributor

@guanziyue guanziyue Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If write status tracking success record, will collecting them in driver bring pressure on driver memory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 This may not work in case of large inserts with millions of inserts. Success record tracking will be enabled when certain indexes (e.g. record index) is enabled in the MDT.

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Feb 26, 2024
@bvaradar
Copy link
Contributor

@nsivabalan : Is this PR still relevant or can be closed ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled size:S PR with lines of changes in (10, 100] status:in-progress Work in progress

Projects

Status: Awaiting Triage
Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

8 participants