Skip to content

[HUDI-5593] Fixing deadlocks due to async cleaner interplay w/ main thread#7739

Merged
codope merged 1 commit intoapache:masterfrom
nsivabalan:asyncCleanLockFix
Jan 24, 2023
Merged

[HUDI-5593] Fixing deadlocks due to async cleaner interplay w/ main thread#7739
codope merged 1 commit intoapache:masterfrom
nsivabalan:asyncCleanLockFix

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Jan 23, 2023

Change Logs

In post commit in WriteClient, main thread awaits for async cleaner if async clean is enabled. This code block is already within a lock. So, if multi-writer locks are enabled, and if main thread has already acquired the lock, async cleaner may not be able to acquire the lock and it will keep timing out. This was the main reason for our CI tests to keep timing out frequently. Especially we also auto enable lock provider configs for deltastreamer in continuous mode w/ async table services, some of these tests are impacted.

Impact

No deadlocks due to locks when async cleaner is invoked.

Risk level (write none, low medium or high below)

medium.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@nsivabalan nsivabalan added the priority:blocker Production down; release blocker label Jan 23, 2023
}

/**
* Schedules a new cleaning instant.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NTR (not to reviewer): these methods are not used. removing dead code.

* @param skipLocking if this is triggered by another parent transaction, locking can be skipped.
*/
@Nullable
@Deprecated
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NTR: have deprecated the methods which was taking in "skipLocking" as an argument as we don't need it anymore.

protected void rollbackFailedWrites(Map<String, Option<HoodiePendingRollbackInfo>> instantsToRollback) {
rollbackFailedWrites(instantsToRollback, false);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NTR:
rollbackFailedWrites in L679, is being invoked from upgrade code path which is within lock. so, could not deprecate this method to avoid "skipLocking"

…read is acquired the lock and awaiting for async cleaner to finish
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

try {
// Delete the marker directory for the instant.
WriteMarkersFactory.get(config.getMarkersType(), table, instantTime)
.quietDeleteMarkerDir(context, config.getMarkersDeleteParallelism());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should this be done inside the lock if commit has already happened?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to think thru this. lets land this patch for now to unblock CI. by tmrw will get back to this.

@nsivabalan
Copy link
Contributor Author

CI is green
Screen Shot 2023-01-24 at 12 26 50 AM

@codope codope merged commit 146f39d into apache:master Jan 24, 2023
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Jan 31, 2023
…hile main thread is acquired the lock and awaiting for async cleaner to finish (apache#7739)
nsivabalan added a commit to nsivabalan/hudi that referenced this pull request Mar 22, 2023
…hile main thread is acquired the lock and awaiting for async cleaner to finish (apache#7739)
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…hile main thread is acquired the lock and awaiting for async cleaner to finish (apache#7739)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants