fix: Fix dangling lock from storage based lock#18439
fix: Fix dangling lock from storage based lock#18439linliu-code wants to merge 1 commit intoapache:masterfrom
Conversation
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Nice work — the core bug fix (missing return statements for 429 in both S3 and GCS clients) is clearly correct and the new THROTTLED enum value is a clean way to surface this. The single-retry approach in expireLock is pragmatic. Just a couple of minor points in the inline comments around log accuracy and synchronized sleeping.
| result = this.storageLockClient.tryUpsertLockFile(expiredLockData, Option.of(this.getLock())); | ||
| if (result.getLeft() == LockUpsertResult.THROTTLED && !fromShutdownHook) { | ||
| logger.warn("Owner {}: Lock expiration write was throttled, retrying after 1 second.", ownerId); | ||
| try { |
There was a problem hiding this comment.
🤖 This method is synchronized, so the 1-second sleep will block any concurrent tryLock(), renewLock(), or getLock() call on the same provider instance. In practice, is the heartbeat always stopped before we reach this point? If so, it's probably fine — just want to confirm there's no window where the heartbeat thread could be waiting on this monitor while we sleep.
eefd11f to
9a7cfae
Compare
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for working on this! The PR fixes the dangling-lock bug by adding a THROTTLED enum value, restoring the missing return statements in S3/GCS/Azure clients, and restructuring unlock() so the 1-second retry sleeps outside the provider monitor. The round 1 concerns (log-message accuracy and synchronized sleeping) appear to be addressed. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor naming suggestion below; the overall change is clean and the threading design in the refactored unlock() is well-commented.
cc @yihua
9a7cfae to
b05ad36
Compare
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for working on this! The fix correctly addresses the dangling-lock scenario by adding a THROTTLED result code, returning it from each cloud client's 429 path, and restructuring unlock() to retry once outside the provider monitor with a reference-equality bail-out against concurrent tryLock(). No critical correctness issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A couple of comment phrasing nits in the concurrent locking logic; code structure is otherwise clean.
cc @yihua
| logger.debug("Owner {}: Gracefully shutting down heartbeat.", ownerId); | ||
| believesNoLongerHoldsLock &= heartbeatManager.stopHeartbeat(true); | ||
| // If throttled, retry once after sleeping outside the monitor to avoid blocking other threads. | ||
| // Note: when unlock() is called via close() -> shutdown(), the caller's synchronized methods |
There was a problem hiding this comment.
🤖 nit: "the sleep will still block in that path" is a bit ambiguous — it reads as if the sleep is itself blocking, but the actual concern is that the outer synchronized close()/shutdown() caller still holds the monitor throughout the sleep. Could you rephrase to something like "the outer synchronized caller still holds the monitor through the sleep, so other threads remain blocked in that scenario"?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
When StorageBasedLockProvider attempts to expire/release a lock and the
underlying storage write returns HTTP 429 (rate-limited), the S3, GCS,
and Azure storage lock clients were logging the warning but missing the
return statement, causing them to fall through to the UNKNOWN_ERROR
path. The provider had no way to distinguish a throttle from a truly
unknown state, so it would report the lock release as failed without
retrying — potentially leaving the lock in a dangling (non-expired)
state and blocking other writers.
Changes:
- Add THROTTLED(3) to the LockUpsertResult enum to distinguish rate-limit
responses from other transient errors.
- Fix the missing return in S3StorageLockClient, GCSStorageLockClient, and
AzureStorageLockClient so HTTP 429 responses now correctly return THROTTLED
instead of falling through to UNKNOWN_ERROR.
- Introduce ExpireLockResult { SUCCESS, THROTTLED, FAILED } to make
tryExpireCurrentLock a single-attempt primitive; callers own the retry
policy.
- Restructure StorageBasedLockProvider.unlock() so the 1-second backoff
retry sleeps OUTSIDE the provider monitor (releases the lock, sleeps,
re-acquires and re-checks believesLockMightBeHeld). Avoids blocking
other threads during the retry. The shutdown-hook path still holds
the monitor by design.
- Refine tryLock() failure metrics: switch over LockUpsertResult so
ACQUIRED_BY_OTHERS, THROTTLED/UNKNOWN_ERROR, and other states are
accounted in their own metric instead of all sharing the precondition
failure counter.
- Handle THROTTLED in renewLock(): treat as transient, log a warning
and return true so the heartbeat retries on its next cycle.
- Add unit tests: throttled-then-success unlock, throttled-then-still-
throttled unlock, renewLock under THROTTLED, and Azure 429 → THROTTLED.
b05ad36 to
cfdb474
Compare
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the careful follow-up here! The fix correctly adds the missing return for HTTP 429 paths and restructures unlock() so the retry sleep happens outside the provider monitor, with the lock-identity check guarding against a concurrent tryLock() replacing the lock during the sleep window. I traced the relevant race scenarios (concurrent unlock succeeding, natural expiration plus new acquirer, heartbeat-already-stopped invariant) and the logic holds up. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. One minor simplification suggestion below.
cc @yihua
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18439 +/- ##
============================================
- Coverage 68.08% 68.08% -0.01%
- Complexity 28940 28952 +12
============================================
Files 2519 2519
Lines 140646 140704 +58
Branches 17427 17432 +5
============================================
+ Hits 95757 95794 +37
- Misses 37030 37047 +17
- Partials 7859 7863 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
#18438
When
StorageBasedLockProviderattempts to expire/release a lock and the underlying storage write returns HTTP 429 (rate-limited),S3StorageLockClient,GCSStorageLockClient, andAzureStorageLockClientwere logging the warning but missing thereturnstatement, causing them to fall through to theUNKNOWN_ERRORpath. The provider had no way to distinguish a throttle from a truly unknown state, so it would report the lock release as failed without retrying — potentially leaving the lock in a dangling (non-expired) state and blocking other writers.Summary and Changelog
The bug fix.
THROTTLED(3)toLockUpsertResultto distinguish rate-limit responses from other transient errors.returninS3StorageLockClient,GCSStorageLockClient, andAzureStorageLockClientso HTTP 429 returnsTHROTTLEDinstead of falling through toUNKNOWN_ERROR.Lock-release retry, restructured.
enum ExpireLockResult { SUCCESS, THROTTLED, FAILED }.tryExpireCurrentLockis now a@VisibleForTestingsingle-attempt primitive that returns the enum; callers own the retry policy.StorageBasedLockProvider.unlock()so the 1-second backoff retry sleeps outside the provider monitor (releases it, sleeps, re-acquires, re-checksbelievesLockMightBeHeld()in case the shutdown hook expired the lock during the wait). This avoids blocking other threads on the provider monitor for a full second during a transient throttle. The shutdown-hook path still holds the monitor by design.Other lock paths that observed
THROTTLED.renewLock(): treatTHROTTLEDas transient — log a warning and returntrueso the heartbeat retries on its next cycle.tryLock()failure metrics: replace the singleupdateLockAcquirePreconditionFailureMetriccall with aswitchoverLockUpsertResultsoACQUIRED_BY_OTHERS,THROTTLED|UNKNOWN_ERROR, and other states each map to their correct metric instead of all sharing the precondition-failure counter.Tests.
testRenewLockThrottledReturnsTrue, Azure 429 →THROTTLEDassertion.testUnlockSucceedsAfterThrottledRetry,testUnlockThrowsExceptionWhenStillThrottledAfterRetryfor the new monitor-release behavior.TestStorageBasedLockProvider(45),TestS3StorageLockClient(18),TestGCSStorageLockClient(16),TestAzureStorageLockClient(27) — 106 tests, 0 failures.Impact
Reduces the likelihood of dangling locks in S3-, GCS-, and Azure-backed
StorageBasedLockProviderdeployments under high write-rate conditions (e.g. GCS's 1-write/sec per object limit). The 1s retry no longer blocks the provider monitor, so concurrent operations on the same provider instance can proceed during the wait. No behavior change for non-429 error codes.Risk Level
Medium. Touches the lock-release fast path and changes its concurrency semantics (sleep moved outside the monitor). Verified by the unit-test set above.
Documentation Update
none
Contributor's checklist