[fix](fe) Fix deadlock risks in ConcurrentLong2*HashMap#62117
[fix](fe) Fix deadlock risks in ConcurrentLong2*HashMap#62117dataroaring wants to merge 3 commits intoapache:masterfrom
Conversation
…urrentLong2LongHashMap Fix two classes of deadlock in the segmented concurrent maps: 1. forEach read-to-write upgrade deadlock: forEach held a read lock while invoking user callbacks, which could attempt write operations on the same segment. ReentrantReadWriteLock does not support read-to-write upgrade, causing the thread to park forever. Fixed by snapshot-copying entries under the read lock and invoking callbacks outside the lock. 2. Reentrant callback deadlock: compute/merge methods invoked user-supplied functions while holding segment write locks, enabling both same-thread reentrant deadlock and cross-thread lock-order inversion. Added a ThreadLocal guard that detects reentrant map access from within callbacks and throws IllegalStateException (fail-fast instead of silent deadlock). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
This PR aims to eliminate deadlock scenarios in the FE segmented concurrent primitive maps (ConcurrentLong2*HashMap) by changing iteration semantics and adding a fail-fast reentrancy guard around compute/merge callbacks, plus tests and Javadoc to codify the new constraints.
Changes:
- Updated
forEachto snapshot entries under read lock and invoke callbacks outside the lock (per segment) to avoid read→write upgrade deadlocks. - Added a per-map
ThreadLocalcallback guard and enforced it across write APIs (and compute/merge variants) to fail fast on reentrant mutation. - Added unit tests covering the previous deadlock pattern and the new reentrant-callback exceptions.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| fe/fe-foundation/src/main/java/org/apache/doris/foundation/util/ConcurrentLong2ObjectHashMap.java | Snapshot-based forEach, callback restriction Javadoc, and reentrancy guard for write + compute/merge paths |
| fe/fe-foundation/src/main/java/org/apache/doris/foundation/util/ConcurrentLong2LongHashMap.java | Same as above for primitive long→long map (including mergeLong) |
| fe/fe-foundation/src/test/java/org/apache/doris/foundation/util/ConcurrentLong2ObjectHashMapTest.java | Added deadlock-safety and reentrant-callback tests |
| fe/fe-foundation/src/test/java/org/apache/doris/foundation/util/ConcurrentLong2LongHashMapTest.java | Added deadlock-safety and reentrant-callback tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Future<?> future = executor.submit(() -> { | ||
| map.forEach((ConcurrentLong2ObjectHashMap.LongObjConsumer<String>) (k, v) -> { | ||
| map.put(k + 1000L, v + "-copy"); | ||
| }); | ||
| }); | ||
| // Should complete within 5 seconds; if it deadlocks, get() will time out | ||
| future.get(5, java.util.concurrent.TimeUnit.SECONDS); | ||
| executor.shutdown(); |
There was a problem hiding this comment.
The executor is only shut down on the success path. If future.get(...) times out or throws, the single-thread executor can remain alive and keep the test JVM from exiting. Wrap the submit/get in try/finally, cancel the Future on failure, and use shutdownNow() (optionally awaitTermination) to ensure cleanup.
| Future<?> future = executor.submit(() -> { | |
| map.forEach((ConcurrentLong2ObjectHashMap.LongObjConsumer<String>) (k, v) -> { | |
| map.put(k + 1000L, v + "-copy"); | |
| }); | |
| }); | |
| // Should complete within 5 seconds; if it deadlocks, get() will time out | |
| future.get(5, java.util.concurrent.TimeUnit.SECONDS); | |
| executor.shutdown(); | |
| Future<?> future = null; | |
| try { | |
| future = executor.submit(() -> { | |
| map.forEach((ConcurrentLong2ObjectHashMap.LongObjConsumer<String>) (k, v) -> { | |
| map.put(k + 1000L, v + "-copy"); | |
| }); | |
| }); | |
| // Should complete within 5 seconds; if it deadlocks, get() will time out | |
| future.get(5, java.util.concurrent.TimeUnit.SECONDS); | |
| } finally { | |
| if (future != null && !future.isDone()) { | |
| future.cancel(true); | |
| } | |
| executor.shutdownNow(); | |
| executor.awaitTermination(5, java.util.concurrent.TimeUnit.SECONDS); | |
| } |
There was a problem hiding this comment.
Addressed in commit 7f590a5. The test now uses try/catch/finally: on timeout or exception the Future is cancelled, and executor.shutdownNow() is called in the finally block to prevent thread leaks.
— ThinkOps 🤖
| future.get(5, java.util.concurrent.TimeUnit.SECONDS); | ||
| executor.shutdown(); |
There was a problem hiding this comment.
The executor is shut down only after future.get(...) completes successfully. If the get times out or throws, the executor thread may leak and interfere with the rest of the test suite. Use try/finally, cancel the future on failure, and call shutdownNow() (and optionally awaitTermination).
| future.get(5, java.util.concurrent.TimeUnit.SECONDS); | |
| executor.shutdown(); | |
| try { | |
| future.get(5, java.util.concurrent.TimeUnit.SECONDS); | |
| } catch (Exception e) { | |
| future.cancel(true); | |
| throw e; | |
| } finally { | |
| executor.shutdownNow(); | |
| executor.awaitTermination(5, java.util.concurrent.TimeUnit.SECONDS); | |
| } |
There was a problem hiding this comment.
Addressed in commit 7f590a5. Same fix applied: try/catch/finally with future.cancel(true) on exception and executor.shutdownNow() in finally.
— ThinkOps 🤖
| inCallback.set(Boolean.TRUE); | ||
| try { | ||
| V newValue = mappingFunction.apply(key); | ||
| if (newValue != null) { | ||
| seg.map.put(key, newValue); | ||
| } | ||
| return newValue; | ||
| } finally { | ||
| inCallback.set(Boolean.FALSE); | ||
| } |
There was a problem hiding this comment.
Using ThreadLocal and setting it back to Boolean.FALSE can still leave an entry in long-lived thread pools (and may delay cleanup after the map is GC'd). Consider using inCallback.remove() in the finally block instead of set(Boolean.FALSE) so the per-thread value is cleared entirely when leaving the callback.
There was a problem hiding this comment.
Addressed in commit 7f590a5. Changed from inCallback.set(Boolean.FALSE) to inCallback.remove() in all finally blocks to fully clear the ThreadLocal entry.
— ThinkOps 🤖
| inCallback.set(Boolean.TRUE); | ||
| long newValue; | ||
| try { | ||
| newValue = mappingFunction.applyAsLong(key); | ||
| } finally { | ||
| inCallback.set(Boolean.FALSE); | ||
| } |
There was a problem hiding this comment.
ThreadLocal values set back to Boolean.FALSE still leave an entry in the thread's ThreadLocalMap (which can be undesirable with long-lived thread pools). Consider calling inCallback.remove() in the finally block instead of set(Boolean.FALSE) to fully clear the thread-local state on exit.
There was a problem hiding this comment.
Addressed in commit 7f590a5. All set(Boolean.FALSE) calls replaced with inCallback.remove() in both map implementations.
— ThinkOps 🤖
| * {@code computeIfPresent}, {@code compute}, and {@code merge} <em>must not</em> attempt to update | ||
| * any other mappings of this map. This restriction is enforced at runtime: reentrant access from a | ||
| * callback throws {@link IllegalStateException}. Violation may also cause deadlock if callbacks | ||
| * attempt cross-segment updates from multiple threads. |
There was a problem hiding this comment.
The reentrancy guard only blocks writes from compute/merge callbacks. Callbacks can still call get/containsKey/... on other keys, which can deadlock across segments (e.g., T1 holds segA write-lock then reads segB; T2 holds segB write-lock then reads segA). If the intent is to fail-fast instead of risking deadlock, consider also checking inCallback in read methods (or explicitly documenting that callbacks must not access the map at all, even for reads).
| * {@code computeIfPresent}, {@code compute}, and {@code merge} <em>must not</em> attempt to update | |
| * any other mappings of this map. This restriction is enforced at runtime: reentrant access from a | |
| * callback throws {@link IllegalStateException}. Violation may also cause deadlock if callbacks | |
| * attempt cross-segment updates from multiple threads. | |
| * {@code computeIfPresent}, {@code compute}, and {@code merge} <em>must not</em> access this map | |
| * reentrantly in any way, including both updates and reads (for example {@code get}, | |
| * {@code containsKey}, or any other operation on another key). Runtime checks reject reentrant | |
| * mutating access from a callback by throwing {@link IllegalStateException}; however, even | |
| * read-only cross-segment access from callbacks can deadlock when multiple threads hold locks on | |
| * different segments. Callbacks must therefore treat this map as off-limits for the duration of | |
| * the callback. |
There was a problem hiding this comment.
Addressed in commit 64db7d2. checkNotInCallback() is now called at the top of all read methods too: get, getOrDefault, containsKey, containsValue, size, and isEmpty. A dedicated test testReentrantReadFromCallbackThrowsIllegalStateException verifies that get() from within a compute callback throws ISE.
— ThinkOps 🤖
| * attempt to update any other mappings of this map. This restriction is enforced at runtime: | ||
| * reentrant access from a callback throws {@link IllegalStateException}. Violation may also cause | ||
| * deadlock if callbacks attempt cross-segment updates from multiple threads. |
There was a problem hiding this comment.
The callback guard only prevents reentrant writes. A compute/merge callback can still call read APIs (get, containsKey, etc.) for other keys, which can deadlock between segments when two threads each hold a segment write-lock and try to read the other segment. To truly turn these deadlocks into fail-fast errors, consider guarding read methods too (or strengthen the documented restriction to forbid any map access from callbacks, not just updates).
| * attempt to update any other mappings of this map. This restriction is enforced at runtime: | |
| * reentrant access from a callback throws {@link IllegalStateException}. Violation may also cause | |
| * deadlock if callbacks attempt cross-segment updates from multiple threads. | |
| * access this map in any way while executing, including reads such as {@code get}, | |
| * {@code containsKey}, or any attempt to update other mappings. Reentrant callback access is | |
| * rejected with {@link IllegalStateException}; any map access from a callback is unsupported | |
| * because cross-segment reads or writes from multiple threads may deadlock. |
There was a problem hiding this comment.
Addressed in commit 64db7d2. Same fix in the Long2Long map: checkNotInCallback() added to all read methods (get, getOrDefault, containsKey, containsValue, size, isEmpty). Test testReentrantReadFromCallbackThrowsIllegalStateException covers this.
— ThinkOps 🤖
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
- Use inCallback.remove() instead of set(FALSE) to avoid stale ThreadLocalMap entries in long-lived thread pools - Wrap test executor in try/finally with shutdownNow() to prevent thread leak on timeout/failure - Strengthen Javadoc: callbacks must not access the map at all (including reads), since cross-segment reads can also deadlock across threads Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
run buildall |
TPC-H: Total hot run time: 29038 ms |
TPC-DS: Total hot run time: 177967 ms |
|
/review |
There was a problem hiding this comment.
Found 1 issue.
fe/fe-foundation/src/main/java/org/apache/doris/foundation/util/ConcurrentLong2LongHashMap.javaandfe/fe-foundation/src/main/java/org/apache/doris/foundation/util/ConcurrentLong2ObjectHashMap.java: the new callback guard only blocks write APIs. Read APIs (get,containsKey,size,keySet,values, etc.) are still callable from inside compute/merge callbacks even though the new contract says callbacks must not access the map at all. This leaves an ABBA deadlock: thread T1 can entercompute(kA, ...), hold segment A's write lock, then callget(kB)and wait for segment B's read lock while thread T2 does the symmetriccompute(kB, ...)->get(kA)path. Both threads then wait forever.
Critical checkpoint conclusions
- Goal of task / proof: Partially achieved. The
forEachupgrade deadlock is fixed and tests cover that, but callback-triggered deadlocks are not eliminated because callback reads remain allowed. No test covers this scenario. - Minimality/focus: Yes. The patch is focused on the two map implementations and their tests.
- Concurrency review: This code is concurrency-sensitive. Segment
ReentrantReadWriteLocks andinCallbackare the critical state. Running callbacks under segment write locks is still unsafe when callback read methods can acquire other segment locks; lock ordering across segments remains inconsistent. - Lifecycle/static init: No special lifecycle issue beyond the per-instance
ThreadLocal; switching toremove()is correct. - Config changes: None.
- Compatibility: None.
- Parallel code paths: Yes. The same incomplete guard is present in both concurrent map classes.
- Special conditional checks:
checkNotInCallback()is reasonable, but its enforcement is narrower than the documented contract. - Test coverage: Positive tests were added for mutating
forEachand write reentrancy. Missing negative coverage for callback read access across segments. - Observability: No extra observability appears necessary for this utility class.
- Transaction/persistence: Not applicable.
- Data write/atomicity: Segment-local atomicity is preserved, but deadlock remains possible for cross-segment callback reads.
- FE/BE variable passing: Not applicable.
- Performance: Snapshot-based
forEachadds allocations, but that tradeoff looks acceptable for correctness. - Other issues: None beyond the deadlock hole above.
Not run locally in this review runner.
| * <p><b>Callback restriction:</b> The mapping/remapping functions passed to {@code computeIfAbsent}, | ||
| * {@code computeIfPresent}, {@code compute}, {@code merge}, and {@code mergeLong} <em>must not</em> | ||
| * access this map in any way while executing, including reads such as {@code get} or | ||
| * {@code containsKey}, as well as writes. Reentrant write access from a callback is rejected at |
There was a problem hiding this comment.
This still leaves the deadlock you describe in the new Javadoc. checkNotInCallback() is only wired into mutating methods, so a callback can still call get() / containsKey() / size() / snapshot iterators while the current thread already holds one segment's write lock. With two threads running compute on different segments and each callback reading the other key, you get an ABBA write(A) -> read(B) / write(B) -> read(A) deadlock. Since the contract now says callbacks must not access the map at all, the runtime guard and tests need to cover read paths too, not just writes.
There was a problem hiding this comment.
Addressed in commit 64db7d2. checkNotInCallback() now guards all read methods (get, getOrDefault, containsKey, containsValue, size, isEmpty) in addition to write methods. Any map access from within a compute/merge callback — read or write — now throws IllegalStateException, fully closing the ABBA cross-segment deadlock window. Test coverage added via testReentrantReadFromCallbackThrowsIllegalStateException.
— ThinkOps 🤖
| * <p><b>Callback restriction:</b> The mapping/remapping functions passed to {@code computeIfAbsent}, | ||
| * {@code computeIfPresent}, {@code compute}, and {@code merge} <em>must not</em> access this map | ||
| * in any way while executing, including reads such as {@code get} or {@code containsKey}, as well | ||
| * as writes. Reentrant write access from a callback is rejected at runtime with |
There was a problem hiding this comment.
Same issue here: the new ThreadLocal guard only protects write APIs. Read-only access from inside compute* / merge callbacks is still allowed, even though the class Javadoc now says callbacks must not access the map in any way. Two threads doing cross-segment callback reads can still deadlock while each holds a segment write lock.
There was a problem hiding this comment.
Addressed in commit 64db7d2. Same fix applied here: the ThreadLocal guard now covers both read and write APIs. All six read methods (get, getOrDefault, containsKey, containsValue, size, isEmpty) call checkNotInCallback() at entry. The documented contract (callbacks must not access the map in any way) is now enforced at runtime for all code paths.
— ThinkOps 🤖
…dlock Add checkNotInCallback() to all read methods (get, getOrDefault, containsKey, containsValue, size, isEmpty) so that compute/merge callbacks cannot acquire read-locks on other segments, preventing ABBA deadlocks when two threads each hold a segment write-lock. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
run buildall |
FE UT Coverage ReportIncrement line coverage `` 🎉 |
Summary
forEachin bothConcurrentLong2ObjectHashMapandConcurrentLong2LongHashMapfrom inline iteration under read-lock to snapshot-based iteration. Entries are copied into arrays under the read lock, then callbacks are invoked outside the lock. This eliminates the read-to-write upgrade deadlock (e.g.,forEach((k, v) -> map.put(k, v + 1))).ThreadLocal<Boolean>per map instance that is set during compute/merge callback execution. All write operations (put,remove,putIfAbsent,replace,clear,addTo, and all compute/merge variants) check this flag and throwIllegalStateExceptionon reentrant access, turning a silent deadlock into a fail-fast error.ConcurrentHashMap's contract).Test plan
testForEachWithMutatingCallbackDoesNotDeadlock— verifies forEach + put no longer deadlocks (uses 5s timeout)testReentrantComputeThrowsIllegalStateException— verifies compute callback calling put throws ISEtestReentrantComputeIfAbsentThrowsIllegalStateException— verifies computeIfAbsent callback calling put throws ISEtestReentrantMergeThrowsIllegalStateException/testReentrantMergeLongThrowsIllegalStateException— verifies merge callback calling remove throws ISE🤖 Generated with Claude Code