[fix](fe) Fix ReadListener leak on rejected worker task#62679
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found one blocking correctness issue.
ReadListenernow sends an ERR packet directly from the read-notification thread whenchannel.getWorker().execute()is rejected. That bypasses the normalMysqlConnectProcessor.processOnce()flow, which resets the MySQL packet sequence and consumes the client command before replying.MysqlProto.sendResponsePacket()therefore uses staleMysqlChannel.sequenceIdstate from the previous command, or an unknowable sequence for a multi-packet request that was never read, so the client can still see protocol errors instead of a clean overload failure. The added unit test mocksMysqlProto.sendResponsePacket(), so it does not cover this.
Critical checkpoints:
- Goal of the task: Partially met. The PR adds cleanup on rejected task submission, but the client-facing overload path is still protocol-incorrect, so the goal is not fully achieved.
- Change scope: Small and focused.
- Concurrency: Yes.
ReadListener.handleEvent()runs on the XNIO I/O thread and hands work to the worker pool; the new rejection path stays on the I/O thread and bypasses normal command-processing state transitions. - Lifecycle/static init: No special static-init issue. Session cleanup still closes the channel and relies on the existing close listener for unregister.
- Config changes: None.
- Compatibility/storage format/API compatibility: None.
- Parallel paths:
AcceptListenerhas a similar rejection branch, but it is not an equivalent template because this read path is already inside the post-handshake MySQL command protocol. - Special conditions: The new
RejectedExecutionExceptionbranch needs protocol-aware handling or an alternative that does not emit a malformed MySQL reply. - Tests: Added unit coverage only checks mocked method calls. No test covers real packet sequencing or client-visible behavior for rejected read tasks.
- Test result files: Not applicable.
- Observability: Existing warn logs are sufficient.
- Transaction/persistence/data writes/FE-BE variable passing: Not involved.
- Performance: The rejection branch also performs synchronous response/cleanup work on the I/O thread in an overload scenario.
Reviewer validation: code inspection only.
FE Regression Coverage ReportIncrement line coverage |
678dfd3 to
fda93d1
Compare
|
/review |
|
run buildall |
There was a problem hiding this comment.
Findings
-
ReadListenerrejected-task teardown is not lifecycle-safe for transactional sessions. The new catch branch callscleanup()before installingConnectContextthread-local. On a registered MySQL session, connection close reachesConnectPoolMgr.unregisterConnection()->ctx.closeTxn(). On observer FE,TransactionEntry.abortTransaction()usesConnectContext.get()to buildMasterOpExecutor/MasterTxnExecutor, so this path can fail to abort the txn and leave server-side transaction state behind. -
The same branch performs full session cleanup on the XNIO I/O thread.
ConnectContext.cleanup()may delete temp tables or forward to master, and close-triggered txn abort can do synchronous metadata / RPC work. Under overload, that can block the event loop instead of just failing the single session quickly.
Critical Checkpoints
- Goal of the task: Partially accomplished. The PR catches
RejectedExecutionException, but the recovery path is not safe for already-initialized sessions. The added test only proves a mockedcleanup()call, not end-to-end session teardown. - Small / focused change: Yes. The change is minimal and isolated to
ReadListenerplus a targeted unit test. - Concurrency: Not safe yet.
ReadListenerstarts on the XNIO I/O thread and normally hands work to a worker thread. The new rejection path changes teardown thread / affinity and can run heavyweight cleanup on the I/O thread. - Lifecycle management: Incorrect in the new branch.
ConnectContextthread-local is not installed before transaction / connection teardown can run. - Config changes: None.
- Compatibility changes: None.
- Parallel code paths:
AcceptListeneris not an exact precedent because its rejection handling happens before a session can own transactions or temp tables. - Special condition rationale: The new
RejectedExecutionExceptioncatch is justified, but it still needs lifecycle-safe teardown for mid-session failures. - Test coverage: Insufficient. No test covers a real registered connection with transaction or temp-table cleanup after worker rejection.
- Test results: No autogenerated result files changed; the new unit test itself does not exercise the risky lifecycle path.
- Observability: Existing warning logs are adequate.
- Transaction / persistence: Not safe for observer transactional sessions because rollback can fail on unregister.
- Data write / modification safety: At risk for in-flight transactional sessions due to failed abort on cleanup.
- FE-BE variable passing: Not applicable.
- Performance: Risky due to heavyweight cleanup on the I/O thread.
- Other issues: None beyond the lifecycle and I/O-thread teardown problems above.
Requesting changes.
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
fda93d1 to
9d78072
Compare
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
|
run p0 |
|
run external |
FE Regression Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
|
/review Since the cleaning here is to forcibly close the channel, there is no problem even if no packets are consumed? The client should be able to see that the connection has been killed, but I'm not sure if they will receive a specific error message, such as' high load '? |
There was a problem hiding this comment.
One blocking issue remains on the current head, and I found one additional test gap.
-
ReadListenerstill performs fullctx.cleanup()on the XNIO I/O thread when worker submission is rejected. That issue is already raised inline and is still blocking on the latest head:cleanup()may delete temp tables or synchronously abort transaction work through the close listener, so overload on the worker pool can still spill heavyweight teardown back onto the event loop. -
ConnectionExceedTestdoes not actually verify the lifecycle ordering it is trying to protect. EachMockito.inOrder(context)creates a fresh verifier, so the test passes as long as each call happens at some point; it will not fail ifsetThreadLocalInfo(),setKilled(), andcleanup()are reordered.
Critical checkpoints:
- Goal of the task: Partially accomplished. The PR no longer sends a protocol reply from the rejected-submission path and now restores
ConnectContextthread-local before cleanup, but the overload teardown path is still not safe on the I/O thread. - Change scope: Small and focused.
- Concurrency: Yes.
ReadListener.handleEvent()starts on the XNIO I/O thread and normally hands work to the worker pool. The rejected-submission branch still executes heavyweight teardown on the I/O thread, and the added test does not really lock down the intended lifecycle order. - Lifecycle/static init: No static-init issue. Session thread-local lifecycle is improved versus the prior revision, but the teardown thread affinity is still problematic.
- Config changes: None.
- Compatibility/incompatible changes: None.
- Parallel paths:
AcceptListeneris not an equivalent template because its overload handling happens before a session owns temp tables or transactions. - Special conditions: Catching
RejectedExecutionExceptionis appropriate, but the failure path should avoid heavy cleanup on the event loop. - Test coverage: Insufficient. The new unit test is mock-only and the current
InOrderusage does not actually assert ordering. - Test result files: Not applicable.
- Observability: Existing warn logs are sufficient.
- Transaction/persistence/data writes: No new persistence format or FE-BE protocol change. The remaining production risk is teardown behavior under overload rather than compatibility.
- FE-BE variable passing: Not applicable.
- Performance: Not safe yet because overload can still push cleanup or RPC-style work onto the event loop.
- Other issues: None beyond the above.
User focus: closing the channel without consuming the request is protocol-safe, and the client should observe the connection drop rather than a malformed MySQL packet. No additional issue there beyond the tradeoff that this path cannot reliably return a specific high load ERR packet without first consuming the request and resetting protocol sequence state.
Reviewer validation: code inspection only.
|
PR approved by at least one committer and no changes requested. |
9d78072 to
7c279c6
Compare
|
/review |
|
run buildall |
There was a problem hiding this comment.
I reviewed the current head 7c279c6d against the Doris FE review checklist.
The previous protocol-sequence problem and the unit-test ordering bug are fixed in this revision. I did not add duplicate inline comments, but one blocking concern from the existing inline review still applies on the current head: the RejectedExecutionException path in ReadListener still calls full ctx.cleanup() on the XNIO I/O thread, which can perform temp-table cleanup, forwarded RPC, and transaction-abort work under overload.
Critical checkpoints:
- Goal / correctness: Partially met. The PR now closes rejected sessions instead of leaving them suspended, but the overload cleanup path is still not safe on the I/O thread.
- Scope: Small and focused.
- Concurrency: Not safe yet. The rejection path runs on the I/O thread and performs heavyweight teardown there.
- Lifecycle: Thread-local setup/removal is now more consistent with the normal async path.
- Configuration / compatibility / FE-BE protocol: No new config or compatibility concerns in this patch.
- Parallel paths:
AcceptListenerremains the closest analogous path; this PR only updatesReadListener. - Tests: The new unit test improves coverage for the rejection branch and now verifies ordering correctly, but it is still mock-based and does not exercise real registered-connection teardown behavior.
- Observability: Existing logging is adequate for this small change.
User focus:
- No additional user-provided focus points.
Because the remaining blocker is already captured in an existing inline thread, I did not post a duplicate inline comment.
FE Regression Coverage ReportIncrement line coverage |
ReadListener does not handle RejectedExecutionException, which may cause the Client to hang when concurrency is extremely high. AcceptListener has already performed similar handling.