Skip to content

[fix](fe) Fix ReadListener leak on rejected worker task#62679

Merged
morrySnow merged 1 commit into
apache:masterfrom
HonestManXin:fix_connection_leak
May 6, 2026
Merged

[fix](fe) Fix ReadListener leak on rejected worker task#62679
morrySnow merged 1 commit into
apache:masterfrom
HonestManXin:fix_connection_leak

Conversation

@HonestManXin
Copy link
Copy Markdown
Contributor

ReadListener does not handle RejectedExecutionException, which may cause the Client to hang when concurrency is extremely high. AcceptListener has already performed similar handling.

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@HonestManXin
Copy link
Copy Markdown
Contributor Author

run buildall

@HonestManXin
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking correctness issue.

  1. ReadListener now sends an ERR packet directly from the read-notification thread when channel.getWorker().execute() is rejected. That bypasses the normal MysqlConnectProcessor.processOnce() flow, which resets the MySQL packet sequence and consumes the client command before replying. MysqlProto.sendResponsePacket() therefore uses stale MysqlChannel.sequenceId state from the previous command, or an unknowable sequence for a multi-packet request that was never read, so the client can still see protocol errors instead of a clean overload failure. The added unit test mocks MysqlProto.sendResponsePacket(), so it does not cover this.

Critical checkpoints:

  • Goal of the task: Partially met. The PR adds cleanup on rejected task submission, but the client-facing overload path is still protocol-incorrect, so the goal is not fully achieved.
  • Change scope: Small and focused.
  • Concurrency: Yes. ReadListener.handleEvent() runs on the XNIO I/O thread and hands work to the worker pool; the new rejection path stays on the I/O thread and bypasses normal command-processing state transitions.
  • Lifecycle/static init: No special static-init issue. Session cleanup still closes the channel and relies on the existing close listener for unregister.
  • Config changes: None.
  • Compatibility/storage format/API compatibility: None.
  • Parallel paths: AcceptListener has a similar rejection branch, but it is not an equivalent template because this read path is already inside the post-handshake MySQL command protocol.
  • Special conditions: The new RejectedExecutionException branch needs protocol-aware handling or an alternative that does not emit a malformed MySQL reply.
  • Tests: Added unit coverage only checks mocked method calls. No test covers real packet sequencing or client-visible behavior for rejected read tasks.
  • Test result files: Not applicable.
  • Observability: Existing warn logs are sufficient.
  • Transaction/persistence/data writes/FE-BE variable passing: Not involved.
  • Performance: The rejection branch also performs synchronous response/cleanup work on the I/O thread in an overload scenario.

Reviewer validation: code inspection only.

Comment thread fe/fe-core/src/main/java/org/apache/doris/mysql/ReadListener.java Outdated
@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 68.42% (13/19) 🎉
Increment coverage report
Complete coverage report

@HonestManXin
Copy link
Copy Markdown
Contributor Author

/review

@HonestManXin
Copy link
Copy Markdown
Contributor Author

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  1. ReadListener rejected-task teardown is not lifecycle-safe for transactional sessions. The new catch branch calls cleanup() before installing ConnectContext thread-local. On a registered MySQL session, connection close reaches ConnectPoolMgr.unregisterConnection() -> ctx.closeTxn(). On observer FE, TransactionEntry.abortTransaction() uses ConnectContext.get() to build MasterOpExecutor / MasterTxnExecutor, so this path can fail to abort the txn and leave server-side transaction state behind.

  2. The same branch performs full session cleanup on the XNIO I/O thread. ConnectContext.cleanup() may delete temp tables or forward to master, and close-triggered txn abort can do synchronous metadata / RPC work. Under overload, that can block the event loop instead of just failing the single session quickly.

Critical Checkpoints

  • Goal of the task: Partially accomplished. The PR catches RejectedExecutionException, but the recovery path is not safe for already-initialized sessions. The added test only proves a mocked cleanup() call, not end-to-end session teardown.
  • Small / focused change: Yes. The change is minimal and isolated to ReadListener plus a targeted unit test.
  • Concurrency: Not safe yet. ReadListener starts on the XNIO I/O thread and normally hands work to a worker thread. The new rejection path changes teardown thread / affinity and can run heavyweight cleanup on the I/O thread.
  • Lifecycle management: Incorrect in the new branch. ConnectContext thread-local is not installed before transaction / connection teardown can run.
  • Config changes: None.
  • Compatibility changes: None.
  • Parallel code paths: AcceptListener is not an exact precedent because its rejection handling happens before a session can own transactions or temp tables.
  • Special condition rationale: The new RejectedExecutionException catch is justified, but it still needs lifecycle-safe teardown for mid-session failures.
  • Test coverage: Insufficient. No test covers a real registered connection with transaction or temp-table cleanup after worker rejection.
  • Test results: No autogenerated result files changed; the new unit test itself does not exercise the risky lifecycle path.
  • Observability: Existing warning logs are adequate.
  • Transaction / persistence: Not safe for observer transactional sessions because rollback can fail on unregister.
  • Data write / modification safety: At risk for in-flight transactional sessions due to failed abort on cleanup.
  • FE-BE variable passing: Not applicable.
  • Performance: Risky due to heavyweight cleanup on the I/O thread.
  • Other issues: None beyond the lifecycle and I/O-thread teardown problems above.

Requesting changes.

Comment thread fe/fe-core/src/main/java/org/apache/doris/mysql/ReadListener.java Outdated
Comment thread fe/fe-core/src/main/java/org/apache/doris/mysql/ReadListener.java Outdated
@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 23.53% (4/17) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 76.47% (13/17) 🎉
Increment coverage report
Complete coverage report

@HonestManXin
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 18.75% (3/16) 🎉
Increment coverage report
Complete coverage report

@HonestManXin
Copy link
Copy Markdown
Contributor Author

run p0

@HonestManXin
Copy link
Copy Markdown
Contributor Author

run external

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 56.52% (13/23) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 3.23% (13/402) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 3.13% (13/415) 🎉
Increment coverage report
Complete coverage report

@924060929
Copy link
Copy Markdown
Contributor

/review Since the cleaning here is to forcibly close the channel, there is no problem even if no packets are consumed? The client should be able to see that the connection has been killed, but I'm not sure if they will receive a specific error message, such as' high load '?

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One blocking issue remains on the current head, and I found one additional test gap.

  1. ReadListener still performs full ctx.cleanup() on the XNIO I/O thread when worker submission is rejected. That issue is already raised inline and is still blocking on the latest head: cleanup() may delete temp tables or synchronously abort transaction work through the close listener, so overload on the worker pool can still spill heavyweight teardown back onto the event loop.

  2. ConnectionExceedTest does not actually verify the lifecycle ordering it is trying to protect. Each Mockito.inOrder(context) creates a fresh verifier, so the test passes as long as each call happens at some point; it will not fail if setThreadLocalInfo(), setKilled(), and cleanup() are reordered.

Critical checkpoints:

  • Goal of the task: Partially accomplished. The PR no longer sends a protocol reply from the rejected-submission path and now restores ConnectContext thread-local before cleanup, but the overload teardown path is still not safe on the I/O thread.
  • Change scope: Small and focused.
  • Concurrency: Yes. ReadListener.handleEvent() starts on the XNIO I/O thread and normally hands work to the worker pool. The rejected-submission branch still executes heavyweight teardown on the I/O thread, and the added test does not really lock down the intended lifecycle order.
  • Lifecycle/static init: No static-init issue. Session thread-local lifecycle is improved versus the prior revision, but the teardown thread affinity is still problematic.
  • Config changes: None.
  • Compatibility/incompatible changes: None.
  • Parallel paths: AcceptListener is not an equivalent template because its overload handling happens before a session owns temp tables or transactions.
  • Special conditions: Catching RejectedExecutionException is appropriate, but the failure path should avoid heavy cleanup on the event loop.
  • Test coverage: Insufficient. The new unit test is mock-only and the current InOrder usage does not actually assert ordering.
  • Test result files: Not applicable.
  • Observability: Existing warn logs are sufficient.
  • Transaction/persistence/data writes: No new persistence format or FE-BE protocol change. The remaining production risk is teardown behavior under overload rather than compatibility.
  • FE-BE variable passing: Not applicable.
  • Performance: Not safe yet because overload can still push cleanup or RPC-style work onto the event loop.
  • Other issues: None beyond the above.

User focus: closing the channel without consuming the request is protocol-safe, and the client should observe the connection drop rather than a malformed MySQL packet. No additional issue there beyond the tradeoff that this path cannot reliably return a specific high load ERR packet without first consuming the request and resetting protocol sequence state.

Reviewer validation: code inspection only.

morrySnow
morrySnow previously approved these changes Apr 23, 2026
@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 23, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@HonestManXin
Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot removed the approved Indicates a PR has been approved by one committer. label Apr 24, 2026
@HonestManXin
Copy link
Copy Markdown
Contributor Author

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the current head 7c279c6d against the Doris FE review checklist.

The previous protocol-sequence problem and the unit-test ordering bug are fixed in this revision. I did not add duplicate inline comments, but one blocking concern from the existing inline review still applies on the current head: the RejectedExecutionException path in ReadListener still calls full ctx.cleanup() on the XNIO I/O thread, which can perform temp-table cleanup, forwarded RPC, and transaction-abort work under overload.

Critical checkpoints:

  • Goal / correctness: Partially met. The PR now closes rejected sessions instead of leaving them suspended, but the overload cleanup path is still not safe on the I/O thread.
  • Scope: Small and focused.
  • Concurrency: Not safe yet. The rejection path runs on the I/O thread and performs heavyweight teardown there.
  • Lifecycle: Thread-local setup/removal is now more consistent with the normal async path.
  • Configuration / compatibility / FE-BE protocol: No new config or compatibility concerns in this patch.
  • Parallel paths: AcceptListener remains the closest analogous path; this PR only updates ReadListener.
  • Tests: The new unit test improves coverage for the rejection branch and now verifies ordering correctly, but it is still mock-based and does not exercise real registered-connection teardown behavior.
  • Observability: Existing logging is adequate for this small change.

User focus:

  • No additional user-provided focus points.

Because the remaining blocker is already captured in an existing inline thread, I did not post a duplicate inline comment.

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 4.91% (13/265) 🎉
Increment coverage report
Complete coverage report

@morrySnow morrySnow merged commit 8c891ca into apache:master May 6, 2026
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants