Skip to content

killAllExpiredTransactions: use a shared_ptr to manage coro context#212

Merged
xiexiaoy merged 1 commit intoeloqdata:mainfrom
xiexiaoy:coro_crash
Sep 26, 2025
Merged

killAllExpiredTransactions: use a shared_ptr to manage coro context#212
xiexiaoy merged 1 commit intoeloqdata:mainfrom
xiexiaoy:coro_crash

Conversation

@xiexiaoy
Copy link
Collaborator

@xiexiaoy xiexiaoy commented Sep 25, 2025

  1. killAllExpiredTransactions: use a shared_ptr to manage coro context.
  2. Use boost::context::protected_fixedsize_stack to replace the self-implemented stack.

Summary by CodeRabbit

  • Performance
    • Improved coroutine stack allocation for lower overhead and better responsiveness under load.
  • Refactor
    • Consolidated allocator-backed coroutine setup and simplified coroutine lifecycle and capture handling.
  • Stability
    • Added safer lifetime checks for background/state-machine tasks and removed legacy stack preallocation/overflow checks while preserving error handling and synchronization.
  • Logging
    • Standardized log wording and punctuation for clearer diagnostics.

@coderabbitai
Copy link

coderabbitai bot commented Sep 25, 2025

Walkthrough

Replaces raw stack backing and canary-based overflow checks with boost::context::protected_fixedsize_stack; introduces shared CoroCtx in kill_sessions_local for allocator-enabled continuations and hooks; updates callcc/continuation captures, logging, and lifecycle checks while preserving yield/resume/error-handling semantics.

Changes

Cohort / File(s) Summary
KillSessions coroutine context refactor
src/mongo/db/kill_sessions_local.cpp
Adds CoroCtx holding a protected_fixedsize_stack allocator, source continuation, and yieldFunc/resumeFunc/longResumeFunc hooks; switches callcc to std::allocator_arg + CoroCtx allocator and uses std::shared_ptr<CoroCtx> across continuation path; updates continuation body, captures, and resume/yield logging; preserves DBException handling and sync/finalization.
ServiceStateMachine allocator switch & overflow check removal
src/mongo/transport/service_state_machine.h, src/mongo/transport/service_state_machine.cpp
Replaces boost::context::stack_context / raw _coroStack and canary-based overflow logic with boost::context::protected_fixedsize_stack _salloc{kCoroStackSize}; removes NoopAllocator, canary constants, _coroStackContext(), and _abortIfStackOverflow(); updates callcc usage to pass _salloc and changes coroutine lambdas to use a weak_from_this() guard; removes explicit stack-overflow checks in yield/resume flows.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller
  participant Kill as kill_sessions_local
  participant Coro as CoroCtx (salloc + source)
  participant Cont as Continuation

  Caller->>Kill: start kill sessions
  Note right of Kill: create shared CoroCtx with protected_fixedsize_stack
  Kill->>Coro: init yield/resume hooks
  Kill->>Cont: callcc(std::allocator_arg, Coro.salloc, ...)
  Note over Cont,Coro: continuation stores Coro.source and uses Coro.yieldFunc to yield
  Cont-->>Kill: yield (via Coro.yieldFunc)
  Kill->>Cont: resume (via Coro.resumeFunc / longResumeFunc)
  Cont-->>Kill: complete or throw DBException
  Kill-->>Caller: finalize
Loading
sequenceDiagram
  autonumber
  participant Net as Network
  participant SSM as ServiceStateMachine
  participant Alloc as _salloc (protected_fixedsize_stack)
  participant Src as _source (continuation)

  Net->>SSM: new connection/event
  Note right of SSM: callcc(std::allocator_arg, Alloc, ...) with weak_from_this() guard
  SSM->>Src: enter processing loop
  Src-->>SSM: yield (no explicit stack-overflow checks)
  SSM->>Src: resume on I/O/state change
  alt normal completion
    Src-->>SSM: done
  else error
    Src-->>SSM: error surfaced
  end
  SSM-->>Net: transition/close
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I nibbled at canaries, then found a new track,
swapped raw stacks for allocs and hopped right back.
Shared CoroCtx snug in a burrow so deep,
I yield, then resume—no more nightmares in sleep. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly states that killAllExpiredTransactions now uses a shared_ptr for its coroutine context, which is one of the real and significant changes in this PR, even though other shifts like adopting protected_fixedsize_stack and updating the service state machine are not mentioned.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@xiexiaoy xiexiaoy changed the title coroutine stack: use boost::context::protected_fixedsize_stack killAllExpiredTransactions: move source and stack out of the callback of scanSessions Sep 25, 2025
@xiexiaoy xiexiaoy marked this pull request as ready for review September 25, 2025 09:51
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
src/mongo/db/kill_sessions_local.cpp (3)

104-108: Consider removing or adjusting log level for coroutine resume logging

The log message "abortArbitraryTransactionIfExpired call resume." might be too verbose for production at the default log level. Consider using a higher verbosity level (e.g., LOG(3) or LOG(4)) or removing it if it was added for debugging purposes.

 std::function<void()> resumeTask = [&source, &client] {
-    log() << "abortArbitraryTransactionIfExpired call resume.";
+    LOG(3) << "abortArbitraryTransactionIfExpired call resume.";
     Client::setCurrent(std::move(client));
     source = source.resume();
 };

122-126: Consider removing or adjusting log level for coroutine yield logging

Similar to the resume logging, this yield log message appears to be debug information that might be too verbose for production.

 yieldFunc = [&sink, &client]() {
-    log() << "abortArbitraryTransactionIfExpired call yield.";
+    LOG(3) << "abortArbitraryTransactionIfExpired call yield.";
     client = Client::releaseCurrent();
     sink = sink.resume();
 };

117-158: Verify protected stack size is sufficient

The protected_fixedsize_stack is allocated with 3200KB (3.125MB). Ensure this size is adequate for the worst-case stack usage during transaction abort operations, considering any potential deep call stacks or large local variables.

The stack size allocation of 3200KB appears reasonable for typical transaction abort operations. However, you should monitor stack usage in production to ensure this is sufficient, especially under heavy load or with complex transaction scenarios. Consider adding metrics or diagnostics to track actual stack usage if issues arise.

src/mongo/transport/service_state_machine.cpp (1)

554-557: Consider adjusting log level for coroutine yield logging

The yield log message at LOG(3) level seems appropriate for debugging but might still be verbose in some scenarios.

Consider using LOG(4) or LOG(5) for even less verbose output in production:

 ssm->_coroYield = [ssm = ssm.get(), &sink]() {
-    MONGO_LOG(3) << "call yield";
+    MONGO_LOG(4) << "call yield";
     ssm->_dbClient = Client::releaseCurrent();
     sink = sink.resume();
 };
src/mongo/transport/service_state_machine.h (1)

267-267: Extract coroutine stack size into a shared constant
Hardcoding 3200 * 1024 in service_state_machine.h (line 267) and kill_sessions_local.cpp (line 83) leads to duplication. Define

constexpr size_t kCoroutineStackSize = 3200 * 1024;

in a shared header and replace both instances.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd60381 and 3b4746d.

📒 Files selected for processing (3)
  • src/mongo/db/kill_sessions_local.cpp (4 hunks)
  • src/mongo/transport/service_state_machine.cpp (1 hunks)
  • src/mongo/transport/service_state_machine.h (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/mongo/transport/service_state_machine.cpp (1)
src/mongo/db/client.cpp (2)
  • releaseCurrent (164-167)
  • releaseCurrent (164-164)
src/mongo/db/kill_sessions_local.cpp (1)
src/mongo/db/client.cpp (4)
  • setCurrent (169-172)
  • setCurrent (169-169)
  • releaseCurrent (164-167)
  • releaseCurrent (164-164)
🔇 Additional comments (3)
src/mongo/transport/service_state_machine.cpp (1)

548-551: Good: Consistent use of protected stack allocator

The migration to boost::context::protected_fixedsize_stack with the allocator-augmented callcc provides automatic stack overflow protection, eliminating the need for manual overflow checks that were removed throughout the file.

src/mongo/transport/service_state_machine.h (1)

267-269: Good: Clean migration to protected stack allocation

The replacement of manual stack management with boost::context::protected_fixedsize_stack is a significant improvement. This provides automatic stack overflow protection through guard pages, eliminating the need for manual canary checking and making the coroutine implementation more robust and maintainable.

src/mongo/db/kill_sessions_local.cpp (1)

83-84: Drop thread-safety concern—scanSessions is single-threaded
scanSessions acquires _mutex and invokes the callback serially, so salloc and source captured by reference are never accessed concurrently.

@xiexiaoy xiexiaoy force-pushed the coro_crash branch 2 times, most recently from c027a9b to ad6e003 Compare September 25, 2025 13:28
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/mongo/transport/service_state_machine.h (1)

267-270: Make stack allocator const and centralize stack size.

  • _salloc can be const; it’s not mutated after construction.
  • kCoroStackSize duplicates the literal used elsewhere (kill_sessions_local.cpp). Prefer one shared definition to avoid drift.

Apply:

-    static constexpr size_t kCoroStackSize = 3200 * 1024;
+    static constexpr size_t kCoroStackSize = 3200 * 1024;  // TODO: centralize this constant
     boost::context::continuation _source;
-    boost::context::protected_fixedsize_stack _salloc{kCoroStackSize};
+    const boost::context::protected_fixedsize_stack _salloc{kCoroStackSize};

Optionally, expose kCoroStackSize in a common header (e.g., mongo/util/coro_stack.h) and include it here and in kill_sessions_local.cpp.

src/mongo/db/kill_sessions_local.cpp (1)

102-104: Avoid hardcoding stack size; reuse a single constant.

3200*1024 is duplicated from ServiceStateMachine. Prefer a shared constant to prevent divergence.

Apply:

-            boost::context::protected_fixedsize_stack salloc(3200 * 1024);
+            // Consider centralizing this constant (e.g., mongo/util/coro_stack.h)
+            boost::context::protected_fixedsize_stack salloc(3200 * 1024);

If practical, include a shared header and use the same kCoroStackSize.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b4746d and ad6e003.

📒 Files selected for processing (4)
  • src/mongo/db/kill_sessions_local.cpp (3 hunks)
  • src/mongo/transport/service_state_machine.cpp (1 hunks)
  • src/mongo/transport/service_state_machine.h (2 hunks)
  • src/mongo/util/exit.cpp (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • src/mongo/util/exit.cpp
🧰 Additional context used
🧬 Code graph analysis (2)
src/mongo/db/kill_sessions_local.cpp (2)
src/mongo/transport/service_executor_coroutine.cpp (2)
  • resumeTask (49-54)
  • resumeTask (49-49)
src/mongo/db/client.cpp (4)
  • setCurrent (169-172)
  • setCurrent (169-169)
  • releaseCurrent (164-167)
  • releaseCurrent (164-164)
src/mongo/transport/service_state_machine.cpp (1)
src/mongo/db/client.cpp (2)
  • releaseCurrent (164-167)
  • releaseCurrent (164-164)
🔇 Additional comments (5)
src/mongo/transport/service_state_machine.h (1)

34-34: Allocator include looks correct.

Using boost::context::protected_fixedsize_stack is appropriate for guarded stacks.

src/mongo/transport/service_state_machine.cpp (1)

549-561: Allocator-aware callcc usage LGTM.

Passing std::allocator_arg with _salloc and capturing sink by reference is standard in this pattern. No issues spotted here.

src/mongo/db/kill_sessions_local.cpp (3)

47-48: Allocator header include is appropriate.

Required for protected_fixedsize_stack; looks good.


161-164: Continuation drain loop LGTM.

Draining while (source) ensures completion. Good addition.


102-110: Confirm intent: source/salloc still live inside scanSessions callback.

The PR description mentions moving source and stack out of the scanSessions callback. They’re still defined within the callback (before scheduling task). If the goal was to move them fully outside scanSessions, please clarify or adjust.

@xiexiaoy xiexiaoy force-pushed the coro_crash branch 4 times, most recently from 187b978 to 59af554 Compare September 25, 2025 14:19
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
src/mongo/transport/service_state_machine.cpp (1)

548-561: Capture ThreadGuard by move, not reference.

Capturing &guard and then moving it is brittle. Capture by move to avoid dangling-reference risks in future edits/refactors.

-                        _source = boost::context::callcc(
-                            std::allocator_arg,
-                            _salloc,
-                            [ssm = shared_from_this(),
-                             &guard](boost::context::continuation&& sink) {
+                        _source = boost::context::callcc(
+                            std::allocator_arg,
+                            _salloc,
+                            [ssm = shared_from_this(),
+                             guard = std::move(guard)](boost::context::continuation&& sink) {
                                 ssm->_coroYield = [ssm = ssm.get(), &sink]() {
                                     MONGO_LOG(3) << "call yield";
                                     ssm->_dbClient = Client::releaseCurrent();
                                     sink = sink.resume();
                                 };
                                 ssm->_processMessage(std::move(guard));
 
                                 return std::move(sink);
                             });
src/mongo/transport/service_state_machine.h (2)

34-34: Remove unused/over-specific include.

continuation.hpp suffices; continuation_fcontext.hpp is an internal/impl header and unnecessary here.

-#include <boost/context/continuation_fcontext.hpp>

267-270: Avoid duplicating stack-size literals across modules.

This constant (3200*1024) also appears in kill_sessions_local.cpp. Consider centralizing (e.g., a shared constexpr in a common header) to keep the stack size consistent.

src/mongo/db/kill_sessions_local.cpp (3)

46-48: Prefer public continuation header.

Use <boost/context/continuation.hpp> instead of the fcontext-specific header.

-#include <boost/context/continuation_fcontext.hpp>
-#include <boost/context/protected_fixedsize_stack.hpp>
+#include <boost/context/continuation.hpp>
+#include <boost/context/protected_fixedsize_stack.hpp>

102-104: Allocator lifetime: capture by value to reduce lifetime coupling.

task currently captures salloc by reference. Capturing by value avoids accidental lifetime issues if the waiting behavior changes later.

-            boost::context::protected_fixedsize_stack salloc(3200 * 1024);
+            boost::context::protected_fixedsize_stack salloc(3200 * 1024);
             boost::context::continuation source;

And update the capture below (see next comment’s diff).


162-166: Task attribution: use a more specific executor task name if available.

kSSMProcessMessage may skew metrics. If the executor exposes a task name for session cleanup/kill or maintenance, prefer that.

Can you check if transport::ServiceExecutorTaskName has a suitable constant (e.g., for session maintenance/cleanup) to replace kSSMProcessMessage?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ad6e003 and 187b978.

📒 Files selected for processing (4)
  • src/mongo/db/kill_sessions_local.cpp (3 hunks)
  • src/mongo/transport/service_state_machine.cpp (1 hunks)
  • src/mongo/transport/service_state_machine.h (2 hunks)
  • src/mongo/util/exit.cpp (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/mongo/transport/service_state_machine.cpp (1)
src/mongo/db/client.cpp (2)
  • releaseCurrent (164-167)
  • releaseCurrent (164-164)
src/mongo/db/kill_sessions_local.cpp (1)
src/mongo/db/client.cpp (4)
  • setCurrent (169-172)
  • setCurrent (169-169)
  • releaseCurrent (164-167)
  • releaseCurrent (164-164)
🔇 Additional comments (2)
src/mongo/util/exit.cpp (1)

120-123: LGTM: logging split is fine.

No behavior change; the message formatting remains clear.

src/mongo/db/kill_sessions_local.cpp (1)

115-160: Critical: OperationContext crosses threads; create it on the executor thread.

opCtx originates on the caller’s thread and is used inside the scheduled task on another thread. Construct a fresh OperationContext after Client::setCurrent and use that instead. Also capture salloc by value.

-            auto task =
-                [&salloc, &finished, &mux, &cv, &source, &yieldFunc, opCtx, session, &client] {
+            auto task =
+                [salloc, &finished, &mux, &cv, &source, &yieldFunc, session, &client] {
                     Client::setCurrent(std::move(client));
                     source = boost::context::callcc(
                         std::allocator_arg,
                         salloc,
-                        [&yieldFunc, opCtx, session, &client](boost::context::continuation&& sink) {
+                        [&yieldFunc, session, &client](boost::context::continuation&& sink) {
                             yieldFunc = [&sink, &client]() {
                                 log() << "abortArbitraryTransactionIfExpired call yield.";
                                 client = Client::releaseCurrent();
                                 sink = sink.resume();
                             };
 
-                            try {
-                                session->abortArbitraryTransactionIfExpired(opCtx);
+                            try {
+                                auto localOpCtx = Client::getCurrent()->makeOperationContext();
+                                session->abortArbitraryTransactionIfExpired(localOpCtx.get());
                             } catch (const DBException& ex) {
                                 const Status& status = ex.toStatus();
                                 std::string errmsg = str::stream()
                                     << "May have failed to abort expired "
                                        "transaction with session id (lsid) '"
                                     << session->getSessionId() << "'."
                                     << " Caused by: " << status;

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/mongo/transport/service_state_machine.h (1)

267-270: Optional: make stack size portable/configurable

  • Consider using boost::context::stack_traits::default_size() (or max_size()) to derive a portable default, or gate kCoroStackSize behind a server parameter for tuning under heavy connection counts.
  • If coroutine mode can be disabled, lazily construct the allocator only when enabled to reduce per-connection footprint.

No behavior change required now.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 187b978 and 59af554.

📒 Files selected for processing (4)
  • src/mongo/db/kill_sessions_local.cpp (4 hunks)
  • src/mongo/transport/service_state_machine.cpp (1 hunks)
  • src/mongo/transport/service_state_machine.h (2 hunks)
  • src/mongo/util/exit.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/mongo/db/kill_sessions_local.cpp
🧰 Additional context used
🧬 Code graph analysis (1)
src/mongo/transport/service_state_machine.cpp (1)
src/mongo/db/client.cpp (2)
  • releaseCurrent (164-167)
  • releaseCurrent (164-164)
🔇 Additional comments (3)
src/mongo/util/exit.cpp (1)

120-123: LGTM: log message reformat only

No functional change; the split insertions are fine.

src/mongo/transport/service_state_machine.h (1)

34-35: Include boost::context::protected_fixedsize_stack: OK

Header addition is appropriate for the allocator-backed coroutine stack.

src/mongo/transport/service_state_machine.cpp (1)

548-562: Allocator-augmented callcc: verify continuation lifetime

  • _coroYield captures sink by reference—ensure it’s only ever invoked while the coroutine frame (and its sink) is active (i.e. never after the coroutine returns or before callcc sets up sink).
  • _source.resume() is already gated on CoroStatus::OnGoing—keep tests covering double-resume and migration paths.
  • Optional: capture sink by value in a small holder on the coroutine stack and expose a bound yield() method.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/mongo/transport/service_state_machine.h (1)

34-35: Remove unused fcontext include

continuation.hpp is sufficient; continuation_fcontext.hpp isn’t needed here.

-#include <boost/context/continuation_fcontext.hpp>
 #include <boost/context/protected_fixedsize_stack.hpp>
src/mongo/db/kill_sessions_local.cpp (1)

161-169: Race/UAF guard: ensure task waits or lifetime is anchored

With the above changes, source/salloc lifetime is anchored; if you choose not to move them out, alternatively keep the task alive until finished to avoid UAF (less preferred).

Alternative minimal change inside task after callcc:

-                // (task returns immediately here)
+                // Keep locals alive until completion if you don't hoist them.
+                {
+                    std::unique_lock lk(mux);
+                    cv.wait(lk, [&finished]() { return finished; });
+                }

Prefer the earlier hoist + single post-callcc resume.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 59af554 and 87b4ffc.

📒 Files selected for processing (4)
  • src/mongo/db/kill_sessions_local.cpp (2 hunks)
  • src/mongo/transport/service_state_machine.cpp (1 hunks)
  • src/mongo/transport/service_state_machine.h (2 hunks)
  • src/mongo/util/exit.cpp (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/mongo/transport/service_state_machine.cpp (1)
src/mongo/db/client.cpp (2)
  • releaseCurrent (164-167)
  • releaseCurrent (164-164)
src/mongo/db/kill_sessions_local.cpp (2)
src/mongo/transport/service_executor_coroutine.cpp (2)
  • resumeTask (49-54)
  • resumeTask (49-49)
src/mongo/db/client.cpp (2)
  • setCurrent (169-172)
  • setCurrent (169-169)
🔇 Additional comments (4)
src/mongo/util/exit.cpp (1)

120-123: LGTM: logging split is benign

Splitting the stream insertion has no behavioral impact; formatting remains intact.

src/mongo/transport/service_state_machine.cpp (1)

548-552: Allocator-arg callcc adoption looks correct

Using protected_fixedsize_stack via allocator_arg is the right move and matches the header change.

src/mongo/db/kill_sessions_local.cpp (1)

120-131: Bug: OperationContext crosses threads

opCtx originates on the caller’s thread but is used inside the scheduled task. Create a fresh OperationContext on the executor thread.

Apply within the callcc body:

-                        try {
-                            session->abortArbitraryTransactionIfExpired(opCtx);
+                        try {
+                            auto localOpCtx = Client::getCurrent()->makeOperationContext();
+                            session->abortArbitraryTransactionIfExpired(localOpCtx.get());

And remove opCtx from all lambda capture lists.

src/mongo/transport/service_state_machine.h (1)

267-270: Fix destruction order: destroy continuation before allocator

Currently _salloc is declared after _source, so it will be destroyed first, risking UB if the continuation references allocator-managed resources. Declare the allocator before the continuation so the continuation is destroyed first.

-    static constexpr size_t kCoroStackSize = 3200 * 1024;
-    boost::context::continuation _source;
-    boost::context::protected_fixedsize_stack _salloc{kCoroStackSize};
+    static constexpr size_t kCoroStackSize = 3200 * 1024;
+    boost::context::protected_fixedsize_stack _salloc{kCoroStackSize};
+    boost::context::continuation _source;

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/mongo/transport/service_state_machine.cpp (1)

548-561: Allocator-aware callcc is correct; capture guard by move, not reference

Avoid capturing ThreadGuard by reference across coroutine boundaries. Capture by move to eliminate lifetime pitfalls.

Apply:

-                        _source = boost::context::callcc(
-                            std::allocator_arg,
-                            _salloc,
-                            [ssm = shared_from_this(),
-                             &guard](boost::context::continuation&& sink) {
+                        _source = boost::context::callcc(
+                            std::allocator_arg,
+                            _salloc,
+                            [ssm = shared_from_this(),
+                             guard = std::move(guard)](boost::context::continuation&& sink) mutable {
                                 ssm->_coroYield = [ssm = ssm.get(), &sink]() {
                                     MONGO_LOG(3) << "call yield";
                                     ssm->_dbClient = Client::releaseCurrent();
                                     sink = sink.resume();
                                 };
-                                ssm->_processMessage(std::move(guard));
+                                ssm->_processMessage(std::move(guard));
 
                                 return std::move(sink);
                             });
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87b4ffc and 32c4fd8.

📒 Files selected for processing (4)
  • src/mongo/db/kill_sessions_local.cpp (2 hunks)
  • src/mongo/transport/service_state_machine.cpp (1 hunks)
  • src/mongo/transport/service_state_machine.h (2 hunks)
  • src/mongo/util/exit.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/mongo/util/exit.cpp
🧰 Additional context used
🧬 Code graph analysis (2)
src/mongo/db/kill_sessions_local.cpp (2)
src/mongo/transport/service_executor_coroutine.cpp (2)
  • resumeTask (49-54)
  • resumeTask (49-49)
src/mongo/db/client.cpp (4)
  • setCurrent (169-172)
  • setCurrent (169-169)
  • releaseCurrent (164-167)
  • releaseCurrent (164-164)
src/mongo/transport/service_state_machine.cpp (1)
src/mongo/db/client.cpp (2)
  • releaseCurrent (164-167)
  • releaseCurrent (164-164)
🔇 Additional comments (5)
src/mongo/transport/service_state_machine.h (2)

34-34: Switch to protected_fixedsize_stack — LGTM

Appropriate header for allocator-backed coroutine stacks.


267-269: Fix for destruction order — LGTM (continuation destroyed before allocator)

Declaring _salloc before _source ensures the continuation is destroyed first. This resolves the earlier concern.

src/mongo/db/kill_sessions_local.cpp (3)

47-48: Header include — LGTM

Using boost::context::protected_fixedsize_stack is appropriate here.


102-104: Good: allocator and continuation lifetime moved out of the scheduled task

This prevents UAF of continuation/stack across resume calls.


115-160: Critical: opCtx crosses threads; and missing resume after callcc leads to hang

  • opCtx is created on the caller thread but used inside the scheduled task; create it on the executor thread after Client::setCurrent.
  • After callcc returns (on initial yield), you must resume once to progress the coroutine, per PR objective; otherwise finished may never be set and the wait deadlocks.

Apply:

-            auto task =
-                [&salloc, &finished, &mux, &cv, source, &yieldFunc, opCtx, session, &client] {
+            auto task =
+                [&salloc, &finished, &mux, &cv, source, &yieldFunc, session, &client] {
                     Client::setCurrent(std::move(client));
                     *source = boost::context::callcc(
                         std::allocator_arg,
                         salloc,
-                        [&finished, &mux, &cv, &yieldFunc, opCtx, session, &client](
+                        [&finished, &mux, &cv, &yieldFunc, session, &client](
                             boost::context::continuation&& sink) {
                             yieldFunc = [&sink, &client]() {
                                 log() << "abortArbitraryTransactionIfExpired call yield.";
                                 client = Client::releaseCurrent();
                                 sink = sink.resume();
                             };
 
-                            std::unique_lock lk(mux);
                             try {
-                                session->abortArbitraryTransactionIfExpired(opCtx);
+                                auto localOpCtx = Client::getCurrent()->makeOperationContext();
+                                session->abortArbitraryTransactionIfExpired(localOpCtx.get());
                             } catch (const DBException& ex) {
                                 const Status& status = ex.toStatus();
                                 std::string errmsg = str::stream()
                                     << "May have failed to abort expired "
                                        "transaction with session id (lsid) '"
                                     << session->getSessionId() << "'."
                                     << " Caused by: " << status;
 
                                 // LockTimeout errors are expected if we are unable to acquire
                                 // an IS lock to clean up transaction cursors. The transaction
                                 // abort (and lock resource release) should have succeeded
                                 // despite failing to clean up cursors. The cursors will
                                 // eventually be cleaned up by the cursor manager. We'll log
                                 // such errors at a higher log level for diagnostic purposes in
                                 // case something gets stuck.
                                 if (ErrorCodes::isShutdownError(status.code()) ||
                                     status == ErrorCodes::LockTimeout) {
                                     LOG(1) << errmsg;
                                 } else {
                                     warning() << errmsg;
                                 }
                             }
 
+                            std::unique_lock lk(mux);
                             finished = true;
                             client = Client::releaseCurrent();
                             cv.notify_one();
                             return std::move(sink);
                         });
+                    // Ensure coroutine progresses past the initial yield.
+                    if (!finished) {
+                        Client::setCurrent(std::move(client));
+                        *source = source->resume();
+                    }
                 };

Optional follow-up (recommended): narrow the mutex scope to only the finished/notify section as shown above to avoid holding mux during the abort.

@xiexiaoy xiexiaoy changed the title killAllExpiredTransactions: move source and stack out of the callback of scanSessions killAllExpiredTransactions: use a shared_ptr to manage source Sep 25, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/mongo/db/kill_sessions_local.cpp (1)

102-104: Stack allocator and continuation lifetime now correctly owned by the outer scope

This fixes prior UAF risks with local stack/continuation inside the scheduled task.

Consider introducing a small constant for the stack size to avoid magic numbers:

-            boost::context::protected_fixedsize_stack salloc(3200 * 1024);
+            constexpr size_t kCoroStackSize = 3200 * 1024;
+            boost::context::protected_fixedsize_stack salloc(kCoroStackSize);
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32c4fd8 and fffb8e8.

📒 Files selected for processing (4)
  • src/mongo/db/kill_sessions_local.cpp (2 hunks)
  • src/mongo/transport/service_state_machine.cpp (1 hunks)
  • src/mongo/transport/service_state_machine.h (2 hunks)
  • src/mongo/util/exit.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/mongo/transport/service_state_machine.cpp
🧰 Additional context used
🧬 Code graph analysis (1)
src/mongo/db/kill_sessions_local.cpp (1)
src/mongo/db/client.cpp (4)
  • setCurrent (169-172)
  • setCurrent (169-169)
  • releaseCurrent (164-167)
  • releaseCurrent (164-164)
🔇 Additional comments (9)
src/mongo/util/exit.cpp (1)

120-123: LGTM: purely cosmetic stream split

No behavioral change; safe and clearer.

src/mongo/transport/service_state_machine.h (2)

34-34: LGTM: adopt protected_fixedsize_stack include

Matches allocator-backed coroutine approach across the PR.


267-269: Good: safe destruction order (_source before allocator)

Declaring _salloc before _source ensures the continuation is destroyed first. This addresses the prior concern.

src/mongo/db/kill_sessions_local.cpp (6)

47-48: LGTM: switch to protected_fixedsize_stack include

Consistent with the allocator-backed stack usage.


123-128: LGTM: yield functor swaps Client and yields

Matches the coroutine protocol.


129-139: Guarded completion path looks good (aside from opCtx bug above)

Lock/notify on finished is correct. With the opCtx fix, this block is fine.


155-159: LGTM: completion + wakeup

Sets finished, releases Client, and notifies waiter; correct finalization.


115-121: Bug: OperationContext crosses threads; create it on the executor thread

opCtx originates on the caller’s thread. Using it inside the scheduled task is unsafe. Create a fresh OperationContext after Client::setCurrent on the destination thread and remove opCtx from captures.

Apply:

-            auto task =
-                [&salloc, &finished, &mux, &cv, source, &yieldFunc, opCtx, session, &client] {
+            auto task =
+                [&salloc, &finished, &mux, &cv, source, &yieldFunc, session, &client] {
                 Client::setCurrent(std::move(client));
-                *source = boost::context::callcc(
+                *source = boost::context::callcc(
                     std::allocator_arg,
                     salloc,
-                    [&finished, &mux, &cv, source, &yieldFunc, opCtx, session, &client](
+                    [&finished, &mux, &cv, source, &yieldFunc, session, &client](
                         boost::context::continuation&& sink) {

And below in the body:

-                            try {
-                                session->abortArbitraryTransactionIfExpired(opCtx);
+                            try {
+                                auto localOpCtx = Client::getCurrent()->makeOperationContext();
+                                session->abortArbitraryTransactionIfExpired(localOpCtx.get());

105-109: Avoid raw‐pointer capture in resumeTask (UAF risk). Change to capture the shared_ptr itself:

- std::function<void()> resumeTask = [source = source.get(), &client] {
+ std::function<void()> resumeTask = [source, &client] {
     log() << "abortArbitraryTransactionIfExpired call resume.";
     Client::setCurrent(std::move(client));
-    *source = source->resume();
+    (*source) = source->resume();
};

No other resume functors capture raw pointers to the continuation.

@xiexiaoy xiexiaoy changed the title killAllExpiredTransactions: use a shared_ptr to manage source killAllExpiredTransactions: use a shared_ptr to manage coro context Sep 26, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/mongo/db/kill_sessions_local.cpp (1)

122-166: Critical: OperationContext crosses threads; create it on the executor thread

opCtx originates on the caller’s thread and is used inside the scheduled task. Create a new OperationContext after Client::setCurrent on the executor thread and pass that to abortArbitraryTransactionIfExpired. Also remove opCtx from lambda captures.

-            auto task = [&finished, &mux, &cv, coroCtx, opCtx, session, &client] {
+            auto task = [&finished, &mux, &cv, coroCtx, session, &client] {
                 Client::setCurrent(std::move(client));
                 coroCtx->source = boost::context::callcc(
                     std::allocator_arg,
                     coroCtx->salloc,
-                    [&finished, &mux, &cv, coroCtx, opCtx, session, &client](
+                    [&finished, &mux, &cv, coroCtx, session, &client](
                         boost::context::continuation&& sink) {
                         coroCtx->yieldFunc = [&sink, &client]() {
                             log() << "abortArbitraryTransactionIfExpired call yield.";
                             client = Client::releaseCurrent();
                             sink = sink.resume();
                         };

                         std::unique_lock lk(mux);
                         try {
-                            session->abortArbitraryTransactionIfExpired(opCtx);
+                            auto localOpCtx = Client::getCurrent()->makeOperationContext();
+                            session->abortArbitraryTransactionIfExpired(localOpCtx.get());
                         } catch (const DBException& ex) {
                             const Status& status = ex.toStatus();
                             std::string errmsg = str::stream()
                                 << "May have failed to abort expired "
                                    "transaction with session id (lsid) '"
                                 << session->getSessionId() << "'."
                                 << " Caused by: " << status;
🧹 Nitpick comments (1)
src/mongo/db/kill_sessions_local.cpp (1)

76-83: Avoid magic number for stack size in CoroCtx

Define a constant within CoroCtx (or centralize it) instead of repeating 3200*1024.

 struct CoroCtx {
-    boost::context::protected_fixedsize_stack salloc{3200 * 1024};
+    static constexpr size_t kCoroStackSize = 3200 * 1024;
+    boost::context::protected_fixedsize_stack salloc{kCoroStackSize};
     boost::context::continuation source;
     std::function<void()> yieldFunc;
     std::function<void()> resumeFunc;
     std::function<void()> longResumeFunc;
 };
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1bae001 and 4c34046.

📒 Files selected for processing (3)
  • src/mongo/db/kill_sessions_local.cpp (3 hunks)
  • src/mongo/transport/service_state_machine.cpp (1 hunks)
  • src/mongo/transport/service_state_machine.h (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/mongo/transport/service_state_machine.cpp
🧰 Additional context used
🧬 Code graph analysis (1)
src/mongo/db/kill_sessions_local.cpp (3)
src/mongo/db/service_context.cpp (4)
  • client (231-239)
  • client (231-231)
  • getGlobalServiceContext (68-71)
  • getGlobalServiceContext (68-68)
src/mongo/db/client.cpp (4)
  • releaseCurrent (164-167)
  • releaseCurrent (164-164)
  • setCurrent (169-172)
  • setCurrent (169-169)
src/mongo/transport/service_executor_coroutine.cpp (2)
  • resumeTask (49-54)
  • resumeTask (49-49)
🔇 Additional comments (6)
src/mongo/transport/service_state_machine.h (2)

34-34: LGTM: correct Boost stack header

Including protected_fixedsize_stack aligns with the new allocator-based coro usage.


267-269: Member order is correct (continuation destroyed before allocator)

_salloc is declared before _source, so _source is destroyed first. This fixes the prior destruction-order hazard.

src/mongo/db/kill_sessions_local.cpp (4)

47-48: LGTM: add protected_fixedsize_stack include

Matches allocator-based coroutine changes.


97-97: LGTM: shared ownership for coroutine context

Using shared_ptr for CoroCtx addresses dangling-lifetime risks between scheduler/resumer.


103-106: LGTM: wiring CoroutineFunctors to CoroCtx

Hooking functors to CoroCtx members is correct.


117-121: LGTM: resume functors bound via executor

Assuming the above lifetime fix, this is consistent with the coroutine model.

Please confirm coroutineResumeFunctor/coroutineLongResumeFunctor don’t outlive the enclosing scanSessions callback. If they can, the lifetime fix above is required to avoid UAF.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c34046 and bff5f1d.

📒 Files selected for processing (3)
  • src/mongo/db/kill_sessions_local.cpp (3 hunks)
  • src/mongo/transport/service_state_machine.cpp (1 hunks)
  • src/mongo/transport/service_state_machine.h (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/mongo/db/kill_sessions_local.cpp (3)
src/mongo/db/service_context.cpp (4)
  • client (231-239)
  • client (231-231)
  • getGlobalServiceContext (68-71)
  • getGlobalServiceContext (68-68)
src/mongo/db/client.cpp (4)
  • releaseCurrent (164-167)
  • releaseCurrent (164-164)
  • setCurrent (169-172)
  • setCurrent (169-169)
src/mongo/transport/service_executor_coroutine.cpp (2)
  • resumeTask (49-54)
  • resumeTask (49-49)
src/mongo/transport/service_state_machine.cpp (1)
src/mongo/db/client.cpp (2)
  • releaseCurrent (164-167)
  • releaseCurrent (164-164)
🔇 Additional comments (4)
src/mongo/db/kill_sessions_local.cpp (2)

123-139: Don’t reuse the caller’s OperationContext on the executor thread

The scheduled task runs on a different thread/group, yet it still captures and uses opCtx from the caller. OperationContext instances are thread-affine; reusing one across threads violates invariants and can trip debug assertions. Create a fresh OperationContext after calling Client::setCurrent on the destination thread and use that in abortArbitraryTransactionIfExpired.

-            auto task = [&finished, &mux, &cv, coroCtx, opCtx, session, &client] {
+            auto task = [&finished, &mux, &cv, coroCtx, session, &client] {
                 Client::setCurrent(std::move(client));
                 coroCtx->source = boost::context::callcc(
                     std::allocator_arg,
                     coroCtx->salloc,
-                    [&finished, &mux, &cv, coroCtx, opCtx, session, &client](
+                    [&finished, &mux, &cv, coroCtx, session, &client](
                         boost::context::continuation&& sink) {
                         coroCtx->yieldFunc = [&sink, &client]() {
                             log() << "abortArbitraryTransactionIfExpired call yield.";
                             client = Client::releaseCurrent();
                             sink = sink.resume();
                         };

                         std::unique_lock lk(mux);
                         try {
-                            session->abortArbitraryTransactionIfExpired(opCtx);
+                            auto localOpCtx = Client::getCurrent()->makeOperationContext();
+                            session->abortArbitraryTransactionIfExpired(localOpCtx.get());
                         } catch (const DBException& ex) {

112-116: Keep the coroutine context alive during async resumes

resumeTask captures coroCtx->source by reference without retaining coroCtx. Once the scanSessions callback unwinds, the shared_ptr can drop to zero before resumeTask runs, leaving the reference dangling and making the next resume a UAF. Capture the shared_ptr itself so it stays alive for the resume path.

-            std::function<void()> resumeTask = [&source = coroCtx->source, &client] {
+            std::function<void()> resumeTask = [coroCtx, &client] {
                 log() << "abortArbitraryTransactionIfExpired call resume.";
                 Client::setCurrent(std::move(client));
-                source = source.resume();
+                coroCtx->source = coroCtx->source.resume();
             };
src/mongo/transport/service_state_machine.h (2)

35-35: Include for the guarded stack allocator looks correct

Adding <boost/context/protected_fixedsize_stack.hpp> is required for the new allocator.


267-269: Allocator declared ahead of the continuation keeps destruction order safe

Declaring _salloc before _source ensures the continuation is torn down before the allocator, which matches Boost’s expectations.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/mongo/transport/service_state_machine.h (1)

267-269: De-duplicate stack size constant across modules (optional).

Coro stack size (3200*1024) is duplicated here and in CoroCtx. Consider a shared constexpr (e.g., a common header) to prevent drift.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bff5f1d and 610dc41.

📒 Files selected for processing (3)
  • src/mongo/db/kill_sessions_local.cpp (3 hunks)
  • src/mongo/transport/service_state_machine.cpp (1 hunks)
  • src/mongo/transport/service_state_machine.h (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/mongo/transport/service_state_machine.cpp (1)
src/mongo/db/client.cpp (2)
  • releaseCurrent (164-167)
  • releaseCurrent (164-164)
src/mongo/db/kill_sessions_local.cpp (2)
src/mongo/db/service_context.cpp (4)
  • client (231-239)
  • client (231-231)
  • getGlobalServiceContext (68-71)
  • getGlobalServiceContext (68-68)
src/mongo/db/client.cpp (4)
  • releaseCurrent (164-167)
  • releaseCurrent (164-164)
  • setCurrent (169-172)
  • setCurrent (169-169)
🔇 Additional comments (8)
src/mongo/db/kill_sessions_local.cpp (5)

47-48: LGTM: switch to protected_fixedsize_stack.

Using boost::context::protected_fixedsize_stack is a solid replacement for the custom stack.


76-83: CoroCtx layout is sound.

Allocator precedes continuation, ensuring continuation is destroyed before the allocator.


123-167: Critical: OperationContext used across threads; create it on the executor thread.

opCtx originates on the caller’s thread but is used inside the scheduled task on a different thread. Construct a fresh OperationContext after Client::setCurrent() on the executor thread and remove opCtx from captures.

Apply:

-            auto task = [&finished, &mux, &cv, coroCtx, opCtx, session, &client] {
+            auto task = [&finished, &mux, &cv, coroCtx, session, &client] {
                 Client::setCurrent(std::move(client));
-                coroCtx->source = boost::context::callcc(
+                coroCtx->source = boost::context::callcc(
                     std::allocator_arg,
                     coroCtx->salloc,
-                    [&finished, &mux, &cv, coroCtx, opCtx, session, &client](
+                    [&finished, &mux, &cv, coroCtx, session, &client](
                         boost::context::continuation&& sink) {
                         coroCtx->yieldFunc = [&sink, &client]() {
                             log() << "abortArbitraryTransactionIfExpired call yield.";
                             client = Client::releaseCurrent();
                             sink = sink.resume();
                         };

                         std::unique_lock lk(mux);
                         try {
-                            session->abortArbitraryTransactionIfExpired(opCtx);
+                            auto localOpCtx = Client::getCurrent()->makeOperationContext();
+                            session->abortArbitraryTransactionIfExpired(localOpCtx.get());
                         } catch (const DBException& ex) {
                             const Status& status = ex.toStatus();
                             std::string errmsg = str::stream()
                                 << "May have failed to abort expired "
                                    "transaction with session id (lsid) '"
                                 << session->getSessionId() << "'."
                                 << " Caused by: " << status;

103-108: LGTM: wiring functors via CoroCtx.

Passing addresses of the CoroCtx std::functions into CoroutineFunctors is consistent with how the client expects them.


112-121: Avoid dangling reference: capture CoroCtx in resumeTask.

resumeTask captures coroCtx->source by reference. If resume runs after the scheduling scope ends, it risks UAF. Capture coroCtx (shared_ptr) by value and update via the pointer.

Apply:

-            std::function<void()> resumeTask = [&source = coroCtx->source, &client] {
+            std::function<void()> resumeTask = [coroCtx, &client] {
                 log() << "abortArbitraryTransactionIfExpired call resume.";
                 Client::setCurrent(std::move(client));
-                source = source.resume();
+                coroCtx->source = coroCtx->source.resume();
             };
src/mongo/transport/service_state_machine.h (2)

34-34: LGTM: include protected_fixedsize_stack.

Header dependency updated appropriately.


267-269: LGTM: allocator before continuation.

Member order ensures continuation is destroyed before the allocator. Matches best practice and prior guidance.

src/mongo/transport/service_state_machine.cpp (1)

548-566: Break potential UAF in _coroYield by using weak_ptr.

Great fix switching to weak_ptr for callcc capture and using _salloc. However, _coroYield captures a raw SSM pointer (ssm.get()), which can dangle if invoked during teardown. Capture the weak_ptr and lock on use.

Apply:

-                            [wssm, &guard](boost::context::continuation&& sink) {
+                            [wssm, &guard](boost::context::continuation&& sink) {
                                 auto ssm = wssm.lock();
                                 if (!ssm) {
                                     return std::move(sink);
                                 }
-                                ssm->_coroYield = [ssm = ssm.get(), &sink]() {
-                                    MONGO_LOG(3) << "call yield";
-                                    ssm->_dbClient = Client::releaseCurrent();
-                                    sink = sink.resume();
-                                };
+                                ssm->_coroYield = [wssm, &sink]() mutable {
+                                    if (auto locked = wssm.lock()) {
+                                        MONGO_LOG(3) << "call yield";
+                                        locked->_dbClient = Client::releaseCurrent();
+                                    }
+                                    sink = sink.resume();
+                                };
                                 ssm->_processMessage(std::move(guard));

                                 return std::move(sink);
                             });

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
src/mongo/db/kill_sessions_local.cpp (2)

46-48: Prefer portable Boost.Context include.

Use <boost/context/continuation.hpp> instead of the fcontext-specific header for portability and clearer API surface. Keep protected_fixedsize_stack include.

-#include <boost/context/continuation_fcontext.hpp>
+#include <boost/context/continuation.hpp>
 #include <boost/context/protected_fixedsize_stack.hpp>

76-83: Optional: move UniqueClient into CoroCtx to avoid cross-thread ref to local.

Storing ServiceContext::UniqueClient in CoroCtx (and using it in yield/resume) avoids capturing a stack-local reference (&client) across async boundaries.

Example shape:

  • Add ServiceContext::UniqueClient client; to CoroCtx.
  • Move-assign client into coroCtx->client after releaseCurrent().
  • Use coroCtx->client within yield/resume lambdas.
src/mongo/transport/service_state_machine.h (2)

34-34: Use stack_traits to validate configured stack size

Good switch to protected_fixedsize_stack. Add stack_traits and static_asserts to catch misconfigurations at compile-time.

 #include <boost/context/continuation.hpp>
 #include <boost/context/continuation_fcontext.hpp>
+#include <boost/context/stack_traits.hpp>
 #include <boost/context/protected_fixedsize_stack.hpp>

And near the kCoroStackSize definition:

     static constexpr size_t kCoroStackSize = 3200 * 1024;
+    static_assert(kCoroStackSize >= boost::context::stack_traits::minimum_size(),
+                  "kCoroStackSize below boost minimum");
+    static_assert(kCoroStackSize <= boost::context::stack_traits::maximum_size(),
+                  "kCoroStackSize above boost maximum");

32-35: Consider removing unused continuation_fcontext include

If nothing in this header needs continuation_fcontext.hpp, drop it to reduce compile surface area.

-#include <boost/context/continuation_fcontext.hpp>
src/mongo/transport/service_state_machine.cpp (1)

733-742: Proactively release coroutine resources during cleanup

To free the stack pages and drop any pending closures immediately on EndSession, clear the continuation and coroutine functors here.

 void ServiceStateMachine::_cleanupSession(ThreadGuard guard) {
     MONGO_LOG(1) << "ServiceStateMachine::_cleanupSession";
     _state.store(State::Ended);

     _inMessage.reset();

     // By ignoring the return value of Client::releaseCurrent() we destroy the session.
     // _dbClient is now nullptr and _dbClientPtr is invalid and should never be accessed.
     Client::releaseCurrent();
+    // Break any remaining continuation/functor references and release the coroutine stack
+    _source = {};
+    _coroYield = {};
+    _coroResume = {};
+    _coroLongResume = {};
+    _coroMigrateThreadGroup = {};
+    _resumeTask = {};
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 610dc41 and 3d2a701.

📒 Files selected for processing (3)
  • src/mongo/db/kill_sessions_local.cpp (3 hunks)
  • src/mongo/transport/service_state_machine.cpp (1 hunks)
  • src/mongo/transport/service_state_machine.h (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/mongo/db/kill_sessions_local.cpp (2)
src/mongo/db/service_context.cpp (4)
  • client (231-239)
  • client (231-231)
  • getGlobalServiceContext (68-71)
  • getGlobalServiceContext (68-68)
src/mongo/db/client.cpp (4)
  • releaseCurrent (164-167)
  • releaseCurrent (164-164)
  • setCurrent (169-172)
  • setCurrent (169-169)
src/mongo/transport/service_state_machine.cpp (1)
src/mongo/db/client.cpp (2)
  • releaseCurrent (164-167)
  • releaseCurrent (164-164)
🔇 Additional comments (7)
src/mongo/db/kill_sessions_local.cpp (5)

76-83: Good: centralize coroutine state and stack via CoroCtx.

Shared state + protected stack is a solid improvement.


98-98: Good: shared_ptr for CoroCtx lifecycle.

This prevents dangling functor pointers.


103-108: Good: bind functors to stable addresses in CoroCtx.

Avoids pointers to temporaries.


123-139: Critical: OperationContext crosses threads; create it on the executor thread.

opCtx originates on the caller thread; using it inside the scheduled task is unsafe. Construct a fresh OperationContext after Client::setCurrent on the executor thread and remove opCtx from captures.

-            auto task = [&finished, &mux, &cv, coroCtx, opCtx, session, &client] {
+            auto task = [&finished, &mux, &cv, coroCtx, session, &client] {
                 Client::setCurrent(std::move(client));
                 coroCtx->source = boost::context::callcc(
                     std::allocator_arg,
                     coroCtx->salloc,
-                    [&finished, &mux, &cv, coroCtx, opCtx, session, &client](
+                    [&finished, &mux, &cv, coroCtx, session, &client](
                         boost::context::continuation&& sink) {
                         coroCtx->yieldFunc = [&sink, &client]() {
                             log() << "abortArbitraryTransactionIfExpired call yield.";
                             client = Client::releaseCurrent();
                             sink = sink.resume();
                         };
 
                         std::unique_lock lk(mux);
                         try {
-                            session->abortArbitraryTransactionIfExpired(opCtx);
+                            auto localOpCtx = Client::getCurrent()->makeOperationContext();
+                            session->abortArbitraryTransactionIfExpired(localOpCtx.get());
                         } catch (const DBException& ex) {

Also applies to: 128-139


112-121: Avoid UAF and retain cycles: capture weak_ptr to CoroCtx in resumeTask.

Capturing a reference to coroCtx->source risks UAF if resume is invoked after scope ends; capturing a shared_ptr can create a retain cycle via coroCtx->resumeFunc. Use weak_ptr.

-            std::function<void()> resumeTask = [&source = coroCtx->source, &client] {
-                log() << "abortArbitraryTransactionIfExpired call resume.";
-                Client::setCurrent(std::move(client));
-                source = source.resume();
-            };
+            auto weakCoroCtx = std::weak_ptr<CoroCtx>(coroCtx);
+            std::function<void()> resumeTask = [weakCoroCtx, &client] {
+                if (auto ctx = weakCoroCtx.lock()) {
+                    log() << "abortArbitraryTransactionIfExpired call resume.";
+                    Client::setCurrent(std::move(client));
+                    ctx->source = ctx->source.resume();
+                }
+            };
src/mongo/transport/service_state_machine.h (1)

267-269: Allocator/continuation destruction order is now correct

_salloc declared before _source ensures continuation is destroyed before the allocator. This fixes the previous lifetime risk.

src/mongo/transport/service_state_machine.cpp (1)

548-553: Confirm toolchain supports C++17
weak_from_this() requires C++17 or newer. Verify CMAKE_CXX_STANDARD (or equivalent) is set to ≥17 in your build configuration.

@xiexiaoy xiexiaoy requested review from liunyl and lzxddz September 26, 2025 04:26
@xiexiaoy xiexiaoy merged commit f208149 into eloqdata:main Sep 26, 2025
2 checks passed
@xiexiaoy xiexiaoy deleted the coro_crash branch September 26, 2025 05:20
zhangh43 pushed a commit to zhangh43/eloqdoc that referenced this pull request Sep 28, 2025
…loqdata#212)

1. killAllExpiredTransactions: use a shared_ptr to manage coro context.
2. Use boost::context::protected_fixedsize_stack to replace the self-implemented stack.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: *** stack smashing detected ***: terminated [Bug]: Invalid coroutine stack

2 participants