Skip to content

Refactor: route DebuggerController operations through a per-controller worker queue #1088

@xusheng6

Description

@xusheng6

Motivation

The 17 public operations on DebuggerController (Launch, Attach, Connect, Go, GoReverse, Step{Into,Over,Return}{,Reverse}, RunTo{,Reverse}, Restart, Detach, Quit, Pause) each spawn a fresh std::thread and detach it. Every site captures this raw (or via DbgRef after #1086 lands), then calls back into the controller. This has caused a recurring family of crashes — the controller can be destroyed while one of these detached threads is still running, leading to use-after-free on its members. Several recent Sentry reports trace to exactly this pattern:

  • BINARYNINJA-3X — RspConnector::SendRaw via CorelliumAdapter::BreakInto (26 events)
  • BINARYNINJA-7M — DbgEngTTDAdapter::ResetEngineLoop
  • BINARYNINJA-60 / BINARYNINJA-5Z — DbgEngAdapter::ReadMemory
  • BINARYNINJA-1A — DbgEngAdapter::Attach worker thread
  • Crash below DebuggerController::DetachAndWait #1083DetachAndWait crash

The per-call DbgRef capture (#1086) is a stopgap that keeps the controller alive for the duration of the detached call. The structural fix is to stop spawning per-op threads in the first place.

Approach

Replace the per-operation detach pattern with:

  1. One persistent worker thread per controller, owned by the controller and joined in the destructor (after draining).
  2. A FIFO work queue that every public operation submits to. The queue serializes ops naturally — no concurrent adapter calls, no mutex dance to prevent them.
  3. Submit() / SubmitAndWait() templates that handle queueing and re-entrant calls (when the worker invokes another operation on itself, run inline instead of deadlocking).
  4. Pause() and PauseAndWait() bypass the queue and call m_adapter->BreakInto() directly on the caller's thread — the worker is blocked inside ExecuteAdapterAndWait for the in-flight op and needs an out-of-band signal to unblock.
  5. Timeout parameter on every XxxAndWait, default std::chrono::milliseconds::max() so existing callers don't change behavior. Closes the long-standing "sync ops can hang forever" complaint without an API break.

The public API surface is unchanged — every existing Launch() / LaunchAndWait() call site keeps working. Internally:

  • The old XxxAndWait() body (lock + Internal + unlock + NotifyStopped) is renamed XxxAndWaitOnWorker() (private) and invoked by the worker.
  • New public XxxAndWait(timeout) is a one-liner around SubmitAndWait.

Implementation

See #1087 (open).

Out of scope (follow-ups)

  • Adapter-internal threads (DbgEngAdapter::Attach's spawned thread, EngineLoop, LLDB EventListener, RSP send/receive). These have their own lifetime concerns — adapter destructors need to act as join barriers. Separate from controller-side scheduling.
  • m_targetControlMutex cleanup. Redundant once the queue serializes work, but left in place to minimize behavioral risk.
  • BinaryDataNotification / settings notification unregister in Destroy (BINARYNINJA-75, -2H — the RebaseToAddress family).
  • m_userRequestedBreak should become std::atomic<bool> now that Pause writes it from a non-worker thread.
  • Simplification of the adapter-vs-controller stop event plumbing (separate issue).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions