Skip to content

Fix crash in debugger transport shutdown path during process exit#126372

Merged
jkotas merged 14 commits intomainfrom
copilot/fix-default-principal-policy-test
Apr 2, 2026
Merged

Fix crash in debugger transport shutdown path during process exit#126372
jkotas merged 14 commits intomainfrom
copilot/fix-default-principal-policy-test

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 31, 2026

DbgTransportSession::AbortConnection() accessed m_channel->CloseConnection() without guards. The m_channel pointer is a moving target — the transport worker thread can set it to NULL and replace it as it creates/recreates connections — leading to NULL dereferences or concurrent access races during process shutdown.

Description

Deleted DbgTransportSession::AbortConnection() entirely. Instead, a plain static callback pointer g_pfnAbortTransportCallback is defined in unix/twowaypipe.cpp and set inside TwoWayPipe::CreateServer() the first time the server pipe names are initialized. Debugger::CleanupTransportSocket() calls this callback directly if set, with no access to g_pDbgTransport or m_channel.

Concretely:

  • unix/twowaypipe.cpp defines g_pfnAbortTransportCallback (plain void (*)(void), initialized to nullptr) and maintains static pipe name buffers (s_serverInPipeName, s_serverOutPipeName) populated in TwoWayPipe::CreateServer(). A file-static AbortPipeServerImpl() calls unlink() on these buffers (signal-safe; no ASSERT/TRACE). TwoWayPipe::CreateServer() sets g_pfnAbortTransportCallback to point at AbortPipeServerImpl only on the first call, using VolatileStore to ensure the write is properly fenced. The guard check uses VolatileLoad to read the current value.
  • Defining the symbol in twowaypipe.cpp (part of the debug-pal OBJECT library) avoids a macOS linker error that occurred when the symbol was previously defined in dbgtransportsession.cpp under #ifndef RIGHT_SIDE_COMPILE: mscordbi builds with RIGHT_SIDE_COMPILE defined, so the symbol was never emitted in that translation unit, causing an undefined symbol error at link time on macOS's strict linker.
  • g_pfnAbortTransportCallback is declared with extern in dbgtransportsession.h (under #ifndef RIGHT_SIDE_COMPILE / #ifdef HOST_UNIX) so that debugger.cpp can access it without requiring twowaypipe.h to be included. volatile.h is included in twowaypipe.cpp (available via ../../inc in the HOST_UNIX debug-pal/CMakeLists.txt include path) to provide VolatileStore/VolatileLoad.
  • Debugger::CleanupTransportSocket() reads the callback once into a local via VolatileLoad, checks for null, and calls it — avoiding redundant global accesses. The callback is NULL until the first TwoWayPipe::CreateServer() call, so a no-op is correct if the transport was never initialized.

This keeps the pipe-name management concern entirely within the TwoWayPipe layer and eliminates the shutdown path's dependency on DbgTransportSession. It is fine if the channel continues running for a bit after the pipes are unlinked — it simply means an external debugger can no longer connect to the process.

Customer Impact

Intermittent SIGSEGV (exit code 139 instead of expected 42) in any process that exits without a debugger attached, manifesting as spurious test failures in CI under jitminopts/stress configurations.

Regression

No — this is a latent defect in the debugger transport shutdown path, exposed more frequently under jitminopts due to non-deterministic thread scheduling during shutdown.

Testing

The crash is non-deterministic and stress-dependent. The fix removes the unsafe m_channel access entirely — the shutdown path no longer touches DbgTransportSession at all. unlink() is called on pre-computed names via a signal-safe static callback; memory ordering is enforced with explicit VolatileStore/VolatileLoad at the two sites where the barrier is needed. No new tests added; the existing DefaultPrincipalPolicyTest should stop producing exit-code-139 failures.

Risk

Low. The change is limited to the non-Windows left-side shutdown path. CleanupTransportSocket() now loads the callback via VolatileLoad into a local and calls it if non-null — the same unlink() operation CloseConnection() ultimately performed. No functional behavior change when a debugger is connected. The macOS build break fix (moving the definition to twowaypipe.cpp) is purely a symbol ownership correction with no behavioral change.

Package authoring no longer needed in .NET 9

IMPORTANT: Starting with .NET 9, you no longer need to edit a NuGet package's csproj to enable building and bump the version.
Keep in mind that we still need package authoring in .NET 8 and older versions.

Fixes #126339

Copilot AI changed the title [WIP] Fix test failure in DefaultPrincipalPolicyTest Fix NULL dereference in DbgTransportSession::AbortConnection() during process shutdown Mar 31, 2026
Copilot AI requested a review from mangod9 March 31, 2026 19:32
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag
See info in area-owners.md if you want to be subscribed.

@mangod9 mangod9 marked this pull request as ready for review March 31, 2026 20:49
@mangod9 mangod9 requested review from Copilot and hoyosjs March 31, 2026 20:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an intermittent shutdown crash in CoreCLR’s debugger transport session by defensively handling a null transport channel during process exit, aligning AbortConnection() with existing shutdown patterns in the same component.

Changes:

  • Add a null guard in DbgTransportSession::AbortConnection() before calling m_channel->CloseConnection().

@jkotas
Copy link
Copy Markdown
Member

jkotas commented Mar 31, 2026

Two shutdown paths race during Environment.Exit():
Path A (normal): EEShutDown → StopDebugger clears g_pDbgTransport = NULL before PAL exit
Path B (PAL exit): ExitProcess → AbortConnection() — safe only if Path A completed first

I do not see evidence of the two shutdown paths racing in the crash dump.

Also see my comment on the issue: #126339 (comment)

The null check added in this PR is going to reduce chance of shutdown crash in this specific spot, but I doubt it is actually fixing the problem.

@jkotas jkotas self-requested a review March 31, 2026 21:56
@jkotas
Copy link
Copy Markdown
Member

jkotas commented Mar 31, 2026

Can the actual problem be missing synchronization between TransportWorker() and shutdown?

@mangod9
Copy link
Copy Markdown
Member

mangod9 commented Apr 1, 2026

A broader fix could be adding synchronization in dbgtransportsession.cpp, where its recreating channels and setting m_channel to NULL before recreating/reassigning. But this smaller fix might be reasonable compromise for now

More info here: #126339 (comment).

@mangod9 mangod9 enabled auto-merge (squash) April 1, 2026 02:52
@noahfalk
Copy link
Copy Markdown
Member

noahfalk commented Apr 1, 2026

The delayed thread visibility in copilot's 2nd explanation does sound plausible if the process was very short-lived or OS thread-scheduling delayed the transport worker to start close to the same time the main thread was shutting down. I see some odd behavior where the channel is initially created on the main thread inside of DbgTransportSession::Init() but then it starts the worker thread and the worker thread immediately disconnects that channel, NULLs out the field, and creates a new one. That would open a brief window where the field transitions non-NULL -> NULL -> non-NULL without any debugger interaction.

It looks like there are multiple issues lurking here:

  1. Disconnecting the initial channel might ignore a valid debugger connection
  2. The potential for the NULL deref in AbortConnection
  3. The potential that AbortConnection runs CloseConnection() concurrently with the worker when its not clear the method supports concurrency
  4. AbortConnection doesn't change the worker's state to SS_Closed so the worker will race process exit to re-create the connection.

I'm fine with the NULL check but agreed with @jkotas, I think its a partial mitigation at best.

@jkotas
Copy link
Copy Markdown
Member

jkotas commented Apr 1, 2026

AbortConnection ends up just unlinking the pipe names:

    unlink(m_inPipeName);
    unlink(m_outPipeName);

Can we store the names of pipes in a static variable and unlink them directly without going through the channel indirection that's a moving target? It is fine if the channel continues running for a bit even after the pipes are unlinked. It just means that external debugger cannot connect to the process anymore.

@jkotas jkotas disabled auto-merge April 1, 2026 04:42
@jkotas
Copy link
Copy Markdown
Member

jkotas commented Apr 1, 2026

Can we store the names of pipes in a static variable and unlink them directly without going through the channel indirection that's a moving target? It is fine if the channel continues running for a bit even after the pipes are unlinked. It just means that external debugger cannot connect to the process anymore.

@copilot

Comment thread src/coreclr/debug/debug-pal/unix/twowaypipe.cpp Outdated
Copilot AI requested a review from jkotas April 2, 2026 03:33
Copy link
Copy Markdown
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @mangod9 @noahfalk PTLA

Copy link
Copy Markdown
Member

@mangod9 mangod9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noahfalk are there any specific diagnostics tests which need to be run for this change?

@github-actions

This comment has been minimized.

Comment thread src/coreclr/debug/debug-pal/unix/twowaypipe.cpp Outdated
…ck instead of Volatile<> wrapper

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/ffa47985-f6f6-4347-b813-02bb99cd0076

Co-authored-by: jkotas <6668460+jkotas@users.noreply.github.com>
Copilot AI requested a review from jkotas April 2, 2026 04:02
Comment thread src/coreclr/debug/debug-pal/unix/twowaypipe.cpp Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 2, 2026

Note

This review was generated by Copilot.

🤖 Copilot Code Review — PR #126372

Holistic Assessment

Motivation: The problem is real. DbgTransportSession::AbortConnection() accessed m_channel->CloseConnection() without synchronization, and the transport worker thread NULLs and replaces m_channel in a loop during reconnection. A NULL-deref crash during process exit/signal handling (exit code 139 instead of expected 42) is a genuine latent defect exposed under jitminopts stress. The linked issue #126339 and analysis from @noahfalk confirming the non-NULL → NULL → non-NULL transition window support the diagnosis.

Approach: The fix is the right one. @jkotas directly suggested the design: store pipe names in a static variable and unlink them directly without going through the channel indirection. The result is a clean separation — the shutdown path no longer touches DbgTransportSession at all. The release-acquire publish pattern via VolatileStore/VolatileLoad is correctly applied. The PR went through several rounds of refinement (visible in 13 commits), each addressing specific feedback from @jkotas.

Summary: ✅ LGTM. The code is correct, the memory ordering is sound, the signal-safety constraints are satisfied, and there is no behavioral regression from the old CloseConnection() path. All concerns investigated below are verified as safe. Two models (Claude Opus 4.6, Claude Sonnet 4.6) independently arrived at the same conclusion.


Detailed Findings

✅ Thread Safety — Acquire-release publish pattern is correct

The memcpyVolatileStore → (other thread) VolatileLoad → call pattern is textbook acquire-release:

  • Producer (twowaypipe.cpp:51-53): Two memcpy calls fill the static buffers, then VolatileStore publishes the callback pointer. On ARM64, VolatileStore uses __atomic_store(..., __ATOMIC_RELEASE), ensuring the memcpy data is visible before the pointer. On x86 (TSO), the compiler barrier is sufficient.
  • Consumer (debugger.cpp:1701): VolatileLoad reads the callback into a local, providing acquire semantics. After observing non-NULL, the caller is guaranteed to see the completed s_serverInPipeName/s_serverOutPipeName buffers.

The plain read in the guard (if (g_pfnAbortTransportCallback == nullptr) at line 49) is safe because CreateServer() is called exclusively from the single TransportWorker thread — within a single thread, all operations are sequentially ordered.

✅ Signal Safety — unlink() is async-signal-safe

unlink() is explicitly listed as async-signal-safe in POSIX.1-2017 (Table 21-1). The AbortPipeServerImpl() function correctly avoids ASSERT/TRACE macros as noted in its comment. This function is called through PAL_SetShutdownCallbackPROCNotifyProcessShutdownEESocketCleanupHelperCleanupTransportSocket, which can be invoked from signal handlers.

✅ No Behavioral Regression — Old path only called unlink() too

Verified the full call chain: old AbortConnection()m_channel->CloseConnection()m_pipe.Disconnect()unlink(m_inPipeName); unlink(m_outPipeName). The old Disconnect() did not close file descriptors (m_inboundPipe/m_outboundPipe) — it only unlinked the pipe names from the filesystem. The new code does exactly the same operation on identical path strings stored in static buffers. Open FDs are reclaimed by the kernel on process exit regardless.

✅ Concurrent unlink() Calls Are Benign

Both AbortPipeServerImpl() and TwoWayPipe::Disconnect() may call unlink() on the same path strings concurrently. This is safe because:

  1. unlink() is atomic with respect to the filesystem namespace
  2. A second unlink() on an already-removed path returns -1/ENOENT — both callers ignore the return value
  3. The two functions operate on independent memory (s_server*PipeName vs m_*PipeName) containing the same string content

✅ "First Call Only" Guard Is Correct

The pipe names are derived from the process PID via PAL_GetTransportPipeName(), so they are identical across reconnections within the same process. Storing them only on the first CreateServer() call is correct — subsequent calls produce the same names.

Generated by Code Review for issue #126372 ·

@jkotas
Copy link
Copy Markdown
Member

jkotas commented Apr 2, 2026

  Undefined symbols for architecture x86_64:
    "_g_pfnAbortTransportCallback", referenced from:
        TwoWayPipe::CreateServer(ProcessDescriptor const&) in twowaypipe.cpp.

@copilot Build break on macOS

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 2, 2026

  Undefined symbols for architecture x86_64:
    "_g_pfnAbortTransportCallback", referenced from:

...

Fixed in commit 0e95dab. The root cause: g_pfnAbortTransportCallback was defined in dbgtransportsession.cpp under #ifndef RIGHT_SIDE_COMPILE, but mscordbi builds with RIGHT_SIDE_COMPILE defined so the symbol was never emitted — causing the macOS linker to reject the undefined reference from twowaypipe.cpp.

Moved the definition into twowaypipe.cpp itself (the natural owner, since it's the sole writer). Removed the forward extern declaration from twowaypipe.cpp. The extern declaration in dbgtransportsession.h remains so debugger.cpp can still read the callback.

Copilot AI requested a review from jkotas April 2, 2026 08:00
Copy link
Copy Markdown
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks more complete than the previous NULL check - LGTM.

From what I can see remaining issues to be addressed in a future PR are:

  • transport worker thread destroys initial connection, potentially disconnecting the debugger depending on timing
  • transport worker might re-create the FIFOs after the cleanup call got rid of them. The race requires similar timing to how we speculated the m_channel field was being set to NULL.

@jkotas
Copy link
Copy Markdown
Member

jkotas commented Apr 2, 2026

transport worker thread destroys initial connection, potentially disconnecting the debugger depending on timing

Yes, that logic does not make sense to me.

transport worker might re-create the FIFOs after the cleanup call got rid of them

I think this is "by design" for abnormal terminations that this codepath is for. The cleanup is best effort in that case and it needs to be done in a signal safe way that makes it impossible to do any synchronization reliably.

@jkotas jkotas merged commit 7c2620d into main Apr 2, 2026
110 of 112 checks passed
@jkotas jkotas deleted the copilot/fix-default-principal-policy-test branch April 2, 2026 21:11
radekdoulik pushed a commit to radekdoulik/runtime that referenced this pull request Apr 9, 2026
…tnet#126372)

`DbgTransportSession::AbortConnection()` accessed
`m_channel->CloseConnection()` without guards. The `m_channel` pointer
is a moving target — the transport worker thread can set it to NULL and
replace it as it creates/recreates connections — leading to NULL
dereferences or concurrent access races during process shutdown.

# Description

Deleted `DbgTransportSession::AbortConnection()` entirely. Instead, a
plain static callback pointer `g_pfnAbortTransportCallback` is defined
in `unix/twowaypipe.cpp` and set inside `TwoWayPipe::CreateServer()` the
first time the server pipe names are initialized.
`Debugger::CleanupTransportSocket()` calls this callback directly if
set, with no access to `g_pDbgTransport` or `m_channel`.

Fixes dotnet#126339

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: mangod9 <61718172+mangod9@users.noreply.github.com>
Co-authored-by: jkotas <6668460+jkotas@users.noreply.github.com>
Co-authored-by: Jan Kotas <jkotas@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test failure: System.Threading.Threads.Tests.ThreadTests.DefaultPrincipalPolicyTest

5 participants