Fix crash in debugger transport shutdown path during process exit#126372
Fix crash in debugger transport shutdown path during process exit#126372
Conversation
…tion() Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/756d23b7-a596-4faf-970b-e9ad20e69bb0 Co-authored-by: mangod9 <61718172+mangod9@users.noreply.github.com>
|
Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag |
There was a problem hiding this comment.
Pull request overview
Fixes an intermittent shutdown crash in CoreCLR’s debugger transport session by defensively handling a null transport channel during process exit, aligning AbortConnection() with existing shutdown patterns in the same component.
Changes:
- Add a null guard in
DbgTransportSession::AbortConnection()before callingm_channel->CloseConnection().
I do not see evidence of the two shutdown paths racing in the crash dump. Also see my comment on the issue: #126339 (comment) The null check added in this PR is going to reduce chance of shutdown crash in this specific spot, but I doubt it is actually fixing the problem. |
|
Can the actual problem be missing synchronization between |
|
A broader fix could be adding synchronization in dbgtransportsession.cpp, where its recreating channels and setting m_channel to NULL before recreating/reassigning. But this smaller fix might be reasonable compromise for now More info here: #126339 (comment). |
|
The delayed thread visibility in copilot's 2nd explanation does sound plausible if the process was very short-lived or OS thread-scheduling delayed the transport worker to start close to the same time the main thread was shutting down. I see some odd behavior where the channel is initially created on the main thread inside of DbgTransportSession::Init() but then it starts the worker thread and the worker thread immediately disconnects that channel, NULLs out the field, and creates a new one. That would open a brief window where the field transitions non-NULL -> NULL -> non-NULL without any debugger interaction. It looks like there are multiple issues lurking here:
I'm fine with the NULL check but agreed with @jkotas, I think its a partial mitigation at best. |
|
Can we store the names of pipes in a static variable and unlink them directly without going through the channel indirection that's a moving target? It is fine if the channel continues running for a bit even after the pipes are unlinked. It just means that external debugger cannot connect to the process anymore. |
|
… issue Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/eb0c5374-d72d-4e81-8f24-49355449d9e9 Co-authored-by: jkotas <6668460+jkotas@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
…ck instead of Volatile<> wrapper Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/ffa47985-f6f6-4347-b813-02bb99cd0076 Co-authored-by: jkotas <6668460+jkotas@users.noreply.github.com>
|
Note This review was generated by Copilot. 🤖 Copilot Code Review — PR #126372Holistic AssessmentMotivation: The problem is real. Approach: The fix is the right one. Summary: ✅ LGTM. The code is correct, the memory ordering is sound, the signal-safety constraints are satisfied, and there is no behavioral regression from the old Detailed Findings✅ Thread Safety — Acquire-release publish pattern is correctThe
The plain read in the guard ( ✅ Signal Safety —
|
@copilot Build break on macOS |
… twowaypipe.cpp Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/cc7aac95-c129-4f68-9d5d-43fc66d15948 Co-authored-by: jkotas <6668460+jkotas@users.noreply.github.com>
... Fixed in commit 0e95dab. The root cause: Moved the definition into |
noahfalk
left a comment
There was a problem hiding this comment.
This looks more complete than the previous NULL check - LGTM.
From what I can see remaining issues to be addressed in a future PR are:
- transport worker thread destroys initial connection, potentially disconnecting the debugger depending on timing
- transport worker might re-create the FIFOs after the cleanup call got rid of them. The race requires similar timing to how we speculated the m_channel field was being set to NULL.
Yes, that logic does not make sense to me.
I think this is "by design" for abnormal terminations that this codepath is for. The cleanup is best effort in that case and it needs to be done in a signal safe way that makes it impossible to do any synchronization reliably. |
…tnet#126372) `DbgTransportSession::AbortConnection()` accessed `m_channel->CloseConnection()` without guards. The `m_channel` pointer is a moving target — the transport worker thread can set it to NULL and replace it as it creates/recreates connections — leading to NULL dereferences or concurrent access races during process shutdown. # Description Deleted `DbgTransportSession::AbortConnection()` entirely. Instead, a plain static callback pointer `g_pfnAbortTransportCallback` is defined in `unix/twowaypipe.cpp` and set inside `TwoWayPipe::CreateServer()` the first time the server pipe names are initialized. `Debugger::CleanupTransportSocket()` calls this callback directly if set, with no access to `g_pDbgTransport` or `m_channel`. Fixes dotnet#126339 --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: mangod9 <61718172+mangod9@users.noreply.github.com> Co-authored-by: jkotas <6668460+jkotas@users.noreply.github.com> Co-authored-by: Jan Kotas <jkotas@microsoft.com>
DbgTransportSession::AbortConnection()accessedm_channel->CloseConnection()without guards. Them_channelpointer is a moving target — the transport worker thread can set it to NULL and replace it as it creates/recreates connections — leading to NULL dereferences or concurrent access races during process shutdown.Description
Deleted
DbgTransportSession::AbortConnection()entirely. Instead, a plain static callback pointerg_pfnAbortTransportCallbackis defined inunix/twowaypipe.cppand set insideTwoWayPipe::CreateServer()the first time the server pipe names are initialized.Debugger::CleanupTransportSocket()calls this callback directly if set, with no access tog_pDbgTransportorm_channel.Concretely:
unix/twowaypipe.cppdefinesg_pfnAbortTransportCallback(plainvoid (*)(void), initialized tonullptr) and maintains static pipe name buffers (s_serverInPipeName,s_serverOutPipeName) populated inTwoWayPipe::CreateServer(). A file-staticAbortPipeServerImpl()callsunlink()on these buffers (signal-safe; no ASSERT/TRACE).TwoWayPipe::CreateServer()setsg_pfnAbortTransportCallbackto point atAbortPipeServerImplonly on the first call, usingVolatileStoreto ensure the write is properly fenced. The guard check usesVolatileLoadto read the current value.twowaypipe.cpp(part of thedebug-palOBJECT library) avoids a macOS linker error that occurred when the symbol was previously defined indbgtransportsession.cppunder#ifndef RIGHT_SIDE_COMPILE:mscordbibuilds withRIGHT_SIDE_COMPILEdefined, so the symbol was never emitted in that translation unit, causing an undefined symbol error at link time on macOS's strict linker.g_pfnAbortTransportCallbackis declared withexternindbgtransportsession.h(under#ifndef RIGHT_SIDE_COMPILE / #ifdef HOST_UNIX) so thatdebugger.cppcan access it without requiringtwowaypipe.hto be included.volatile.his included intwowaypipe.cpp(available via../../incin the HOST_UNIXdebug-pal/CMakeLists.txtinclude path) to provideVolatileStore/VolatileLoad.Debugger::CleanupTransportSocket()reads the callback once into a local viaVolatileLoad, checks for null, and calls it — avoiding redundant global accesses. The callback is NULL until the firstTwoWayPipe::CreateServer()call, so a no-op is correct if the transport was never initialized.This keeps the pipe-name management concern entirely within the
TwoWayPipelayer and eliminates the shutdown path's dependency onDbgTransportSession. It is fine if the channel continues running for a bit after the pipes are unlinked — it simply means an external debugger can no longer connect to the process.Customer Impact
Intermittent SIGSEGV (exit code 139 instead of expected 42) in any process that exits without a debugger attached, manifesting as spurious test failures in CI under jitminopts/stress configurations.
Regression
No — this is a latent defect in the debugger transport shutdown path, exposed more frequently under jitminopts due to non-deterministic thread scheduling during shutdown.
Testing
The crash is non-deterministic and stress-dependent. The fix removes the unsafe
m_channelaccess entirely — the shutdown path no longer touchesDbgTransportSessionat all.unlink()is called on pre-computed names via a signal-safe static callback; memory ordering is enforced with explicitVolatileStore/VolatileLoadat the two sites where the barrier is needed. No new tests added; the existingDefaultPrincipalPolicyTestshould stop producing exit-code-139 failures.Risk
Low. The change is limited to the non-Windows left-side shutdown path.
CleanupTransportSocket()now loads the callback viaVolatileLoadinto a local and calls it if non-null — the sameunlink()operationCloseConnection()ultimately performed. No functional behavior change when a debugger is connected. The macOS build break fix (moving the definition totwowaypipe.cpp) is purely a symbol ownership correction with no behavioral change.Package authoring no longer needed in .NET 9
IMPORTANT: Starting with .NET 9, you no longer need to edit a NuGet package's csproj to enable building and bump the version.
Keep in mind that we still need package authoring in .NET 8 and older versions.
Fixes #126339