-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Description
Environment
- .NET Runtime Version: 9.0.7
- Operating System: Linux Debian 12
- Architecture: x64
- GC Mode: Server GC
- Environment Variable:
DOTNET_GCName=libclrgc.so
Problem Description
We are running a .NET 9 server application in our production environment. The server application intermittently experiences a total freeze or deadlock, becoming completely unresponsive.
Observed Symptoms
When the freeze occurs, we observe the following:
- No Response: All HTTP API endpoints stop responding.
- No Logs: The application stops writing any new logs.
- Delayed Timers: We have
System.Threading.Timerinstances that are observed to trigger significantly later than their scheduled time, indicating the entire process is stalled. - Correlation with GC: The freeze often correlates with a high-activity GC period (e.g., GC pause times reported as > 2 seconds just before the incident).
- Process State: During the freeze,
htoportopshows the process's CPU usage drops to 0. - Thread Pool: The thread pool queue length drops to near 0, even as new requests should be queuing.
- Diagnostics Fail: Attempting to use
dotnet-stack reporton the hung process also hangs indefinitely and provides no output.
GDB Stack Traces
We successfully attached gdb to the hung process. We found that the vast majority of .NET TP Worker threads are idle, but a few critical threads seem to be involved in a deadlock.
1. Suspicious Stacks (Potential Deadlock)
Two threads, including the crucial .NET SynchManag thread, are stuck in __GI_cfsetispeed. This appears to be the core of the deadlock.
Thread 2 (Thread 0x7fb6b2fb56c0 (LWP 7) ".NET SynchManag"):
#0 0x00007fb6b38c121f in __GI_cfsetispeed (termios_p=0x7fb6b2fb4d78, speed=1) at ../sysdeps/unix/sysv/linux/speed.c:96
#1 0x00007fb6b2fb4db0 in ?? ()
#2 0x00007fb6b2fb4d78 in ?? ()
#3 0x0000000000000001 in ?? ()
#4 0xffffffff00000000 in ?? ()
#5 0x000055a8504b0f58 in ?? ()
#6 0x00007fb6b3628fa0 in ?? () from /usr/share/dotnet/shared/Microsoft.NETCore.App/9.0.7/libcoreclr.so
#7 0x00007fb6b3628603 in ?? () from /usr/share/dotnet/shared/Microsoft.NETCore.App/9.0.7/libcoreclr.so
#8 0x00007fb6b363214e in ?? () from /usr/share/dotnet/shared/Microsoft.NETCore.App/9.0.7/libcoreclr.so
#9 0x00007fb6b384e1f5 in __pthread_create_2_1 (newthread=<optimized out>, attr=<optimized out>, start_routine=<optimized out>, arg=<optimized out>) at ./nptl/pthread_create.c:846
#10 0x0000000000000000 in ?? ()
Thread 181 (Thread 0x7e83927fc6c0 (LWP 255) "rdk:broker10265"):
#0 0x00007fb6b38c121f in __GI_cfsetispeed (termios_p=0x7e83740049b8, speed=2) at ../sysdeps/unix/sysv/linux/speed.c:96
#1# 0x0000000000000000 in ?? ()
2. Idle Worker Threads
Most other threads, especially .NET TP Worker threads, are idle and appear to be waiting for work (stuck in __pthread_attr_extension).
Thread 194 (Thread 0x7e8651ffb6c0 (LWP 311668) ".NET TP Worker"):
#0 0x00007fb6b384af16 in __pthread_attr_extension (attr=0x0) at ./nptl/pthread_attr_extension.c:30
#1 0x0000000000000189 in ?? ()
#2 0x000055a8504e83b8 in ?? ()
#3 0x0000000000000000 in ?? ()
Thread 193 (Thread 0x7e84fdffb6c0 (LWP 311660) ".NET TP Worker"):
#0 0x00007fb6b384af16 in __pthread_attr_extension (attr=0x0) at ./nptl/pthread_attr_extension.c:30
#1 0x000055a800000189 in ?? ()
#2 0x000055a8504e83b8 in ?? ()
#3 0x0000000000000000 in ?? ()
Thread 192 (Thread 0x7e851effd6c0 (LWP 311650) ".NET TP Worker"):
#0 0x00007fb6b384af16 in __pthread_attr_extension (attr=0x0) at ./nptl/pthread_attr_extension.c:30
#1 0x0000000000000189 in ?? ()
#2 0x000055a8504e83b8 in ?? ()
#3 0x0000000000000000 in ?? ()
Summary
The presence of __GI_cfsetispeed in the stack for the .NET SynchManag thread is highly suspicious and suggests a potential native-level deadlock, possibly within glibc or how the runtime interacts with it. This stall prevents all managed code (including timers and the thread pool) from making progress.
Given that diagnostic tools like dotnet-stack also fail, this points to a very low-level lock or stall.
Reproduction Steps
Hard to reproduce, occured sometimes in production environment
Expected behavior
Should not hang forever
Actual behavior
Hangs forever
Regression?
No response
Known Workarounds
No response
Configuration
- .NET Runtime Version: 9.0.7
- Operating System: Linux (inferred from stack trace)
- Architecture: x64 (inferred from stack trace)
- GC Mode: Server GC
- Environment Variable:
DOTNET_GCName=libclrgc.so
Other information
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status