Skip to content

Production server freeze/deadlock on .NET 9.0.7, CPU 0, gdb points to __GI_cfsetispeed in multiple threads #121461

@vrecluse

Description

@vrecluse

Description

Environment

  • .NET Runtime Version: 9.0.7
  • Operating System: Linux Debian 12
  • Architecture: x64
  • GC Mode: Server GC
  • Environment Variable: DOTNET_GCName=libclrgc.so

Problem Description

We are running a .NET 9 server application in our production environment. The server application intermittently experiences a total freeze or deadlock, becoming completely unresponsive.

Observed Symptoms

When the freeze occurs, we observe the following:

  1. No Response: All HTTP API endpoints stop responding.
  2. No Logs: The application stops writing any new logs.
  3. Delayed Timers: We have System.Threading.Timer instances that are observed to trigger significantly later than their scheduled time, indicating the entire process is stalled.
  4. Correlation with GC: The freeze often correlates with a high-activity GC period (e.g., GC pause times reported as > 2 seconds just before the incident).
  5. Process State: During the freeze, htop or top shows the process's CPU usage drops to 0.
  6. Thread Pool: The thread pool queue length drops to near 0, even as new requests should be queuing.
  7. Diagnostics Fail: Attempting to use dotnet-stack report on the hung process also hangs indefinitely and provides no output.

GDB Stack Traces

We successfully attached gdb to the hung process. We found that the vast majority of .NET TP Worker threads are idle, but a few critical threads seem to be involved in a deadlock.

1. Suspicious Stacks (Potential Deadlock)

Two threads, including the crucial .NET SynchManag thread, are stuck in __GI_cfsetispeed. This appears to be the core of the deadlock.

Thread 2 (Thread 0x7fb6b2fb56c0 (LWP 7) ".NET SynchManag"):
#0  0x00007fb6b38c121f in __GI_cfsetispeed (termios_p=0x7fb6b2fb4d78, speed=1) at ../sysdeps/unix/sysv/linux/speed.c:96
#1  0x00007fb6b2fb4db0 in ?? ()
#2  0x00007fb6b2fb4d78 in ?? ()
#3  0x0000000000000001 in ?? ()
#4  0xffffffff00000000 in ?? ()
#5  0x000055a8504b0f58 in ?? ()
#6  0x00007fb6b3628fa0 in ?? () from /usr/share/dotnet/shared/Microsoft.NETCore.App/9.0.7/libcoreclr.so
#7  0x00007fb6b3628603 in ?? () from /usr/share/dotnet/shared/Microsoft.NETCore.App/9.0.7/libcoreclr.so
#8  0x00007fb6b363214e in ?? () from /usr/share/dotnet/shared/Microsoft.NETCore.App/9.0.7/libcoreclr.so
#9  0x00007fb6b384e1f5 in __pthread_create_2_1 (newthread=<optimized out>, attr=<optimized out>, start_routine=<optimized out>, arg=<optimized out>) at ./nptl/pthread_create.c:846
#10 0x0000000000000000 in ?? ()

Thread 181 (Thread 0x7e83927fc6c0 (LWP 255) "rdk:broker10265"):
#0  0x00007fb6b38c121f in __GI_cfsetispeed (termios_p=0x7e83740049b8, speed=2) at ../sysdeps/unix/sysv/linux/speed.c:96
#1# 0x0000000000000000 in ?? ()

2. Idle Worker Threads

Most other threads, especially .NET TP Worker threads, are idle and appear to be waiting for work (stuck in __pthread_attr_extension).

Thread 194 (Thread 0x7e8651ffb6c0 (LWP 311668) ".NET TP Worker"):
#0  0x00007fb6b384af16 in __pthread_attr_extension (attr=0x0) at ./nptl/pthread_attr_extension.c:30
#1  0x0000000000000189 in ?? ()
#2  0x000055a8504e83b8 in ?? ()
#3  0x0000000000000000 in ?? ()

Thread 193 (Thread 0x7e84fdffb6c0 (LWP 311660) ".NET TP Worker"):
#0  0x00007fb6b384af16 in __pthread_attr_extension (attr=0x0) at ./nptl/pthread_attr_extension.c:30
#1  0x000055a800000189 in ?? ()
#2  0x000055a8504e83b8 in ?? ()
#3  0x0000000000000000 in ?? ()

Thread 192 (Thread 0x7e851effd6c0 (LWP 311650) ".NET TP Worker"):
#0  0x00007fb6b384af16 in __pthread_attr_extension (attr=0x0) at ./nptl/pthread_attr_extension.c:30
#1  0x0000000000000189 in ?? ()
#2  0x000055a8504e83b8 in ?? ()
#3  0x0000000000000000 in ?? ()

Summary

The presence of __GI_cfsetispeed in the stack for the .NET SynchManag thread is highly suspicious and suggests a potential native-level deadlock, possibly within glibc or how the runtime interacts with it. This stall prevents all managed code (including timers and the thread pool) from making progress.

Given that diagnostic tools like dotnet-stack also fail, this points to a very low-level lock or stall.


Reproduction Steps

Hard to reproduce, occured sometimes in production environment

Expected behavior

Should not hang forever

Actual behavior

Hangs forever

Regression?

No response

Known Workarounds

No response

Configuration

  • .NET Runtime Version: 9.0.7
  • Operating System: Linux (inferred from stack trace)
  • Architecture: x64 (inferred from stack trace)
  • GC Mode: Server GC
  • Environment Variable: DOTNET_GCName=libclrgc.so

Other information

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions