-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Description
When working on stress-testing storage in Garnet, I have stumbled upon what looks like a case of Windows disk IO completion events getting rarely lost. This happens in .NET 10 (and .NET 9), but apparently not in .NET 8. It happens frequently on a few of our servers, but very rarely on others.
The disk IO is specifically an asynchronous unbuffered overlapped IO (sector aligned).
I was luckily able to build a fairly minimal self-contained repro outside the Garnet codebase. Running the repro requires starting a (simple) client and server process. This is because the thread that issues the IO on the server side seems to need to be a SocketAsyncEventArgs completion handler to trigger the bug.
Reproduction Steps
Find the code here: https://github.com/microsoft/garnet/tree/badrishc/dotnet-issue-121608/playground/ReadFileRepro
Steps
Window 1:
cd ReadFileRepro\ReadFileServer
dotnet run -c Release -f net10.0
Window 2:
cd ReadFileRepro\ReadFileClient
dotnet run -c Release -f net10.0
Four client-server connections are established by default in the repro. In response to each client request, the server issues a disk IO, then waits on a semaphore. The disk IO completion is configured to release the semaphore. You will see that the after a while the server will stop printing the count of processed requests because all threads are waiting on the semaphore forever. I verified that the IO is issued but the callback (which is supposed to release the semaphore) never gets scheduled.
The repro uses two different device implementations: Native Win32 calls based on ReadFile with the handle bound to the thread pool, and the .NET RandomAccess API. The bug manifests in both cases.
Expected behavior
The server should run forever, printing a running count of IOs processed.
Actual behavior
The server will stop printing output because the threads get stuck waiting on a semaphore that the IO callback is supposed to set.
Regression?
Worked fine in .NET 8, seems to occur from .NET 9 onwards, including the latest version of .NET 10.
Known Workarounds
- If I set
DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS=1then the bug does not repro. So this seems like an event loss when jumping from IO thread to the thread pool. - If I create my own dedicated threads that invoke
GetQueuedCompletionStatusand circumvent the thread pool, then the bug does not repro.
Configuration
Processor AMD EPYC 7413 24-Core Processor 2.65 GHz (2 processors)
System type 64-bit operating system, x64-based processor
Edition Windows Server 2022 Standard
Version 21H2
OS build 20348.4405
dotnet version: 10.0.100
Other information
No response