Skip to content

Disk IO completions getting lost #121608

@badrishc

Description

@badrishc

Description

When working on stress-testing storage in Garnet, I have stumbled upon what looks like a case of Windows disk IO completion events getting rarely lost. This happens in .NET 10 (and .NET 9), but apparently not in .NET 8. It happens frequently on a few of our servers, but very rarely on others.

The disk IO is specifically an asynchronous unbuffered overlapped IO (sector aligned).

I was luckily able to build a fairly minimal self-contained repro outside the Garnet codebase. Running the repro requires starting a (simple) client and server process. This is because the thread that issues the IO on the server side seems to need to be a SocketAsyncEventArgs completion handler to trigger the bug.

Reproduction Steps

Find the code here: https://github.com/microsoft/garnet/tree/badrishc/dotnet-issue-121608/playground/ReadFileRepro

Steps

Window 1:

cd ReadFileRepro\ReadFileServer
dotnet run -c Release -f net10.0

Window 2:

cd ReadFileRepro\ReadFileClient
dotnet run -c Release -f net10.0

Four client-server connections are established by default in the repro. In response to each client request, the server issues a disk IO, then waits on a semaphore. The disk IO completion is configured to release the semaphore. You will see that the after a while the server will stop printing the count of processed requests because all threads are waiting on the semaphore forever. I verified that the IO is issued but the callback (which is supposed to release the semaphore) never gets scheduled.

The repro uses two different device implementations: Native Win32 calls based on ReadFile with the handle bound to the thread pool, and the .NET RandomAccess API. The bug manifests in both cases.

Expected behavior

The server should run forever, printing a running count of IOs processed.

Actual behavior

The server will stop printing output because the threads get stuck waiting on a semaphore that the IO callback is supposed to set.

Regression?

Worked fine in .NET 8, seems to occur from .NET 9 onwards, including the latest version of .NET 10.

Known Workarounds

  • If I set DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS=1 then the bug does not repro. So this seems like an event loss when jumping from IO thread to the thread pool.
  • If I create my own dedicated threads that invoke GetQueuedCompletionStatus and circumvent the thread pool, then the bug does not repro.

Configuration

Processor AMD EPYC 7413 24-Core Processor 2.65 GHz (2 processors)
System type 64-bit operating system, x64-based processor
Edition Windows Server 2022 Standard
Version 21H2
OS build 20348.4405

dotnet version: 10.0.100

Other information

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions