Skip to content

Consider detecting stalled HTTP/2 connections in SocketsHttpHandler #82236

@MihaZupan

Description

@MihaZupan

In .NET 7, we added a mitigation in the connection pool to tear down stalled connection attempts to prevent a single stalled operation from resulting in many request failures (#71785).

When using HTTP/1.1, if a connection stops responding, the request using said connection will eventually time out and tear down the connection.
But with HTTP/2, the connection is shared, and request cancellations/timeouts do not affect its lifetime. As a result, a single stalled connection can continue to live in the connection pool and result in many request failures.

I recently saw a memory dump with an Http2Connection that had 0 currently active streams, but a write channel with thousands of entries. All the requests were failing inside of Http2Connection.SendHeadersAsync as the write channel was not getting drained.
There was also a pending 32 KB socket write for that connection, but it had apparently not completed/failed for a long time.
Because we store a timestamp every time we read a frame, I can also see that we haven't received anything from the server for 80 hours at the time the dump was taken, despite a PooledConnectionIdleTimeout of 1 minute.
Note that these were regular socket read/writes in an otherwise responsive process.

I believe we should consider employing some sort of mitigation to get rid of connections where either the writes aren't draining, or the server hasn't sent us anything for a long time.

The obvious way to achieve that would be to make use of client-initiated HTTP/2 pings that we exposed in .NET 5 (#31198), which are currently OFF by default. Judging by the comments on the original API proposal issue, some servers are very aggressive about closing connections when clients use pings, so just turning them on by default may prove too risky.

On the other hand, we could have logic that did "If FlushOutgoingBytesAsync hasn't completed in X amount of time, take the connection out of the pool" + "If header writes are going through, but we haven't received anything from the server in Y amount of time, take the connection out of the pool as it's effectively idle/unresponsive" or something along those lines.

Any thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions