Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rework SocketsHttpHandler request queue handling for HTTP/1.1 and HTTP/2 #53851

Merged
merged 9 commits into from
Jul 15, 2021

Conversation

geoffkizer
Copy link
Contributor

@geoffkizer geoffkizer commented Jun 8, 2021

Fixes #44818
Fixes #47484

This is a significant change to connection pooling and request queuing in SocketsHttpHandler. The primary goal is to address #44818 -- that is, allow a request to be satisfied by any connection that becomes available while the request is waiting for a connection. The secondary goals are:

(1) Provide consistency in connection pooling across HTTP versions. Where possible, code is shared and reused. In other cases, we provide parallel but distinct implementations for different HTTP versions.
(2) Separate pool policy from connection state, where possible, to allow in the future the ability to use connections directly without being associated with a pool. This is a work in progress; the HTTP2 logic is closer to enabling this goal than the HTTP/1.1 logic.

Details:

Previously, a new connection attempt (HTTP1.1 or HTTP2) was always associated with a specific request.

In the simplest case of HTTP/1.1, no connection limit, and no connection currently available, this would always result in a new connection attempt for the request, and the request would always wait for this connection to be established, even if other connections became available sooner.

For HTTP/1.1 with a connection limit, when the limit was hit, we would queue requests to the HTTP/1.1 request queue instead of initiating a new connection. When an existing connection became available, it would try to satisfy one of the queued requests, if any. If an existing connection became unusable, then it would allow a new connection to be created -- but again, this connection would be associated with a specific request (the one that happened to be at the top of the HTTP.1/1 request queue at the time).

The HTTP/2 model was very different. By default, it would only allow a single HTTP/2 connection at a time. During connection establishment, requests would be queued by the _http2ConnectionCreateLock. After the HTTP/2 connection was established, requests were subject to queuing by the _concurrentStreams CreditManger on the Http2Connection, which would queue requests when necessary to enforce the HTTP2 MAX_CONCURRENT_STREAMS setting from the server.

When the EnableMultipleHttp2Connections setting was set to true, HTTP/2 behavior would change so that we would no longer queue on the connection to enforce MAX_CONCURRENT_STREAMS. Instead, we would create a new HTTP2 connection to satisfy requests. However, like HTTP/1.1, new requests would always wait for the new connection to be established instead of using existing connections if/when they became available for use.

This PR implements a uniform request queuing model for HTTP/1.1 and HTTP/2 that is used for all requests when no connections are immediately usable. When connections (new or existing) become usable, they dequeue requests from the queue and process them.

Simple connection injection policy is used to determine if/when to create a connections. Specifically:

For HTTP/1.1:
We will inject a new connection whenever we are:
(a) below the connection limit (if any), and
(b) have more requests in the queue than pending connections.

For HTTP/2:
We only inject a single HTTP/2 connection at a time.
Before use, an HTTP/2 connection is checked to see if it has available streams. If not, we will not attempt to use it; instead we will register to be notified when streams become available. There is no longer any connection-level queuing of requests.
If EnableMultipleHttp2Connections is true, we will inject a new connection any time we do not have a pending connection, and we have queued requests.

Additionally, this PR contains some cleanup/clarification to related issues such as HTTP2 connection shutdown, connection scavenging, managing of idle time and lifetime, etc.

@dotnet/ncl Please take a look.
@stephentoub You implemented most of the HTTP/1.1 connection pooling logic originally
@scalablecory
@alnikola You are most familiar with the HTTP/2 connection management
@ManickaP You are most familiar with #47484

@ghost
Copy link

ghost commented Jun 8, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #44818

This is a relatively large change to connection pooling and request queuing in SocketsHttpHandler. The immediate goal is to address #44818. Beyond that, this reworking is also intended to:
(1) Provide consistency in connection pooling across HTTP versions. Where possible, code is shared and reused. In other cases, we provide parallel but distinct implementations for different HTTP versions.
(2) Separate pool policy from connection state, where possible, to allow in the future the ability to use connections directly without being associated with a pool. This is a work in progress; the HTTP2 logic is closer to enabling this goal than the HTTP/1.1 logic.

This PR does not change HTTP3 pooling logic; this will be addressed in a subsequent PR.

Note: There are a couple TODOs in the code currently, and I've also disabled a bunch of telemetry tests that are failing due to the queuing changes.

Author: geoffkizer
Assignees: -
Labels:

area-System.Net.Http

Milestone: -

@geoffkizer geoffkizer requested review from a team and scalablecory June 8, 2021 05:21
@geoffkizer geoffkizer marked this pull request as draft June 8, 2021 05:21
Copy link
Contributor

@scalablecory scalablecory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only about half way through the PR but I like the direction so far.

Comment on lines 234 to 237
if (_shutdownWaiter is null)
{
_shutdownWaiter = new TaskCompletionSource(TaskCreationOptions.RunContinuationsAsynchronously);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (_shutdownWaiter is null)
{
_shutdownWaiter = new TaskCompletionSource(TaskCreationOptions.RunContinuationsAsynchronously);
}
_shutdownWaiter ??= new TaskCompletionSource(TaskCreationOptions.RunContinuationsAsynchronously);

Debug.Assert(_httpStreams.Count == 0);
ThrowShutdownException();
throw; // unreachable
// The server must have sent a downward adjustment to SETTINGS_MAX_CONCURRENT_STREAMS, so our previous stream reservation is no longer valid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also avoid ACKing the settings frame until we're at/under the limit including reservations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could. Doing that seems nontrivial, so for now at least I'm just keeping the existing behavior.

I agree we should consider this in the future though.

private bool _inUse;
private bool _detachedFromPool;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from _inUse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_detachedFromPool is used in one very specific case, which is when NT auth causes us to create a new connection (and close the old one) in order to proceed with auth.

//
// Once we have found a usable connection, we use it to process the request.
// For HTTP/1.1, a connection can handle only a single request at a time, thus it is immediately removed from the list of available connections.
// For HTTP2/3, a connection is only removed from the available list when it has no more available streams.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the server dynamically changes how many streams are allowed on the connection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See Http2Connection.SendHeadersAsync. We will throw a retryable exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note this can't happen in HTTP3.

_associatedHttp11ConnectionCount++;
_pendingHttp11ConnectionCount++;

Task.Run(() => AddHttp11Connection(request));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to queue a work item to call this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, because we are holding the lock here. We could move the AddHttp11Connection outside the lock with a bit of work if we wanted.

Second, it seems good in general to queue any work associated with creating the new connection vs starting to do that work on the calling thread. This routine is called in a couple different places, and if we were to call AddHttp11Connection directly here, we might be delaying some other work from completing.

Thoughts?

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Skimmed through the draft. Generally direction looks reasonable. Thanks.

@geoffkizer geoffkizer marked this pull request as ready for review June 23, 2021 06:26
@karelz karelz requested review from ManickaP and alnikola June 24, 2021 15:47
@geoffkizer
Copy link
Contributor Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

// Fail the next request, if any.
HttpRequestException hre = new HttpRequestException(SR.net_http_http2_connection_not_established);
ExceptionDispatchInfo.SetCurrentStackTrace(hre);
_http2RequestQueue.TryFailNextRequest(hre);
Copy link
Contributor

@alnikola alnikola Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if fully understand how this approach handles multiple HTTP/2 connections. _http2RequestQueue is shared by all connections, so it looks like the next request dequeuing can happen on a different connection that the one passed to here, therefore that other connection will be filed by this exception.
Am I missing something? Could you please clarify the logic here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the question.

TryFailNextRequest will dequeue the request and fail it. Is that what you're asking about? I can change the name to be clearer. It's meant to parallel TryDequeueNextRequest below, except that the request will be failed with the specific exception.

{
if (NetEventSource.Log.IsEnabled()) connection.Trace("");

Task.Run(async () =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems Task.Run() pattern is frequently used by this new code, but it's not clear to me how potential exceptions are handled in all such invocations. In example, I don't see a wrapping try {} catch {} block here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here should never throw. I could add a try/catch and assert this.

Copy link
Member

@ManickaP ManickaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through all except for HttpConnectionPool.cs, there's too much churn and I'm already too tired to understand it ATM. I'll finish on Monday.

@@ -315,7 +315,7 @@ public async Task GetAsyncWithRedirect_SetCookieContainer_CorrectCookiesSent()
using (HttpClient client = CreateHttpClient(handler))
{
client.DefaultRequestHeaders.ConnectionClose = true; // to avoid issues with connection pooling
await client.GetAsync(url1);
await client.GetAsync(url1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT:

Suggested change
await client.GetAsync(url1);
await client.GetAsync(url1);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what happened here :)

{
if (!_concurrentStreams.TryRequestCreditNoWait(1))
if (_nextStream == MaxStreamId)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this part roughly corresponds to the original ThrowShutdownException, so my question is: should we check _abortException here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If _abortException is set, then _shutdown is true. We check _shutdown just below, and if it's true, we throw a retryable exception.


if (NetEventSource.Log.IsEnabled()) Trace($"{nameof(lastValidStream)}={lastValidStream}, {nameof(_lastStreamId)}={_lastStreamId}");
Shutdown();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here to L1711 seems pretty much like the end of ProcessGoAwayFrame and what originally was in AbortStreams.
Is there a reason why not have it in a private method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically just merged all these functions together because it didn't seem like there was a good reason to have them as separate functions.


private bool HasSyncObjLock => Monitor.IsEntered(_availableHttp11Connections);

// Overview of connection management (mostly HTTP version independent):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the long comment 👍

@@ -367,10 +378,54 @@ public byte[] Http2AltSvcOriginUri
}
}

public bool EnableMultipleHttp2Connections => _poolManager.Settings.EnableMultipleHttp2Connections;
private bool EnableMultipleHttp2Connections => _poolManager.Settings.EnableMultipleHttp2Connections;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we even keep it? It's used at only one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I will remove this.


Debug.Assert(_maxConnections != int.MaxValue, "_waiters queue is allocated but no connection limit is set??");
CheckForHttp2ConnectionInjection();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a cyclic call chain CheckForHttp2ConnectionInjection -> AddHttp2ConnectionAsync -> HandleHttp2ConnectionFailure -> CheckForHttp2ConnectionInjection which concerns me because it looks vulnerable to subtle bugs we can lead to going into an infinite loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddHttp2ConnectionAsync is never called directly -- rather, we enqueue a Task to run it. So it's not literally going to end up in an infinite cyclic call chain.

We should be safe from runaway behavior because:

(1) We only ever inject a new connection if there are requests queued -- and more specifically, if the # of requests queued > # of pending connections currently. (And other policy is true as well.)
(2) Any connection attempt will always result in one request being dequeued and processed (unless the queue is empty). In particular, if a connection attempt fails, this will result in a request getting failed.

So at the very least, the number of connection attempts will be bounded by the number of requests submitted. And as connections are reused for multiple requests, this will shrink the queue and further limit the number of connections created.

I do agree that there is a risk of subtle bugs here that could result in very problematic behavior, but I'm not sure what to do here beyond what I've already done -- try to keep the model as simple as possible, and assert where appropriate that unexpected things are not happening. Suggestions welcome.


_associatedHttp2ConnectionCount--;

CheckForHttp2ConnectionInjection();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another cyclic call chain that look suspicious to me CheckForHttp2ConnectionInjection -> AddHttp2ConnectionAsync -> ReturnHttp2Connection -> DisableHttp2Connection -> CheckForHttp2ConnectionInjection.

Copy link
Member

@ManickaP ManickaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments.

/// <summary>The maximum number of HTTP/1.1 connections allowed to be associated with the pool.</summary>
private readonly int _maxHttp11Connections;
/// <summary>The number of HTTP/1.1 connections associated with the pool, including in use, available, and pending.</summary>
private int _associatedHttp11ConnectionCount;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not correspond to _availableHttp11Connections.Count? The same question for H/2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_availableHttp11Connections only holds the connections that are currently usable for new requests. Connections which are in use by a current request aren't in the list.

The total _associatedHttp11ConnectionCount includes available connections but also connections that are in use and therefore not available, as well as pending connections, i.e. those that are in the process of being established.

It's mostly useful so we can check it against the max connection limit.

_associatedHttp11ConnectionCount++;
_pendingHttp11ConnectionCount++;

Task.Run(() => AddHttp11ConnectionAsync(request));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this needs to run completely separately? Wouldn't just var _ = AddHttp11ConnectionAsync(request) be enough? Is this about the lock we have here? It would be nice to at least some comment explaining why it needs to run this way, so that someone doesn't remove it accidentally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's partly about the lock, but that could be handled differently -- we could just make the call outside the lock.

The main issue is that we don't want to do a lot of work on the calling thread here. If we invoke AddHttp11ConnectionAsync here directly (or in the callers), we are potentially doing a lot of work in this call, including calling the user's ConnectCallback etc. We don't want the calling thread here to be tied up doing that. So instead we queue a work item to the thread pool to do whatever work is necessary to create the new connection, while the current thread can continue on doing whatever it was about to do -- whether that is to wait for (any) available connection or to finish up tearing down an existing connection, etc.

// HTTP/2 connection pool

/// <summary>List of available HTTP/2 connections stored in the pool.</summary>
private List<Http2Connection>? _availableHttp2Connections;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing duplicate fields and methods for HTTP 1.1 and HTTP 2 (_http?RequestQueue, _associatedHttp?ConnectionCount, _availableHttp?Connections, AddHttp?ConnectionAsync, CheckForHttp?ConnectionInjection, GetHttp?ConnectionAsync, ...) Wouldn't it be worth it to extract it into a generic class and thus reduce the amount of code in here?

This is just a clean-up suggestion, it's not a prereq to get this PR merged. Perhaps you already have some ideas in your mind...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about trying to unify these, but in the end I didn't. Or rather, I decided to share RequestQueue<T> here but not the rest of the fields.

There are several important differences here between HTTP/1.1 and HTTP2/3:

(1) HTTP/1.1 has a (optional) hard connection limit, which we don't have for HTTP2 or HTTP3 (but perhaps we should).
(2) For HTTP/1.1, we allow multiple pending connections at a time, whereas for HTTP2/3 we only allow one pending connection at a time.
(3) We know that HTTP/1.1 connections can only handle a single request at a time, so we always remove them from the available list when we are about to use them. For HTTP2/3, we only remove them from the available list if they can't handle any more requests at the moment.

Etc...

All that said, I think there is more potential unification that could happen here; I will look at this and see what makes sense to do.

{
// We have a connection that we can attempt to use.
// Validate it below outside the lock, to avoid doing expensive operations while holding the lock.
connection = _availableHttp2Connections![availableConnectionCount - 1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used take H/2 connection from the start of the array, so that we would always try to get as many requests processed by the connections from the beginning. Thus we could inject a connection to handle additional load and then get quickly rid of it one the load calms down. I don't know if this will still apply since we always pick the newest connection. Maybe I'm missing some other piece that makes this all work together...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'm also concerned by this. Currently, GetHttp2Connection goes from the oldest to the newest connection validating their expiration. It's done to handle traffic bursts, but don't keep many connection opened longer than necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new behavior here is different, but I don't think it's any worse than the existing behavior, and in some cases it's better.

The new behavior here is that we will use HTTP2 connections in the order that they most recently became newly available for use. So, at any given point in time, we are trying to use the most recently active connection, and only falling back to other connections if/when that connection is full and not usable.

In general, I believe this will optimize for removing unnecessary connections when a burst subsides.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skimmed it, but overall LGTM.

@geoffkizer
Copy link
Contributor Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@karelz karelz assigned geoffkizer and unassigned ManickaP and alnikola Jul 13, 2021
@geoffkizer
Copy link
Contributor Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@geoffkizer
Copy link
Contributor Author

I ran HttpStress for 30 minutes each of HTTP/1.1 and HTTP2, no failures.

@karelz
Copy link
Member

karelz commented Jul 14, 2021

That is great! Does it mean that once there is clean CI, we can merge?

@geoffkizer
Copy link
Contributor Author

That is great! Does it mean that once there is clean CI, we can merge?

Yeah. I saw a failure in the HTTP2 tests and wanted to re-run them... then I realized it was this: #41078, which has already been disabled again.

@geoffkizer geoffkizer merged commit d3a8a19 into dotnet:main Jul 15, 2021
alnikola added a commit that referenced this pull request Jul 15, 2021
…g logic

It enables all disabled multiple HTTP/2 connection tests to check if it now work on the new connection scheduling logic introduced by #53851.
@karelz karelz added this to the 6.0.0 milestone Jul 15, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Aug 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
6 participants