Improve Kestrel connection metrics with error reasons #55565

JamesNK · 2024-05-07T03:31:19Z

This PR improves the kestrel.connection.duration metric by adding information about why the connection ended.

Tags added or changed:

http.connection.protocol_code - Standards based, publicly communicated code for the connection ending. Often sent to the client. The value is either the HTTP/2 error code or HTTP/3 error code that the connection ended with. Doesn't apply to HTTP/1.1.
kestrel.connection.end_reason - A more detailed internal reason about why a connection ended. Always present.
error.type - If the end reason is an error, e.g. connection killed because of timeout, excessive stream resets, transport ended while there are active HTTP requests, invalid frame, etc. error.type could be the same as the end reason. The value could also be an exception type name if an unhandled exception is thrown from connection middleware pipeline. The first error recorded wins. Only present if there is an error.

We need to decide what the new tags is. Is it the end reason or error reason? For example, a connection can be ended by the client sending a go away frame. Do we want to track that? It's not an error to close a connection like that, so in this case we're are tracking all end reasons. On the other hand, we might want to only track unexpected and unhealthy reasons that end the connection. The benefit of only tracking errors is it is very easy to get just the unhealthy connections by filtering metrics to ones that have this tag. However, I think we can achieve this by also putting error reasons into the error.type tag.

tldr kestrel.connection.end_reason is always present and provides the reason, error.type is present if the end reason is considered an error and has the same value as kestrel.connection.end_reason.

Questions for folks:

Is having information about all connection end reasons useful? If it isn't then we could remove kestrel.connection.end_reason and just have error.type.
What end reasons are an error? For example, are the following errors or not...
- Transport closing when there are no active HTTP requests
- Transport closing when there are active HTTP requests
- Connection closing because the application is shutting down
- Client sends GOAWAY with error code of NO_ERROR
- Client sends GOAWAY with error code of INTERNAL_ERROR

Tasks:

cc @noahfalk @tarekgh @lmolkova @samsp-msft

JamesNK · 2024-05-07T03:41:07Z

src/Servers/Kestrel/test/InMemory.FunctionalTests/Http2/Http2ConnectionTests.cs

@@ -1727,7 +1727,7 @@ public async Task AbortConnectionAfterTooManyEnhanceYourCalms()
        await WaitForConnectionErrorAsync<Http2ConnectionErrorException>(
            ignoreNonGoAwayFrames: true,
            expectedLastStreamId: int.MaxValue,
-            expectedErrorCode: Http2ErrorCode.INTERNAL_ERROR,
+            expectedErrorCode: Http2ErrorCode.ENHANCE_YOUR_CALM,


@amcasey I added a HttpConnection.Abort overload that has an Http2ErrorCode argument. When there are excessive resets I made the passed in value ENHANCE_YOUR_CALM instead of INTERNAL_ERROR. That seems a more appropriate error.

Did you choose INTERNAL_ERROR for a reason here, or is it returned because that is what abort already sent?

This was a hotfix, so I suspect I was making the smallest change possible, but I no longer recall.

src/Servers/Kestrel/Core/src/Internal/Http2/Http2Connection.cs

noahfalk · 2024-05-09T08:49:19Z

Is having information about all connection end reasons useful?

What was the motivation to add the data initially? I can hypothesize a reason someone might find it useful (diagnosing why connections didn't live as long as they were expected to live) but I don't know how often someone needs to diagnose that issue in practice and what alternatives would they have available if this metric attribute didn't exist?

What end reasons are an error? For example, are the following errors or not..

Any attempt at classification gets very subjective but a few suggestions at least:

Ideally errors represent situations that do not occur frequently and are undesirable. For example "Connection closing because the application is shutting down" might be infrequent but it doesn't sound undesirable.
Ideally metric error classifications would consistent with other observable behavior that implies an error classification. For example if in-proc code throws an Exception or some error code is transmitted in the wire protocol then it would consistent for the metrics to also classify that condition as an error. If "Client sends GOAWAY with error code of NO_ERROR" then it would consistent for metrics also not to treat that as an error.

src/Servers/Kestrel/Core/src/Internal/Http/Http1OutputProducer.cs

lmolkova · 2024-05-09T16:44:13Z

I think it's valuable to know the end reason even if it's not an error and even if we don't have a specific scenario in mind.
I also don't see any drawbacks from including it - it would not increase cardinality much.

The change looks great from semconv perspective! LMK if you need any help adding new attributes to otel.

samsp-msft · 2024-05-09T21:12:28Z

This seems like data that should be going into logs not metrics. Metric dimensions should be kept small as adding additional properties and values adds incremental storage. If we can reduce this data down to one property - was the connection closed cleanly or due to an error, and provide logs to give more details as to why. Metrics are supposed to be about aggregated statistics not detailed analysis as to what happened.

Anyone who is going to be doing analysis, can the use the metric to determine if the trend is changing, and then go to a log query to understand why.

Once exemplars are better supported in .NET/OTel they will provide the correlation between metrics and traces/logs.

JamesNK · 2024-05-10T00:52:16Z

@samsp-msft

There is prior art for showing more error info in error.type. That tag is already on HTTP requests for server and client in spans and metrics. Reducing error info down to a boolean is optimizing too much and goes against conventions.

http.connection.protocol_code (exact name TBD) is the HTTP connection version of request's HTTP status code. I don't think it is controversial.

kestrel.connection.end_reason is more debatable. My feeling from talking to people using Kestrel over the years is connections are a black box and people want more insight here. If there are performance/technical reasons why an extra tag is problematic, then I can reach out to ops folks to get feedback on whether end reasons for "non-error" situations would be useful.

Note that while the ConnectionEndReason enum has a lot of potential values (about 35), many of them are for protocol violations which will almost never happen, or are specific to one particular HTTP protocol (e.g. closing a critical stream is specific to HTTP/3). There are about half a dozen would occur in typical usage:

ConnectionReset
UnexpectedError
AbortedByApplication
ApplicationShutdown
ServerTimeout
IOError
ClientGoAway
TransportCompleted

That is similar to how there are many HTTP status codes, but most of them you'll never see.

src/Servers/Kestrel/Core/src/Internal/HttpConnection.cs

src/Servers/Kestrel/Core/src/Internal/Http/Http1Connection.cs

amcasey · 2024-05-13T20:00:42Z

src/Servers/Kestrel/Core/src/Internal/Http/Http1MessageBody.cs

@@ -72,7 +72,7 @@ protected override Task OnConsumeAsync()
            _context.ReportApplicationError(connectionAbortedException);

            // Have to abort the connection because we can't finish draining the request
-            _context.StopProcessingNextRequest();
+            _context.StopProcessingNextRequest(ConnectionEndReason.AbortedByApplication);


It feels like we might want to distinguish between "the application requested an abort" and "the application failed and we had to abort".

I went with AbortedByApp because the log message says the app aborted the connection. I agree that the reason isn't clear.

What do you think about the logging and end reason here @halter73 @Tratcher @mgravell ?

I think the assumption here is that the only reason TryReadInternal would throw an InvalidOperationException is because middleware exited after calling HttpContext.Request.BodyReader.ReadAsync without a corresponding call to AdvanceTo, and rather than try to recover we bail out since we assume it's an unlikely scenario anyway.

I should have created a custom message explaining this rather than reuse CoreStrings.ConnectionAbortedByApplication for the exception passed to ReportApplicationError.

Ideally, we'd use "application_abort" only for HttpContext.Abort() on HTTP/1.1 connections, and then use "application_left_bodyreader_in_invalid_state" for this, and "application_canceled_response_flush" for the usage of AbortedByApplication in TimingPipeFlusher.

If we reused the same ConnectionEndReason for HttpCotnext.Abort() and this issue with leaving the body reader in an invalid state, I won't be too upset since this scenario is pretty niche. Cancelling a response body flush on an HTTP/1.1 connection probably deserves its own ConnectionEndReason though.

src/Servers/Kestrel/Core/src/Internal/Http2/Http2Connection.cs

src/Servers/Kestrel/Core/src/Internal/Http2/Http2FrameWriter.cs

amcasey · 2024-05-13T20:13:33Z

src/Servers/Kestrel/Core/src/Internal/Http3/Http3Connection.cs

@@ -530,8 +544,14 @@ private void UpdateStreamTimeouts(long timestamp)
                    await outboundControlStreamTask;
                }

+                // Use graceful close reason if it has been set.
+                if (reason == ConnectionEndReason.Unknown && _gracefulCloseReason != ConnectionEndReason.Unknown)


Nit: I don't think && _gracefulCloseReason != ConnectionEndReason.Unknown actually accomplishes anything.

What do you mean? Is it that _gracefulCloseReason != ConnectionEndReason.Unknown condition could be removed and an unknown value is set to unknown?

It's been a while, but I probably meant that reason == ConnectionEndReason.Unknown implies _gracefulCloseReason != ConnectionEndReason.Unknown at this point in the code. I may well have been mistaken.

amcasey · 2024-05-13T20:27:44Z

Is having information about all connection end reasons useful? If it isn't then we could remove kestrel.connection.end_reason and just have error.type.

With the notable disclaimer that I haven't actually monitored the health of a real service...

I think I'd want to have it for all connections. Sure, the majority will be graceful shutdowns (hopefully), but it's (I assume) easy enough to filter out a few uninteresting values and which values are uninteresting seems (as mentioned above) very subjective.

Another reason to include things that don't seem like errors is for making attack signatures. Something that's "normal" for a single connection may be problematic when it happens on many connections simultaneously. Or it may not be problematic, but still indicate that a change that signals a problem will soon occur.

Personally, I think I'd have just end_reason and drop error_type altogether (with the caveat that we might want a special way to represent unhandled exceptions).

If we go the other way, we have all sorts of strange cases to consider like whether cutting off a slow-reading client is an error. Arguably, that's the system functioning as expected and desired, but it's still interesting.

amcasey · 2024-05-13T20:30:56Z

This seems like data that should be going into logs not metrics

@samsp-msft I'm not sure I'm following this line of reasoning. Given that logs are structured and queryable, why would anything not be logs-only? When would you use metrics instead?

src/Servers/Kestrel/Core/src/Internal/Http/Http1MessageBody.cs

src/Servers/Kestrel/Core/src/Internal/Infrastructure/PipeWriterHelpers/TimingPipeFlusher.cs

src/Servers/Kestrel/Core/src/Internal/Http/Http1Connection.cs

src/Servers/Kestrel/Core/src/Internal/Http/HttpProtocol.cs

…dywriter state

src/Servers/Kestrel/Core/test/HttpResponseHeadersTests.cs

src/Servers/Kestrel/samples/Http2SampleApp/Program.cs

src/Shared/ServerInfrastructure/Http2/ConnectionEndReason.cs

src/Shared/Metrics/MetricsExtensions.cs

src/Servers/Kestrel/Core/src/Internal/Infrastructure/KestrelMetrics.cs

src/Servers/Kestrel/test/InMemory.FunctionalTests/ResponseTests.cs

amcasey · 2024-07-17T21:31:17Z

src/Servers/Kestrel/test/InMemory.FunctionalTests/ResponseTests.cs

@@ -2395,7 +2415,9 @@ public async Task ConnectionClosedAfter101Response()
    [Fact]
    public async Task ThrowingResultsIn500Response()
    {
-        var testContext = new TestServiceContext(LoggerFactory);
+        var testMeterFactory = new TestMeterFactory();
+        using var connectionDuration = new MetricCollector<double>(testMeterFactory, "Microsoft.AspNetCore.Server.Kestrel", "kestrel.connection.duration");


Consider extracting a helper for new MetricCollector<double>(testMeterFactory, "Microsoft.AspNetCore.Server.Kestrel", "kestrel.connection.duration").

src/Servers/Kestrel/Core/src/Internal/Http2/Http2Connection.cs

amcasey

This PR is big enough that I could keep giving feedback indefinitely, but I have no fundamental concerns and would be okay with merging it.

src/Servers/Kestrel/Core/src/Internal/Http/Http1Connection.cs

src/Servers/Kestrel/Core/src/Internal/Http2/Http2Connection.cs

src/Servers/Kestrel/Core/src/Internal/Http3/Http3ControlStream.cs

src/Servers/Kestrel/Core/src/Internal/Http3/Http3Stream.cs

amcasey · 2024-07-18T17:45:15Z

src/Shared/ServerInfrastructure/Http2/ConnectionEndReason.cs

+    FlowControlWindowExceeded,
+    KeepAliveTimeout,
+    InsufficientTlsVersion,
+    InvalidHandshake,


This seems similar to UnexpectedFrame

What do you mean?

InvalidHandshake is only used for HTTP/2 and occurs when the client doesn't start the connection with the HTTP/2 preface.

I think that comment was supposed to be about FrameAfterStreamClose. My bad.

Hm, I wonder if I was commenting on an older commit. I can't find a refresh button in the VS Code UI that seems to cover moving forward to the latest revision of the PR.

src/Shared/ServerInfrastructure/Http2/ConnectionEndReason.cs

src/Servers/Kestrel/Core/src/Internal/Http/Http1Connection.cs

src/Servers/Kestrel/Core/src/Internal/Infrastructure/KestrelTrace.General.cs

src/Servers/Kestrel/Core/src/Internal/Http/Http1MessageBody.cs

src/Servers/Kestrel/Core/src/Internal/Http/HttpProtocol.cs

halter73 · 2024-07-19T21:14:19Z

src/Servers/Kestrel/Core/src/Internal/Http3/Http3Connection.cs

@@ -468,20 +473,24 @@ private void UpdateStreamTimeouts(long timestamp)
        {
            Log.RequestProcessingError(_context.ConnectionId, ex);
            error = ex;
+            reason = ConnectionEndReason.IOError;


I think it'd be fine to say it's some sort of transport level error.

halter73 · 2024-07-19T21:16:03Z

src/Servers/Kestrel/Core/src/Internal/Http3/Http3Connection.cs

    }

    public void OnInputOrOutputCompleted()
    {
        TryStopAcceptingStreams();

        // Abort the connection using the error code the client used. For a graceful close, this should be H3_NO_ERROR.
-        Abort(new ConnectionAbortedException(CoreStrings.ConnectionAbortedByClient), (Http3ErrorCode)_errorCodeFeature.Error);
+        Abort(new ConnectionAbortedException(CoreStrings.ConnectionAbortedByClient), (Http3ErrorCode)_errorCodeFeature.Error, ConnectionEndReason.TransportCompleted);


Like I mentioned in my other comment, I think indicating that this was client initiated would be good. If ClientFin is too TCP specific, we could go with ClientClosedTransport or something.

src/Servers/Kestrel/Core/src/Internal/Infrastructure/KestrelMetrics.cs

JamesNK · 2024-07-20T03:43:41Z

/benchmark plaintext aspnet-citrine-win kestrel

pr-benchmarks · 2024-07-20T04:19:02Z

Benchmark started for plaintext on aspnet-citrine-win with kestrel. Logs: link

pr-benchmarks · 2024-07-20T04:19:34Z

An error occurred, please check the logs

JamesNK added the area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions label May 7, 2024

JamesNK requested a review from amcasey May 7, 2024 03:31

JamesNK requested review from halter73, BrennanConroy and mgravell as code owners May 7, 2024 03:31

JamesNK commented May 7, 2024

View reviewed changes

gfoidl reviewed May 7, 2024

View reviewed changes

src/Servers/Kestrel/Core/src/Internal/Http2/Http2Connection.cs Show resolved Hide resolved

JamesNK force-pushed the jamesnk/connection-protocol-code branch from 7c05f97 to 8a35061 Compare May 9, 2024 08:36

noahfalk reviewed May 9, 2024

View reviewed changes

src/Servers/Kestrel/Core/src/Internal/Http/Http1OutputProducer.cs Outdated Show resolved Hide resolved

amcasey reviewed May 13, 2024

View reviewed changes

dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label May 21, 2024

JamesNK force-pushed the jamesnk/connection-protocol-code branch from eb1b413 to 6ed61b7 Compare June 10, 2024 03:44

JamesNK removed the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Jun 10, 2024

halter73 reviewed Jun 12, 2024

View reviewed changes

src/Servers/Kestrel/Core/src/Internal/Http/Http1MessageBody.cs Outdated Show resolved Hide resolved

halter73 reviewed Jun 12, 2024

View reviewed changes

src/Servers/Kestrel/Core/src/Internal/Infrastructure/PipeWriterHelpers/TimingPipeFlusher.cs Outdated Show resolved Hide resolved

JamesNK commented Jun 18, 2024

View reviewed changes

src/Servers/Kestrel/Core/src/Internal/Http/Http1Connection.cs Outdated Show resolved Hide resolved

dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Jun 27, 2024

JamesNK force-pushed the jamesnk/connection-protocol-code branch from 9b036f4 to 4c8f48d Compare June 28, 2024 14:04

JamesNK removed the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Jun 28, 2024

JamesNK force-pushed the jamesnk/connection-protocol-code branch 2 times, most recently from 4b4839f to c510461 Compare July 4, 2024 06:33

JamesNK commented Jul 4, 2024

View reviewed changes

src/Servers/Kestrel/Core/src/Internal/Http/HttpProtocol.cs Show resolved Hide resolved

build-analysis bot mentioned this pull request Jul 4, 2024

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

3 tasks

dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Jul 11, 2024

JamesNK added 14 commits July 17, 2024 11:25

Tests, graceful shutdown vs abort shutdown, add reason for invalid bo…

193e4a1

…dywriter state

WIP

d7825d6

HTTP/1.1 end reason work

733476a

Comment

8e1fd7f

Add reasons for some HTTP/1.1 request body errors

6bb6196

More tests

8fb7354

Add exceeded max concurrent connections reason

077cf06

Flaky test

33a98af

WIP

47f5ebe

WIP

fac7e1c

Update

b90be50

Fix build

cc3a26f

Clean up

34f175b

PR feedback

7b517ea

JamesNK force-pushed the jamesnk/connection-protocol-code branch from d528f01 to 7b517ea Compare July 17, 2024 03:35

build-analysis bot mentioned this pull request Jul 17, 2024

Roslyn analyzer throws error AD0001 NullReferenceException dotnet/dnceng#3305

Open

3 tasks

amcasey reviewed Jul 17, 2024

View reviewed changes

PR feedback

38f887c

amcasey reviewed Jul 18, 2024

View reviewed changes

JamesNK added 4 commits July 19, 2024 09:11

PR feedback

d066571

Add header limits exceeded

550376d

Fix tests

e51f962

Fix tests

8238159

WontonSam approved these changes Jul 19, 2024

View reviewed changes

halter73 approved these changes Jul 19, 2024

View reviewed changes

amcasey approved these changes Jul 19, 2024

View reviewed changes

PR feedback

43f692e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Kestrel connection metrics with error reasons #55565

Improve Kestrel connection metrics with error reasons #55565

JamesNK commented May 7, 2024 •

edited

Loading

JamesNK May 7, 2024

amcasey May 13, 2024

noahfalk commented May 9, 2024

lmolkova commented May 9, 2024

samsp-msft commented May 9, 2024

JamesNK commented May 10, 2024 •

edited

Loading

amcasey May 13, 2024

JamesNK Jun 10, 2024

halter73 Jun 11, 2024

amcasey May 13, 2024

JamesNK Jun 10, 2024

amcasey Jun 11, 2024

amcasey commented May 13, 2024

amcasey commented May 13, 2024

amcasey Jul 17, 2024

amcasey left a comment

amcasey Jul 18, 2024

JamesNK Jul 19, 2024

amcasey Jul 19, 2024

amcasey Jul 19, 2024

halter73 Jul 19, 2024

halter73 Jul 19, 2024

JamesNK commented Jul 20, 2024

pr-benchmarks bot commented Jul 20, 2024

pr-benchmarks bot commented Jul 20, 2024

Improve Kestrel connection metrics with error reasons #55565

Are you sure you want to change the base?

Improve Kestrel connection metrics with error reasons #55565

Conversation

JamesNK commented May 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noahfalk commented May 9, 2024

lmolkova commented May 9, 2024

samsp-msft commented May 9, 2024

JamesNK commented May 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amcasey commented May 13, 2024

amcasey commented May 13, 2024

Choose a reason for hiding this comment

amcasey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesNK commented Jul 20, 2024

pr-benchmarks bot commented Jul 20, 2024

pr-benchmarks bot commented Jul 20, 2024

JamesNK commented May 7, 2024 •

edited

Loading

JamesNK commented May 10, 2024 •

edited

Loading