Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding a telemetry event for failed HTTP connection attempts #110351

Open
MihaZupan opened this issue Dec 3, 2024 · 6 comments
Open

Comments

@MihaZupan
Copy link
Member

MihaZupan commented Dec 3, 2024

With how requests are split from connections in SocketsHttpHandler, the user may not always see the errors that occurred during connection establishment (dns + tcp + tls + http/2 handshake).
The errors may be swallowed if we already served the initiating request with a different connection, or if that request timed out.
There are therefore situations where you may have poor visibility into what's happening.

One way of gaining the info is to inject a custom ConnectCallback along the lines of

handler.ConnectCallback = async (context, ct) =>
{
    var stopwatch = Stopwatch.StartNew();
    var socket = new Socket(SocketType.Stream, ProtocolType.Tcp) { NoDelay = true };
    try
    {
        await socket.ConnectAsync(context.DnsEndPoint, ct);
        return new NetworkStream(socket, ownsSocket: true);
    }
    catch (Exception ex)
    {
        socket.Dispose();
        Console.WriteLine($"Failed to connect to {context.DnsEndPoint} after {stopwatch.ElapsedMilliseconds:N2}: {ex}");
        throw;
    }
};

This gives you visibility into failures when trying to connect, but not into TLS failures/timeouts.
It's possible to also perform the TLS handshake in the callback, but we don't always make it trivial (how do you pick the host, how do you handle ALPN, potential custom ssl settings, ...).


It seems rather simple for us to add an EventSource event to System.Net.Http that just logs all such failures, corresponding to

private void HandleHttp11ConnectionFailure(HttpConnectionWaiter<HttpConnection>? requestWaiter, Exception e)
{
if (NetEventSource.Log.IsEnabled()) Trace($"HTTP/1.1 connection failed: {e}");

private void HandleHttp2ConnectionFailure(HttpConnectionWaiter<Http2Connection?> requestWaiter, Exception e)
{
if (NetEventSource.Log.IsEnabled()) Trace($"HTTP2 connection failed: {e}");

private void HandleHttp3ConnectionFailure(HttpConnectionWaiter<Http3Connection?> requestWaiter, Exception? e)
{
Debug.Assert(IsHttp3Supported());
if (NetEventSource.Log.IsEnabled()) Trace($"HTTP3 connection failed: {e}");

Maybe something like

[Event(42, Level = EventLevel.Error)]
public void ConnectionFailed(byte versionMajor, byte versionMinor, string scheme, string host, int port, string? remoteAddress, double elapsedMilliseconds, string exception);

Does this seem reasonable @antonfirsov, maybe we have better ways of exposing such data now?

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Dec 3, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

@antonfirsov
Copy link
Member

antonfirsov commented Dec 3, 2024

maybe we have better ways of exposing such data now?

We are reporting the same exceptions on the HTTP connection_setup Activity when a connection attempt fails:

finally
{
if (activity is not null)
{
ConnectionSetupDistributedTracing.StopConnectionSetupActivity(activity, exception, remoteEndPoint);
}
}

Whether it's a "better way", depends on the use-case. Some may prefer to consume distributed traces, others may want EventSource logs. IMO it would be valuable to expose the error information in both telemetry legs, but the it should be probably logged at the same point, and I would prefer it to happen around the line I linked. The difference is that at that point we don't catch exceptions comin from PlainTextStreamFilter and cancellations are not wrapped yet by CreateConnectTimeoutException.

@MihaZupan
Copy link
Member Author

Ah, thanks, I forgot about this one. An event could still be useful but I'm not as worried in that case.

are not wrapped yet by CreateConnectTimeoutException.

Looks like we're inconsistent here rn, we do report the ConnectTimeout on H/3.

catch (Exception e)
{
connectionException = e is OperationCanceledException oce && oce.CancellationToken == cts.Token && !waiter.CancelledByOriginatingRequestCompletion ?
CreateConnectTimeoutException(oce) :
e;
// On success path connectionSetupActivity is stopped before calling InitQuicConnection().
// This assertion makes sure that InitQuicConnection() does not throw unexpectedly.
Debug.Assert(connectionSetupActivity?.IsStopped is not true);
if (connectionSetupActivity is not null) ConnectionSetupDistributedTracing.StopConnectionSetupActivity(connectionSetupActivity, connectionException, null);

I could see it being useful if we did include info on whether the cancellation was due to ConnectTimeout / due to CancelledByOriginatingRequestCompletion (we don't include that info today anyway, but we could improve that).

One difference that could matter is that we also exclude H/2 handshake failures here.

@antonfirsov
Copy link
Member

An event could still be useful but I'm not as worried in that case.

So are you aware of a specific user scenario you wanted this to help with? If yes, does this mean they would be OK consuming distributed traces?

@MihaZupan
Copy link
Member Author

MihaZupan commented Dec 5, 2024

Yes. I'm helping in an investigation where a service isn't establishing connections for some reason.
In trying to advise what sort of logging might help us investigate the issue (outside of internal diagnostics), I can't give a good, clear, simple answer atm.
The ConnectCallback can cover some, but not all, cases. Existing EventCounters & events can also help.

does this mean they would be OK consuming distributed traces?

I'll try to find out, but it's likely that it only being available since 9.0 is going to be an issue.

(whatever improvements we make here are unlikely to help in this particular investigation, but I'd like us to have a better experience in the future - if for no other reason so that investigating things is easier for ourselves)

@antonfirsov antonfirsov added this to the 10.0.0 milestone Dec 5, 2024
@antonfirsov antonfirsov removed the untriaged New issue has not been triaged by the area owner label Dec 5, 2024
@antonfirsov
Copy link
Member

Ok, tentatively triaged this to 10.0 assuming it's easy to implement. We can always change the priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants