Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Activities for Http Connections, Dns, Sockets and SslStream #103922

Merged
merged 50 commits into from
Jul 13, 2024

Conversation

antonfirsov
Copy link
Member

@antonfirsov antonfirsov commented Jun 24, 2024

Edit: for the final design see:

#103922 (comment)
open-telemetry/semantic-conventions#1192


This is a replacement for #101814 following our agreement that Http Connection sub-operations (Dns, Socket, SslStream) should be represented with their sub-activities instead of emitting events on the Http Connection activity.

The PR implements the following activity-tree across multiple libraries:

System.Net.Http.Connections.HttpConnection
  |- System.Net.NameResolution.DsnLookup
  |- System.Net.Sockets.Connect
  |- System.Net.Security.TlsHandshake

Where System.Net.Http.Connections.HttpConnection doesn't have a parent (has it's own Trace ID), and every request is linked to their connection via activity.AddLink() when the connection is being dispatched to a request.

The 4 new activities have their own ActivitySources which have to be enabled separately. By using OpenTelemetry.Exporter.Console this can be done via wildcards:

using var tracerProvider = Sdk.CreateTracerProviderBuilder()
                .SetResourceBuilder(ResourceBuilder.CreateDefault().AddService("ConnAct"))
                .AddSource("System.Net.*")
                .AddConsoleExporter()
                .Build();

A sampe program & output can be found here..

Fixes #93832.

/cc @samsp-msft @noahfalk @lmolkova

@davidfowl
Copy link
Member

A sampe program & output can be found here..

Can I ask you to show an example with the aspire dashboard? https://aspiredashboard.com/

It should be a simple docker command to run, pointing the app at it with the otlp exporter. Looking at console output for traces is quite hard.

# Conflicts:
#	src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http3Connection.cs
#	src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionBase.cs
@antonfirsov
Copy link
Member Author

@davidfowl here you go:

image

Where expanding the connection will see:

image

A few notes:

  • The connection span covers the entire connection lifetime until closure, while the subspans cover the networking operations (DNS lookup, TCP connection establishment, TLS handshake)
  • There are no attributes since thy are not yet implemented. We are currently discussing what is reasonable to include.
  • Currently, it is not possible to navigate to connections from requests. For that, Aspire dashboard needs to implement Aspire Dashboard: Traces: Span links aren't shown in span details aspire#2577.
    image
  • With the current design, you will not be able to navigate to failing connections from their initiating requests. For that we would need to add another Activity Link and tag those links with names ("initiated connection" vs "serving connection"). This might raise the concern that we bake too many connection pool implementation details into the telemetry we produce.

@stephentoub
Copy link
Member

stephentoub commented Jun 26, 2024

The connection span covers the entire connection lifetime until closure

Is this going to be useful in a trace? I don't know one way or the other, but it seems like it could result in a lot of noise.

Is there an equivalent span that's only about setting up the connection, from the time it was requested until the time it became usable?

@davidfowl
Copy link
Member

Is this going to be useful in a trace? I don't know one way or the other, but it seems like it could result in a lot of noise.

Is there an equivalent span that's only about setting up the connection, from the time it was requested until the time it became usable?

I agree, we care about how long it took to connect, not how long the connection itself lasts. That feels like it could be another span, but not the most useful one. I think about this as zooming into why a request may have taken so long.

@antonfirsov
Copy link
Member Author

antonfirsov commented Jun 26, 2024

That feels like it could be another span, but not the most useful one.

The proposed connection span matches the semantics of the http.client.request.duration metric.

One solution to have more useful traces could be to decompose it as following:

[http] connection {peer}                 // Covers the entire pooled connection lifetime
  |- [http] connection_initiation {peer} // Covers the connection establishment
      |- DNS {host}
      |- [socket] connection_initiation {peer}
      |- TLS {server_name}

Another is to only introduce connection_initiation, however that feels problematic to link from follow-up requests reusing the same connection.

cc @noahfalk @lmolkova

@antonfirsov antonfirsov marked this pull request as draft June 26, 2024 23:29
@lmolkova
Copy link

lmolkova commented Jun 26, 2024

The overall connection span could be useful to understand:

  • which connection my request was done over (if we can link connection span to request span). alternatives could be:
    • some connection id attribute on the request span
    • just logs
  • could help debugging abandoned response streams and connections that are not returned to the pool (some additional stuff is probably needed to make it obvious)
  • not sure if HTTP pipelining is still a thing, but it makes it easy to detect
  • how connection ended (server closed, error happened, etc), when did it end comparing to the request

Many other useful things are already exposed as metrics (number of active connections, connection duration), so I agree span usefulness is limited.

Regarding the noise: connections should be rare comparing to requests (if there is any significant load). The spans are reported by a different ActivitySource, so users decide if they want them.

Anyway, I don't have a strong opinion on how necessary it is - it can always be added later.

Copy link
Member

@MihaZupan MihaZupan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let Mana double check the H3 part, but this LGTM, thank you

P.S. When we eventually remove the experimental prefix, my small preference would be to use the existing activity source for HTTP.

Copy link
Member

@ManickaP ManickaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTTP for now, I have skipped tests though.
I'm continuing on DNS/Sockets/TLS.


public static void AddConnectionLinkToRequestActivity(Activity connectionSetupActivity)
{
Debug.Assert(connectionSetupActivity is not null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like in ReportError, why does this require the null check before every call instead of having:

if (connectionSetupActivity is null) return;

here.

Copy link
Member Author

@antonfirsov antonfirsov Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern seems to be generally around for our telemetry everywhere:

if (HttpTelemetry.Log.IsEnabled()) HttpTelemetry.Log.SomeMethod();

I assumed this exists to save a few cycles by avoiding calling a method and deepening the stack on the happy path, so I also applied it for metrics and distributed tracing.


namespace System.Net.Http
{
internal static class ConnectionSetupDiagnostics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think this warrants yet another diagnostics related class in HTTP, but I'm not gonna die on this hill.
Also I have a small preference for renaming to something like ConnectionDistributedTracing or something.


public static void StopConnectionSetupActivity(Activity activity, Exception? exception, IPEndPoint? remoteEndPoint)
{
Debug.Assert(activity is not null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like in ReportError, why does this require the null check before every call instead of having:

if (connectionSetupActivity is null) return;

here.

Copy link
Member Author

@antonfirsov antonfirsov Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment. IMO if we want to get rid of this pattern because we prefer clarity/readability over saving a few cycles, we should do it for all telemetry pillars not only distributed tracing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@antonfirsov antonfirsov Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm being inconsistent since the Start() methods would also need a check if we want to strictly follow the pattern. I will wait for more feedback before changing anything about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReportError has the check inside and not at the call site, that's one that is bothering me (or rather Stop and AddConnectionLink). With Start, the check and code is more complicated so it makes sense to keep it inside.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReportError is already on an expensive Exception path, so no point bothering with micro-optimizations.
I leave this for now as-is, but I believe we should revisit our telemetry coding patterns/guidelines.

My preference would be clarity over microoptimizations everywhere, but some may disagree.

stream = await ConnectHelper.EstablishSslConnectionAsync(_sslOptionsProxy, request, async, stream, cancellationToken).ConfigureAwait(false);
}
break;
Activity? activity = ConnectionSetupDiagnostics.StartConnectionSetupActivity(IsSecure, OriginAuthority);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is IsSecure enough here? It assumes only HTTP goes through this, but what about web sockets (ws / wss)? And what about socks proxies?

Copy link
Member Author

@antonfirsov antonfirsov Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, looks like ws/wss/socks will report http/https wrongly in both Metrics and Distributed Tracing. If we want to address this in this PR, I would prefer to do it by opting out from creating an activity for those for now, and cover other protocols later.

IMHO in practice this will unlikely hurt anyone for an experimental feature, so my preference would be to leave it as-is for now, and create a tracking issue for .NET 10, when we could fix this for both Metrics and Tracing.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it as-is for now, and create a tracking issue for .NET 10

👍

Copy link
Member

@ManickaP ManickaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments.

@antonfirsov
Copy link
Member Author

@ManickaP addressed:

#103922 (comment)
#103922 (comment)
class renaming from: #103922 (comment)

Copy link
Member

@ManickaP ManickaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last batch of comments. Thank you for your patience!

}
}

static (string?, string?) GetNameAndVersionString(SslProtocols protocol) => protocol switch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SslProtocols is flags, so is this behaving as expected? E.g. returning nothing in case there are more protocols set, or we're sure that at this point there will always be just one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EventSource telemetry does the same after acquiring the enum with GetSslProtocolInternal() so I assumed it should be a single value after handshake completion normally. If not, our telemetry doesn't emit the protocol info.

switch (protocol)
{
#pragma warning disable SYSLIB0039 // TLS 1.0 and 1.1 are obsolete
case SslProtocols.Tls:
protocolSessionsOpen = ref _sessionsOpenTls10;
handshakeDurationCounter = _handshakeDurationTls10Counter;
break;
case SslProtocols.Tls11:
protocolSessionsOpen = ref _sessionsOpenTls11;
handshakeDurationCounter = _handshakeDurationTls11Counter;
break;
#pragma warning restore SYSLIB0039
case SslProtocols.Tls12:
protocolSessionsOpen = ref _sessionsOpenTls12;
handshakeDurationCounter = _handshakeDurationTls12Counter;
break;
case SslProtocols.Tls13:
protocolSessionsOpen = ref _sessionsOpenTls13;
handshakeDurationCounter = _handshakeDurationTls13Counter;
break;
}

@@ -15,6 +15,11 @@ namespace System.Net.Security
{
public partial class SslStream
{
private const string ActivitySourceName = "Experimental.System.Net.Security";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this part of NetSecurityTelemetry like in Dns and Socket?
So we have:

  • an extra class in Http
  • part of telemetry class in Dns and Socket
  • part of production class in Ssl

Copy link
Member Author

@antonfirsov antonfirsov Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These pieces of code evolved like dust balls in a dirty room, but there is some logic:

  • In HTTP, various telemetry pillars follow complex requirements across multiple protocols and various aspects (request vs connection telemetry, Metrics vs EventSource vs Activities, http1-2 vs http3). When there is cohesion, a class has been defined to put related things into a single place, when there isn't, things were kept separate.
  • In DNS and Sockets, telemetry can be tracked together across the 3 pillars
  • In SSL, the Activity logic is so tiny, that I didn't bother to move it to a separate class.

I will do a second round of thinking about the last point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved around things a bit for SslStream

antonfirsov and others added 8 commits July 12, 2024 16:26
Co-authored-by: Marie Píchová <11718369+ManickaP@users.noreply.github.com>
Co-authored-by: Marie Píchová <11718369+ManickaP@users.noreply.github.com>
Co-authored-by: Marie Píchová <11718369+ManickaP@users.noreply.github.com>
@antonfirsov
Copy link
Member Author

/azp run runtime-libraries-coreclr outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider adding activities for potentially long running network operations
9 participants