Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ActivityId missing in a newly added HttpTelemetry Stop events. #40776

Closed
7 of 11 tasks
ManickaP opened this issue Aug 13, 2020 · 19 comments
Closed
7 of 11 tasks

ActivityId missing in a newly added HttpTelemetry Stop events. #40776

ManickaP opened this issue Aug 13, 2020 · 19 comments
Assignees
Milestone

Comments

@ManickaP
Copy link
Member

ManickaP commented Aug 13, 2020

In non-trivial number of cases, the RequestStop event has empty activity id.

9 945 out of 35 452 requests sent during HTTP FunctionalTests were missing activity id.

Other Start/Stop event pairs introduced with the new telemetry are not properly linked with ActivityId as well.

Also easily reproducible with a simple HttpClient scenario:

class Program
{
    static async Task Main(string[] args)
    {
        using var listener = new HttpEventListener();
        using var client = new HttpClient();
        
        await client.GetAsync(@"https://github.com/dotnet/runtime/blob/master/src/libraries/System.Net.Http/src/HttpDiagnosticsGuide.md");
    }
}

internal sealed class HttpEventListener : EventListener
{
    public const EventKeywords TasksFlowActivityIds = (EventKeywords)0x80;
    //public const EventKeywords Debug = (EventKeywords)0x20000;

    protected override void OnEventSourceCreated(EventSource eventSource)
    {
        if (eventSource.Name == "System.Net.Http" ||
            eventSource.Name == "System.Net.Sockets" ||
            eventSource.Name == "System.Net.Security" ||
            eventSource.Name == "System.Net.NameResolution")
        {
            EnableEvents(eventSource, EventLevel.LogAlways);
        }
        else if (eventSource.Name == "System.Threading.Tasks.TplEventSource")
        {
            EnableEvents(eventSource, EventLevel.LogAlways, TasksFlowActivityIds /*| Debug*/);
        }
    }

    protected override void OnEventWritten(EventWrittenEventArgs eventData)
    {
        var sb = new StringBuilder().Append($"{eventData.TimeStamp:HH:mm:ss.fffffff}[{eventData.EventName}]{eventData.ActivityId}.{eventData.RelatedActivityId} ");
        for (int i = 0; i < eventData.Payload?.Count; i++)
        {
            if (i > 0)
                sb.Append(", ");
            sb.Append(eventData.PayloadNames?[i]).Append(": ").Append(eventData.Payload[i]);
        }
        Console.WriteLine(sb.ToString());
    }
}

Result (with the change in #41022)

09:34:45.3129327[RequestStart]00000011-0000-0000-0000-0000093f9d59.00000000-0000-0000-0000-000000000000 scheme: https, host: github.com, port: 443, pathAndQuery: /dotnet/runtime/blob/master/src/libraries/System.Net.Http/src/HttpDiagnosticsGuide.md, versionMajor: 1, versionMinor: 1
09:34:45.4344957[ResolutionStart]00001011-0000-0000-0000-0000090f9d59.00000011-0000-0000-0000-0000093f9d59 hostNameOrAddress: github.com
09:34:45.4752982[ResolutionStop]00001011-0000-0000-0000-0000090f9d59.00000000-0000-0000-0000-000000000000 
09:34:45.4848747[ConnectStart]00002011-0000-0000-0000-0000091f9d59.00000011-0000-0000-0000-0000093f9d59 address: InterNetworkV6:28:{1,187,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,140,82,118,3,0,0,0,0}
09:34:45.5208470[ConnectStop]00002011-0000-0000-0000-0000091f9d59.00000000-0000-0000-0000-000000000000 
09:34:45.5347626[HandshakeStart]00003011-0000-0000-0000-0000096f9d59.00000011-0000-0000-0000-0000093f9d59 isServer: False, targetHost: github.com
09:34:45.8030641[HandshakeStop]00003011-0000-0000-0000-0000096f9d59.00000000-0000-0000-0000-000000000000 protocol: Tls13
09:34:45.8070942[Http11ConnectionEstablished]00000011-0000-0000-0000-0000093f9d59.00000000-0000-0000-0000-000000000000 
09:34:46.2040073[ResponseHeadersBegin]00000011-0000-0000-0000-0000093f9d59.00000000-0000-0000-0000-000000000000 
09:34:46.3278444[RequestStop]00000011-0000-0000-0000-0000093f9d59.00000000-0000-0000-0000-000000000000

Un-paired Stop events:
- ResolutionStop: ran in continuation: https://github.com/dotnet/runtime/blob/master/src/libraries/System.Net.NameResolution/src/System/Net/Dns.cs#L485
- ConnectStop: ran in SocketsEventArgs: https://github.com/dotnet/runtime/blob/master/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.cs#L744
- HandshakeStop: ran in continuation: https://github.com/dotnet/runtime/blob/master/src/libraries/System.Net.Security/src/System/Net/Security/SslStream.Implementation.cs#L240
They're paired, I misinterpreted the logs. The Stop event doesn't carry parent activity id, only the start one.

What needs to be done to achieve desired statistics:

We can also add or leave existing unpaired telemetry (though it shouldn't be named Start/Stop then) for events that cannot be bound by a single async context. We can link them to the Start/Stop events manually either via remembering ActivityId or using some other custom ID.

cc: @samsp-msft @MihaZupan @karelz

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Diagnostics.Tracing untriaged New issue has not been triaged by the area owner labels Aug 13, 2020
@ghost
Copy link

ghost commented Aug 13, 2020

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti
See info in area-owners.md if you want to be subscribed.

@ghost
Copy link

ghost commented Aug 13, 2020

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

@karelz karelz added this to the 5.0.0 milestone Aug 13, 2020
@karelz karelz added bug and removed untriaged New issue has not been triaged by the area owner labels Aug 13, 2020
@ManickaP
Copy link
Member Author

I might have found the root cause of this problem.
When I fetch an HTTP response without content, the ActivityId is present in the Stop event.

The RequestStop is called only after the response has been read to the end, which happens outside of the async scope started when RequestStart was called.

Details:

RequestStart is called in HttpConnectionPoolManager, several levels down from HttpClient.SendAsync.
RequestStop is called in HttpRequestMessage called when the response content stream has been read: e.g.: ContentLengthReadStream. Which is either called by the user, or in case of response buffering, in HttpClient.FinishSendAsync.

Several options how to solve it:

  • track only request part of request / response exchange, i.e. call stop sooner.
    • maybe add separate tracking for response (no idea if it's doable)
  • manually handle activity id and pass it to the stop event (no idea if it's doable)
  • keep it as it is, pairing won't work
    • manually add custom identifier into the logging so at least something can be used

@noahfalk
Copy link
Member

@ManickaP nice investigating 👍
Indeed as you observed, ActivityId is async-local so it flows from parent to child when creating a Task but changes do not flow back from child to parent when an await completes.

I'm not knowledgeable about the networking code, but is there an option to call Start() sooner so that it runs in the same Task where Stop() will eventually be called? Other more customized solutions may also be possible, but I've never attempted to author one and my suspicion is it would get messy.

@ManickaP
Copy link
Member Author

is there an option to call Start() sooner so that it runs in the same Task where Stop() will eventually be called?

We could, theoretically, run it from HttpClient.SendAsync, but that would only help us with buffered responses. In the other case, the request gets finished after the user either disposes it or reads the response stream till the end and we have no control over when and where this happens.

Is there a way to manually retrieve and later attach the ActivityId to the stop event?

@MihaZupan
Copy link
Member

As @ManickaP said, we could only do that for buffered responses. For unbuffered ones, we hand out the response that the user can interact with that will eventaully trigger a Stop event.

If I'm understanding this correctly, we can't flow the ActivityID up to the parent?
In that case, is the HttpClient design (handing out the response to the caller to process) incompatible with a Start/Stop activity tracking the entire processing lifetime?

If we give up on tracking the response portion, we would cut into the value of the activity. At that point we would be telling users to log their own Start/Stop activity enclosing the processing of a request.

Alternatively, would introducing a "SendAndProcessWithTelemetryAsync" method that takes a callback make any sense? In that can we could ensure that everything happens in the same scope.

Something along the lines of (just a rough idea):

// On HttpClient
Task<HttpResponseMessage> SendAndProcessWithTelemetryAsync(HttpRequestMessage request, HttpCompletionOption completionOption, Func<HttpResponseMessage, Task> responseCallback, CancellationToken cancellationToken)
{
    TelemetryStart();
    try
    {
        HttpResponseMessage response = await this.SendAsync(request, completionOption, cancellationToken);
        await responseCallback(response);
        TelemetryStop();
        return response;
    }
    catch
    {
        TelemetryAbort(); // Which also logs Stop
        throw;
    }
}

@ManickaP
Copy link
Member Author

Alternatively, would introducing a "SendAndProcessWithTelemetryAsync" method that takes a callback make any sense? In that can we could ensure that everything happens in the same scope.

That would measure/trace different things depending of HttpCompletionOptions, for buffered it'd include reading the response, for read-headers it wouldn't.
It's similar the first solution I suggested in here: #40776 (comment).

@MihaZupan
Copy link
Member

MihaZupan commented Aug 14, 2020

for read-headers it wouldn't.

The idea is that it would be captured in the user-provided callback.

Essentially if a user had code like

Task DoSomethingWithResponseAsync(HttpResponseMessage response) {}
var response = await client.SendAsync(request, completionOption);
await DoSomethingWithResponseAsync(response);

it would turn into

await client.SendAndProcessWithTelemetryAsync(request, completionOption, DoSomethingWithResponseAsync);

That way the Telemetry Start/Stop would be in the enclosing scope of the response processing.

@ManickaP
Copy link
Member Author

Now I see.

@KalleOlaviNiemitalo
Copy link

KalleOlaviNiemitalo commented Aug 14, 2020

Is there a way to manually retrieve and later attach the ActivityId to the stop event?

One could save EventSource.CurrentThreadActivityId after the start event, use EventSource.SetCurrentThreadActivityId to temporarily change the activity ID before and after the stop event, and set EventActivityOptions.Disable on the stop event to prevent ActivityTracker from overriding the activity ID. However, if the request causes any other activities, then those would not get the correct parent in their hierarchical GUIDs, so the async solution seems safer.

Related: dotnet/diagnostics#1347

@stephentoub
Copy link
Member

Alternatively, would introducing a "SendAndProcessWithTelemetryAsync" method that takes a callback make any sense?

Good question, but no :-) We do not want to encourage that style of programming, and it won't fix the 99% case (on top of which it's more expensive).

@noahfalk
Copy link
Member

One could save EventSource.CurrentThreadActivityId after the start event ...

What @KalleOlaviNiemitalo suggests would work to set the activity id of the stop event, however there is also the consideration of what activity id is attached to all events that happen to occur in between the start and stop. Perhaps we are in position to know that we can control all of those events because we own 100% of the code that executes between those two points? I didn't get a chance to read all the updates above and it is almost 5:30am so I need to sleep before I'll be able to do a good job thinking about this. When I'm awake again I'll dig into it further : )

@ManickaP
Copy link
Member Author

We can move the RequestStart somewhere within the start of HttpClient.SendAsync. This should put all of our other activities in the same async scope. After that, the control is in user's hands, we don't need to track those activities, we just need to link the RequestStop once it happens. For that, we can use the SetCurrentThreadActivityId. The only things that would be excluded are content reading methods (LoadIntoBuffer, SerializeToStream, CreateContentReadStream, etc.), which, AFAIK, don't have any activities, so we should be fine.

All of this is based on premise, that tracing the request until the whole response in consumed is what we want. Alternatively, we could narrow the scope only to the point of creating the response and exclude the response content reading.

@samsp-msft
Copy link
Member

If the response is unbuffered, then we should make the stop be when we hand back control to the user code, as its not ultimately us in control.
If we want to make start/stop events for this that can be correlated, we could create a new event with its own activity ID, that is not parented (so as not to break the async generated IDs) and then reference either it or the other ID in one of other of the payloads, so somebody can track it if they need to. I would put this uber-event at a lower priority for being completed. It may not make sense to be an activity ID, just some other ID that can be used to correlate.

@noahfalk
Copy link
Member

noahfalk commented Aug 17, 2020

In that case, is the HttpClient design (handing out the response to the caller to process) incompatible with a Start/Stop activity tracking the entire processing lifetime?

Yes (at least in the cases where the caller will still need to download the remainder of the response after SendAsync task completes). Apologies for not recognizing the potential for this issue earlier.

The main premises of Activities are:

  • Contiguous time periods of work on a logical thread (an async Task and any Task children). This requires them to have a start time, an end time, and an implied duration between.
  • They nest. When a logical thread starts working on a new activity, the new activity needs to finish before work resumes on the parent.

Even if we are capable creating events that don't adhere to these rules, the tools doing analysis are likely to struggle because we expect Activities form a tree. So a generic pattern that is hard for the library to represent with a single Activity looks like this:

class Worker
{
    Task DoWorkPart1(...); // in our case this is HttpClient.SendAsync()
    Task DoWorkPart2(...); // and Stream.ReadAsync() on the returned stream
}

A user can consume that API and do very not-treeish things with it:

List<string> workList = new List<string>();
foreach(string work in workList) { await Worker.DoWorkPart1(work); }
foreach(string work in workList) { await Worker.DoWorkPart2(work); }

The individual parts of the work can be sequenced/nested, but if part1 and part2 are combined into a single activity they don't nest:

time ---->

| work 1.1 | work 2.1 | work 3.1 |  work 1.2 | work 2.2 | work 3.2 |
< -------------- work 1 --------------------->
            <------------------ work 2 ----------------->
                       < ------------------ work 3 ---------------->

However, there are still options we could look at to improve the situation, now and/or in the future:

  • Create additional (optional) Activities with start/stop events when the response stream is read. We could also create a correlation id so that analysis tools can connect those reads back to the SendAsync which initiated the request. Stream.GetHashCode() might be unique enough to use? (this mirrors @samsp-msft suggestion above)
  • Collaborate with Visual Studio profiler and/or perfview to add logic correlating the different chunks of work and improve analysis/UI. This type of IO analysis seems likely to be useful so if there isn't already some existing support I can advocate that we build it.
  • Scope down the activity created by SendAsync so that it sends the stop event at the end of SendAsync.
  • Create single activity scopes for situations where the library can enforce that all the work happens contiguously, for example GetByteArrayAsync or GetStringAsync
  • Provide guidance to higher level libraries and app developers showing them how they can create single activities that group together multiple portions of the work when they are in position to know it happens contiguously.

I also want to loop in @brianrob who has probably had more experience with this type of issue than I have. He may know of some pre-existing approaches that handle this better without needing to do anything novel. Hope this helps!

@samsp-msft
Copy link
Member

For 5, I'd recommend we scope down and send the stop at the end of SendAsync, and make it clear either with the event name, or parameters in the payload as to what type of request it is, and so what the Stop will actually correspond to.

@samsp-msft
Copy link
Member

samsp-msft commented Aug 19, 2020 via email

@karelz karelz assigned MihaZupan and unassigned ManickaP Aug 27, 2020
@MihaZupan
Copy link
Member

Closing as resolved by #41022

@karelz
Copy link
Member

karelz commented Sep 1, 2020

Fixed in 6.0/master (PR #41022) and in 5.0 (PR #41610).

@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants