Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DiagnosticsHandler doesn't support W3C trace context correctly with aspnetcore #31862

Closed
gislikonrad opened this issue Feb 6, 2020 · 15 comments

Comments

@gislikonrad
Copy link

gislikonrad commented Feb 6, 2020

I have two aspnetcore applications using the W3C activity id format. When I have one call the other, the tracing context in the logs is missing a link in the chain.

Expected behavior
Trace id is the same in the logs of both services.
Parent id in the logs for the second service matches the span id in the logs of the first service.

Observed behavior
Trace id is the same in the logs of both services.
Parent id in the logs for the second service does not match the span id in the logs of the first service.

This is due to what is happening in the DiagnosticsHandler.

On line 52, the DiagnosticListener (s_diagnosticListener) is queried for whether it's enabled. This is always false for .net core. In this statement, a new activity is started and not logged. The part of the method that we want to run (line 95-99) is never run.

I have currently hacked this to work in my project with the following code

internal static class TraceContextHack 
{
    private static readonly string _diagnosticsHandlerTypeFullName = "System.Net.Http.DiagnosticsHandler";
    private static readonly string _diagnosticListenerFieldName = "s_diagnosticListener";

    public static void Initialize()
    {
        var handlerType = typeof(HttpClient).Assembly.GetType(_diagnosticsHandlerTypeFullName);
        var listenerField = handlerType.GetField(_diagnosticListenerFieldName, BindingFlags.NonPublic | BindingFlags.Static);
        var listener = listenerField.GetValue(null) as DiagnosticListener;

        if (!listener.IsEnabled())
            listener.Subscribe(new NullObserver(), _ => false);
    }
}

class NullObserver : IObserver<KeyValuePair<string, object>>
{
    public void OnCompleted()
    {
    }

    public void OnError(Exception error)
    {
    }

    public void OnNext(KeyValuePair<string, object> value)
    {
    }
}
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Diagnostics.Tracing untriaged New issue has not been triaged by the area owner labels Feb 6, 2020
@davidfowl
Copy link
Member

You shouldn’t need to do that, can you provide a repro project that shows the problem?

@gislikonrad
Copy link
Author

@gislikonrad
Copy link
Author

I've now reproduced the issue using my sample solution on Windows 10 Enterprise and Mac OsX Catalina.

@jnovick
Copy link

jnovick commented Apr 28, 2020

I am experiencing the same issue as well.

@wosfilm
Copy link

wosfilm commented May 4, 2020

Me too. Using the hack solved the issue.

@jnovick
Copy link

jnovick commented May 5, 2020

@davidfowl If one of us wanted to make a PR to fix this issue, would we make the PR to master or is there a 3.1 procedure? What is the procedure? Also, would the fix just be to say "Is diagnostic listener enabled OR is .net core" instead of just checking if the listener is enabled?

@tommcdon tommcdon added this to Needs Triage in .NET Core Diagnostics via automation May 11, 2020
@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label May 11, 2020
@tommcdon tommcdon added this to the 5.0 milestone May 11, 2020
@noahfalk
Copy link
Member

I think the code is working 'by design', but the feedback would then be that this current design didn't match your expectations and we can discuss if it is a good design or not. The mismatched assumptions appear to be something like this:

Current design - Any code is allowed to create an Activity. If you want to log an outgoing Activity ID then you need to use a DiagnosticListener (or starting in 5.0 an ActivityListener) to note when the id changes and log the new id.
The behavior you expected? - Only the customer code should create a new Activity. This ensures that if customer code logged an ID somewhere it would never change later before getting to the wire.

Having a separate Activity wrap each outbound request has the benefit that if you make multiple requests the dependencies of each can be distinguished. At this point there is also a significant back-compat consideration, we probably wouldn't want to change the default behavior but we could make it easier to opt-in to some alternate behavior.

One other concern is that HttpClient is one example of the problem, but any library that transmits data can (and probably will) have similar behavior. For example I assume gRPC library, Azure/AWS/GCP SDKs, and database clients would all be examples that create a new Activity sometime after the last ILogger message is emitted. A general purpose solution to this probably isn't to prevent all these libraries from creating an Activity, but instead to ensure that we can log a message when the Activity ID changes. Perhaps what would be useful is a built-in asp.net core logger that does that? As a shorter term workaround we could probably put together an example showing what any developer could add to their own app to log the Activity ID?

Glad to hear thoughts on this and thanks for the feedback!

@noahfalk noahfalk modified the milestones: 5.0.0, 6.0.0 Aug 12, 2020
@mikedoug
Copy link

I agree that the Activity system within the .NET context makes sense that anybody anywhere at anytime can create a "sub" Activity to embody a piece of work. However, the original poster and others who have "me too" responded have a VERY valid point and I've spent the better part of two days tracking this down. We have services that are logging trace-id, parent-id, and span-id so that we can do correlation from logs -- and when sub activities go UNLOGGED you suddenly have child spans that link to non-existent parents. That just doesn't work.

There's two big issues at play for me here:

First, the W3C standard expects that you have logging for every parent/child in the hierarchy. By default this logging is NOT working as the original poster has shown. Something has to be done so that we can trivially hook into and log EVERY Activity that is created in the flow of processing a request in order to prevent some-third-party-library from messing up our logging of this crucial information.

Second, the work around provided exposes a really BAD "side effect" driven behavior of the DiagnosticHandler. In order to disable the creation of the Activity inside of the DiagnosticHandler you have to meet two conditions: First, you MUST have a registered listener on System.Net.Http.DiagnosticsHandler; this triggers this if block to fail and not create the Activity found within. Second, this handler MUST have a predicate that returns False when asked if it handles the System.Net.Http.HttpRequestOut event; this triggers this if block to fail. By having both of those if blocks fail, no sub Activity is generated.

@gislikonrad bravo for figuring that out, any many thanks for your discovery!

But, I want to go back to how this is using a REALLY BAD side effect driven behavior. Let's look at a few scenarios:

  • No listener on System.Net.Http.DiagnosticsHandler results in new Activity
  • A listener on System.Net.Http.DiagnosticsHandler that returns False for handling System.Net.Http.HttpRequestOut events results in NO new Activity.
  • A listener on System.Net.Http.DiagnosticsHandler that returns True for handling System.Net.Http.HttpRequestOut events results in new Activity.

My primary complaint is that the first two scenarios should NOT be different -- that is violation of any sense of consistency.

I appreciate the work around, but I'm concerned without a proper solution that we're all going to suffer massive failures if A) they ever fix this violation of consistency or B) any of us add some third-party library that registers a listener that handles that event (maybe a library for doing automatic telemetry)... As soon as we add that type of library into our app boom these unlogged Activities come back.

The only way I have to work around that is to change the Null observer into one that actually handles the event and logs the new Activity.Current.* information. That keeps us out of the workaround that could fail under either of the above two scenarios.

That's at least my findings after two days of hunting for this problem. I don't understand why this default behavior is acceptable in a world of W3C tracing, and I'm glad to find others hitting the same problem.

@mikedoug
Copy link

mikedoug commented Aug 19, 2020

I would also offer this as a more appropriate way of connecting the null listener:

DiagnosticListener.AllListeners.Subscribe(delegate (DiagnosticListener listener)
{
    if (listener.Name == "HttpHandlerDiagnosticListener")
    {
        listener.Subscribe(new NullObserver(), name => false);
    }
});

You need the System.Reactive.Core package in order to have the delegate be automatically able to cover for the IObservable interface. I spent two days tracking this stuff down... I feel a blog post coming to save my sanity in the future and MAYBE help someone else out in the future because this DiagnosticListener stuff was a slog.

It also looks like you can just pass null instead of NullObserver() since the predicate always returning false means it will never try to call it.

@noahfalk
Copy link
Member

Thanks for the feedback @mikedoug!

if A) they ever fix this violation of consistency

I assume 'they' means Microsoft? 'They' = me in this case : ) Although I wasn't the original author I do have a large say in determining how the code evolves forward. I'm also in position to say that I (or Microsoft as a whole) has no plans to change that default behavior for exactly the reason you raised - consistent or no it is the behavior people have come to take a dependency on in their apps and we want to be respectful of that.

Something has to be done so that we can trivially hook into and log EVERY Activity that is created in the flow of processing a request in order to prevent some-third-party-library from messing up our logging of this crucial information.

We have been working on improving this during .NET 5. Anyone who references the new version of System.Diagnostics.DiagnosticSource.dll (either by upgrading to .NET 5 or by referencing the updated NuGet package regardless of the .NET version you use) gains access to a new type called ActivityListener. The goal of this listener is to make it simple and consistent to receive notifications when Activities start and stop. For example you could write:

using System;
using System.Diagnostics;
using System.Net.Http;
using System.Threading.Tasks;
class Program
{
    static async Task Main(string[] args)
    {
        ActivitySource.AddActivityListener(new ActivityListener()
        {
            // This controls which ActivitySources we want to listen to. We'd like all of them
            ShouldListenTo = (activitySource) => true,

            // These callbacks get invoked whenever an activity created by a source we are listening
            // to starts or stops
            ActivityStarted = (activity) => Console.WriteLine($"Start: {activity.OperationName,-30} {activity.Id}"),
            ActivityStopped = (activity) => Console.WriteLine($"Stop:  {activity.OperationName,-30} {activity.Id}")
        });

        Activity a = new Activity("Example");
        a.Start();
        HttpClient client = new HttpClient();
        await client.GetAsync("http://bing.com");
        a.Stop();
    }
}

Example output:

Start: Example                        00-fb0b3a7b36832a4eb0bc2cb29f3f27e2-841cf3a68170eb44-00
Start: System.Net.Http.HttpRequestOut 00-fb0b3a7b36832a4eb0bc2cb29f3f27e2-7e1af74e22077d45-00
Stop:  System.Net.Http.HttpRequestOut 00-fb0b3a7b36832a4eb0bc2cb29f3f27e2-7e1af74e22077d45-00
Stop:  Example                        00-fb0b3a7b36832a4eb0bc2cb29f3f27e2-841cf3a68170eb44-00

What do you think of this and is there more/different stuff you'd like to see to improve the story here?

@noahfalk
Copy link
Member

Also while not a direct solution to this problem there is some related work in .NET to Add Activity Ids and Context to all logs

@noahfalk
Copy link
Member

cc @tarekgh

@mikedoug
Copy link

@noahfalk Yes -- "they" was referring to Microsoft or whomever wrote that original code. For me it is nonsensical.

I am REALLY happy to hear that .NET 5 is going to give us the ActivityListener that will allow us to follow ANY Activity regardless of its source; that's what I have been searching for to help answer this issue of mine. Also, really glad to hear that you're working on adding all of that stuff into logs. Right now it is rather clunky and the biggest issue I have (at least with Serilog and it's Enrich.WithLogContext() mechanism is that the LogContext appears to be fixed at some point in time -- so when logging comes after a sub-Activity is created it still logs the original context. Hopefully you are looking to solve that!

That said, we really do need a mechanism to configure it so that the DiagnosticHandler skips the creation of a new Action so that we can stop using this hacky work around. That way, if we want to opt out of a sub-Action for every outgoing HTTP call (which is a legitimate desire) we can do so in a way that won't be broken by 3rd party libraries who might add a listener for the events and inadvertently enable it.

The behavior bothers me because NONE of the documentation I could find (Microsoft or otherwise) made me expect to find the core libraries doing this. Everybody talks about simply service A having a spanId and when it calls service B the parentId will be the spanId from service A -- it's always kept simple. I spent a lot of time thinking Refit was doing it, and hunting through that was fun. Then, not finding anything there, I was hunting through Serilog trying to figure out why it would be doing something. It was a surprise to me to find this behavior as the default behavior in the core of the HttpClient (through the DiagnosticsHandler) of the language. I'll be less bothered if an option to explicitly disable it is provided, and the documentation shored up slightly to make this "additional Activity" more clear.

Thanks for the feedback as well!

@noahfalk
Copy link
Member

I am REALLY happy

Excellent! That is what we were hoping for!

the LogContext appears to be fixed at some point in time... Hopefully you are looking to solve that!

Yep, that feature I referenced logs the Activity which is current on the thread at the time the log message is being emitted : )

I'll be less bothered if an option to explicitly disable it is provided, and the documentation shored up slightly to make this "additional Activity" more clear.

Thanks, this is great feedback. I've captured your requests here and here. If you want to add any more details about what you are looking for feel free to add or fix anything I wrote there.

I'll leave this issue open in case there is more feedback, but for now I am assuming that the work already in flight + the new issues I linked above covers everything that needs to happen. Thanks again!

@noahfalk
Copy link
Member

Closing this issue as there been no further feedback.

Issues 41072 and docs/20195 remain open to track adding an opt-out mechanism and better docs respectively.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants