System.Diagnostics.DiagnosticSource.HttpHandlerDiagnosticListener bug fixes & edge cases #33732

CodeBlanch · 2020-03-18T20:07:44Z

Fixes:

HttpHandlerDiagnosticListener did not send DiagnosticSource Stop event on netfx in W3C mode corefx#40777 missed a couple spots where RequestId was being checked for telemetry but not TraceParent. This could lead to duplicate request events or no response events being fired when W3C mode is turned on.
If an exception is thrown creating the connection (DNS issue, TLS issue, etc.) the listener will send Start event but nothing else. There is now an Exception event so tracers can stop/cleanup their telemetry.
Activity is stopped at the end of the request event so Activity.Current is unavailable in response & exception events.
Fixed a couple race conditions:
- If request is canceled/aborted during the Start event processing, we will end up leaking a connection because m_WriteList.Add fires after the abort logic looks to remove/clear the request.
- Start event fires before headers are added so if the cancel/abort logic is triggered a Stop or Exception event can fire on another thread which will be looking for headers the start logic might not have added yet.

…n during connection creation. Fixed a couple spots where RequestIdHeaderName was checked but not TraceParentHeaderName.

noahfalk

Thanks @CodeBlanch! This looked mostly fine to me with a few questions inline and some requests to beef up the testing.
Historically I think @lmolkova and/or @davidsh would have looked over this code and have more experience spotting issues. @lmolkova is out on parental leave but I am hoping @davidsh is available to take a look. In a pinch I can be the approver though.

.../System.Diagnostics.DiagnosticSource/src/System/Diagnostics/HttpHandlerDiagnosticListener.cs

noahfalk · 2020-03-19T05:45:03Z

.../System.Diagnostics.DiagnosticSource/src/System/Diagnostics/HttpHandlerDiagnosticListener.cs

@@ -583,7 +610,7 @@ private HttpHandlerDiagnosticListener() : base(DiagnosticListenerName)

        private void RaiseRequestEvent(HttpWebRequest request)
        {
-            if (request.Headers.Get(RequestIdHeaderName) != null)
+            if (IsRequestInstrumented(request))


Does anything bad happen if we don't change this line?

If there was an instrumenter that used traceparent previously they could rely on this code (maybe even unintentionally) to set tracestate, correlationcontext, and send a started event. Now we have surpressed all that behavior it may be breaking their scenario?

@noahfalk You are right, it could be breaking. BUT, check out dotnet/corefx#40777. That just went out with .NET 5 Preview 1 (pretty sure?). That PR fixes/breaks 1 of the 2 events. This PR adds 1 new event and then makes it consistent for all 3. IMO better to have the consistency than 1 working differently? LMK

If it is consistency just for consistency's sake I'd prioritize the back-compat risk higher. If the lack of consistency results in a specific harm to some customer or tool I'd want to give that greater consideration.

check out dotnet/corefx#40777

I am making an educated guess that adding new events where they were previously missing/absent has a much lower risk of breaking compat than removing functionality does. As far as I could tell this was the only spot where we potentially stop doing work that happened before whereas all the other edits in this PR and in 40777 are adding things. Also I expect it is relatively easy to workaround getting too many events/activities because tools can ignore them whereas working around a missing event is very hard.

I'd say taking dependency on that behavior was wrong... Let's make it consistent and predictable

I'd say taking dependency on that behavior was wrong

Valuing back-compat means I won't second-guess whether dependencies on our current behavior are wise, reasonable, or intentional. I only care how likely it is that the dependencies exist and weigh the harm it would cause customers to have their working app break vs. the value the change brings.

Let's make it consistent and predictable

These aren't the attributes I value much when modifying existing shipping code. If there are specific scenarios that customers are likely to hit and making this change will benefit them then that could be compelling. Otherwise I am quite content sticking with the devil I know : )

noahfalk · 2020-03-19T05:50:06Z

.../System.Diagnostics.DiagnosticSource/src/System/Diagnostics/HttpHandlerDiagnosticListener.cs

@@ -706,6 +740,8 @@ private static void PrepareReflectionObjects()
            s_connectionListField = s_connectionGroupType?.GetField("m_ConnectionList", BindingFlags.Instance | BindingFlags.NonPublic);
            s_connectionType = systemNetHttpAssembly?.GetType("System.Net.Connection");
            s_writeListField = s_connectionType?.GetField("m_WriteList", BindingFlags.Instance | BindingFlags.NonPublic);
+            s_errorField = s_connectionType?.GetField("m_Error", BindingFlags.Instance | BindingFlags.NonPublic);
+            s_innerExceptionField = s_connectionType?.GetField("m_InnerException", BindingFlags.Instance | BindingFlags.NonPublic);


do we have confidence these fields exist in every past version of HttpClient? (I don't have any specific reason to believe there is an issue, solely making sure we've done due diligence)

@noahfalk Good call. I checked the .NET 4.5.1 reference source, they are present (ndp\fx\src\Net\System\Net_Connection.cs). Happy to check others, but not sure where to find them. Here's where I got 4.5.1: https://referencesource.microsoft.com/download.html

I confirmed they exist in 4.5 and the library doesn't target anything earlier than 4.5. For releases after 4.5.1 there are more buttons on the reference source page you linked.

.../System.Diagnostics.DiagnosticSource/src/System/Diagnostics/HttpHandlerDiagnosticListener.cs

noahfalk · 2020-03-19T05:59:37Z

.../System.Diagnostics.DiagnosticSource/src/System/Diagnostics/HttpHandlerDiagnosticListener.cs

            }

            public override int Add(object value)
            {
+                // Add before firing events so if some user code cancels/aborts the request it will be found in the outstanding list.
+                int index = base.Add(value);


I didn't notice any test that would fail if this code got reverted, I would add one.

If reverted TestCanceledRequest will fail because no Exception event is fired. To cancel the request it does this: new EventObserverAndRecorder(_ => { cts.Cancel(); })

Before:

m_WriteList.Add is called

Start event fired

User code called (cts.Cancel)

Request is aborted

Connection looks for an active request in m_WriteList, finds nothing (https://referencesource.microsoft.com/#system/net/System/Net/_Connection.cs,2816). If it finds something a lot more code happens including a potential call to m_WriteList.Clear. No Exception event is fired

Request is added to the ArrayList. It might be leaked at this point, no cleanup will happen because abort/cleanup has already completed?

After:

m_WriteList.Add is called

Request is added to the ArrayList

Start event fired

User code called (cts.Cancel)

Request is aborted

Connection looks for a request to remove in m_WriteList, successfully finds the active one. Is able to do its cleanup which causes the Exception event to fire correctly

CodeBlanch · 2020-03-21T02:25:27Z

@noahfalk I've been testing this a bunch, there are cases where an exception is thrown during connection where it will automatically retry. If the retry succeeds, then you will get a response event. If the retry fails, you will get another exception event. Both cases I think will lead to confusion. There is also an issue with the activity being stopped at the end of Start. This kills OpenTelemetry, where I really want to utilize this listener. I think I have a solution, but still working on it. I'm going to close PR and I'll reopen when I get it working.

noahfalk · 2020-03-21T08:42:10Z

@CodeBlanch - no problem : )

…tResponse blowing up with modified object.

CodeBlanch · 2020-03-25T02:50:09Z

@noahfalk A few days of hair-pulling and I'm back!

This got a whole lot better or a whole lot worse, depending on how you look at it :) I was able to hook into the AsyncCallback used by the HttpWebRequest. That made this infinitely more reliable because it fires at the end of the request processing for both sync & async flows and gives us a nice place to do our work. Hours of trying stuff before I found that little magical beauty.

Anyway, everything is nice and consistent now and follows really how a DiagnosticListener should work IMO. It's not without thorns, though. Ex.Stop event is gone. Now we get a proper response for "no-content" so we don't need that extra event. Exception event is added. Checking IsEnabled before sending all events. That is SOP for DiagnosticListener, no? Some of these things could cause breakage but really, from my seat, it's fixing bugs and inconsistencies that probably shouldn't have ever been in there from the beginning.

LMK what you think. Going to do more testing.

…on another thread.

SergeyKanzhelev · 2020-03-30T22:17:12Z

.../System.Diagnostics.DiagnosticSource/src/System/Diagnostics/HttpHandlerDiagnosticListener.cs

+
+        private bool IsRequestInstrumented(HttpWebRequest request)
+        {
+            return request.Headers.Get(TraceParentHeaderName) != null || request.Headers.Get(RequestIdHeaderName) != null;


should we check what mode we are operating in? It may allow less look ups in the collection.

Done. Adding the check opened up a new avenue where you can have RequestId on there already but we'll still add TraceParent if the W3C mode is enabled so I added a test case for that.

noahfalk · 2020-04-01T11:56:21Z

@CodeBlanch @SergeyKanzhelev - Is the goal to consume this code from OpenTelemetry? I'm open to discussing alternatives but how would you guys feel about putting your own (improved) implementation of the listener into OT and not relying the one in DiagnosticSource?

This is how things look from my point of view:

The runtime team only does limited bug fixing on full framework, back-compat is a high-priority whereas new funtionality has virtually zero priority.
This updated round of changes looks like more than low risk bug fixes. Event names are being added/removed/changed, the integration points are changing and there are plenty of spots where code dependent on details of the old behavior could stop working. It suggests you are pursuing a path where you want new functionality and are willing to accept more risk to achieve it. If the code is in OT you have the freedom to take whatever change/risk you deem appropriate and can continue to do so in the future.
I am not an expert on the internals of HttpClient. Unless I spend a bunch of time learning those internals you won't be getting much value from my code review of the instrumentation.
Given that the framework is taking very little change at this point the risk of a future framework servicing update breaking you appears quite low.

@davidsh - I'd love to get your feedback on this too. Thanks!

CodeBlanch · 2020-04-01T16:46:07Z

@noahfalk My thoughts for what little they are worth :)

I have this version of HttpHandlerDiagnosticListener mirrored in OT already so if we don't want to take it into dotnet/runtime, that's totally cool. I'll probably rename it and change the namespace there if that ends up being the case, but not a big deal.

The reason I opened this PR is I don't think the existing version really does its job well enough to be in the official lib. It has a lot of issues! Not communicating Exceptions thrown establishing the connection is huge for me as far as instrumenting my production code. I want to know about those issues ASAP. Other than that, for HttpClient, it's going to work reasonably well. If you try to use it for the underlying HttpWebRequest though, forget it! HttpWebRequest throws exceptions for Http StatusCodes other than success. Since there's not an Exception event, you won't get any of that data in your telemetry. What I needed was the ability to add OT to a legacy website and automatically catch all its outgoing HttpWebRequests.

I don't know of anything in here that would break anyone, per se. They should continue to get Start/Stop events as they did before (with the same data) and they can just ignore the Exception event. But I totally understand the risk aversion to breakage!

noahfalk · 2020-04-02T01:39:30Z

I have this version of HttpHandlerDiagnosticListener mirrored in OT already so if we don't want to take it into dotnet/runtime, that's totally cool

That was easier than I expected, but glad to hear it : )

I don't think the existing version really does its job well enough to be in the official lib. It has a lot of issues!

Yeah I can believe it : / So far nobody has raised an issue (aside from you guys) so I have to assume either they aren't using it, aren't affected by it, or they worked around it already and might have brittle dependencies on the existing behavior. If we did see broader feedback that other customers would also benefit from having an updated version of this instrumentation in a centralized lib I am not closing the door on that option, but I'd prefer to wait-and-see rather than doing it eagerly since it isn't blocking you guys.

I do appreciate the time you spent working on the PR and I hope in hindsight the majority of it was still time well spent to get a good implementation for OpenTelemetry. I'm going to leave this open a little longer in hopes we hear something from @davidsh but I am anticipating we'll close the issue without taking the PR at this point. Thanks @CodeBlanch!

davidsh · 2020-04-02T15:13:58Z

So far nobody has raised an issue (aside from you guys) so I have to assume either they aren't using it, aren't affected by it, or they worked around it already and might have brittle dependencies on the existing behavior.

I have always been concerned about this diagnostic listener using reflection against .NET Framework internals in order to provide telemetry. But it is something I agreed to a few years ago based on the need for it at that time.

We do not anticipate making any changes to .NET Framework that might upset the use of reflection that this listener uses. We are not making any changes to .NET Framework except security fixes and those fixes that are directly driven by customer feedback where the fix is needed due to broad impact to a service or business.

That said I am worried that this change might break someone else that is dependent on the current behavior of this listener. Since OT has already created its own copy of the listener, they are in the best position to modify it and fully test it. So, I would not support modifying this in .NET Core unless we see other customer feedback.

noahfalk · 2020-04-02T19:16:21Z

Thanks @davidsh! It sounds like we are all agreed that we will leave DiagnosticSource as-is (bugs and all) and continue to monitor customer feedback in this area. In the meantime OpenTelemetry will apply the improvements here to their own private copy of the listener which satisfies the telemetry needs there.

Improved HttpHandlerDiagnosticListener's handling of exceptions throw…

0c7187a

…n during connection creation. Fixed a couple spots where RequestIdHeaderName was checked but not TraceParentHeaderName.

Dotnet-GitSync-Bot added the area-System.Diagnostics.Tracing label Mar 18, 2020

stephentoub requested review from lmolkova and noahfalk March 18, 2020 20:43

Fixed a few style issues. One small optimization.

af40415

noahfalk requested a review from davidsh March 19, 2020 05:28

noahfalk reviewed Mar 19, 2020

View reviewed changes

davidsh requested review from stephentoub and a team March 19, 2020 15:03

CodeBlanch added 2 commits March 19, 2020 09:25

Merge branch 'master' into HttpHandlerDiagnosticListener-exceptions

687a020

Code review.

f54a7c9

CodeBlanch closed this Mar 21, 2020

New design hooking into AsyncCallback.

b6ac3ac

CodeBlanch reopened this Mar 25, 2020

Added HttpWebRequest specific tests. Switched hook state to fix EndGe…

e243490

…tResponse blowing up with modified object.

Added a test for WebRequest.GetResponseAsync method.

a4e9e46

jaredpar mentioned this pull request Mar 25, 2020

Test Failed: GetAsync_SetCookieContainer_CookieSent failed #12043

Closed

Fixed event ordering and race condition where response gets disposed …

8500529

…on another thread.

SergeyKanzhelev reviewed Mar 30, 2020

View reviewed changes

Code review feedback.

aa2ee32

noahfalk closed this Apr 2, 2020

SergeyKanzhelev mentioned this pull request Apr 6, 2020

ASP.NET dependency collector fixes and cleanup. open-telemetry/opentelemetry-dotnet#570

Merged

cijothomas mentioned this pull request Sep 2, 2020

DependencyTelemetry object hold reference to HttpWebRequest keeping connections open in case of transmission errors microsoft/ApplicationInsights-dotnet#2018

Closed

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System.Diagnostics.DiagnosticSource.HttpHandlerDiagnosticListener bug fixes & edge cases #33732

System.Diagnostics.DiagnosticSource.HttpHandlerDiagnosticListener bug fixes & edge cases #33732

CodeBlanch commented Mar 18, 2020 •

edited

Loading

noahfalk left a comment

noahfalk Mar 19, 2020

CodeBlanch Mar 19, 2020

noahfalk Mar 21, 2020

SergeyKanzhelev Mar 30, 2020

noahfalk Apr 1, 2020

noahfalk Mar 19, 2020

CodeBlanch Mar 19, 2020

noahfalk Mar 21, 2020

noahfalk Mar 19, 2020

CodeBlanch Mar 19, 2020 •

edited

Loading

CodeBlanch commented Mar 21, 2020

noahfalk commented Mar 21, 2020

CodeBlanch commented Mar 25, 2020

SergeyKanzhelev Mar 30, 2020

CodeBlanch Mar 30, 2020

noahfalk commented Apr 1, 2020

CodeBlanch commented Apr 1, 2020

noahfalk commented Apr 2, 2020

davidsh commented Apr 2, 2020

noahfalk commented Apr 2, 2020

System.Diagnostics.DiagnosticSource.HttpHandlerDiagnosticListener bug fixes & edge cases #33732

System.Diagnostics.DiagnosticSource.HttpHandlerDiagnosticListener bug fixes & edge cases #33732

Conversation

CodeBlanch commented Mar 18, 2020 • edited Loading

noahfalk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CodeBlanch Mar 19, 2020 • edited Loading

Choose a reason for hiding this comment

CodeBlanch commented Mar 21, 2020

noahfalk commented Mar 21, 2020

CodeBlanch commented Mar 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noahfalk commented Apr 1, 2020

CodeBlanch commented Apr 1, 2020

noahfalk commented Apr 2, 2020

davidsh commented Apr 2, 2020

noahfalk commented Apr 2, 2020

CodeBlanch commented Mar 18, 2020 •

edited

Loading

CodeBlanch Mar 19, 2020 •

edited

Loading