Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support OpenTelemetry concepts on Activity #31373

Closed
tarekgh opened this issue Nov 2, 2019 · 45 comments · Fixed by #35220
Closed

Support OpenTelemetry concepts on Activity #31373

tarekgh opened this issue Nov 2, 2019 · 45 comments · Fixed by #35220
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Diagnostics
Milestone

Comments

@tarekgh
Copy link
Member

tarekgh commented Nov 2, 2019

This issue tracking the work as the part of the issue https://github.com/dotnet/corefx/issues/42305

Please look at dotnet/designs#98 for the detailed design.

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the Future milestone Feb 1, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@noahfalk noahfalk assigned tarekgh and unassigned noahfalk Mar 3, 2020
@noahfalk
Copy link
Member

noahfalk commented Mar 3, 2020

@tarekgh is the primary dev owner, I am actively assisting. So far design feedback has been occurring in #32660

@tarekgh tarekgh added api-needs-work API needs work before it is approved, it is NOT ready for implementation and removed enhancement Product code improvement that does NOT require public API changes/additions untriaged New issue has not been triaged by the area owner labels Mar 11, 2020
@tarekgh tarekgh modified the milestones: Future, 5.0 Mar 11, 2020
@reyang
Copy link

reyang commented Mar 12, 2020

Looks great! Thanks @tarekgh and @noahfalk!

One question related to

using (Activity foo = new Activity(...))
{
    throw new FoobarException(...);
}

If it is possible to capture the exception and associate it with the Activity?

In OpenTelemetry Python there is such feature (although I think it is more like a question to the language/runtime rather than API) where we know if there is exception and derive status code.

@tarekgh
Copy link
Member Author

tarekgh commented Mar 12, 2020

Thanks @reyang for your review.

I don't think we can automatically capture the exception inside using block without try-catch. The user can write the code like:

using (Activity foo = Activity.StartNew(...))
{
    try
    {
        throw new FoobarException(...);
    }
    catch (Exception e)
    {
        foo?.SetCustomProperty("Exception", e);
    }
}

and if there is any listener, will get the notification in stopping the activity and then can check if the exception is set there using GetCustomProperty. I'll think if we'll have a better way to handle it.

@reyang
Copy link

reyang commented Mar 12, 2020

If developers perform exception handling in a central place (e.g. the evil global exception handler, or catching exception from top level code), is there a way for them to know what's the enclosing activity? (I guess no?)

try {
    using (var foo = new Activity("foo"))
    {
        using (var baz = new Activity("baz")
        {
            DoSomething();
        }
        DoSomething();
    }
    using (var bar = new Activity("bar"))
    {
        DoSomething();
    }
} catch (FoobarException ex) {
    log(
        ex,
        MagicalFunctionWhichReturnsTheEnclosingActivity(ex).Name, // this is hard since activity is disposed already
        MagicalFunctionWhichReturnsTheVeryContext(ex).SpanId,
    );
}

@chandramouleswaran
Copy link
Contributor

Can you also share the examples for creating an Activity from a parent activity ID only? Specifically around async workflows.

There are lots of use cases for us where we might spawn off an async task while the sync portion of the API needs to resume back in existing Activity context. In that case, we would still want to link the async task's activity to the API activity.

There are couple of options I can think of
Option 1 (taking inspiration from internal Ifx tools)
Create an Activity and do not dispose it.
Provide the ability to Unload it which returns to the older context.
Activity created can be passed to the task
Allow the option to Load the activity back and continue using a dispose pattern

Option 2
Get the serialized Activity data (may be just the current full ID is good enough if it has parent information)
Pass that along to the async task
Create a new activity with the serialized data as the parent
Use the dispose pattern

@noahfalk noahfalk changed the title Make Activity and OpenTelemetry Span the same Support OpenTelemetry Span operations on Activity Mar 12, 2020
@noahfalk
Copy link
Member

Thanks Tarek! Its a little tricky to comment on in GitHub because all the text is in the body of an issue. I'll do quotes but if the comments grow it might be helpful to put the text as a markdown document in a PR to allow inline commenting.

We also enhance and simplify the usage of the Activity

I think there is another major value prop worth calling out explicitly: it is now reasonable for a listener such as OpenTelemetry to observe Activity instrumentation from any code component in the app. Previously OpenTelemetry would have needed a-priori knowledge of the strings that would be used for DiagnosticSource name and IsEnabled() callbacks for each Activity it wanted to receive callbacks for.

The current usage pattern of the Activity is not neat and not that performant

I don't think the current proposal is making any performance improvement? Previously when we had ActivitySource that offered a faster path, but we've removed it.

Also, the listener should be more performant than the current listening way (through DiagnosticListener)

Similar to above, I don't think the current proposal is faster?

// The following pattern is useful for the Dependency Injection scenario to allow easily creating the listener with the reflection without the need to subclass and override the listener methods.
        ActivityListener listener = ActivityListener.CreateAndStart(shouldCreateActivity: (name, ContextBoundObject, links) => true, onActivityStarted: (activity) => { ... }, onActivityStopped: (activity) => { ... });
  1. I think ContextBoundObject was unintended? Lower in the spec it is ActivityContext
  2. The strongly typed objects in the parameters may make it tricky to use via reflection but something like expression trees might make it doable. I'll play with it.

public virtual bool ShouldCreateActivity(string activitySourceName, ActivityContext context, IEnumerable? links)

Naming suggestions:
activitySourceName -> activityName
context -> parent

For ActivityListener I assume the callback behavior is that OnActivityStarted/OnActivityStopped get invoked for every Activity, but ShouldCreateActivity only gets invoked for Activities created via Activity.StartNew? We should clarify that.

I've been thinking a bit about how we handle cases where someone is using Hierarchial ids rather than W3C ids. I think there may be an easy path where we add an overload to Activity.StartNew and ActivityListener.ShouldCreateActivity that uses a string for the id instead of the ActivityContext.

@noahfalk
Copy link
Member

If it is possible to capture the exception and associate it with the Activity?

If we had a defined place on the Activity where Exception should be stored then the BCL (or any of our users) could make a helper method that lets you write code like this:

Activity.RunWithActivity("MyActivityName", activity =>
{
   ... code that does whatever work inside activity scope
   throw new Exception("oopsie");
});

That helper could be implemented something like:

class Activity
{
    public static RunWithActivity(string name, Action<Activity?> action)
    {
        try
        {
             Activity? a = Activity.StartNew(name);
             action(a);
        }
        catch(Exception e) when AssociateException(a, e)
        { Assert.Fail("This should be unreachable code") }
        finally
        {
             a?.Dispose();
        }
    }

    static bool AssociateException(Activity a, Exception e)
    {
         a.SetCustomProperty("Exception", e);
         return false; // this function gets evaluated as an exception filter, we don't want to catch the exception
    }
}

If developers perform exception handling in a central place (e.g. the evil global exception handler, or catching exception from top level code), is there a way for them to know what's the enclosing activity? (I guess no?)

In a catch handler no, but in an exception filter potentially yes. However if the exception is caught and then rethrown at some intermediate frame (as often happens with async) then the filter won't help either.

@noahfalk noahfalk changed the title Support OpenTelemetry Span operations on Activity Support OpenTelemetry concepts on Activity Mar 12, 2020
@tarekgh
Copy link
Member Author

tarekgh commented Mar 12, 2020

@noahfalk thanks for the feedback. I have updated the description according to your comments. please let me know if I missed anything there.

@chandramouleswaran thanks for the feedback too, I'll try to get back to you soon.

@pakrym
Copy link
Contributor

pakrym commented Mar 12, 2020

Is there a fast way to check if activity should be created? Sometimes collecting context and links might be expensive, it would be nice to have a performant way to know beforehand if someone is listening on the activity.

@tarekgh
Copy link
Member Author

tarekgh commented Mar 12, 2020

@pakrym we Activity.StartNew(string name) which doesn't take a context or links.

if you are talking about the Listener side and want to avoid dealing with context/links there, we may think to add a callback to the listener which doesn't take a context/links. but I am not sure if this case concerns you.

namespace System.Diagnostics
{
    public class ActivityListener : IDisposable
    {
        ...
        public virtual bool ShouldCreateActivity(string activityName, ActivityContext parent, IEnumerable<ActivityLink>? links)
        public virtual bool ShouldCreateActivity(string activityName)
        ...   
   }
}

@pakrym
Copy link
Contributor

pakrym commented Mar 12, 2020

As a producer of activity I want to know when any particular activity is enabled so I don't have to extract links/ActivityContext from, for example an incoming request, when no one is listening.

@tarekgh
Copy link
Member Author

tarekgh commented Mar 12, 2020

got it. would adding something like:

public static bool ActivityListener.IsEnabled(string activityName)

will it address your scenario?

@pakrym
Copy link
Contributor

pakrym commented Mar 12, 2020

would adding something like:
public static bool ActivityListener.IsEnabled(string activityName)

Would it be as fast as DiagnosticSource.IsEnabled() check currently is (a single non-virtual field read)?

@tarekgh
Copy link
Member Author

tarekgh commented Mar 12, 2020

Would it be as fast as DiagnosticSource.IsEnabled() check currently is (a single non-virtual field read)?

If there are no registered listeners, then yes, we'll just check a static field value and if it is null we'll just return false. if there is any registered listener, we'll need to call them to know if any of the listeners interested in creating such an activity with the input name.

@pakrym
Copy link
Contributor

pakrym commented Mar 12, 2020

Then I don't think it addresses my scenario, N virtual calls just to know if an activity should be created seems way to high for something like Kestrel.

@noahfalk
Copy link
Member

in the best case but also the way it scales

Agreed it certainly scales differently. So far I can't come up with a scenario where that scaling appears to matter? If there are scenarios that are creating 100s or 1000s of Activities per request but somehow the requests are also really fast it would be an issue - but I don't forsee that happening in practice. If we had a plausible hypothetical example I might see it differently.

@tarekgh
Copy link
Member Author

tarekgh commented Mar 17, 2020

Summary to the status of the discussions

  • SpanContext.Tracestate will keep it as string per @reyang recommendation
  • SpanContext.IsValid/IsRemote will not be added until we find a value for it. per @reyang recommendation. IsValid is a trivial check (context == default). IsRemote, was recommended not to added before by Liudmila.
  • Added public bool TryGetContext(out ActivityContext context) to Activity class to retrieve the Activity context.
  • Status/Event will not be added to the Activity for now.
  • I added a couple of overloads that allow passing Kind, attributes, start time, context and links.
  • IsRecording would not be needed per @noahfalk comment: Support OpenTelemetry concepts on Activity #31373 (comment). @SergeyKanzhelev please let's know if you have a comment regarding that.
  • I added a setter to OperationName (to address UpdateName). @noahfalk please advise if you have any concern allowing the setter.
  • Removed the ActivityListener and added public static IDisposable StartListening(Func<string, ActivityContext, IEnumerable<ActivityLink>?, bool> shouldCreateActivity, Action<Activity> onActivityStarted, Action<Activity> onActivityStopped); to the Activity class.
  • Will not add Tracer for the mentioned complexity. @SergeyKanzhelev let's know if you have a comment about that.
  • @pakrym concern about the perf which @noahfalk replied.

@noahfalk
Copy link
Member

Thanks @tarekgh!

I added a setter to OperationName (to address UpdateName). @noahfalk please advise if you have any concern allowing the setter.

I am concerned that if we add that setter people will use it in ways that break pre-existing code. All .NET code for the past decade has been able to rely on OperationName being an immutable value. I recommend we do not add the setter and get feedback from @SergeyKanzhelev or @reyang to understand what the use case is to search for an alternate solution.

IsRecording would not be needed per @noahfalk comment: #31373 (comment).

I had an open question (to @SergeyKanzhelev and @reyang) in that portion that needs to be answered before we can resolve IsRecording: "The spec isn't clear about what the value prop is to create Spans where IsRecording is false given that they won't be reported to SpanProcessors?"

I added a couple of overloads that allow passing Kind, attributes, start time, context and links.

It is hard to provide feedback on this before the code changes are visible. Ideally new work would should up as changes to the APIs in the design doc. Changes that people are proposing but we ultimately decide not to do can show up in the Q&A section to record the request and the rationale for why it didn't happen or what the alternative is.

Earlier I commented about needing to handle the hierarchial ID case. I think we need an overload ShouldCreateActivity which uses a string instead of ActivityContext.

@reyang
Copy link

reyang commented Mar 17, 2020

I am concerned that if we add that setter people will use it in ways that break pre-existing code. All .NET code for the past decade has been able to rely on OperationName being an immutable value. I recommend we do not add the setter and get feedback from @SergeyKanzhelev or @reyang to understand what the use case is to search for an alternate solution.

FYI - this PR has some context about the span name semantic. I don't think users would be blocked if there is no UpdateName. It could be a bit inconvenient as I could imagine, as we have to give users some guidance which is not aligned with other OpenTelemetry SDKs.

I had an open question (to @SergeyKanzhelev and @reyang) in that portion that needs to be answered before we can resolve IsRecording: "The spec isn't clear about what the value prop is to create Spans where IsRecording is false given that they won't be reported to SpanProcessors?"

I think the spec does not ask SDK to create a new span. Most implementations would return a dummy span singleton when we know it is not being reported to SpanProcessor.

@benaadams
Copy link
Member

in the best case but also the way it scales

Agreed it certainly scales differently. So far I can't come up with a scenario where that scaling appears to matter? If there are scenarios that are creating 100s or 1000s of Activities per request but somehow the requests are also really fast it would be an issue. If we had a plausible hypothetical example I might see it differently.

Isn't it that entire design though? This is tracing through a latency chain of many components not a single request; its N requests occurring sequentially so is directly N x additive to the latency.

@tarekgh
Copy link
Member Author

tarekgh commented Mar 18, 2020

@noahfalk I have changed the design doc to reflect the latest changes. we may need to add more bullets in the Q & A section to capture the updates in the discussion here.

@noahfalk
Copy link
Member

noahfalk commented Mar 18, 2020

[@reyang] FYI - this PR has some context about the span name semantic. I don't think users would be blocked if there is no UpdateName. It could be a bit inconvenient as I could imagine, as we have to give users some guidance which is not aligned with other OpenTelemetry SDKs.

Thanks! Reading thought the PR I still felt like I was grasping at the scenario where a user would need to call UpdateName(). The closest explanation I could find was this sentence at the end "Instrumentation MUST NOT default to using URI path as span name, but MAY provide hooks to allow custom logic to override the default span name." If I follow correctly the worry was that the code creating the Span either can't access the URI or doesn't know what portions of it are dynamic. Some code later on does have that information and wants to use it to improve descriptive text shown in the UI. If that is accurate it suggests there are two different name concepts being squeezed into one property, the initial one is for classification and sampling within the app, and the later improved one is only intended to aid end-user understanding. If we wanted to model that in .Net my first thought would be to have two properties, Activity.OperationName and Activity.DisplayName. The default value of DisplayName would be identical to OperationName, but it can be changed. However this feels like a place we'd want to engage on the broader spec rather than coming up with our scheme in isolation. I agree that the best path in the immediate future would be to leave it out.

[@reyang] I think the spec does not ask SDK to create a new span. Most implementations would return a dummy span singleton when we know it is not being reported to SpanProcessor.

Cool, then it sounds like there is no issue if .Net returns a null Activity to represent this case rather than a dummy.

[@benaadams ] Isn't it that entire design though? This is tracing through a latency chain of many components not a single request; its N requests occurring sequentially so is directly N x additive to the latency.

By single request I am referring to a single trace-id from the point it enters the .Net Core app until they point we send a response message that completes the request. Certainly it is possible the code that implements this request could have many sequential steps within it, including the creation of many Activities. However I don't think there is a scenario where the implementation both creates a large number of Activities and it is already extremely fast? For example do we expect a scenario where a request completes in 100us needs 100 Activities to instrument all of its dependencies? Being fast is usually mutually exclusive with doing a large amount of work.

@noahfalk I have changed the design doc to reflect the latest changes. we may need to add more bullets in the Q & A section to capture the updates in the discussion here.

Thanks!

@SergeyKanzhelev
Copy link

I don't know how to comment on this thread. I'll try. Pretty-please, can we switch to google doc?

  1. UpdateName is absolutely needed. Asp.Net Core listener wouldn't work without it: https://github.com/open-telemetry/opentelemetry-dotnet/blob/633dda37bdb171f337596ba4b53021b5d35f117c/src/OpenTelemetry.Collector.AspNetCore/Implementation/HttpInListener.cs#L148
  2. IsRecording vs. no activity allocation. I commented before that not creating activity breaks propagation of the context. Which will break user scenarios. So we need Activity to be created or at least context propagated. Thus it needs some indication on whether recording enabled or not.
  3. Having specific callback for sampling instead of having the sampling decision made in creation callback is not aligned with OpenTelemetry spec. Some features like attributes returned by Sampler to be associated with the Span will simply not work.

@noahfalk
Copy link
Member

Pretty-please, can we switch to google doc?

I agree a single inline thread is hard to follow - I am already trying to move discussion to the PR which does allow for inline annotations and replies to different threads of conversation. I think this is similar to properties we'd get from a google doc. I don't really want to move yet again to google doc because (a) it is different than the convention every other .NET design issue uses, (b) every time we move the discussion forum it breaks the flow and makes it that much harder to understand the history or accumulate issues in one place.

It is very tempting (I am guilty too) to keep replying back to this issue. To try and break that pattern I copied each of the issues you raised to comments in relevant portions of the PR and responded there instead.

@SergeyKanzhelev
Copy link

@noahfalk yes, PR in designs repo is OK.

@tarekgh
Copy link
Member Author

tarekgh commented Mar 29, 2020

Just FYI, I have updated the design proposal dotnet/designs#98 to reflect the latest discussions and decisions.

@tarekgh tarekgh added api-ready-for-review and removed api-needs-work API needs work before it is approved, it is NOT ready for implementation labels Apr 2, 2020
@chandramouleswaran
Copy link
Contributor

@chandramouleswaran thanks for the feedback too, I'll try to get back to you soon.

Any updates here?

@tarekgh
Copy link
Member Author

tarekgh commented Apr 3, 2020

@chandramouleswaran, sorry for the late reply I have missed this one unintentionally. thanks for the reminder.

ActivitySource.StartActivity overload APIs provide multiple options to specify the parent context.

        /// Creates a new Activity object if there is any listener to the Activity, returns null otherwise.
        public Activity? StartActivity(string name, ActivityKind kind = ActivityKind.Internal);

        /// Creates a new <see cref="Activity"/> object if there is any listener to the Activity events, returns null otherwise.
        public Activity? StartActivity(
                            string name,
                            ActivityKind kind,
                            ActivityContext parentContext, IEnumerable<KeyValuePair<string, string?>>? tags = null,
                            IEnumerable<ActivityLink>? links = null,
                            DateTimeOffset startTime = default);

        /// Creates a new <see cref="Activity"/> object if there is any listener to the Activity events, returns null otherwise.
        public Activity? StartActivity(
                            string name,
                            ActivityKind kind,
                            string parentId,
                            IEnumerable<KeyValuePair<string, string?>>? tags = null,
                            IEnumerable<ActivityLink>? links = null,
                            DateTimeOffset startTime = default);

Here are the options:

  • If you use the first overload which not providing ActivityContext nor parent Id, we'll create the Activity and have the parent pointing at the existing Activity.Current. This should address the async scenario as Activity.Current is defined as AsyncLocal which transfers with the context of the async call.
  • If you use the second overload, you can provide the parent ActivityContext (which includes the parent trace Id, span Id, flags and state). You have the control to set the parent context to the newly created Activity.
  • If you use the third overload, you can create the Activity object using the parent Id only.

Please let me know if you have more questions or anything unclear here.

@tarekgh
Copy link
Member Author

tarekgh commented Apr 14, 2020

dotnet/apireviews#117

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review labels Apr 17, 2020
@terrajobst
Copy link
Member

(marked as approved because that's what we decided in the notes but I didn't update the item during the review)

@tarekgh
Copy link
Member Author

tarekgh commented May 18, 2020

Use new Activity to Replace OT Span: open-telemetry/opentelemetry-dotnet#660

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Diagnostics
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants