Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support customized Activity trace id generation #46704

Closed
lupengamzn opened this issue Jan 7, 2021 · 35 comments · Fixed by #48138
Closed

Support customized Activity trace id generation #46704

lupengamzn opened this issue Jan 7, 2021 · 35 comments · Fixed by #48138
Labels
api-approved API was approved in API review, it can be implemented area-System.Diagnostics.Activity
Milestone

Comments

@lupengamzn
Copy link

lupengamzn commented Jan 7, 2021

Background and Motivation

The .NET runtime provides APIs to generate Activity, which can be adopted and processed as trace context and eventually emitted to multiple backend vendors by OpenTelemetry .NET SDK for the purpose of tracing. Currently .NET runtime only allows an activity's trace id to be randomly generated and there is no public API available to override it. While for some backend vendors, trace context with a non-random customized trace id is more preferred. For instance, AWS X-Ray backend will only process trace context with a trace id containing a timestamp. In this case, is it possible for the runtime to support customized activity trace id in Activity class, like adding API similar as Activity.SetTraceId()?

Previous discussion: open-telemetry/opentelemetry-specification#896

Proposal Edited by @tarekgh

Proposed API

namespace System.Diagnostics
{
    public partial class Activity : IDisposable
    {
        // alternative proposed names: 
        //     GenerateTraceId 
        //     TraceIdCreationFunction
        public static Func<ActivityTraceId>? TraceIdGenerationFunction { get; set; } 
    }
}

Usage Examples

    Activity.TraceIdGenerationFunction = () => ActivityTraceId.CreateRandom();
    
    ...


    using (Activity activity = source.StartActivity("OperationName"))
    {
        Console.WriteLine(activity?.TraceId);
    }
@lupengamzn lupengamzn added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Jan 7, 2021
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Diagnostics.Tracing untriaged New issue has not been triaged by the area owner labels Jan 7, 2021
@ghost
Copy link

ghost commented Jan 7, 2021

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and Motivation

The .NET runtime provides APIs to generate Activity, which can be adopted and processed as trace context and eventually emitted to multiple backend vendors by OpenTelemetry .NET SDK for the purpose of tracing. Currently .NET runtime only allows an activity's trace id to be randomly generated and there is no public API available to override it. While for some backend vendors, trace context with a non-random customized trace id is more preferred. For instance, AWS X-Ray backend will only process trace context with a trace id containing a timestamp. In this case, is it possible for the runtime to support customized activity trace id, like adding API similar as Activity.SetTraceId()?

Previous discussion: open-telemetry/opentelemetry-specification#896

Proposed API

pseudocode

public Activity SetTraceId(string traceId)
{
    _traceId = traceId
}

Usage Examples

Alternative Designs

Risks

Author: lupengamzn
Assignees: -
Labels:

api-suggestion, area-System.Diagnostics.Tracing, untriaged

Milestone: -

@tarekgh tarekgh removed the untriaged New issue has not been triaged by the area owner label Jan 7, 2021
@tarekgh tarekgh added this to the 6.0.0 milestone Jan 7, 2021
@tarekgh
Copy link
Member

tarekgh commented Jan 7, 2021

@lupengamzn I don't think overriding the trace id is a good idea here. if you do so, it can break other consumers of the Activity objects which used the original trace id before overriding it. Why you can't create the Activity object passing the ActivityContext with the desired trace id?

CC @noahfalk

@cijothomas
Copy link
Contributor

I think the ask is not to override traceid. The ask to control trace_id generation algorithm. Its currently random, but some backends like AWS, requires the traceid to be in a particular format. So essentially asking for a 'hook', which will be called whenever Activity is ever going to generate a TraceId, and use that, instead of currently randmonid generation.

@noahfalk
Copy link
Member

noahfalk commented Jan 8, 2021

So essentially asking for a 'hook'

The proposed API above isn't in the form of a hook? (I think hook == callback or event) @lupengamzn's design assumes that there are certain Activities the developer will create directly and only those Activities will have the customized trace ids. The API above feels simpler to me relative to a hook because there is no ambiguity about who has control of the ID - whatever code creates the Activity has the control. In the case of hooks we get issues such as what if two different components try to set the hook and they don't agree about which ID to create. Its pretty awkward if multiple components are fighting over which of them gets to control singleton global state.

Thanks @lupengamzn! I can somewhat see how this scenario would work but I am still trying to fit together the pieces : ) Don't take this as me pushing back, I just want to ensure that if we build something its really going to solve the problems that it needs to solve.

  1. How does AWS handle when incoming requests already have a trace id and the id is not in the form that AWS would generate itself?
  2. A variation of (1) assuming that AWS supports distributed tracing for apps created by 3rd party developers containing arbitrary .NET code: how would you handle if the app developer creates a new Activity and does not force an ID that complies with your rules for IDs?
  3. We are promoting ActivitySource.StartActivity() as an improved mechanism to manage Activity creation, the main benefit being that it supports sampling in a way that is compatible with OpenTelemetry. The new API proposed above wouldn't handle this case because StartActivity() returns an Activity object that already has an ID and is already started. Would we need to extend the design proposal to handle that case or there are reasons it doesn't matter?

@tarekgh
Copy link
Member

tarekgh commented Jan 8, 2021

Discussion moved from the #46706:

From @tarekgh

@lupengamzn I think this is reasonable. The only concern is this can affect the current value we return from Activity.ParentId which I am not sure if this will be high risk for the app compatibility.

Could you please elaborate more on why you need this change? what scenario you think current behavior would be problematic?

CC @noahfalk

From @lupengamzn

Hey @tarekgh ,

Thanks for the quick response!

We're planing to override the trace id of a root activity which is created by .NET runtime and we found out that there is no public API for us to do so. The only API available for us to touch the trace id is through SetParentId() so we use it and pass new trace id, default parent span id, and the original trace flags of activity hoping to get an activity with the new trace id, but parent span id and original trace flags the same as previous. We found out that, if we pass default as parent span id to this API, the activity's parent span id is no longer null, but overridden to 0000000000000000 instead, which will give us 00-new trace id-0000000000000000-00 when we call activity.ParentId, instead of null as we expected.

We're thinking if it's more suitable for this API to keep the parent span id as null when we pass default as the parent span id as the second parameter. After that, when we call this API again and only passing the new trace id and keep the second and third parameters the same as before, only the trace id of activity will be overridden and the activity.ParentId will return null still.

From @tarekgh

@lupengamzn thanks for clarifying that.

@noahfalk what you think about this change? I think the app compat would be minimal and the change looks desired.

From @noahfalk

I'd like to find out why the broader scenario needs the dev to have direct control of the trace id for the Activity? I'm only aware of two typical scenarios for activities and neither of them needs that ability:

root -> create a new random id
not a root -> there is a parent and the id flows from the parent
I'm not inherently opposed to supporting a 3rd option where the developer specifies the id but there are some costs/consequences for us to do that so its important to understand what is the benefit for .NET devs. One consequence is that we have to resolve what is supposed to happen if anyone combines this with setting Activity.DefaultFormat = ActivityFormat.Hierarchical and Activity.ForceDefaultFormat = true. A 2nd is that the new ActivitySource.StartActivity() code path wouldn't support it so either we need some more changes there or we need to decide why it doesn't matter in that case.

Fwiw if we do this I think we should consider adding a new SetTraceId() API rather than modifying SetParentId(). It feels counter-intuitive to me that after calling SetParentId(x,y) it might still be the case that ParentId == null. I get that is the goal in this case, but I don't think it is what anyone else looking at this API would expect to have happen. Anyone else who wanted the same behavior would probably not be able to discover how to do it.

From @lupengamzn

Hey @noahfalk ,

Thanks for the response!

Changing the trace id of a root activity by calling SetParentId() turns out counter-intuitive for us too, so adding a new API such as SetTraceId() would be the most ideal way as currently there seems no direct APIs for us to touch the trace id of an activity. We're working with OTel .NET SDK team and as OTel .NET SDK adopts trace context from .NET runtime, it would be great that you can provide OTel .NET developers an API to override the original random trace id as other OTel SDKs already support such feature. The use case of such API so far is to override the original trace id with a non-random one for a root activity so that it (and its child activities) can be further recognized and processed by specific backend vendor (like AWS X-Ray).

I opened an issue discussing this and the workaround on it would be much appreciated if it's decided to target .NET 6.

From @tarekgh

@lupengamzn do you control the root activity creation? I mean how to decide which Activity is the root activity?

Can we close this issue and continue the discussion in #46704 to not have teared discussion?

From @lupengamzn

Hey @tarekgh ,

We're using activity listener to detect every activity and we're able to know if it's a root activity by checking if it's parentId is null or not. If it's root activity, we're calling SetParentId() to it to override the trace id. It seems a little bit tricky but is so far the only way we're able to come up with based on the APIs we can utilize from activity class.

@noahfalk
Copy link
Member

noahfalk commented Jan 9, 2021

We're using activity listener to detect every activity and we're able to know if it's a root activity by checking if it's parentId is null or not. If it's root activity, we're calling SetParentId() to it to override the trace id. It seems a little bit tricky but is so far the only way we're able to come up with based on the APIs we can utilize from activity class.

Ah I didn't realize you were intending to call this from the callback inside ActivityListener. I think you are relying on unintended behavior - the SetParentId() function is missing a check whether the Activity is already started. The activity's ID and parent are both intended to be immutable at that point. For OpenTelemetry the requirement is a bit stronger because the ID is used for sampling decisions before the Activity is started. If we gave the sampler id X and it makes it decision based on that we don't want to transmute the activity to have new ID Y because that might invalidate the sampling decision. We can look at designs that set the ID earlier to avoid these kinds of issues, it just probably wouldn't be the API above any longer (at least not for the ActivitySource.StartActivity() case).

It also sounds like your goal is to control the IDs of all Activities generated within the app, not just the ones that your code is directly responsible for creating? Datadog asked for something similar and I initially rebuffed them, but now that we've got multiple companies asking I feel like it at least deserves a 2nd look. I'm still trying to figure out what happens when an external trace ID comes in on the wire and it doesn't follow the ID generation rules your system knows how to handle. Do you fail to log it or you hope to convert it using some extensibility mechanism for intercepting inbound IDs?

@lupengamzn
Copy link
Author

lupengamzn commented Jan 12, 2021

Hey @noahfalk ,

Sorry for the late reply and thank you so much for the explanation. 😃

For your questions:

How does AWS handle when incoming requests already have a trace id and the id is not in the form that AWS would generate itself?

If using X-Ray SDK, AWS X-Ray backend will abandon all trace context which doesn't have a compatible trace id and log them out. If using OTel SDK, X-Ray collector will discard them and log them out.

A variation of (1) assuming that AWS supports distributed tracing for apps created by 3rd party developers containing arbitrary .NET code: how would you handle if the app developer creates a new Activity and does not force an ID that complies with your rules for IDs?

We plan to provide APIs which can generate X-Ray compatible trace id that app developers can use to pass in the API provided by .NET runtime to eventually override the root activity's trace id. And developers will need to have such instrumentation in their application so as to get trace context recognized by X-Ray backend (if they would like to use X-Ray). Actually X-Ray's trace id is in similar format as activity's trace id, they're both 32 character hex string, except that X-Ray requires the first 8 character to be the epoch time and the remaining 24 character to be random. That being said, X-Ray's trace id should be compatible for other backend vendors as long as they're compatible with activity's trace id. In this case, instrumenting this API will not affect the behavior when developers are also using other tracing services for their application at the same time.

In fact, other OTel SDKs have already provided such API to achieve this, to which developer can pass X-Ray compatiable trace id to override the root's trace id, so we're contacting with OTel .NET team to discuss about this feature. And since OTel .NET SDK adopts trace context from .NET runtime, we may need your assistance to realize this as well. 😃

We are promoting ActivitySource.StartActivity() as an improved mechanism to manage Activity creation, the main benefit being that it supports sampling in a way that is compatible with OpenTelemetry. The new API proposed above wouldn't handle this case because StartActivity() returns an Activity object that already has an ID and is already started. Would we need to extend the design proposal to handle that case or there are reasons it doesn't matter?

We realized this behavior that sampling decision is made before an activity is created while using StartActivity(), so currently in this case, we'll not only replace the trace id, but also update the sampling decision based on the new trace id. And yes, it would be great that you may extend the design proposal to create activities earlier before sampling decision is made to them as some sampler is traceid-based.

It also sounds like your goal is to control the IDs of all Activities generated within the app, not just the ones that your code is directly responsible for creating?

Yes, trace id needs to be uniform in format as X-Ray will trace the entire application and all its downstream services, which could either be AWS services or the other instrumented applications. So far, only the root activity's trace id needs to be updated as the new ids will be inherited by its child activity.

Do you fail to log it or you hope to convert it using some extensibility mechanism for intercepting inbound IDs?

So far for X-Ray the only mechanism for manipulating trace id lies in the SDK, and there is no external mechanism for our backend system to convert them except logging.

@noahfalk
Copy link
Member

noahfalk commented Jan 13, 2021

Thanks @lupengamzn, that clears up/confirms everything I was curious about. Also wanted to add @macrogreg and mention #42786 which is a similar (though slightly larger scoped) request from DataDog. I did push back when @macrogreg asked for similar functionality earlier but I suppose my resistance is dwindling as I think about it more and see the request replicated. Based on what I know now my suggestion is that we add some API similar to this:

partial class Activity
{
    // Whenever a root Activity is first sampled or started, this delegate will be invoked to determine the new ID. This property
    // will be initialized to a default random number generation function provided by .NET Runtime which should be suitable
    // for most use-cases.
    public static Func<ActivityTraceId> TraceIdGenerationFunction { get; set; }
}

I'll admit I grimace a little bit suggesting this because I worry that it creates opportunities for incompatibility. For example if someone tried to log to both DataDog and XRay your systems have mutually incompatible expectations for what proper IDs look like. To me this feels like it goes against the intent of the W3C TraceContext specification which intended to let vendors interoperate using a standardized definition of the trace id. However for practical purposes I'd guess that most developers are going to pick a single vendor and your efforts to increase compatibility can be decided based on requests from your customers rather abstract mandates of a specification.

From the .NET side I do want to be up-front that we have no ability to control what code would set that property. Hopefully a cottage industry of NuGet library authors won't spring up with new uses for this, but if they do and their property setting overrides your setting then it would be up to you to figure out how to handle the incompatibilities.

@tarekgh @cijothomas - what do you think of this?

@cijothomas
Copy link
Contributor

From Otel spec, it allows customization of both TraceId and SpanId, although it looks like AWS X-Ray requirement is only around TraceId. We may allow similar for SpanId generation as well.

@tarekgh
Copy link
Member

tarekgh commented Jan 13, 2021

I am not comfortable with this proposal as this delegate would be global and can cause confusion if more than one party opt-in doing that. Also, there will not be easy way for anyone else to create the Ids differently when this delegate is set.

Would it be better to have this delegate passed to the Activity creation instead and not be global? or at least we should have option to the Activity creators to opt-out of using the global delegate if needed.

Or we have the mix of these 2 options, have the global static delegate as the default and then Activity creation can have option to choose something else if needed.

Last, we need to get some feedback from different parties about this behavior and if they think if they can break because of that. Also, we need to know the perf implication if someone provided a delegate which consume a lot of time.

@cijothomas
Copy link
Contributor

we have the same issue with Activity.IdFormat as well, right? Its global and anyone could set/change it affecting everyone else.

@cijothomas
Copy link
Contributor

Or we have the mix of these 2 options, have the global static delegate as the default and then Activity creation can have option to choose something else if needed.
👍

@noahfalk
Copy link
Member

Also, there will not be easy way for anyone else to create the Ids differently when this delegate is set.

I agree this is possible, but I am relying on it being both unlikely and resolvable when it does occur. Do you think such a scenario would be likely?

Would it be better to have this delegate passed to the Activity creation instead and not be global?

This is conceptually easier but it doesn't solve the issue XRay and DataDog are trying to solve. Their code isn't the one that calls ActivitySource.StartActivity(), they are only a listener.

Activity creation can have option to choose something else if needed.

My guess is that that this requirement wouldn't occur often enough to be worth adding that extra complexity, but I'm not opposed to it if we needed it. A rough comparision might be looking at how often existing code invokes Activity.SetIdFormat() to override the global setting.

Last, we need to get some feedback from different parties about this behavior

I'm glad to loop anyone you think we should be talking to but not sure who you had in mind?

we need to know the perf implication if someone provided a delegate which consume a lot of time

I think the perf implications are the same as writing slow delegates for many of the other ActivityListener callbacks. If customers value high performance but a vendor implements something slow then customers will complain and/or switch to a new vendor.

@tarekgh
Copy link
Member

tarekgh commented Jan 14, 2021

I agree this is possible, but I am relying on it being both unlikely and resolvable when it does occur. Do you think such a scenario would be likely?

We don't know yet if this scenario is likely or not. As soon as we enable this feature, we'll start discover that I guess. That is why I am trying to be conservative to avoid any surprises. but if you think this is really unlikely, of course I trust your judgment.

This is conceptually easier but it doesn't solve the issue XRay and DataDog are trying to solve. Their code isn't the one that calls ActivitySource.StartActivity(), they are only a listener.

Yes I know that and that is why I suggested to add the ability for overriding the TraceId during the creation too.
As the scenario we are talking about is when listening, which means the Activity is not created yet. should we allow overriding the TraceId at that time instead? We already passing the ActivityCreationOptions to the listeners. We can allow them set the trace id at that time.

I'm glad to loop anyone you think we should be talking to but not sure who you had in mind?

aspnet, Azure SDK or any other Azure teams. I think you know better than me here :-)

I think the perf implications are the same as writing slow delegates for many of the other ActivityListener callbacks. If customers value high performance but a vendor implements something slow then customers will complain and/or switch to a new vendor.

That make sense. do we have to care about generating colliding Ids especially Cijo asking to allow generating the Span Ids too? or would be same to complain about the owner of the generator?

@tarekgh
Copy link
Member

tarekgh commented Jan 28, 2021

@lupengamzn @cijothomas

I have discussed this issue offline with @noahfalk and we agree with the Noah's proposal #46704 (comment). One change think is we'll make this delegate nullable

    public static Func<ActivityTraceId>? TraceIdGenerationFunction { get; set; } // alternative name would be GenerateTraceId per https://docs.microsoft.com/en-us/dynamicsax-2012/developer/naming-conventions-delegates-and-event-handlers?

If you agree, I'll modify the proposal in the top with that and we can proceed.

For the generating span id, we would wait to find more demand on that and we can discuss the details as needed as there is some concerns with it if we take the same approach as we did with the trace Id.

@lupengamzn
Copy link
Author

lupengamzn commented Jan 28, 2021

@lupengamzn @cijothomas

I have discussed this issue offline with @noahfalk and we agree with the Noah's proposal #46704 (comment). One little change is we'll make this delegate nullable

    public static Func<ActivityTraceId>? TraceIdGenerationFunction { get; set; } // alternative name would be GenerateTraceId per https://docs.microsoft.com/en-us/dynamicsax-2012/developer/naming-conventions-delegates-and-event-handlers?

If you agree, I'll modify the proposal in the top with that and we can proceed.

For the generating span id, we would wait to find more demand on that and we can discuss the details as needed as there is some concerns with it if we take the same approach as we did with the trace Id.

This sounds great for my side, thanks for the workaround!

@tarekgh
Copy link
Member

tarekgh commented Jan 28, 2021

@macrogreg please let's know if the proposal will work for DataDog before we proceed.

@tarekgh tarekgh added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Feb 2, 2021
@macrogreg
Copy link

@tarekgh sorry, I was out for a few days, just seeing this. I'll take a close look in the next 2 days.

@tarekgh
Copy link
Member

tarekgh commented Feb 2, 2021

Thanks @macrogreg. Sorry interrupting you during your time off. I hope you can get back to us before Thursday as I expect the review will be scheduled that day. If you cannot do it, I can try to remove the ready for review mark till you respond. Please let me know what work best for you.

@macrogreg
Copy link

Ok, @tarekgh , I'll review tomorrow morning and if it takes longer I'll add a comment here.

@macrogreg
Copy link

Hi, again, folks, :)

Sorry for having missed this discussion previously. It is, indeed, very similar to what I requested previously in #42786, slightly scoped down.

This particular discussion is about the Trace Id only. Customising Span Id may (or may not) be necessary, but it is out of score here. :)

The API looks good, I like that the change is small. However, I would like to make sure that we have a clear answer on how it would address the relevant scenarios. If the answer is already known and I overlooked it while catching up on the discussion, then sorry, please just point it out. If not, perhaps we can discuss a little more before locking it in. :)

I. Considering parent activity when generating the Trace Id.

So the scenario is that a service received a request. The request may or may not already have a trace context in the headers. Let's think through the possibilities.

  1. The incoming request does not have any trace context info, and the vendor does not want to influence the Trace Id generation.

The application server (or vendor code) handling the service call entry point will start the local root activity, which is in this case also the global trace root activity. That code will call one of the existing ActivitySource.StartActivity(..) overloads, just like today.

  1. The incoming request does have some trace context info, and the vendor does not want to influence the Trace Id generation.

The application server (or vendor code) handling the service call entry point will extract the trace context info from the request. It will create an instance of ActivityContext and then pass it to one of the existing ActivitySource.StartActivity(..) overloads, just like today.

  1. The incoming request does not have any trace context info, and the vendor does want to influence the Trace Id generation.

The vendor will make a call of the form

Activity.TraceIdGenerationFunction = ( () => { ActivityTraceId id = DoStuff(); return id; } );

They will want to do this once, during initialization.
During execution, the application server (or vendor code) handling the service call entry point will start the local root activity, which is in this case also the global trace root activity. That code will call one of the existing ActivitySource.StartActivity(..) overloads, just like today.
The ActivitySource.StartActivity(..) will (internally) notice that Activity.TraceIdGenerationFunction is not null and call it to generate the Trace Id. Q: If the TraceIdGenerationFunction throws, should we crash, or should be swallow and fall back to the default method? Seems like the runtime should not try to be smart and allow the exception to proparage (likely resulting in a crash).

  1. The incoming request does have some trace context info, and the vendor does want to influence the Trace Id generation.

This is the bit I an concerned about in the context of the current proposal.
The application server (or vendor code) handling the service call entry point will extract the trace context info from the request. The activity being created is the local root, but logically it is not the global trace root. The vendor may (or may not) need to use the existing trace context information when creating their customized Trace Id.
If they do not want to use that information, the existing proposal works just fine.
If they do want to use it, there is currently no way for them to do so (please let me know if I am missing it).

So it seems that TraceIdGenerationFunction needs to be able to receive some state. What type should it have?

a) ActivityContext.
I.e. the API is public static Func<ActivityContext, ActivityTraceId> TraceIdGenerationFunction { get; set; }
In that case, the application server (or vendor code) will construct the trace context from the request just like in case (2) above. It will be passed to ActivitySource.StartActivity(..) and the TraceIdGenerationFunction can decide to take it into consideration (or not).

This my preferred suggestion because it the the simplest. We should be aware, however, that it does not allow a vendor to consider information for Trace Id creation, that does not fit into ActivityContext. If we wanted to allow that, we would need to offer an overload of ActivitySource.StartActivity(..) that takes an additional parameter that is then passed to the TraceIdGenerationFunction. Two types for that parameter are possible:

b) object.
I.e. the API is public static Func<object, ActivityTraceId> TraceIdGenerationFunction { get; set; }
However, that would result in an allocation.

c) T.
I.e. the API is public static Func<T, ActivityTraceId> TraceIdGenerationFunction { get; set; }
In that case T is also a type param on StartActivity(..). Note the additional complexity around handling the fact that the T specified when setting TraceIdGenerationFunction might not match the T used when invoking StartActivity(..).

Personally, I tend to prefer (a) for its simplicity or (c) to avoid the allocation, despite the complexity.

based on the above, I would suggest the API shape to be

public static Func<ActivityContext, ActivityTraceId> TraceIdGenerationFunction { get; set; }

or

public static Func<T, ActivityTraceId> TraceIdGenerationFunction { get; set; }

What do you guys think?

P.S. I have two more considerations to raise. I will post them in separate comments later today. :)

@macrogreg
Copy link

Minor point: TraceIdCreationFunction, instead of TraceIdGenerationFunction?
It's shorter and communicates that an ActivityTraceId instance is being created.

@tarekgh tarekgh added api-needs-work API needs work before it is approved, it is NOT ready for implementation and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Feb 3, 2021
@tarekgh
Copy link
Member

tarekgh commented Feb 3, 2021

I removed ready for review tag for now till we resolve the feedback comments.

Q: If the TraceIdGenerationFunction throws, should we crash, or should be swallow and fall back to the default method? Seems like the runtime should not try to be smart and allow the exception to proparage (likely resulting in a crash).

That is a good question. I can see the both ways. if TraceIdGenerationFunction throws frequently, that would make the app un-usable I guess. but I am not sure if this will be a concern as we have many other callbacks (like in ActivityListener) which can do the same effect. So, my take here is the runtime never catch the exception and let it propagate as any other exceptions. The top stack should be pointing to the culprit.

For point #4, I believe if anyone specified a context when calling StartActivity this context should be respected regardless. Otherwise, the aggregation of the tracing data across the distributed system would be broken. It will be very surprising behavior if the context get changed underneath I guess.

By the way, the c proposal cannot work. You cannot define that in Activity class without specifying a specific T type. We'll have to define all types we can accept. like

public static Func<ActivityContex, ActivityTraceId> TraceIdGenerationContextFunction { get; set; }
public static Func<string, ActivityTraceId> TraceIdGenerationstringFunction { get; set; }

Minor point: TraceIdCreationFunction, instead of TraceIdGenerationFunction?

I'll add this as alternative proposed name in the proposal above.

@macrogreg
Copy link

Q: If the TraceIdGenerationFunction throws, should we crash, or should be swallow and fall back to the default method? Seems like the runtime should not try to be smart and allow the exception to proparage (likely resulting in a crash).

That is a good question. I can see the both ways. if TraceIdGenerationFunction throws frequently, that would make the app un-usable I guess. but I am not sure if this will be a concern as we have many other callbacks (like in ActivityListener) which can do the same effect. So, my take here is the runtime never catch the exception and let it propagate as any other exceptions. The top stack should be pointing to the culprit.

Agreed.

For point #4, I believe if anyone specified a context when calling StartActivity this context should be respected regardless. Otherwise, the aggregation of the tracing data across the distributed system would be broken. It will be very surprising behavior if the context get changed underneath I guess.

Yes. For the activities that have a remote parent (and thus a ActivityContex is passed to StartActivity(..)), the TraceId comes from the context, doesn't it? So what I am saying is that the custom creator/generator should be able to consider that parent context. :)

By the way, the c proposal cannot work. You cannot define that in Activity class without specifying a specific T type. We'll have to define all types we can accept. like

public static Func<ActivityContex, ActivityTraceId> TraceIdGenerationContextFunction { get; set; }
public static Func<string, ActivityTraceId> TraceIdGenerationstringFunction { get; set; }

Good point. We could consider some workaround similar to:

class Activity
{
    private static object s_traceIdCreationFunction = null;

    public static Func<T, ActivityTraceId> GetTraceIdCreationFunction<T>()
    {
        if (_traceIdCreationFunction  == null)
        {
            return null;
        }

        if (! typeof(Func<T, ActivityTraceId>).Equals(_traceIdCreationFunction.GetType()))
        {
            throw new InvalidOperationException($"You are requesting a TraceIdCreationFunction of type"
                                              + $" Func<T, ActivityTraceId> where T = {typeof(T).GetName()}."
                                              + $" However, for the the current TraceIdCreationFunction"
                                              + $" T is a different type. Make sure the respective types match.}");
        }

       return _traceIdCreationFunction;
    }

    public static void SetTraceIdCreationFunction<T>(Func<T, ActivityTraceId> func)
    {
        _traceIdCreationFunction  = func;
    }
}

However, I am not saying that the complexity is warranted. I am only saying that the option exists. :)

@tarekgh
Copy link
Member

tarekgh commented Feb 3, 2021

the TraceId comes from the context, doesn't it?

That is right.

So what I am saying is that the custom creator/generator should be able to consider that parent context. :)

This is the point that is unclear to me till now. Should the custom creator/generator be able to change the trace id of the current request? if so, then that is the concern I raised that we should honor the context passed to the StartActivity call without doing any change to it. Or you are saying the custom creator/generator will consider that in the future trace id generation when there is no trace id passed to StartActivity? Or I missing something else here?

@macrogreg
Copy link

I am not sure I understand this bit:

Or you are saying the custom creator/generator will consider that in the future trace id generation when there is no trace id passed to StartActivity?

Overall, that is exactly the topic of my second larger point which I am writing up right now.
The premise is that the guy who creates the trace-id-value value for the local root activity may want to do that both:
a) In cases where the local root activity is also the global trace-root activity - everything is "easy" in that case.
b) In cases where there is an existing remote span that is the global race-root activity. In such cases, we have an "incoming" trace context that needs to be considered.

If we explicitly do not support (b) we can simplify, but then we should state that explicitly. However, the way I understand AWS's request (b) is a requirement. In a generic case for .NET, (b) can certainly occur, but whether or not .NET wants to support it right now is, of course to be decided. :)

I'll finish my other point and post it :)

@noahfalk
Copy link
Member

noahfalk commented Feb 4, 2021

However, the way I understand AWS's request (b) is a requirement.

I don't think your (b) was a requirement for AWS. From above in the conversation:

[Noah] How does AWS handle when incoming requests already have a trace id and the id is not in the form that AWS would generate itself?

[lupengamzn] If using X-Ray SDK, AWS X-Ray backend will abandon all trace context which doesn't have a compatible trace id and log them out. If using OTel SDK, X-Ray collector will discard them and log them out.

I would be hesitant to use this extensibility point for your (b) goal. I suspect we might want a separate feature that looks at an incoming ActivityContext and decides whether or not to create a new root activity. If it does create a new root Activity it probably also wants to generate various customizations of links, tags, baggage, etc which this callback would not allow you to do. This may also be related to extensible propagation that still needs to be discussed. I'd recommend that we keep this scoped to just the (a) goal for now.

@macrogreg
Copy link

macrogreg commented Feb 4, 2021

I am absolutely OK with scoping to only (a) as described in the earlier comment, however, are we sure it is enough then:

Say, AWS X-Ray cannot process Trace IDs that are not "like they need them". If a call chain begins in an X-Ray monitored service, we cover everything by (a) and all is good. What if the call chain does not begin in a X-Ray monitored service?

From what I gather from the above discussion, either X-Ray gets the Trace ID to be the way they need it to be, or they won't log the data. Am I missing something? (I am not sure that "log them out" above means. :) )

@macrogreg
Copy link

As mentioned earlier, here my second consideration. :)

II. Making sure custom Trace Id plays nice with trace context propagation in a heterogeneous monitoring environment.

Consider the scenario where vendor A uses default Trace Id creation, vendor B uses a custom Trace Id creation and vendor C uses a custom Trace Id creation that is different from B. Consider the following distributed call chains where Sn(X) means a service n monitored by vendor X.

Request 1: user -> S1(A) -> S2(B) -> S3(C) -> S4(B) -> S5(A)
Request 2: user -> S11(A) -> S12(B) -> S13(A)

Note:

  • B needs to recognize that the invocations of S2 and S4 belong to the same distributed trace.
  • A needs to recognize that the invocations of S1 and S5 belong to the same distributed trace.
  • A needs to recognize that the invocations of S11 and S13 belong to the same distributed trace.

Consider S2.
B has set TraceIdCreationFunction to generate whatever trace ID is appropriate and replace that came in from S1 in the trace context. B does not know who monitors S3, as well as any other downstream dependencies of S2. If B knew that they are monitored by B too, it would simply use its new Trace Id. But in a world where somewhere down the chain a service may be monitored by A, S2 must make sure not to break the chain.

I have previously discussed how that can be achieved here and there. That particular discussion was Datadog specific, but I think it can be generalized easily. .NET cannot enforce that vendors adhere to the process I describe in those links (or any other guidance that may or may not be based on that discussion), however:

  • Guidance would make it easier for vendors to co-operate in the interest of the customer (perhaps it's OTel and not .NET who should be issues such guidance?)
  • If in this discussion we can agree on a recommended interop process, we can validate if the proposed API covers the scenarios.

When I generalize the conclusions made in the above links in my mind, I end up with the following.

  • TraceIdCreationFunction needs access to any potential incoming trace context.
  • If TraceIdCreationFunction modified an incoming trace id, then EVERY component that enriches downstream invocation payloads to include context propagation info into them, needs to know:
    • The fact that the trace id is "custom" (i.e. different from default).
    • What exactly was the data overwritten (it needs to be resurrected and propagated).

There are different ways to achieve it. I will make a proposal, but there may be better ones:

  • ActivityTraceId gets a byte-sized Kind property. There are some reserved values for the default/automatic cases. Custom implementations of TraceIdCreationFunction must use other values.
  • The shape of the trace id generator can be like this:
delegate bool TryCreateCustomActivityTraceId(ActivityContext? parentContext, out ActivityTraceId customTraceId);
  • If the method returns false, the default algorithm for id generation is used. That way the vendor can selectively fall back to it without reasoning about how it works.
  • parentContext is the additional parameter from my discussion point (I). It would be of generic type T instead of ActivityContext? if we wanted to go with option (c) there.
  • customTraceId is the ID generated. It's Kind property must be as described above.

The Activity class gets a new non-static API:

public ActivityTraceId GetNonCustomTraceId();

If a custom generator was not set OR if TryCreateCustomActivityTraceId(..) returned false, then GetNonCustomTraceId() will return the same value as the "normal" Trace Id property.
If a custom Trace Id was set, then GetNonCustomTraceId() will return the value that would be the Trace Id if the custom was not set. Such "non-custom id" will need to be propagated to child activities in the same way as the trace id is now.

This will allow context propagators to propagate the "non-custom id" using OTel propagation mechanisms and the custom generated id in vendor specific propagation headers. This should allow things to work in a heterogeneous environment in a manner discussed in the above links.

@macrogreg
Copy link

My last point is shorter :)

III. The full ID of a Span is really the ActivityContext, not the ActivityTraceId and not the ActivitySpanId. Consider the custom ID creation API to generate ActivityContext not the ActivityTraceId.

Basically, we are having this discussion because an important vendor cannot use the default format for the Trace Id. That's reasonable. I am a big fan of not solving problems that do not yet exist. However, encountering the same issue with Span IDs, as we just did with Trace IDs is very conceivable. If we were to add a custom Span ID creation API in the future, we would end up with more APIs => more complexity. I recommend to seriously consider making the "creator" apply to the ActivityContext as a whole. The above discussion, can be applied accordingly:

  • a "Kind" property would need to exist only once on the ActivityContext type.
  • the choice of focussing only on (a) or on both, (a) and (b) mentioned earlier can still be made either way.
  • really, all of the above can be transferred 1:1 to ActivityContext. I think (please correct me if I am wrong) without adding any complexity.

@noahfalk
Copy link
Member

noahfalk commented Feb 6, 2021

Thanks for feedback Greg!

From what I gather from the above discussion, either X-Ray gets the Trace ID to be the way they need it to be, or they won't log the data. Am I missing something?

My understanding is that the X-Ray developers acknowledge this as a limitation in the scenarios they can support. I don't think you are missing anything.

B has set TraceIdCreationFunction to generate whatever trace ID is appropriate and replace that came in from S1 in the trace context

The current scope of this feature does not include any ID replacement capability. It is solely intended to handle creating the id for a new root activity. I would handle those other out-of-scope scenarios by either:

  • Designing a new future feature that is broader in scope that allows the creation of a new root activity and corresponding new ID, triggered via some extensibility mechanism. The new activity might have tags, links, or baggage that in part are inherited from the remote activity and in part are populated by the extensibility mechanism. Such a feature appears non-trivial, probably rising to the level of OTel specification changes in which case it would best be designed as part of the OTel design process and not as a .NET specific feature. The OTel spec currently says "For a Span with a parent, the TraceId MUST be the same as the parent"
  • Leave the ID as-is and use tags, custom properties, baggage, or trace state to attach vendor specific data, such as a vendor specific alternate id

Guidance would make it easier for vendors to co-operate in the interest of the customer (perhaps it's OTel and not .NET who should be issues such guidance?)

I suspect OTel or the W3C trace id spec discussion could be a better forum since I am guessing the interop questions you are raising aren't .NET specific? If there are areas that are .NET specific I'm happy to chat.

However, encountering the same issue with Span IDs, as we just did with Trace IDs is very conceivable

I agree its conceivable, but don't think that should be the bar for adding more features.

If we were to add a custom Span ID creation API in the future, we would end up with more APIs => more complexity

This part didn't feel persuasive for a few reasons:

  1. Adding a feature before people need it often means we don't understand the goals and requirements very well. This makes it more likely we design an inferior solution or solve a problem that nobody winds up having. Unused or poorly suited APIs are also a form of complexity.
  2. I care about complexity proportional to the number of .NET developers that will be impacted by that complexity. I expect the audience for the APIs we are discussing here will be <100 people, quite possibly < 10. I don't feel bad at all asking 10 people to call a few extra APIs because we allowed the design to grow organically.

@tarekgh
Copy link
Member

tarekgh commented Feb 9, 2021

@macrogreg per discussion we'll stick with the original proposal. and for the more broader scenarios you brought we can track addressing those in future.

@tarekgh tarekgh added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-needs-work API needs work before it is approved, it is NOT ready for implementation labels Feb 9, 2021
@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Feb 9, 2021
@terrajobst
Copy link
Member

terrajobst commented Feb 9, 2021

  • Let's name this property TraceIdGenerator
  • The method will neither throw when it's set more than once nor when it's set after an ID was generated. The consumer has to set it early, ideally in main. But if they don't we'll honor it next time an ID is generated. It just means that part of the activities may not be processable if the format of the ID is important to them.
namespace System.Diagnostics
{
    public partial class Activity
    {
        public static Func<ActivityTraceId>? TraceIdGenerator { get; set; } 
    }
}

@macrogreg
Copy link

Thanks for the responses.

I care about complexity proportional to the number of .NET developers that will be impacted by that complexity. I expect the audience for the APIs we are discussing here will be <100 people, quite possibly < 10. I don't feel bad at all asking 10 people to call a few extra APIs because we allowed the design to grow organically.

This is, actually, a very good point.
However, those 10 developers are probably writing code that is more impactful that a single application, so it would be good to educate them well. I would love to see some of the points we discussed, including API limitations and reasons for the design decisions, find their way into the standard API docs for this API. :)

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 11, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 12, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Mar 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Diagnostics.Activity
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants