Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reworked Distributed Tracing Extension #607

Merged
merged 4 commits into from
May 9, 2020

Conversation

slinkydeveloper
Copy link
Member

Fixes #603

Signed-off-by: Francesco Guardiani francescoguard@gmail.com

Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>
Copy link
Contributor

@ian-mi ian-mi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear what the motivation is for these changes or how they would make the extension more useful. I also don't think we should adopt such a large change to this extension's specification especially since the proposed change appears to severely limit the usefulness of this extension as a trace propagation mechanism.


To integrate with existing tracing libraries, the Distributed Tracing attributes
MUST be encoded over HTTP(S) as headers. E.g.
The Distributed Tracing Extension is not intended to replace the protocol specific headers for tracing,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more accurate to say that it is intended to function like the HTTP tracecontext headers in non-HTTP protocols or for event storage.

Given a multi hop event transmission, the Distributed Tracing Extension, if used, MUST
carry the trace information of the starting trace of the transmission.
In other words, It MUST NOT carry trace information of the actual hop, since these information are usually
carried using protocol specific defined headers, understood by tools like [OpenTelemetry](https://opentelemetry.io/).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many protocols do not define a trace propagation mechanism, hence the need for this extension as the eventing trace propagation standard. If this extension is changed to specify that it should not track the actual trace context, then we would need a new protocol agnostic standard for trace propagation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure cloudevents is the right place where this standardization should happen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case where do you propose that this happens?

Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>
@bsideup
Copy link
Contributor

bsideup commented Apr 30, 2020

Hey @adriancole, perhaps you can provide some input here?


To integrate with existing tracing libraries, the Distributed Tracing attributes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2p and take it for what its worth. I don't see the update of this section bringing clarity the linked specification doesn't already cover. Maybe just underscore that the headers themselves are only the context of the trace.

Other event attributes may be helpful to add to trace data (or even some as log correlations or metrics dimensions), but specifically how to do log correlation, metrics or tracing is out of scope for this spec.

I'm aware that mentioning Zipkin may not be ok here, but if you did there's a lot of existing art with messaging tracing. Your call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracing is out of scope for this spec.

That's what we're trying to establish here 😄 The reason of this PR is exactly because somebody thought that the goal of this spec extension is to enable tracing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. If the goal is to enable tracing, it seems limiting to only discuss the tracecontext spec. many years of tools, thousands of users are familiar with B3. You might consider linking to this also, if adoption is a concern. https://github.com/openzipkin/b3-propagation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another reason why i feel we should not touch the distributed tracing argument at all, otherwise we have a clear overlap with what openzipkin/opentelemetry/w3c is doing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that specifically how to do log correlation, metrics, and tracing are out of scope of this extension. That does not imply, however, that trace context propagation is out of scope. Since this extension is modeled after the W3C trace propagation format it seems that trace context propagation must be in-scope here. Nor do I think that adding a trace propagation mechanism for events conflicts with those other projects especially given that they are unlikely to define trace propagation standards either for cloudevents or for all possible eventing protocols.

@codefromthecrypt
Copy link

@bsideup sure, hope it helped.

Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>
@duglin
Copy link
Collaborator

duglin commented May 6, 2020

A couple of rambling thoughts/comments:

  • I think the current shape of the PR (modulo the small syntax issues I noted) seems to correctly align with our original intent
  • there appears to still be some questions around the value of this. In particular, I believe the concerns come from 2 perspectives:
    • the tracing info feels like something outside of the CE domain as it's not really about the event as much as it's about the transport path it took. And as such, feels like an incorrect bit of metadata for CE to deal with
    • if this information does not contain all of the tracing info (from entire entire transport path) then what is it's value? And this comes down to whether or not there is value in knowing just the tracing info from the initial sender.

In my chats with people, I've heard pros and cons for this extension. However, it is JUST an extension, which means we actually don't need to all agree. And, nor is there any requirement for any SDK to support it, or any other extension.

From that alone, it seems like it's ok for it to stick-around since at least some people do seem to find it useful. However, it is concerning that we seem to be revisiting this so often. What I'd like to propose is this:

  • if there are no concerns with the PR itself (ignoring the confusion around the purpose of the PR), and the PR makes the text better than it was before, then I think we should merge it
  • ask for people to come forward with interest or intent to support this extension. I'd like to see if we can get more than "this sounds interesting" type of feedback. If no one has any intention of actually supporting it then we could consider dropping it via some deprecation/warning period. Actually, we may want to define a process in which all extensions go thru some periodic review to ensure we remove stale ones.

Thoughts?

Copy link
Contributor

@ian-mi ian-mi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • the tracing info feels like something outside of the CE domain as it's not really about the event as much as it's about the transport path it took. And as such, feels like an incorrect bit of metadata for CE to deal with

The CE spec is not just a data format, it also deals with transmission and protocols. Therefore I don't think it's necessarily out of bounds for an extension to handle metadata related to event transmission. Such extensions have already proved to be useful for other purposes in knative such as for TTL and cycle detection.

  • if this information does not contain all of the tracing info (from entire entire transport path) then what is it's value? And this comes down to whether or not there is value in knowing just the tracing info from the initial sender.

I agree that there seems to be little purpose in storing an expressly immutable trace context.

In my chats with people, I've heard pros and cons for this extension. However, it is JUST an extension, which means we actually don't need to all agree. And, nor is there any requirement for any SDK to support it, or any other extension.

I agree that one possible solution would be to be less prescriptive about how the extension is used. This PR does not do that, however. It instead adds additional restrictions that prevent this extension from being a useful mechanism for trace propagation.

From that alone, it seems like it's ok for it to stick-around since at least some people do seem to find it useful. However, it is concerning that we seem to be revisiting this so often. What I'd like to propose is this:

  • if there are no concerns with the PR itself (ignoring the confusion around the purpose of the PR), and the PR makes the text better than it was before, then I think we should merge it

I do have concerns beyond the purpose of the PR. The requirements that would be added around transmission make the extension less useful than before as they would prevent the use of this extension for trace propagation.

  • ask for people to come forward with interest or intent to support this extension. I'd like to see if we can get more than "this sounds interesting" type of feedback. If no one has any intention of actually supporting it then we could consider dropping it via some deprecation/warning period. Actually, we may want to define a process in which all extensions go thru some periodic review to ensure we remove stale ones.

Thoughts?

I think there is clearly value in an extension which can provide trace propagation across event transmission and I have yet to see a compelling use-case presented for an immutable trace context which could not be used for trace propagation. I think this discussion would be more fruitful if such a use-case was brought forward. In the meantime I don't see why we should replace an extension with a clear use case (and one that is already being used in knative eventing) with an extension that seems to be strictly less useful.

Given a multi hop event transmission, the Distributed Tracing Extension, if used, MUST
carry the trace information of the starting trace of the transmission.
In other words, It MUST NOT carry trace information of the actual hop, since these information are usually
carried using protocol specific defined headers, understood by tools like [OpenTelemetry](https://opentelemetry.io/).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case where do you propose that this happens?


To integrate with existing tracing libraries, the Distributed Tracing attributes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that specifically how to do log correlation, metrics, and tracing are out of scope of this extension. That does not imply, however, that trace context propagation is out of scope. Since this extension is modeled after the W3C trace propagation format it seems that trace context propagation must be in-scope here. Nor do I think that adding a trace propagation mechanism for events conflicts with those other projects especially given that they are unlikely to define trace propagation standards either for cloudevents or for all possible eventing protocols.

@duglin
Copy link
Collaborator

duglin commented May 7, 2020

The CE spec is not just a data format, it also deals with transmission and protocols. Therefore I don't think it's necessarily out of bounds for an extension to handle metadata related to event transmission.

It really only deals with transmission and protocol enough to show how CE metadata appears in certain protocols. We specifically tried to avoid transport level metadata as CE properties. See the discussions of "destination" in the primer, e.g. https://github.com/cloudevents/spec/blob/master/primer.md#cloudevents-concepts

That's not to say some of those bits of metadata aren't useful - but they're not in-scope for CE right now.

The requirements that would be added around transmission make the extension less useful than before as they would prevent the use of this extension for trace propagation.

which requirements are you referring to?

I kind of feel like there are two conversations going on here:
1 - trying to add clarity around what the intent of this extension was meant to do/say - not any more than that. And this extension, I believe, was not intended to be a tracing mechanism - it was meant to echo some of the tracing info into CE metadata.
2 - trying to turn this extension into a full tracing mechanism

I'd prefer to let @slinkydeveloper developer off the hook and let him do just 1 since that's all he signed up for. If someone wants to propose 2 then I think that's a different PR.

Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>
@slinkydeveloper
Copy link
Member Author

Actually, we may want to define a process in which all extensions go thru some periodic review to ensure we remove stale ones.

I like the sound of that: we add an extension, we check the adoption after a while and then we understand if we want to keep it or not

I agree that one possible solution would be to be less prescriptive about how the extension is used. This PR does not do that, however. It instead adds additional restrictions that prevent this extension from being a useful mechanism for trace propagation.

I see your concerns @ian-mi but I feel that, if we don't clarify the scope of this extension, we still can't use it for our production use case, because we'll still have unanswered questions:

  • When we receive an event with the ce-traceparent, should we trust its value and build our span from it?
  • When we receive an event with both ce-traceparent and traceparent, which of the two should we trust?
  • When we receive an event with ce-traceparent, should we replace it?

The whole goal of this PR is to clearly answers these questions, following the original intentions of the extension.

@codefromthecrypt
Copy link

I'm one of the few people implementing tracestate, literally am right now actually, but you can ask advice from instana who notably also implement this dance.

instana/nodejs@ef6fdf3

In general, you need to be able to identify your own entry in tracestate. Once you can assert that you ignore the traceparent mostly as it is not helpful. I think they prefer you to re-use the same trace ID, but the spec is squishy.

The other things to consider are just "pass through" meaning you apply zero semantic value to anything. This is what jaeger and elastic do.

@duglin
Copy link
Collaborator

duglin commented May 7, 2020

@adriancole are you implementing this CE extension?

Also, without getting into whether or not the CE attribute is useful, does this PR help clarify what its expect value is and how it can be used? Or at least not make it worse? https://github.com/cloudevents/spec/pull/607/files#r418372301 seems to indicate that it doesn't make it any worse :-)

@codefromthecrypt
Copy link

@duglin my only commentary is about the usefulness of the "traceparent" field in the specification that this current work is bound to. I implement trace propagation for dozens of things including messaging rpc you name it. The implementation will be limited to the semantics available in the spec. In other words, by joining this directly to the trace-context spec, it is probably best to either punt to that spec or decide some application here. If you do decide to apply it here, I'd suggest looking at the code I mentioned from instana, but this is just a suggestion. I feel like some of this chat is cyclic as I'm really not sure who is implementing anything and it is odd to choose a concrete header spec without experience with it first. How you navigate this is up to you. I've only put commentary here as someone asked me to. For now, I'll stop giving advice and let y'all just decide whatever it is you are going to do.

@duglin
Copy link
Collaborator

duglin commented May 9, 2020

Approved on 5/7 call

@duglin duglin merged commit ec8f1d7 into cloudevents:master May 9, 2020
n3wscott pushed a commit to n3wscott/cloudevents-spec that referenced this pull request Nov 30, 2020
* Reworked Distributed Tracing Extension

Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>

* suggestions

Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>

* suggestions

Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>

* Grammar

Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>
@duglin duglin mentioned this pull request Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clarification in Distributed Tracing Extension
5 participants