Skip to content
This repository has been archived by the owner on Oct 3, 2023. It is now read-only.

Proposal: record first local span's ID in SpanContext #229

Open
axw opened this issue Jan 21, 2019 · 27 comments
Open

Proposal: record first local span's ID in SpanContext #229

axw opened this issue Jan 21, 2019 · 27 comments
Labels
data-model Schema, attribute names, tracestate keys, metrics, units, etc.

Comments

@axw
Copy link

axw commented Jan 21, 2019

I maintain the Elastic APM Go Agent. I've spent some time investigating whether we can implement an OpenCensus exporter, and I've hit a blocker that I think requires a change to opencensus-go.

The issue is that we have an additional concept of "transaction", which you can think of as the entry span in a service (e.g. an incoming HTTP request). Every span within a transaction records the trace ID, transaction ID, and (if it has one) parent span ID. Every span has a transaction ID, which may used for grouping. For each service in a trace, we would map the first local OpenCensus span (i.e. spans with a remote parent) to an Elastic APM transaction.

Because the IDs are all flowing to the exporter from opencensus-go, which has no concept of transaction, we cannot propagate the transaction ID to child spans in process.

Describe the solution you'd like

Record the first local span's ID in SpanContext / record the IDs of spans which have remote parents.

Describe alternatives you've considered

Alternative 1

I have considered keeping a map of OpenCensus span IDs to Elastic APM transaction IDs. This would be expensive to maintain, and would only work if the parent ends (i.e. is exported) before the child, which would be unusual.

Alternative 2

Another option would be something long the lines of the callbacks described in #70. Our exporter/plugin would be invoked when a span is created, along with the parent span context; we would store the transaction ID in tracestate, and propagate via this callback. There are a couple of issues with this approach:

  • it adds the transaction ID to tracestate, which will be propagated between services. This is not necessary, so just adds overhead
  • precludes https://github.com/census-instrumentation/opencensus-service, or at least diminishes its value by continuing to require vendor-specific code in the application

Alternative 3

Change Elastic APM to make the transaction ID on spans optional. As of its current release (6.6), they are required. While we may lose some future functionality for trace data originating in the OpenCensus, but for now the impact would be minimal.

@axw
Copy link
Author

axw commented Jan 21, 2019

Brave very recently introduced the concept of "local root", which is conceptually the same thing: openzipkin/brave#801

@codefromthecrypt
Copy link

codefromthecrypt commented Jan 21, 2019 via email

@SergeyKanzhelev
Copy link
Member

perf implications aside, I wonder if this will be the job for tracestate? The whole idea of tracestate is to hold a position of a span in a multiple distributed traces graphs. Holding Elasic's transaction ID it in tracestate will also give you parent/child relationship between transactions as tracestate will be propagated across processes.

@axw
Copy link
Author

axw commented Jan 22, 2019

perf implications aside, I wonder if this will be the job for tracestate? The whole idea of tracestate is to hold a position of a span in a multiple distributed traces graphs. Holding Elasic's transaction ID it in tracestate will also give you parent/child relationship between transactions as tracestate will be propagated across processes.

FYI, we are implementing the W3C Trace-Context draft, but we currently only use traceparent and not tracestate.

It could hypothetically be useful to propagate transaction IDs between processes, but we do not currently have a need for that. In our model, a transaction's parent may be either another (remote) transaction, or a (remote) span. We set the "span" ID in traceparent to either the remote transaction or span ID. If I understand @adriancole's comment correctly, we're doing something very similar: in some cases we elide a transaction's details (i.e. its local spans), and connect the transactions directly.

Regardless, it would require vendor-specific code in the application, right? And so the exporter service wouldn't be possible?

@codefromthecrypt
Copy link

codefromthecrypt commented Jan 22, 2019 via email

@SergeyKanzhelev
Copy link
Member

Got it. So it's not an issue of how to implement it - there are at least two ways OC SDK should be able to support this scenario - via local-only scoped tags or via tracestate. It is about a new concept to be introduced natively.

Is transaction ID the only thing needed? In Application Insights we used to have a notion of local operation name (basically localRoot.Name). This name is also useful (maybe even more) for aggregations. User ID and Session ID obtained by the incoming request handler are also good candidates to store in this localRoot context natively. Do you think those are in scope of this issue?

@codefromthecrypt
Copy link

codefromthecrypt commented Jan 23, 2019 via email

@axw
Copy link
Author

axw commented Jan 23, 2019

Is transaction ID the only thing needed?

For us: currently, yes.

In Application Insights we used to have a notion of local operation name (basically localRoot.Name). This name is also useful (maybe even more) for aggregations. User ID and Session ID obtained by the incoming request handler are also good candidates to store in this localRoot context natively. Do you think those are in scope of this issue?

I think User ID and Session ID might fit better in Correlation-Context, since they may be useful for aggregations across services. IIANM, this is already handled with instrumentation-provided tags.

For Elastic APM's exception reporting, we could make use of a locally propagated transaction name/type for aggregation/filtering. That doesn't fit with OpenCensus anyway though, so that's very hypothetical.

@SergeyKanzhelev
Copy link
Member

so your question is whether to add more scope than the id?

@adriancole I'm wondering if local root ID is just one of properties alongside other things that may be promoted to local/remote tags or to the tracestate. There may be elaborate configuration for this kind of attributes to tags promotion.

From API cleaness - implementing local root ID natively may require new methods like StartScopedLocalRoot or ResetLocalRootId on span builder. When implementing it via tags or tracestate extension module will keep APIs simpler.

That doesn't fit with OpenCensus anyway though, so that's very hypothetical.

Can you please elaborate on why local root ID fits and name/type doesn't from your perspective?

@axw
Copy link
Author

axw commented Jan 24, 2019

That doesn't fit with OpenCensus anyway though, so that's very hypothetical.

Can you please elaborate on why local root ID fits and name/type doesn't from your perspective?

I was referring to exception reporting here, not local root IDs. Elastic APM provides a means of reporting exceptions that occur within a transaction or span; OpenCensus does not (aside from setting the status of spans). If there were some mapping, then there might be additional local root/transaction properties that we would want to propagate locally.

From API cleaness - implementing local root ID natively may require new methods like StartScopedLocalRoot or ResetLocalRootId on span builder. When implementing it via tags or tracestate extension module will keep APIs simpler.

I'm relatively inexperienced with OpenCensus, so can you please expand on that a bit? Specifically how is that cleaner than my proposal? In case my proposal was not clear, here it is again with references to the Tracer and SpanBuilder from the Java API:

For Spans created via a SpanBuilder obtained with Tracer.spanBuilderWithRemoteParent, the span's ID will be recorded in its SpanContext as "local root ID". This will be inherited by descendant spans created within the same process.

@codefromthecrypt
Copy link

codefromthecrypt commented Jan 24, 2019 via email

@SergeyKanzhelev
Copy link
Member

Yeah, I meant "fullness", not "cleaness". Sorry. I looked at brave PR @adriancole posted above. My comment was that with any new concept there will be people who want to have better control over when local root being set and reset. One example may be if incoming request is tracked by automatic module, but one want to signal that the "real" local root is the span that is tracked from customer code and which has all the information representing local root like span name and attributes. Another scenario is a trace that initiate another trace as a background job. Given those may be just theoretical - @adriancole I wonder if anybody asked for this after this concept was introduced?

Another two questions for the new concept I was trying to raise with my questions:

  1. Is it universal and needed for every (or most) open census user?
  2. Is it generic enough and whether the very next request will be to expose other local root span properties?

First is enough to have it. Second is nice to think about in advance.

I'm all for the scenarios enabled via local root. And I understand desire to make this feature to be pre-installed. My first reaction on this is that configurable extension approach would be a better option here. What are your thoughts on whether it's universal and generic.

@codefromthecrypt
Copy link

codefromthecrypt commented Jan 28, 2019 via email

@SergeyKanzhelev
Copy link
Member

I would offer that the local root functionality is technically difficult if possible to do as a bolt on.

Is it what Skywalking requested here? I might be missing bigger context here

@axw
Copy link
Author

axw commented Feb 4, 2019

Another two questions for the new concept I was trying to raise with my questions:

Is it universal and needed for every (or most) open census user?

As Adrian says above, the target is exporters and not (directly) end-users. For anyone wanting to export OpenCensus spans to Elastic APM, this is necessary.

I can't speak for any other solution, though it's clearly not universal as there are other exporters without this need. Every solution is a little different of course, taking advantage of their particular constraints.

Is it generic enough and whether the very next request will be to expose other local root span properties?

I do not anticipate such requests coming from our end.

@bogdandrutu
Copy link
Contributor

Hi @axw and everyone.

First: Sorry that I am late in this discussion.

Second: The current design of the library assumes that SpanContext is the "propagated" between processes (client to server RPCs), which I don't think it is the case for this. I am trying to think to not add any other non-propagated property in the SpanContext, what if add this to the Span?

PS: this is a bit of brainstorming so ideas are welcome.

@axw
Copy link
Author

axw commented Feb 13, 2019

@bogdandrutu no worries.

The current design of the library assumes that SpanContext is the "propagated" between processes (client to server RPCs), which I don't think it is the case for this.

Makes sense, and you're right that this does not need to be propagated inter-process.

I am trying to think to not add any other non-propagated property in the SpanContext, what if add this to the Span?

Yes, that would work: record it on Span, and copy across to SpanData for the exporter. I've updated the opencensus-go PR accordingly.

@codefromthecrypt
Copy link

codefromthecrypt commented Feb 13, 2019 via email

@axw
Copy link
Author

axw commented Feb 15, 2019

@bogdandrutu census-instrumentation/opencensus-go#1029 was merged; not my intention, but that's fine by me if you're happy with it. How should we proceed?

@tsloughter
Copy link
Member

Rather than extending the span data can't this be an attribute?

@axw
Copy link
Author

axw commented Feb 20, 2019

Rather than extending the span data can't this be an attribute?

Attributes aren't mentioned at https://opencensus.io/tracing/span/, but based on what I've gleaned from code, I understand that attributes are meant to describe properties of the operation that a span represents, and should not be required for describing causality or anything else crucial to the tracing system.

At least in the OpenCensus Go implementation, attributes are capped and can be dropped: https://github.com/census-instrumentation/opencensus-go/blob/57c09932883846047fd542903575671cb6b75070/trace/trace.go#L47-L49

@tsloughter
Copy link
Member

Yea, true, it is meant to to drop when a limit is reached. If it isn't a "good to have" to improve experience then I guess that isn't an option.

@tsloughter
Copy link
Member

Is this information used during the trace? Or only on the elastic backend when bringing everything together?

Spans can already be set that their parent is remote, not in the same process, which would signal that the span id for that span is the transaction id. It just wouldn't be available if needed when other children are creating/handling spans.

@felixbarny
Copy link

The API spec of the Elastic APM Server requires the transaction_id of the span: https://github.com/elastic/apm-server/blob/c9bf185f68ccb318d93897208a4611bc83d95afc/docs/spec/spans/span.json#L14

The UI needs that information in order to be able to see all spans of a transaction. This view is especially interesting to service owners which can use it to dig into why a particular transaction was slow, without the noise of the rest of the trace. This view also allows to "zoom out" and see the full trace.

Another thing where the entry span can be useful is to collect skeleton traces (Adrian already elaborated about this use case). Basically, you'd only send up the entry spans aka transactions. In order for the parent ids to make sense, the client spans don't propagate their ids but rather their transaction ids.

This request is not about adding support for skeleton traces, which may be a follow up for this. This is to enable a basic Elastic APM exporter. As mentioned, the APM Server requires the transaction_id. As currently there's no way to get the transaction id aka span id of the entry span, we're unable to integrate with OpenCensus.

@tsloughter
Copy link
Member

Yea, I wasn't suggesting it would fit the api today but what I thought was an alternative way for the Elastic APM Server to put this information together, instead of extending the opencensus span definition.

rghetia added a commit to rghetia/opencensus-go that referenced this issue Feb 20, 2019
rghetia added a commit to census-instrumentation/opencensus-go that referenced this issue Feb 20, 2019
This reverts commit 57c0993.

It was merged without updating the spec and resolving
census-instrumentation/opencensus-specs#229
@axw
Copy link
Author

axw commented Mar 7, 2019

After some more investigation, we've come up with an alternative that unblocks creating an exporter for Elastic APM: make the transaction ID optional for spans. This may mean loss of future functionality, but the initial impact will be minimal.

So this can be downgraded to "nice to have" for us.

@codefromthecrypt
Copy link

by the way, this is actively desired by zipkin users. I'm not sure why people are theoretical about this. It exists in multiple APMs systems and there seems to be a disbelief in the value. anyway I've seen you've reverted this, so we can add it to the "no idea why census' doesn't get it list" along with messaging spans

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
data-model Schema, attribute names, tracestate keys, metrics, units, etc.
Projects
None yet
Development

No branches or pull requests

6 participants