Skip to content

Conversation

@SylvainJuge
Copy link
Member

@SylvainJuge SylvainJuge commented Sep 29, 2021

Add specification for OpenTelemetry API agent bridges.

Approval from agents is optional here, only Java agent will use this in the short term to provide an OpenTelemetry bridge.

Implementation on APM server : elastic/apm-server#6259

Things to clarify

  • clarify: should we define all the vendor-specific attributes that we could support ? For example using attributes with OTel API user.id or span.type could allow to access vendor fields in Elastic agent protocol without direct API access. This mapping could also be handled on the server.
  • properly document unsupported features: events, span links, bagage, ...
  • Capturing events not supported, but exceptions reporting should be supported if possible OTel bridge spec #516 (comment)
  • mapping between OTel status and Elastic outcome: mapping at agent level, in apm-server for protocol-specific inference
  • add to spec that the feature should be configurable, disabled by default until proven mature and allows users to opt-in.
  • Should we implement an automatic fallback with labels for apm-server < 7.16 ? Yes, see comment for details.

@SylvainJuge SylvainJuge requested review from a team as code owners September 29, 2021 12:56
@ghost
Copy link

ghost commented Sep 29, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-03-14T05:52:00.851+0000

  • Duration: 7 min 31 sec

Co-authored-by: Benjamin Wohlwend <bw@piquadrat.ch>
SylvainJuge and others added 2 commits September 30, 2021 11:16
Co-authored-by: Felix Barnsteiner <felixbarny@users.noreply.github.com>
Co-authored-by: Felix Barnsteiner <felixbarny@users.noreply.github.com>
Co-authored-by: Trent Mick <trentm@gmail.com>
@SylvainJuge SylvainJuge marked this pull request as draft October 4, 2021 11:59
- Java Agent: Elastic active context is implemented as a thread-local stack
- Java OTel API: active context is implemented as a key-value map propagated through thread local

In order to avoid potentially complex and tedious synchronization issues between OTel and our existing agent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into a little trickiness when implementing the "if there's not a parent span start a transaction" behavior when working on the Node.js Bridge related to the context objects. I believe I have a path forward for the Node.js agent -- I'm mainly sharing this as background in case other folks run into it (and also forcing myself to say everything out loud in case there's some corner I'm not considering ;))

First -- there's no span builder object in any of the Node.js API or SDK code. Second, neither the startSpan or the startActiveSpan methods of the Node.js tracer allow you to pass in the span that should be a parent. Instead, this information is stored on the passed in open telemetry context object. Specifically, the intended pattern here is that you can fetch the span that should be the parent (i.e. the span that's active) by using the context's getSpanContext method.

You can see an example of this in the default Node.js SDK/tracer:

The becomes a bit complicated because, currently, this span context is initially set by whatever tracecontext propagator is currently being used

https://github.com/open-telemetry/opentelemetry-js/blob/04f9edd12fb2f42898bee7086691e47b3ab9b629/packages/opentelemetry-core/src/trace/W3CTraceContextPropagator.ts#L120

https://github.com/open-telemetry/opentelemetry-js/blob/04f9edd12fb2f42898bee7086691e47b3ab9b629/packages/opentelemetry-propagator-b3/src/B3SinglePropagator.ts#L86

https://github.com/open-telemetry/opentelemetry-js/blob/04f9edd12fb2f42898bee7086691e47b3ab9b629/packages/opentelemetry-propagator-b3/src/B3MultiPropagator.ts#L137

https://github.com/open-telemetry/opentelemetry-js/blob/04f9edd12fb2f42898bee7086691e47b3ab9b629/packages/opentelemetry-propagator-jaeger/src/JaegerPropagator.ts#L111

The trickiness came from the fact that we're not using these W3C propagators in the Node.js agent bridge -- trace context needs to be propagated when creating a transaction. This means that in cases where we're starting a transaction and assigning it to the ElasticOtelSpan object, we also need to set that span context on the open telemetry context object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that in cases where we're starting a transaction and assigning it to the ElasticOtelSpan object, we also need to set that span context on the open telemetry context object.

The idea for the bridges is that there's just a single context storage - the internal one of the agent.

The case you mentioned is handled by this part of the spec:

After activating an Elastic span via the agent's API, the [Context] returned via the [get current context API] should contain that Elastic span

What it means is that when ContextManager::active is called, the bridge implementation of the interface would create a Context on-the-fly which contains the Elastic span.

The trickiness came from the fact that we're not using these W3C propagators in the Node.js agent bridge -- trace context needs to be propagated when creating a transaction.

I don't think there's a need to implement a custom W3C propagators in the Node.js agent bridge. You can just rely on the provided implementation.

The becomes a bit complicated because, currently, this span context is initially set by whatever tracecontext propagator is currently being used

The trace context propagators don't alter the context storage. It's a bit confusing but trace.setSpanContext, which is called by the propagators, only creates an immutable copy of the provided Context and sets the span. In the case of W3CTraceContextPropagator, the span is a "fake" NonRecordingSpan. The whole purpose of this is to populate a Context object that has the trace ids from the traceparent header.

Personally, I find that part of the OTel API a bit bloated and a source of unnecessary ceremony and allocation.

Happy to jump on a zoom if that was unclear.

@basepi
Copy link
Contributor

basepi commented Jan 26, 2022

Have there been any thoughts on tracking uptake of these bridges in telemetry? Obviously it would have to wait for telemetry v2, but should we be setting an otel attribute or something so that we can know if a span/transaction was created via the bridge?

Edit: the Span Kind section assumes that the APM Server can tell that a Transaction/Span is from the otel bridge (as it is expected to infer the span_kind if it's not provided). How would the APM Server know, if there are no (optional) otel attributes on the event?

if (otel_span.remote_contex != null) {
span_or_transaction = createTransactionWithParent(otel_span.remote_context);
} else if (otel_span.parent == null) {
span_or_transaction = createRootTransaction();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The go otel library allows users to set NewRoot as an option when creating a new span (https://pkg.go.dev/go.opentelemetry.io/otel/trace#WithNewRoot). According to the documentation:

Any existing parent span context will be ignored when defining the Span's trace identifiers.

I'm assuming this would definitely create a new root transaction, even if otel_span.remote_contex != null

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That option seems to be specific to Go and I'm not intimately familiar with it. But I suppose you're right.

@felixbarny
Copy link
Member

How would the APM Server know, if there are no (optional) otel attributes on the event?

APM Server would only know if at least one attribute is set.

@basepi
Copy link
Contributor

basepi commented Feb 1, 2022

APM Server would only know if at least one attribute is set.

In that case, does it make sense to set an elastic-namespaced attribute so we always know if a span was created with opentelemetry? I think it would be really useful to be able to track this via telemetry v2 and I have no idea whether the average user would add any attributes or not, but I'd rather avoid that confounding variable altogether.

Edit: Seems like we can use the fact that otel spans will always have otel.span_kind set -- if the APM Server sees this field (prior to auto-inferring this field for non-otel spans), then it could set a field for telemetry to pick up later.

@basepi
Copy link
Contributor

basepi commented Feb 8, 2022

In the Python otel agent, Span.record_exception() has an attributes argument. Should we be putting these attributes on the Elastic exception object (presumably at otel.attributes)? Putting them in the exception context? Ignoring them?

Currently in the python implementation I'm just putting them in the context dictionary. But if we want them to be on the exception object at otel.attributes I need to add some more logic in the agent's internals.

Edit: chatted with @felixbarny, seems like there's no otel.attributes on the error intake. We should put them in labels instead.

Co-authored-by: Colton Myers <colton.myers@gmail.com>
@SylvainJuge SylvainJuge marked this pull request as ready for review February 24, 2022 15:01
OTel relies on key-value pairs for span attributes.
Keys and values are protocol-specific and are defined in [semantic convention](https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/trace/semantic_conventions) specification.

In order to minimize the mapping complexity in agents, most of the mapping between OTel attributes and agent protocol will be delegated to APM server:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but looking at the inference algorithm (which is an additional one to one we already have), and result/outcome inference - it seems reasonable to reconsider moving more login from agents to the server.
Not blocking this PR of course, but I think something to reconsider

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, being able to delegate all of the mapping and only implement it once would be great.

The inference algorithm on the agent is only required for the "essential" fields as they are reused by other agent features.

  • type (and subtype) are required for breakdown metrics: making this feature otel-compliant seems challenging.
  • outcome is required as we map it from otel status and don't store it as-is. (in hindsight maybe copying this one to otel.status next to otel.attributes would have been a good idea.

@SylvainJuge
Copy link
Member Author

As discussed in last week agents meeting, it is now time to merge this PR.
If there is no comment or vote against this in the next couple of days I'll merge it as-is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[META 524] Spec for OTel APM Bridge