-
Notifications
You must be signed in to change notification settings - Fork 124
OTel bridge spec #516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTel bridge spec #516
Changes from all commits
08688fa
4932d5a
5831f7b
b94c05b
071253e
27d79c1
e72ad6d
1a59b14
e8e008b
bb3b8ae
fcabe0b
d6a9e98
7318a1b
ac510ac
97b6ec0
93a4aa5
3701625
433e2c0
d934065
5a0cd1b
a0aba19
c02707f
a376d5e
d1ac141
da7447a
7b9e59b
a476547
c9e3004
b195414
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,277 @@ | ||
| ## OpenTelemetry API (Tracing) | ||
|
|
||
| [OpenTelemetry](https://opentelemetry.io) (OTel in short) provides a vendor-neutral API that allows to capture tracing, logs and metrics data. | ||
|
|
||
| Agents MAY provide a bridge implementation of OpenTelemetry Tracing API following this specification. | ||
| When available, implementation MUST be configurable and should be disabled by default when marked as `experimental`. | ||
|
|
||
| The bridge implementation relies on APM Server version 7.16 or later. Agents SHOULD recommend this minimum version to users in bridge documentation. | ||
|
|
||
| Bridging here means that for each OTel span created with the API, a native span/transaction will be created and sent to APM server. | ||
|
|
||
| ### User experience | ||
|
|
||
| On a high-level, from the perspective of the application code, using the OTel bridge should not differ from using the | ||
| OTel API for tracing. See [limitations](#limitations) below for details on the currently unsupported OTel features. | ||
| For tracing the support should include: | ||
| - creating spans with attributes | ||
| - context propagation | ||
| - capturing errors | ||
|
|
||
| The aim of the bridge is to allow any application/library that is instrumented with OTel API to capture OTel spans to | ||
| seamlessly delegate to Elastic APM span/transactions. Also, it provides a vendor-neutral alternative to any existing | ||
| manual agent API with similar features. | ||
|
|
||
| One major difference though is that since the implementation of OTel API will be delegated to Elastic APM agent, the | ||
| whole OTel configuration that might be present in the application code (OTel processor pipeline) or deployment | ||
| (env. variables) will be ignored. | ||
|
|
||
| ### Limitations | ||
|
|
||
| The OTel API/specification goes beyond tracing, as a result, the following OTel features are not supported: | ||
| - metrics | ||
| - logs | ||
| - span events | ||
| - span links | ||
|
|
||
| ### Spans and Transactions | ||
|
|
||
| OTel only defines Spans, whereas Elastic APM relies on both Spans and Transactions. | ||
| OTel allows users to provide the _remote context_ when creating a span, which is equivalent to providing a parent to a transaction or span, | ||
| it also allows to provide a (local) parent span. | ||
|
|
||
| As a result, when creating Spans through OTel API with a bridge, agents must implement the following algorithm: | ||
|
|
||
| ```javascript | ||
| // otel_span contains the properties set through the OTel API | ||
| span_or_transaction = null; | ||
| if (otel_span.remote_contex != null) { | ||
| span_or_transaction = createTransactionWithParent(otel_span.remote_context); | ||
| } else if (otel_span.parent == null) { | ||
| span_or_transaction = createRootTransaction(); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @SylvainJuge @felixbarny Do we want/need to spec any behavior about how the bridge will interact with existing instrumentation? I'm not sure what the situation is in java-land, but here's a few use cases that outline why I'm asking. In the Node.js agent we automatically start transactions whenever there's an HTTP(S) request being handled regardless of the framework (express, hapi, fastify), and I've been thinking we might want/need to suppress this default behavior if we're also going to start a transaction due to an oTel span being created. (otherwise that's two transactions per request being handled) Similarly, there's a question of what to do when there's an oTel instrumentation package that we also instrument. The Node.js agent has instrumentation for the
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In Java, there is a clear distinction between OTel API (a regular java library) and the OTel instrumentation (also known as OTel instrumentaion agent), thus:
If there is no such distinction with API/instrumentation with other platforms, we should then probably clarify this in the spec for which instrumentation have priority. If that would be the case, it also means that we can't distinguish usages of OTel API to instrument custom spans from OTel instrumentation, and that we would just disable our own instrumentations and replace them with OTel ones. This would go way beyond the "create a bridge for the OTel API" goal, thus I hope it's not the case here.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The go otel library allows users to set
I'm assuming this would definitely create a new root transaction, even if
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That option seems to be specific to Go and I'm not intimately familiar with it. But I suppose you're right. |
||
| } else { | ||
| span_or_transaction = createSpanWithParent(otel_span.parent); | ||
| } | ||
| ``` | ||
|
|
||
| ### Span Kind | ||
|
|
||
| OTel spans have an `SpanKind` property ([specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#spankind)) which is close but not strictly equivalent to our definition of spans and transactions. | ||
|
|
||
| For both transactions and spans, an optional `otel.span_kind` property will be provided by agents when set through | ||
| the OTel API. | ||
| This value should be stored into Elasticsearch documents to preserve OTel semantics and help future OTel integration. | ||
|
|
||
| Possible values are `CLIENT`, `SERVER`, `PRODUCER`, `CONSUMER` and `INTERNAL`, refer to [specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#spankind) for details on semantics. | ||
|
|
||
| By default, OTel spans have their `SpanKind` set to `INTERNAL` by OTel API implementation, so it is assumed to always be provided when using the bridge. | ||
|
|
||
| For existing agents without OTel bridge or for data captured without the bridge, the APM server has to infer the value of `otel.span_kind` with the following algorithm: | ||
|
|
||
| ```javascript | ||
| span_kind = null; | ||
| if (isTransaction(item)) { | ||
| if (item.type == "messaging") { | ||
| span_kind = "CONSUMER"; | ||
| } else if (item.type == "request") { | ||
| span_kind = "SERVER"; | ||
| } | ||
| } else { | ||
| // span | ||
| if (item.type == "external" || item.type == "storage" || item.type == "db") { | ||
| span_kind = "CLIENT"; | ||
| } | ||
| } | ||
|
|
||
| if (span_kind == null) { | ||
| span_kind = "INTERNAL"; | ||
| } | ||
|
|
||
| ``` | ||
|
|
||
| While being optional, inferring the value of `otel.span_kind` helps to keep the data model closer to the OTel specification, even if the original data was sent using the native agent protocol. | ||
|
|
||
| ### Span status | ||
|
|
||
| OTel spans have a [Status](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#set-status) | ||
| field to indicate the status of the underlying task they represent. | ||
|
|
||
| When the [Set Status](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#set-status) on OTel API is used, we can map it directly to `span.outcome`: | ||
| - OK => Success | ||
| - Error => Failure | ||
| - Unset (default) => Unknown | ||
|
|
||
| However, when not provided explicitly agents can infer the outcome from the presence of a reported error. | ||
| This behavior is not expected with OTel API with status, thus bridged spans/transactions should NOT have their outcome | ||
| altered by reporting (or lack of reporting) of an error. Here the behavior should be identical to when the end-user provides | ||
| the outcome explicitly and thus have higher priority over the inferred value. | ||
|
|
||
| ### Attributes mapping | ||
|
|
||
| OTel relies on key-value pairs for span attributes. | ||
| Keys and values are protocol-specific and are defined in [semantic convention](https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/trace/semantic_conventions) specification. | ||
|
|
||
| In order to minimize the mapping complexity in agents, most of the mapping between OTel attributes and agent protocol will be delegated to APM server: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense, but looking at the inference algorithm (which is an additional one to one we already have), and result/outcome inference - it seems reasonable to reconsider moving more login from agents to the server.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with you, being able to delegate all of the mapping and only implement it once would be great. The inference algorithm on the agent is only required for the "essential" fields as they are reused by other agent features.
|
||
| - All OTel span attributes should be captured as-is and written to agent protocol. | ||
| - APM server will handle the mapping between OTel attributes and their native transaction/spans equivalents | ||
| - Some native span/transaction attributes will still require mapping within agents for [compatibility with existing features](#compatibility-mapping) | ||
|
|
||
| OpenTelemetry attributes should be stored in `otel.attributes` as a flat key-value pair mapping added to `span` and `transaction` objects: | ||
| ```json | ||
| { | ||
| // [...] other span/transaction attributes | ||
| "otel": { | ||
| "span_kind": "CLIENT", | ||
| "attributes": { | ||
| "db.system": "mysql", | ||
| "db.statement": "SELECT * from table_1" | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| Starting from version 7.16 onwards, APM server must provide a mapping that is equivalent to the native OpenTelemetry Protocol (OTLP) intake for the | ||
| fields provided in `otel.attributes`. | ||
|
|
||
| When sending data to APM server version before 7.16, agents MAY use span and transaction labels as fallback to store OTel attributes to avoid dropping information. | ||
|
|
||
| ### Compatibility mapping | ||
|
|
||
| Agents should ensure compatibility with the following features: | ||
| - breakdown metrics | ||
| - [dropped spans statistics](handling-huge-traces/tracing-spans-dropped-stats.md) | ||
| - [compressed spans](handling-huge-traces/tracing-spans-compress.md) | ||
|
|
||
| As a consequence, agents must provide values for the following attributes: | ||
| - `transaction.name` or `span.name` : value directly provided by OTel API | ||
| - `transaction.type` : see inference algorithm below | ||
| - `span.type` and `span.subtype` : see inference algorithm below | ||
| - `span.destination.service.resource` : see inference algorithm below | ||
trentm marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| #### Transaction type | ||
|
|
||
| ```javascript | ||
| a = transation.otel.attributes; | ||
| span_kind = transaction.otel_span_kind; | ||
| isRpc = a['rpc.system'] !== undefined; | ||
| isHttp = a['http.url'] !== undefined || a['http.scheme'] !== undefined; | ||
| isMessaging = a['messaging.system'] !== undefined; | ||
| if (span_kind == 'SERVER' && (isRpc || isHttp)) { | ||
| type = 'request'; | ||
| } else if (span_kind == 'CONSUMER' && isMessaging) { | ||
SylvainJuge marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| type = 'messaging'; | ||
| } else { | ||
| type = 'unknown'; | ||
| } | ||
| ``` | ||
|
|
||
| #### Span type, sub-type and destination service resource | ||
|
|
||
| ```javascript | ||
| a = span.otel.attributes; | ||
| type = undefined; | ||
| subtype = undefined; | ||
| resource = undefined; | ||
|
|
||
| httpPortFromScheme = function (scheme) { | ||
| if ('http' == scheme) { | ||
| return 80; | ||
| } else if ('https' == scheme) { | ||
| return 443; | ||
| } | ||
| return -1; | ||
| } | ||
|
|
||
| // extracts 'host' or 'host:port' from URL | ||
| parseNetName = function (url) { | ||
| var u = new URL(url); // https://developer.mozilla.org/en-US/docs/Web/API/URL | ||
| if (u.port != '') { | ||
| return u.hostname; // host:port already in URL | ||
| } else { | ||
| var port = httpPortFromScheme(u.protocol.substring(0, u.protocol.length - 1)); | ||
| return port > 0 ? u.host + ':'+ port : u.host; | ||
| } | ||
| } | ||
|
|
||
| peerPort = a['net.peer.port']; | ||
| netName = a['net.peer.name'] || a['net.peer.ip']; | ||
|
|
||
| if (netName && peerPort > 0) { | ||
| netName += ':'; | ||
| netName += peerPort; | ||
| } | ||
|
|
||
| if (a['db.system']) { | ||
| type = 'db' | ||
| subtype = a['db.system']; | ||
| resource = netName || subtype; | ||
| if (a['db.name']) { | ||
| resource += '/' | ||
| resource += a['db.name']; | ||
| } | ||
|
|
||
| } else if (a['messaging.system']) { | ||
| type = 'messaging'; | ||
| subtype = a['messaging.system']; | ||
|
|
||
| if (!netName && a['messaging.url']) { | ||
| netName = parseNetName(a['messaging.url']); | ||
| } | ||
| resource = netName || subtype; | ||
| if (a['messaging.destination']) { | ||
| resource += '/'; | ||
| resource += a['messaging.destination']; | ||
| } | ||
|
|
||
| } else if (a['rpc.system']) { | ||
| type = 'external'; | ||
| subtype = a['rpc.system']; | ||
| resource = netName || subtype; | ||
| if (a['rpc.service']) { | ||
| resource += '/'; | ||
| resource += a['rpc.service']; | ||
| } | ||
|
|
||
| } else if (a['http.url'] || a['http.scheme']) { | ||
| type = 'external'; | ||
| subtype = 'http'; | ||
|
|
||
| if (a['http.host'] && a['http.scheme']) { | ||
| resource = a['http.host'] + ':' + httpPortFromScheme(a['http.scheme']); | ||
| } else if (a['http.url']) { | ||
| resource = parseNetName(a['http.url']); | ||
| } | ||
| } | ||
|
|
||
| if (type === undefined) { | ||
| if (span.otel.span_kind == 'INTERNAL') { | ||
| type = 'app'; | ||
| subtype = 'internal'; | ||
| } else { | ||
| type = 'unknown'; | ||
| } | ||
| } | ||
| span.type = type; | ||
| span.subtype = subtype; | ||
| span.destination.service.resource = resource; | ||
| ``` | ||
|
|
||
| ### Active Spans and Context | ||
|
|
||
| When possible, bridge implementation MUST ensure proper interoperability between Elastic transactions/spans and OTel spans when | ||
| used from their respective APIs: | ||
| - After activating an Elastic span via the agent's API, the [`Context`] returned via the [get current context API] should contain that Elastic span | ||
| - When an OTel context is [attached] (aka activated), the [get current context API] should return the same [`Context`] instance. | ||
| - Starting an OTel span in the scope of an active Elastic span should make the OTel span a child of the Elastic span. | ||
| - Starting an Elastic span in the scope of an active OTel span should make the Elastic span a child of the OTel span. | ||
|
|
||
| [`Context`]: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/context/context.md | ||
| [attached]: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/context/context.md#attach-context | ||
| [get current context API]: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/context/context.md#get-current-context | ||
|
|
||
| Both OTel and our agents have their own definition of what "active context" is, for example: | ||
| - Java Agent: Elastic active context is implemented as a thread-local stack | ||
| - Java OTel API: active context is implemented as a key-value map propagated through thread local | ||
|
|
||
| In order to avoid potentially complex and tedious synchronization issues between OTel and our existing agent | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I ran into a little trickiness when implementing the "if there's not a parent span start a transaction" behavior when working on the Node.js Bridge related to the context objects. I believe I have a path forward for the Node.js agent -- I'm mainly sharing this as background in case other folks run into it (and also forcing myself to say everything out loud in case there's some corner I'm not considering ;)) First -- there's no span builder object in any of the Node.js API or SDK code. Second, neither the You can see an example of this in the default Node.js SDK/tracer: The becomes a bit complicated because, currently, this span context is initially set by whatever tracecontext propagator is currently being used The trickiness came from the fact that we're not using these W3C propagators in the Node.js agent bridge -- trace context needs to be propagated when creating a transaction. This means that in cases where we're starting a transaction and assigning it to the ElasticOtelSpan object, we also need to set that span context on the open telemetry context object.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The idea for the bridges is that there's just a single context storage - the internal one of the agent. The case you mentioned is handled by this part of the spec:
What it means is that when
I don't think there's a need to implement a custom W3C propagators in the Node.js agent bridge. You can just rely on the provided implementation.
The trace context propagators don't alter the context storage. It's a bit confusing but Personally, I find that part of the OTel API a bit bloated and a source of unnecessary ceremony and allocation. Happy to jump on a zoom if that was unclear. |
||
| implementations, the bridge implementation SHOULD provide an abstraction to have a single "active context" storage. | ||
Uh oh!
There was an error while loading. Please reload this page.