Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
08688fa
add OpenTracing spec from apm#32
SylvainJuge Sep 29, 2021
4932d5a
add otel bridge spec
SylvainJuge Sep 29, 2021
5831f7b
cleanup + add label fallback
SylvainJuge Sep 29, 2021
b94c05b
add some clarification on server-side mapping
SylvainJuge Sep 29, 2021
071253e
fix wording specs/agents/tracing-api.md
SylvainJuge Sep 29, 2021
27d79c1
extend spec with fallbacks + context activations
SylvainJuge Sep 29, 2021
e72ad6d
Apply suggestions from code review
SylvainJuge Sep 30, 2021
1a59b14
remove opentracing from spec
SylvainJuge Sep 30, 2021
e8e008b
Apply suggestions from code review
SylvainJuge Oct 1, 2021
bb3b8ae
Fix typo specs/agents/tracing-api-otel.md
SylvainJuge Oct 4, 2021
fcabe0b
Update tracing-api-otel.md
stuartnelson3 Oct 11, 2021
d6a9e98
add span type 'db' to spec
SylvainJuge Oct 11, 2021
7318a1b
add type,subtype & resource algorithm
SylvainJuge Nov 2, 2021
ac510ac
add sub-sections for algorithms
SylvainJuge Nov 2, 2021
97b6ec0
active context impl
SylvainJuge Nov 2, 2021
93a4aa5
add gherkin spec
SylvainJuge Nov 2, 2021
3701625
Update specs/agents/README.md
SylvainJuge Nov 15, 2021
433e2c0
Update specs/agents/tracing-api-otel.md
SylvainJuge Nov 15, 2021
d934065
Merge branch 'master' of github.com:elastic/apm into otel-bridge-spec
SylvainJuge Dec 7, 2021
5a0cd1b
add status mapping + configurability
SylvainJuge Dec 7, 2021
a0aba19
Merge branch 'otel-bridge-spec' of github.com:SylvainJuge/apm into ot…
SylvainJuge Dec 7, 2021
c02707f
update gherkin spec
SylvainJuge Dec 7, 2021
a376d5e
add a few clarifications
SylvainJuge Feb 7, 2022
d1ac141
Merge branch 'main' of github.com:elastic/apm into otel-bridge-spec
SylvainJuge Feb 7, 2022
da7447a
clarify user-experience
SylvainJuge Feb 7, 2022
7b9e59b
clarify bridge limitations
SylvainJuge Feb 7, 2022
a476547
MAY use labels for server < 7.16
SylvainJuge Feb 9, 2022
c9e3004
Update specs/agents/tracing-api-otel.md
SylvainJuge Feb 10, 2022
b195414
clarify error capture + supported features
SylvainJuge Feb 23, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions specs/agents/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ You can find details about each of these in the [APM Data Model](https://www.ela
- [Messaging systems](tracing-instrumentation-messaging.md)
- [gRPC](tracing-instrumentation-grpc.md)
- [GraphQL](tracing-instrumentation-graphql.md)
- [OpenTelemetry API Bridge](tracing-api-otel.md)
- [Error/exception tracking](error-tracking.md)
- [Metrics](metrics.md)
- [Logging Correlation](log-correlation.md)
Expand Down
277 changes: 277 additions & 0 deletions specs/agents/tracing-api-otel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,277 @@
## OpenTelemetry API (Tracing)

[OpenTelemetry](https://opentelemetry.io) (OTel in short) provides a vendor-neutral API that allows to capture tracing, logs and metrics data.

Agents MAY provide a bridge implementation of OpenTelemetry Tracing API following this specification.
When available, implementation MUST be configurable and should be disabled by default when marked as `experimental`.

The bridge implementation relies on APM Server version 7.16 or later. Agents SHOULD recommend this minimum version to users in bridge documentation.

Bridging here means that for each OTel span created with the API, a native span/transaction will be created and sent to APM server.

### User experience

On a high-level, from the perspective of the application code, using the OTel bridge should not differ from using the
OTel API for tracing. See [limitations](#limitations) below for details on the currently unsupported OTel features.
For tracing the support should include:
- creating spans with attributes
- context propagation
- capturing errors

The aim of the bridge is to allow any application/library that is instrumented with OTel API to capture OTel spans to
seamlessly delegate to Elastic APM span/transactions. Also, it provides a vendor-neutral alternative to any existing
manual agent API with similar features.

One major difference though is that since the implementation of OTel API will be delegated to Elastic APM agent, the
whole OTel configuration that might be present in the application code (OTel processor pipeline) or deployment
(env. variables) will be ignored.

### Limitations

The OTel API/specification goes beyond tracing, as a result, the following OTel features are not supported:
- metrics
- logs
- span events
- span links

### Spans and Transactions

OTel only defines Spans, whereas Elastic APM relies on both Spans and Transactions.
OTel allows users to provide the _remote context_ when creating a span, which is equivalent to providing a parent to a transaction or span,
it also allows to provide a (local) parent span.

As a result, when creating Spans through OTel API with a bridge, agents must implement the following algorithm:

```javascript
// otel_span contains the properties set through the OTel API
span_or_transaction = null;
if (otel_span.remote_contex != null) {
span_or_transaction = createTransactionWithParent(otel_span.remote_context);
} else if (otel_span.parent == null) {
span_or_transaction = createRootTransaction();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SylvainJuge @felixbarny Do we want/need to spec any behavior about how the bridge will interact with existing instrumentation?

I'm not sure what the situation is in java-land, but here's a few use cases that outline why I'm asking.

In the Node.js agent we automatically start transactions whenever there's an HTTP(S) request being handled regardless of the framework (express, hapi, fastify), and I've been thinking we might want/need to suppress this default behavior if we're also going to start a transaction due to an oTel span being created. (otherwise that's two transactions per request being handled)

Similarly, there's a question of what to do when there's an oTel instrumentation package that we also instrument. The Node.js agent has instrumentation for the express web framework, but OTel also has instrumentation for the express web framework. What should happen when the Node.js is instrumenting an express application and the bridge is active?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Java, there is a clear distinction between OTel API (a regular java library) and the OTel instrumentation (also known as OTel instrumentaion agent), thus:

  • the OTel bridge only makes the OTel API delegate to our own tracing implementation
  • when creating a bridge, there is no such "double instrumentation" as the existing instrumentations are part of the instrumentation agent and are not included in our agent.
  • reusing OTel instrumentations is not part of the current proposal, but might be in the future.

If there is no such distinction with API/instrumentation with other platforms, we should then probably clarify this in the spec for which instrumentation have priority. If that would be the case, it also means that we can't distinguish usages of OTel API to instrument custom spans from OTel instrumentation, and that we would just disable our own instrumentations and replace them with OTel ones. This would go way beyond the "create a bridge for the OTel API" goal, thus I hope it's not the case here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The go otel library allows users to set NewRoot as an option when creating a new span (https://pkg.go.dev/go.opentelemetry.io/otel/trace#WithNewRoot). According to the documentation:

Any existing parent span context will be ignored when defining the Span's trace identifiers.

I'm assuming this would definitely create a new root transaction, even if otel_span.remote_contex != null

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That option seems to be specific to Go and I'm not intimately familiar with it. But I suppose you're right.

} else {
span_or_transaction = createSpanWithParent(otel_span.parent);
}
```

### Span Kind

OTel spans have an `SpanKind` property ([specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#spankind)) which is close but not strictly equivalent to our definition of spans and transactions.

For both transactions and spans, an optional `otel.span_kind` property will be provided by agents when set through
the OTel API.
This value should be stored into Elasticsearch documents to preserve OTel semantics and help future OTel integration.

Possible values are `CLIENT`, `SERVER`, `PRODUCER`, `CONSUMER` and `INTERNAL`, refer to [specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#spankind) for details on semantics.

By default, OTel spans have their `SpanKind` set to `INTERNAL` by OTel API implementation, so it is assumed to always be provided when using the bridge.

For existing agents without OTel bridge or for data captured without the bridge, the APM server has to infer the value of `otel.span_kind` with the following algorithm:

```javascript
span_kind = null;
if (isTransaction(item)) {
if (item.type == "messaging") {
span_kind = "CONSUMER";
} else if (item.type == "request") {
span_kind = "SERVER";
}
} else {
// span
if (item.type == "external" || item.type == "storage" || item.type == "db") {
span_kind = "CLIENT";
}
}

if (span_kind == null) {
span_kind = "INTERNAL";
}

```

While being optional, inferring the value of `otel.span_kind` helps to keep the data model closer to the OTel specification, even if the original data was sent using the native agent protocol.

### Span status

OTel spans have a [Status](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#set-status)
field to indicate the status of the underlying task they represent.

When the [Set Status](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#set-status) on OTel API is used, we can map it directly to `span.outcome`:
- OK => Success
- Error => Failure
- Unset (default) => Unknown

However, when not provided explicitly agents can infer the outcome from the presence of a reported error.
This behavior is not expected with OTel API with status, thus bridged spans/transactions should NOT have their outcome
altered by reporting (or lack of reporting) of an error. Here the behavior should be identical to when the end-user provides
the outcome explicitly and thus have higher priority over the inferred value.

### Attributes mapping

OTel relies on key-value pairs for span attributes.
Keys and values are protocol-specific and are defined in [semantic convention](https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/trace/semantic_conventions) specification.

In order to minimize the mapping complexity in agents, most of the mapping between OTel attributes and agent protocol will be delegated to APM server:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but looking at the inference algorithm (which is an additional one to one we already have), and result/outcome inference - it seems reasonable to reconsider moving more login from agents to the server.
Not blocking this PR of course, but I think something to reconsider

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, being able to delegate all of the mapping and only implement it once would be great.

The inference algorithm on the agent is only required for the "essential" fields as they are reused by other agent features.

  • type (and subtype) are required for breakdown metrics: making this feature otel-compliant seems challenging.
  • outcome is required as we map it from otel status and don't store it as-is. (in hindsight maybe copying this one to otel.status next to otel.attributes would have been a good idea.

- All OTel span attributes should be captured as-is and written to agent protocol.
- APM server will handle the mapping between OTel attributes and their native transaction/spans equivalents
- Some native span/transaction attributes will still require mapping within agents for [compatibility with existing features](#compatibility-mapping)

OpenTelemetry attributes should be stored in `otel.attributes` as a flat key-value pair mapping added to `span` and `transaction` objects:
```json
{
// [...] other span/transaction attributes
"otel": {
"span_kind": "CLIENT",
"attributes": {
"db.system": "mysql",
"db.statement": "SELECT * from table_1"
}
}
}
```

Starting from version 7.16 onwards, APM server must provide a mapping that is equivalent to the native OpenTelemetry Protocol (OTLP) intake for the
fields provided in `otel.attributes`.

When sending data to APM server version before 7.16, agents MAY use span and transaction labels as fallback to store OTel attributes to avoid dropping information.

### Compatibility mapping

Agents should ensure compatibility with the following features:
- breakdown metrics
- [dropped spans statistics](handling-huge-traces/tracing-spans-dropped-stats.md)
- [compressed spans](handling-huge-traces/tracing-spans-compress.md)

As a consequence, agents must provide values for the following attributes:
- `transaction.name` or `span.name` : value directly provided by OTel API
- `transaction.type` : see inference algorithm below
- `span.type` and `span.subtype` : see inference algorithm below
- `span.destination.service.resource` : see inference algorithm below

#### Transaction type

```javascript
a = transation.otel.attributes;
span_kind = transaction.otel_span_kind;
isRpc = a['rpc.system'] !== undefined;
isHttp = a['http.url'] !== undefined || a['http.scheme'] !== undefined;
isMessaging = a['messaging.system'] !== undefined;
if (span_kind == 'SERVER' && (isRpc || isHttp)) {
type = 'request';
} else if (span_kind == 'CONSUMER' && isMessaging) {
type = 'messaging';
} else {
type = 'unknown';
}
```

#### Span type, sub-type and destination service resource

```javascript
a = span.otel.attributes;
type = undefined;
subtype = undefined;
resource = undefined;

httpPortFromScheme = function (scheme) {
if ('http' == scheme) {
return 80;
} else if ('https' == scheme) {
return 443;
}
return -1;
}

// extracts 'host' or 'host:port' from URL
parseNetName = function (url) {
var u = new URL(url); // https://developer.mozilla.org/en-US/docs/Web/API/URL
if (u.port != '') {
return u.hostname; // host:port already in URL
} else {
var port = httpPortFromScheme(u.protocol.substring(0, u.protocol.length - 1));
return port > 0 ? u.host + ':'+ port : u.host;
}
}

peerPort = a['net.peer.port'];
netName = a['net.peer.name'] || a['net.peer.ip'];

if (netName && peerPort > 0) {
netName += ':';
netName += peerPort;
}

if (a['db.system']) {
type = 'db'
subtype = a['db.system'];
resource = netName || subtype;
if (a['db.name']) {
resource += '/'
resource += a['db.name'];
}

} else if (a['messaging.system']) {
type = 'messaging';
subtype = a['messaging.system'];

if (!netName && a['messaging.url']) {
netName = parseNetName(a['messaging.url']);
}
resource = netName || subtype;
if (a['messaging.destination']) {
resource += '/';
resource += a['messaging.destination'];
}

} else if (a['rpc.system']) {
type = 'external';
subtype = a['rpc.system'];
resource = netName || subtype;
if (a['rpc.service']) {
resource += '/';
resource += a['rpc.service'];
}

} else if (a['http.url'] || a['http.scheme']) {
type = 'external';
subtype = 'http';

if (a['http.host'] && a['http.scheme']) {
resource = a['http.host'] + ':' + httpPortFromScheme(a['http.scheme']);
} else if (a['http.url']) {
resource = parseNetName(a['http.url']);
}
}

if (type === undefined) {
if (span.otel.span_kind == 'INTERNAL') {
type = 'app';
subtype = 'internal';
} else {
type = 'unknown';
}
}
span.type = type;
span.subtype = subtype;
span.destination.service.resource = resource;
```

### Active Spans and Context

When possible, bridge implementation MUST ensure proper interoperability between Elastic transactions/spans and OTel spans when
used from their respective APIs:
- After activating an Elastic span via the agent's API, the [`Context`] returned via the [get current context API] should contain that Elastic span
- When an OTel context is [attached] (aka activated), the [get current context API] should return the same [`Context`] instance.
- Starting an OTel span in the scope of an active Elastic span should make the OTel span a child of the Elastic span.
- Starting an Elastic span in the scope of an active OTel span should make the Elastic span a child of the OTel span.

[`Context`]: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/context/context.md
[attached]: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/context/context.md#attach-context
[get current context API]: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/context/context.md#get-current-context

Both OTel and our agents have their own definition of what "active context" is, for example:
- Java Agent: Elastic active context is implemented as a thread-local stack
- Java OTel API: active context is implemented as a key-value map propagated through thread local

In order to avoid potentially complex and tedious synchronization issues between OTel and our existing agent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into a little trickiness when implementing the "if there's not a parent span start a transaction" behavior when working on the Node.js Bridge related to the context objects. I believe I have a path forward for the Node.js agent -- I'm mainly sharing this as background in case other folks run into it (and also forcing myself to say everything out loud in case there's some corner I'm not considering ;))

First -- there's no span builder object in any of the Node.js API or SDK code. Second, neither the startSpan or the startActiveSpan methods of the Node.js tracer allow you to pass in the span that should be a parent. Instead, this information is stored on the passed in open telemetry context object. Specifically, the intended pattern here is that you can fetch the span that should be the parent (i.e. the span that's active) by using the context's getSpanContext method.

You can see an example of this in the default Node.js SDK/tracer:

The becomes a bit complicated because, currently, this span context is initially set by whatever tracecontext propagator is currently being used

https://github.com/open-telemetry/opentelemetry-js/blob/04f9edd12fb2f42898bee7086691e47b3ab9b629/packages/opentelemetry-core/src/trace/W3CTraceContextPropagator.ts#L120

https://github.com/open-telemetry/opentelemetry-js/blob/04f9edd12fb2f42898bee7086691e47b3ab9b629/packages/opentelemetry-propagator-b3/src/B3SinglePropagator.ts#L86

https://github.com/open-telemetry/opentelemetry-js/blob/04f9edd12fb2f42898bee7086691e47b3ab9b629/packages/opentelemetry-propagator-b3/src/B3MultiPropagator.ts#L137

https://github.com/open-telemetry/opentelemetry-js/blob/04f9edd12fb2f42898bee7086691e47b3ab9b629/packages/opentelemetry-propagator-jaeger/src/JaegerPropagator.ts#L111

The trickiness came from the fact that we're not using these W3C propagators in the Node.js agent bridge -- trace context needs to be propagated when creating a transaction. This means that in cases where we're starting a transaction and assigning it to the ElasticOtelSpan object, we also need to set that span context on the open telemetry context object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that in cases where we're starting a transaction and assigning it to the ElasticOtelSpan object, we also need to set that span context on the open telemetry context object.

The idea for the bridges is that there's just a single context storage - the internal one of the agent.

The case you mentioned is handled by this part of the spec:

After activating an Elastic span via the agent's API, the [Context] returned via the [get current context API] should contain that Elastic span

What it means is that when ContextManager::active is called, the bridge implementation of the interface would create a Context on-the-fly which contains the Elastic span.

The trickiness came from the fact that we're not using these W3C propagators in the Node.js agent bridge -- trace context needs to be propagated when creating a transaction.

I don't think there's a need to implement a custom W3C propagators in the Node.js agent bridge. You can just rely on the provided implementation.

The becomes a bit complicated because, currently, this span context is initially set by whatever tracecontext propagator is currently being used

The trace context propagators don't alter the context storage. It's a bit confusing but trace.setSpanContext, which is called by the propagators, only creates an immutable copy of the provided Context and sets the span. In the case of W3CTraceContextPropagator, the span is a "fake" NonRecordingSpan. The whole purpose of this is to populate a Context object that has the trace ids from the traceparent header.

Personally, I find that part of the OTel API a bit bloated and a source of unnecessary ceremony and allocation.

Happy to jump on a zoom if that was unclear.

implementations, the bridge implementation SHOULD provide an abstraction to have a single "active context" storage.
7 changes: 5 additions & 2 deletions specs/agents/tracing-api.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
## Tracer APIs

All agents must provide an API to enable developers to instrument their applications manually, in addition to any automatic instrumentation. Agents document their APIs in the elastic.co docs:
All agents must provide a native API to enable developers to instrument their applications manually, in addition to any
automatic instrumentation.

Agents document their APIs in the elastic.co docs:

- [Node.js Agent](https://www.elastic.co/guide/en/apm/agent/nodejs/current/api.html)
- [Go Agent](https://www.elastic.co/guide/en/apm/agent/go/current/api.html)
Expand All @@ -10,4 +13,4 @@ All agents must provide an API to enable developers to instrument their applicat
- [Ruby Agent](https://www.elastic.co/guide/en/apm/agent/ruby/current/api.html)
- [RUM JS Agent](https://www.elastic.co/guide/en/apm/agent/js-base/current/api.html)

In addition to each agent having a "native" API for instrumentation, they also implement the [OpenTracing APIs](https://opentracing.io). Agents should align implementations according to https://github.com/elastic/apm/issues/32.
In addition, each agent may provide "bridge" implementations of vendor-neutral [OpenTelemetry API](tracing-api-otel.md).
Loading