diff --git a/specs/agents/README.md b/specs/agents/README.md index a0506717..b2a29585 100644 --- a/specs/agents/README.md +++ b/specs/agents/README.md @@ -55,6 +55,7 @@ You can find details about each of these in the [APM Data Model](https://www.ela - [Messaging systems](tracing-instrumentation-messaging.md) - [gRPC](tracing-instrumentation-grpc.md) - [GraphQL](tracing-instrumentation-graphql.md) + - [OpenTelemetry API Bridge](tracing-api-otel.md) - [Error/exception tracking](error-tracking.md) - [Metrics](metrics.md) - [Logging Correlation](log-correlation.md) diff --git a/specs/agents/tracing-api-otel.md b/specs/agents/tracing-api-otel.md new file mode 100644 index 00000000..f78f81c6 --- /dev/null +++ b/specs/agents/tracing-api-otel.md @@ -0,0 +1,277 @@ +## OpenTelemetry API (Tracing) + +[OpenTelemetry](https://opentelemetry.io) (OTel in short) provides a vendor-neutral API that allows to capture tracing, logs and metrics data. + +Agents MAY provide a bridge implementation of OpenTelemetry Tracing API following this specification. +When available, implementation MUST be configurable and should be disabled by default when marked as `experimental`. + +The bridge implementation relies on APM Server version 7.16 or later. Agents SHOULD recommend this minimum version to users in bridge documentation. + +Bridging here means that for each OTel span created with the API, a native span/transaction will be created and sent to APM server. + +### User experience + +On a high-level, from the perspective of the application code, using the OTel bridge should not differ from using the +OTel API for tracing. See [limitations](#limitations) below for details on the currently unsupported OTel features. +For tracing the support should include: +- creating spans with attributes +- context propagation +- capturing errors + +The aim of the bridge is to allow any application/library that is instrumented with OTel API to capture OTel spans to +seamlessly delegate to Elastic APM span/transactions. Also, it provides a vendor-neutral alternative to any existing +manual agent API with similar features. + +One major difference though is that since the implementation of OTel API will be delegated to Elastic APM agent, the +whole OTel configuration that might be present in the application code (OTel processor pipeline) or deployment +(env. variables) will be ignored. + +### Limitations + +The OTel API/specification goes beyond tracing, as a result, the following OTel features are not supported: +- metrics +- logs +- span events +- span links + +### Spans and Transactions + +OTel only defines Spans, whereas Elastic APM relies on both Spans and Transactions. +OTel allows users to provide the _remote context_ when creating a span, which is equivalent to providing a parent to a transaction or span, +it also allows to provide a (local) parent span. + +As a result, when creating Spans through OTel API with a bridge, agents must implement the following algorithm: + +```javascript +// otel_span contains the properties set through the OTel API +span_or_transaction = null; +if (otel_span.remote_contex != null) { + span_or_transaction = createTransactionWithParent(otel_span.remote_context); +} else if (otel_span.parent == null) { + span_or_transaction = createRootTransaction(); +} else { + span_or_transaction = createSpanWithParent(otel_span.parent); +} +``` + +### Span Kind + +OTel spans have an `SpanKind` property ([specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#spankind)) which is close but not strictly equivalent to our definition of spans and transactions. + +For both transactions and spans, an optional `otel.span_kind` property will be provided by agents when set through +the OTel API. +This value should be stored into Elasticsearch documents to preserve OTel semantics and help future OTel integration. + +Possible values are `CLIENT`, `SERVER`, `PRODUCER`, `CONSUMER` and `INTERNAL`, refer to [specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#spankind) for details on semantics. + +By default, OTel spans have their `SpanKind` set to `INTERNAL` by OTel API implementation, so it is assumed to always be provided when using the bridge. + +For existing agents without OTel bridge or for data captured without the bridge, the APM server has to infer the value of `otel.span_kind` with the following algorithm: + +```javascript +span_kind = null; +if (isTransaction(item)) { + if (item.type == "messaging") { + span_kind = "CONSUMER"; + } else if (item.type == "request") { + span_kind = "SERVER"; + } +} else { + // span + if (item.type == "external" || item.type == "storage" || item.type == "db") { + span_kind = "CLIENT"; + } +} + +if (span_kind == null) { + span_kind = "INTERNAL"; +} + +``` + +While being optional, inferring the value of `otel.span_kind` helps to keep the data model closer to the OTel specification, even if the original data was sent using the native agent protocol. + +### Span status + +OTel spans have a [Status](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#set-status) +field to indicate the status of the underlying task they represent. + +When the [Set Status](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#set-status) on OTel API is used, we can map it directly to `span.outcome`: +- OK => Success +- Error => Failure +- Unset (default) => Unknown + +However, when not provided explicitly agents can infer the outcome from the presence of a reported error. +This behavior is not expected with OTel API with status, thus bridged spans/transactions should NOT have their outcome +altered by reporting (or lack of reporting) of an error. Here the behavior should be identical to when the end-user provides +the outcome explicitly and thus have higher priority over the inferred value. + +### Attributes mapping + +OTel relies on key-value pairs for span attributes. +Keys and values are protocol-specific and are defined in [semantic convention](https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/trace/semantic_conventions) specification. + +In order to minimize the mapping complexity in agents, most of the mapping between OTel attributes and agent protocol will be delegated to APM server: +- All OTel span attributes should be captured as-is and written to agent protocol. +- APM server will handle the mapping between OTel attributes and their native transaction/spans equivalents +- Some native span/transaction attributes will still require mapping within agents for [compatibility with existing features](#compatibility-mapping) + +OpenTelemetry attributes should be stored in `otel.attributes` as a flat key-value pair mapping added to `span` and `transaction` objects: +```json +{ + // [...] other span/transaction attributes + "otel": { + "span_kind": "CLIENT", + "attributes": { + "db.system": "mysql", + "db.statement": "SELECT * from table_1" + } + } +} +``` + +Starting from version 7.16 onwards, APM server must provide a mapping that is equivalent to the native OpenTelemetry Protocol (OTLP) intake for the +fields provided in `otel.attributes`. + +When sending data to APM server version before 7.16, agents MAY use span and transaction labels as fallback to store OTel attributes to avoid dropping information. + +### Compatibility mapping + +Agents should ensure compatibility with the following features: +- breakdown metrics +- [dropped spans statistics](handling-huge-traces/tracing-spans-dropped-stats.md) +- [compressed spans](handling-huge-traces/tracing-spans-compress.md) + +As a consequence, agents must provide values for the following attributes: +- `transaction.name` or `span.name` : value directly provided by OTel API +- `transaction.type` : see inference algorithm below +- `span.type` and `span.subtype` : see inference algorithm below +- `span.destination.service.resource` : see inference algorithm below + +#### Transaction type + +```javascript +a = transation.otel.attributes; +span_kind = transaction.otel_span_kind; +isRpc = a['rpc.system'] !== undefined; +isHttp = a['http.url'] !== undefined || a['http.scheme'] !== undefined; +isMessaging = a['messaging.system'] !== undefined; +if (span_kind == 'SERVER' && (isRpc || isHttp)) { + type = 'request'; +} else if (span_kind == 'CONSUMER' && isMessaging) { + type = 'messaging'; +} else { + type = 'unknown'; +} +``` + +#### Span type, sub-type and destination service resource + +```javascript +a = span.otel.attributes; +type = undefined; +subtype = undefined; +resource = undefined; + +httpPortFromScheme = function (scheme) { + if ('http' == scheme) { + return 80; + } else if ('https' == scheme) { + return 443; + } + return -1; +} + +// extracts 'host' or 'host:port' from URL +parseNetName = function (url) { + var u = new URL(url); // https://developer.mozilla.org/en-US/docs/Web/API/URL + if (u.port != '') { + return u.hostname; // host:port already in URL + } else { + var port = httpPortFromScheme(u.protocol.substring(0, u.protocol.length - 1)); + return port > 0 ? u.host + ':'+ port : u.host; + } +} + +peerPort = a['net.peer.port']; +netName = a['net.peer.name'] || a['net.peer.ip']; + +if (netName && peerPort > 0) { + netName += ':'; + netName += peerPort; +} + +if (a['db.system']) { + type = 'db' + subtype = a['db.system']; + resource = netName || subtype; + if (a['db.name']) { + resource += '/' + resource += a['db.name']; + } + +} else if (a['messaging.system']) { + type = 'messaging'; + subtype = a['messaging.system']; + + if (!netName && a['messaging.url']) { + netName = parseNetName(a['messaging.url']); + } + resource = netName || subtype; + if (a['messaging.destination']) { + resource += '/'; + resource += a['messaging.destination']; + } + +} else if (a['rpc.system']) { + type = 'external'; + subtype = a['rpc.system']; + resource = netName || subtype; + if (a['rpc.service']) { + resource += '/'; + resource += a['rpc.service']; + } + +} else if (a['http.url'] || a['http.scheme']) { + type = 'external'; + subtype = 'http'; + + if (a['http.host'] && a['http.scheme']) { + resource = a['http.host'] + ':' + httpPortFromScheme(a['http.scheme']); + } else if (a['http.url']) { + resource = parseNetName(a['http.url']); + } +} + +if (type === undefined) { + if (span.otel.span_kind == 'INTERNAL') { + type = 'app'; + subtype = 'internal'; + } else { + type = 'unknown'; + } +} +span.type = type; +span.subtype = subtype; +span.destination.service.resource = resource; +``` + +### Active Spans and Context + +When possible, bridge implementation MUST ensure proper interoperability between Elastic transactions/spans and OTel spans when +used from their respective APIs: +- After activating an Elastic span via the agent's API, the [`Context`] returned via the [get current context API] should contain that Elastic span +- When an OTel context is [attached] (aka activated), the [get current context API] should return the same [`Context`] instance. +- Starting an OTel span in the scope of an active Elastic span should make the OTel span a child of the Elastic span. +- Starting an Elastic span in the scope of an active OTel span should make the Elastic span a child of the OTel span. + +[`Context`]: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/context/context.md +[attached]: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/context/context.md#attach-context +[get current context API]: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/context/context.md#get-current-context + +Both OTel and our agents have their own definition of what "active context" is, for example: +- Java Agent: Elastic active context is implemented as a thread-local stack +- Java OTel API: active context is implemented as a key-value map propagated through thread local + +In order to avoid potentially complex and tedious synchronization issues between OTel and our existing agent +implementations, the bridge implementation SHOULD provide an abstraction to have a single "active context" storage. diff --git a/specs/agents/tracing-api.md b/specs/agents/tracing-api.md index df961c8d..961236b3 100644 --- a/specs/agents/tracing-api.md +++ b/specs/agents/tracing-api.md @@ -1,6 +1,9 @@ ## Tracer APIs -All agents must provide an API to enable developers to instrument their applications manually, in addition to any automatic instrumentation. Agents document their APIs in the elastic.co docs: +All agents must provide a native API to enable developers to instrument their applications manually, in addition to any +automatic instrumentation. + +Agents document their APIs in the elastic.co docs: - [Node.js Agent](https://www.elastic.co/guide/en/apm/agent/nodejs/current/api.html) - [Go Agent](https://www.elastic.co/guide/en/apm/agent/go/current/api.html) @@ -10,4 +13,4 @@ All agents must provide an API to enable developers to instrument their applicat - [Ruby Agent](https://www.elastic.co/guide/en/apm/agent/ruby/current/api.html) - [RUM JS Agent](https://www.elastic.co/guide/en/apm/agent/js-base/current/api.html) -In addition to each agent having a "native" API for instrumentation, they also implement the [OpenTracing APIs](https://opentracing.io). Agents should align implementations according to https://github.com/elastic/apm/issues/32. +In addition, each agent may provide "bridge" implementations of vendor-neutral [OpenTelemetry API](tracing-api-otel.md). \ No newline at end of file diff --git a/tests/agents/gherkin-specs/otel_bridge.feature b/tests/agents/gherkin-specs/otel_bridge.feature new file mode 100644 index 00000000..8d65772b --- /dev/null +++ b/tests/agents/gherkin-specs/otel_bridge.feature @@ -0,0 +1,246 @@ +@opentelemetry-bridge +Feature: OpenTelemetry bridge + + # --- Creating Elastic span or transaction from OTel span + + Scenario: Create transaction from OTel span with remote context + Given an agent + And OTel span is created with remote context as parent + Then Elastic bridged object is a transaction + Then Elastic bridged transaction has remote context as parent + + Scenario: Create root transaction from OTel span without parent + Given an agent + And OTel span is created without parent + And OTel span ends + Then Elastic bridged object is a transaction + Then Elastic bridged transaction is a root transaction + # outcome should not be inferred from the lack/presence of errors + Then Elastic bridged transaction outcome is "unknown" + + Scenario: Create span from OTel span + Given an agent + And OTel span is created with local context as parent + And OTel span ends + Then Elastic bridged object is a span + Then Elastic bridged span has local context as parent + # outcome should not be inferred from the lack/presence of errors + Then Elastic bridged span outcome is "unknown" + + # --- OTel span kind mapping for spans & transactions + + Scenario Outline: OTel span kind for spans & default span type & subtype + Given an agent + And an active transaction + And OTel span is created with kind "" + And OTel span ends + Then Elastic bridged object is a span + Then Elastic bridged span OTel kind is "" + Then Elastic bridged span type is "" + Then Elastic bridged span subtype is "" + Examples: + | kind | default_type | default_subtype | + | INTERNAL | app | internal | + | SERVER | unknown | | + | CLIENT | unknown | | + | PRODUCER | unknown | | + | CONSUMER | unknown | | + + Scenario Outline: OTel span kind for transactions & default transaction type + Given an agent + And OTel span is created with kind "" + And OTel span ends + Then Elastic bridged object is a transaction + Then Elastic bridged transaction OTel kind is "" + Then Elastic bridged transaction type is 'unknown' + Examples: + | kind | + | INTERNAL | + | SERVER | + | CLIENT | + | PRODUCER | + | CONSUMER | + + # OTel span status mapping for spans & transactions + + Scenario Outline: OTel span mapping with status for transactions + Given an agent + And OTel span is created with kind 'SERVER' + And OTel span status set to "" + And OTel span ends + Then Elastic bridged object is a transaction + Then Elastic bridged transaction outcome is "" + Examples: + | status | outcome | + | unset | unknown | + | ok | success | + | error | failure | + + Scenario Outline: OTel span mapping with status for spans + Given an agent + Given an active transaction + And OTel span is created with kind 'INTERNAL' + And OTel span status set to "" + And OTel span ends + Then Elastic bridged object is a span + Then Elastic bridged span outcome is "" + Examples: + | status | outcome | + | unset | unknown | + | ok | success | + | error | failure | + + # --- span type, subtype and action inference from OTel attributes + + # --- HTTP server + # https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/http.md#http-server + Scenario Outline: HTTP server [ ] + Given an agent + And OTel span is created with kind 'SERVER' + And OTel span has following attributes + | http.url | | + | http.scheme | | + And OTel span ends + Then Elastic bridged object is a transaction + Then Elastic bridged transaction type is "request" + Examples: + | http.url | http.scheme | + | http://testing.invalid/ | | + | | http | + + # --- HTTP client + # https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/http.md#http-client + Scenario Outline: HTTP client [ ] + Given an agent + And an active transaction + And OTel span is created with kind 'CLIENT' + And OTel span has following attributes + | http.url | | + | http.scheme | | + | http.host | | + | net.peer.ip | | + | net.peer.name | | + | net.peer.port | | + And OTel span ends + Then Elastic bridged span type is 'external' + Then Elastic bridged span subtype is 'http' + Then Elastic bridged span OTel attributes are copied as-is + Then Elastic bridged span destination resource is set to "" + Examples: + | http.url | http.scheme | http.host | net.peer.ip | net.peer.name | net.peer.port | resource | + | https://testing.invalid:8443/ | | | | | | testing.invalid:8443 | + | https://[::1]/ | | | | | | [::1]:443 | + | http://testing.invalid/ | | | | | | testing.invalid:80 | + | | http | testing.invalid | | | | testing.invalid:80 | + | | https | testing.invalid | 127.0.0.1 | | | testing.invalid:443 | + | | http | | 127.0.0.1 | | 81 | 127.0.0.1:81 | + | | https | | 127.0.0.1 | | 445 | 127.0.0.1:445 | + | | http | | 127.0.0.1 | host1 | 445 | host1:445 | + | | https | | 127.0.0.1 | host2 | 445 | host2:445 | + + # --- DB client + # https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/database.md + Scenario Outline: DB client [ ] + Given an agent + And an active transaction + And OTel span is created with kind 'CLIENT' + And OTel span has following attributes + | db.system | | + | db.name | | + | net.peer.ip | | + | net.peer.name | | + | net.peer.port | | + And OTel span ends + Then Elastic bridged span type is 'db' + Then Elastic bridged span subtype is "" + Then Elastic bridged span OTel attributes are copied as-is + Then Elastic bridged span destination resource is set to "" + Examples: + | db.system | db.name | net.peer.ip | net.peer.name | net.peer.port | resource | + | mysql | | | | | mysql | + | oracle | | | oracledb | | oracledb | + | oracle | | 127.0.0.1 | | | 127.0.0.1 | + | mysql | | 127.0.0.1 | dbserver | 3307 | dbserver:3307 | + | mysql | myDb | | | | mysql/myDb | + | oracle | myDb | | oracledb | | oracledb/myDb | + | oracle | myDb | 127.0.0.1 | | | 127.0.0.1/myDb | + | mysql | myDb | 127.0.0.1 | dbserver | 3307 | dbserver:3307/myDb | + + # --- Messaging consumer (transaction consuming/receiving a message) + # https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md + Scenario: Messaging consumer + Given an agent + And an active transaction + And OTel span is created with kind 'CONSUMER' + And OTel span has following attributes + | messaging.system | anything | + And OTel span ends + Then Elastic bridged transaction type is 'messaging' + + # --- Messaging producer (client span emitting a message) + # https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md + Scenario Outline: Messaging producer [ ] + Given an agent + And an active transaction + And OTel span is created with kind 'PRODUCER' + And OTel span has following attributes + | messaging.system | | + | messaging.destination | | + | messaging.url | | + | net.peer.ip | | + | net.peer.name | | + | net.peer.port | | + And OTel span ends + Then Elastic bridged span type is 'messaging' + Then Elastic bridged span subtype is "" + Then Elastic bridged span OTel attributes are copied as-is + Then Elastic bridged span destination resource is set to "" + Examples: + | messaging.system | messaging.destination | messaging.url | net.peer.ip | net.peer.name | net.peer.port | resource | + | rabbitmq | | amqp://carrot:4444/q1 | | | | carrot:4444 | + | rabbitmq | | | 127.0.0.1 | carrot-server | 7777 | carrot-server:7777 | + | rabbitmq | | | | carrot-server | | carrot-server | + | rabbitmq | | | 127.0.0.1 | | | 127.0.0.1 | + | rabbitmq | myQueue | amqp://carrot:4444/q1 | | | | carrot:4444/myQueue | + | rabbitmq | myQueue | | 127.0.0.1 | carrot-server | 7777 | carrot-server:7777/myQueue | + | rabbitmq | myQueue | | | carrot-server | | carrot-server/myQueue | + | rabbitmq | myQueue | | 127.0.0.1 | | | 127.0.0.1/myQueue | + + # --- RPC client + # https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/rpc.md + Scenario Outline: RPC client [ ] + Given an agent + And an active transaction + And OTel span is created with kind 'CLIENT' + And OTel span has following attributes + | rpc.system | | + | rpc.service | | + | net.peer.ip | | + | net.peer.name | | + | net.peer.port | | + And OTel span ends + Then Elastic bridged span type is 'external' + Then Elastic bridged span subtype is "" + Then Elastic bridged span OTel attributes are copied as-is + Then Elastic bridged span destination resource is set to "" + Examples: + | rpc.system | rpc.service | net.peer.ip | net.peer.name | net.peer.port | resource | + | grpc | | | | | grpc | + | grpc | myService | | | | grpc/myService | + | grpc | myService | | rpc-server | | rpc-server/myService | + | grpc | myService | 127.0.0.1 | rpc-server | | rpc-server/myService | + | grpc | | 127.0.0.1 | rpc-server | 7777 | rpc-server:7777 | + | grpc | myService | 127.0.0.1 | rpc-server | 7777 | rpc-server:7777/myService | + | grpc | myService | 127.0.0.1 | | 7777 | 127.0.0.1:7777/myService | + + # --- RPC server + # https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/rpc.md + Scenario: RPC server + Given an agent + And OTel span is created with kind 'SERVER' + And OTel span has following attributes + | rpc.system | grpc | + And OTel span ends + Then Elastic bridged transaction type is 'request' + +