Add troubleshooting guide for trace context header propagation#4920
Add troubleshooting guide for trace context header propagation#4920alexandra5000 wants to merge 2 commits intoelastic:mainfrom
Conversation
✅ Vale Linting ResultsNo issues found on modified lines! The Vale linter checks documentation changes against the Elastic Docs style guide. To use Vale locally or report issues, refer to Elastic style guide for Vale. |
🔍 Preview links for changed docs |
|
|
||
| - `continue` (default): Continue incoming traces. | ||
| - `restart`: Always start a new trace. | ||
| - `restart_external`: Restart only for non-Elastic sources. |
There was a problem hiding this comment.
For the reviewer's consideration - seems like this is a specific .NET agent behavior. Is there an equivalent for other agent(s) or do they follow a different logic for "untrusted" headers?
There was a problem hiding this comment.
This is a common feature of the Elastic APM agents cross-platform, we don't have an equivalent yet for OTel/EDOT but this is in our roadmap.
For the restart_external this is only relevant when the upstream service (the one initiating the call) is using an Elastic APM agent or not. This allows for example to deal with an upstream service that is using a w3c-compliant trace propagation but sends data to another backend, hence making the top-level transaction on the downstream service to missing the upstream call.
While we don't have yet a similar feature, a possible work-around for this would be to remove/reset the HTTP headers on the incoming requests, hence forcing the downstream service to start a new trace independently from the upstream service.
|
Hi @SylvainJuge ! I’ve added this guide to help users migrating from Elastic APM agents to OpenTelemetry. Since you know A LOT about the OTel Java SDK and the Elastic agent internals, would you mind reviewing it? Specifically, I want to ensure that:
I would really appreciate your input! |
| - Traces appear split or uncorrelated in the UI. | ||
| - Parent–child relationships are missing when traffic crosses between: | ||
| - OpenTelemetry-instrumented services and Elastic {{product.apm}} agents | ||
| - New and legacy Elastic {{product.apm}} agents in the same call chain |
There was a problem hiding this comment.
what do we mean by "new and legacy" here ?
| - Downstream spans start new traces instead of continuing the existing one. | ||
| - Traces appear split or uncorrelated in the UI. | ||
| - Parent–child relationships are missing when traffic crosses between: | ||
| - OpenTelemetry-instrumented services and Elastic {{product.apm}} agents |
There was a problem hiding this comment.
Maybe we should write once for all that "Opentelemetry-instrumented" here covers both the "EDOT Java" and "OpenTelemetry Java instrumentation" (upstream/vanilla).
|
|
||
| Propagation issues often occur when: | ||
|
|
||
| - Older (pre‑W3C) Elastic agents are still in use. |
There was a problem hiding this comment.
Older here means before version 1.14.0 which was released in March 2020
| ## Recommended and migration-only patterns | ||
|
|
||
| Before going fully OTel-native, you can use the OpenTelemetry Bridge offered by Elastic agents as a transitional solution: | ||
|
|
||
| ### OpenTelemetry Bridge (temporary) | ||
|
|
||
| The bridge lets you use the OpenTelemetry API for manual instrumentation while still using an Elastic {{apm-agent}} for auto‑instrumentation and exporting. | ||
|
|
||
| With the bridge: | ||
|
|
||
| - The Elastic agent implements the OpenTelemetry API. | ||
| - Spans created through the OTel API become native Elastic spans. | ||
| - Parent–child relationships are preserved across manual and auto‑instrumentation. | ||
|
|
||
| The bridge is available in major Elastic agents (Java, .NET, Node.js, Python). Prefer moving to {{edot}} (OTel-native) when you can. | ||
|
|
||
| ### Avoid running a full OpenTelemetry SDK alongside an Elastic agent | ||
|
|
||
| Do not run a full OpenTelemetry SDK in the same process as an Elastic {{apm-agent}}. | ||
|
|
||
| This setup causes: | ||
|
|
||
| - Duplicate instrumentation and added overhead. | ||
| - Trace fragmentation (conflicting trace IDs). | ||
| - Startup conflicts (instrumentation, exporters, environment variables). | ||
|
|
||
| Each SDK might try to manage propagation independently, breaking distributed tracing. For an OpenTelemetry-native setup, use {{edot}} instead of mixing SDKs. | ||
|
|
||
| ## Resolution | ||
|
|
||
| The preferred resolution is to complete your migration to OpenTelemetry and use {{edot}} (OTel-native). However, if you are still in a gradual migration and need traces to connect across mixed services, the following steps might help. |
There was a problem hiding this comment.
This is completely not related to trace and context propagation, while this may be relevant I would suggest to remove this or move it in a dedicated "general migration recommendations".
I also think maybe we miss a general "high level migration strategies" part where we describe:
- adding an OTLP endpoint and making it accessible to applications
- reviewing the migration documentation and apply the required changes to the application: for example migrating away from Elastic APM API and replace it with OpenTelemetry API (short term the "Otel bridge" will allow to keep things working even if the Elastic APM agent is not replaced with EDOT).
- replacing Elastic APM agents (if any) with OpenTelemetry-based agents (either EDOT or upstream)
|
|
||
| - `continue` (default): Continue incoming traces. | ||
| - `restart`: Always start a new trace. | ||
| - `restart_external`: Restart only for non-Elastic sources. |
There was a problem hiding this comment.
This is a common feature of the Elastic APM agents cross-platform, we don't have an equivalent yet for OTel/EDOT but this is in our roadmap.
For the restart_external this is only relevant when the upstream service (the one initiating the call) is using an Elastic APM agent or not. This allows for example to deal with an upstream service that is using a w3c-compliant trace propagation but sends data to another backend, hence making the top-level transaction on the downstream service to missing the upstream call.
While we don't have yet a similar feature, a possible work-around for this would be to remove/reset the HTTP headers on the incoming requests, hence forcing the downstream service to start a new trace independently from the upstream service.
| ::::{step} Use the OpenTelemetry Bridge | ||
|
|
||
| If you're in transition and still use the OpenTelemetry API with an Elastic agent: | ||
|
|
||
| - Turn on the OpenTelemetry Bridge in the agent. | ||
| - Do not install a separate OpenTelemetry SDK in the same process. | ||
|
|
||
| This can help maintain context propagation during the migration. Plan to move to {{edot}} (OTel-native) when possible. |
There was a problem hiding this comment.
As far as I know, the OpenTelemetry bridge does not handle custom context propagation, it only allows to capture metrics and traces with explicit calls to the Otel API from within the application, so we should probably remove this.
| ::::{step} Keep dual‑propagation active during migrations | ||
|
|
||
| In mixed environments with OpenTelemetry SDKs (W3C only) and earlier versions of Elastic agents, keep the default dual‑propagation mode turned on so that: | ||
|
|
||
| - New services read W3C headers. | ||
| - Legacy services read the `elastic-apm-traceparent` header. | ||
|
|
||
| Turning off the legacy header too early can break trace continuity. | ||
|
|
||
| :::: | ||
|
|
||
| ::::{step} Turn off legacy headers after full migration | ||
|
|
||
| When all services support W3C Trace Context, you might turn off emission of the legacy header to reduce header size and network overhead. | ||
|
|
||
| Refer to agent-specific documentation to turn off legacy header output. |
There was a problem hiding this comment.
Maybe as a simpler recommendation, we should advise in this kind of "mixed" environment to keep the elastic-apm-traceparent` header always provided, as it just maximizes the compatibility without any overhead nor complication.
Also, we should mention that if there is any HTTP proxy/gateway this header should be transmitted as-is (some opt-in configuration might be required to explicitly allow headers).
|
|
||
| ## Best practices | ||
|
|
||
| - Use {{edot}} for full OTel support and to avoid mixed-configuration issues. |
There was a problem hiding this comment.
For APM agents, updating to the latest available version would be a good recommended first step to minimize the possible compatibility issues, this also ensures that only the w3c standard header is used.
Summary
This PR introduces a new troubleshooting page focused on trace context header propagation issues when using OpenTelemetry with Elastic APM agents. It includes symptoms, causes, supported mixing patterns, and best practices to ensure proper trace continuity.
Closes #2359
Generative AI disclosure
Tool(s) and model(s) used: Claude Sonnet 4.5 via Cursor