Skip to content

Add troubleshooting guide for trace context header propagation#4920

Open
alexandra5000 wants to merge 2 commits intoelastic:mainfrom
alexandra5000:trace-propagation-errors
Open

Add troubleshooting guide for trace context header propagation#4920
alexandra5000 wants to merge 2 commits intoelastic:mainfrom
alexandra5000:trace-propagation-errors

Conversation

@alexandra5000
Copy link
Contributor

Summary

This PR introduces a new troubleshooting page focused on trace context header propagation issues when using OpenTelemetry with Elastic APM agents. It includes symptoms, causes, supported mixing patterns, and best practices to ensure proper trace continuity.

Closes #2359

Generative AI disclosure

  1. Did you use a generative AI (GenAI) tool to assist in creating this contribution?
  • Yes
  • No
  1. If you answered "Yes" to the previous question, please specify the tool(s) and model(s) used (e.g., Google Gemini, OpenAI ChatGPT-4, etc.).

Tool(s) and model(s) used: Claude Sonnet 4.5 via Cursor

@github-actions
Copy link
Contributor

github-actions bot commented Feb 1, 2026

✅ Vale Linting Results

No issues found on modified lines!


The Vale linter checks documentation changes against the Elastic Docs style guide.

To use Vale locally or report issues, refer to Elastic style guide for Vale.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 1, 2026

@alexandra5000 alexandra5000 marked this pull request as ready for review February 1, 2026 18:22
@alexandra5000 alexandra5000 requested review from a team as code owners February 1, 2026 18:22

- `continue` (default): Continue incoming traces.
- `restart`: Always start a new trace.
- `restart_external`: Restart only for non-Elastic sources.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the reviewer's consideration - seems like this is a specific .NET agent behavior. Is there an equivalent for other agent(s) or do they follow a different logic for "untrusted" headers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a common feature of the Elastic APM agents cross-platform, we don't have an equivalent yet for OTel/EDOT but this is in our roadmap.

For the restart_external this is only relevant when the upstream service (the one initiating the call) is using an Elastic APM agent or not. This allows for example to deal with an upstream service that is using a w3c-compliant trace propagation but sends data to another backend, hence making the top-level transaction on the downstream service to missing the upstream call.

While we don't have yet a similar feature, a possible work-around for this would be to remove/reset the HTTP headers on the incoming requests, hence forcing the downstream service to start a new trace independently from the upstream service.

@alexandra5000
Copy link
Contributor Author

Hi @SylvainJuge !

I’ve added this guide to help users migrating from Elastic APM agents to OpenTelemetry.

Since you know A LOT about the OTel Java SDK and the Elastic agent internals, would you mind reviewing it? Specifically, I want to ensure that:

  • The description of dual-propagation mode (inbound/outbound) accurately reflects how our agents handle the legacy headers.
  • The advice regarding the OpenTelemetry Bridge vs. EDOT is aligned with our current migration recommendations.
  • The version table for W3C support is accurate for the primary agents.

I would really appreciate your input!

- Traces appear split or uncorrelated in the UI.
- Parent–child relationships are missing when traffic crosses between:
- OpenTelemetry-instrumented services and Elastic {{product.apm}} agents
- New and legacy Elastic {{product.apm}} agents in the same call chain
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do we mean by "new and legacy" here ?

- Downstream spans start new traces instead of continuing the existing one.
- Traces appear split or uncorrelated in the UI.
- Parent–child relationships are missing when traffic crosses between:
- OpenTelemetry-instrumented services and Elastic {{product.apm}} agents
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should write once for all that "Opentelemetry-instrumented" here covers both the "EDOT Java" and "OpenTelemetry Java instrumentation" (upstream/vanilla).


Propagation issues often occur when:

- Older (pre‑W3C) Elastic agents are still in use.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Older here means before version 1.14.0 which was released in March 2020

Comment on lines +59 to +89
## Recommended and migration-only patterns

Before going fully OTel-native, you can use the OpenTelemetry Bridge offered by Elastic agents as a transitional solution:

### OpenTelemetry Bridge (temporary)

The bridge lets you use the OpenTelemetry API for manual instrumentation while still using an Elastic {{apm-agent}} for auto‑instrumentation and exporting.

With the bridge:

- The Elastic agent implements the OpenTelemetry API.
- Spans created through the OTel API become native Elastic spans.
- Parent–child relationships are preserved across manual and auto‑instrumentation.

The bridge is available in major Elastic agents (Java, .NET, Node.js, Python). Prefer moving to {{edot}} (OTel-native) when you can.

### Avoid running a full OpenTelemetry SDK alongside an Elastic agent

Do not run a full OpenTelemetry SDK in the same process as an Elastic {{apm-agent}}.

This setup causes:

- Duplicate instrumentation and added overhead.
- Trace fragmentation (conflicting trace IDs).
- Startup conflicts (instrumentation, exporters, environment variables).

Each SDK might try to manage propagation independently, breaking distributed tracing. For an OpenTelemetry-native setup, use {{edot}} instead of mixing SDKs.

## Resolution

The preferred resolution is to complete your migration to OpenTelemetry and use {{edot}} (OTel-native). However, if you are still in a gradual migration and need traces to connect across mixed services, the following steps might help.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is completely not related to trace and context propagation, while this may be relevant I would suggest to remove this or move it in a dedicated "general migration recommendations".

I also think maybe we miss a general "high level migration strategies" part where we describe:

  • adding an OTLP endpoint and making it accessible to applications
  • reviewing the migration documentation and apply the required changes to the application: for example migrating away from Elastic APM API and replace it with OpenTelemetry API (short term the "Otel bridge" will allow to keep things working even if the Elastic APM agent is not replaced with EDOT).
  • replacing Elastic APM agents (if any) with OpenTelemetry-based agents (either EDOT or upstream)


- `continue` (default): Continue incoming traces.
- `restart`: Always start a new trace.
- `restart_external`: Restart only for non-Elastic sources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a common feature of the Elastic APM agents cross-platform, we don't have an equivalent yet for OTel/EDOT but this is in our roadmap.

For the restart_external this is only relevant when the upstream service (the one initiating the call) is using an Elastic APM agent or not. This allows for example to deal with an upstream service that is using a w3c-compliant trace propagation but sends data to another backend, hence making the top-level transaction on the downstream service to missing the upstream call.

While we don't have yet a similar feature, a possible work-around for this would be to remove/reset the HTTP headers on the incoming requests, hence forcing the downstream service to start a new trace independently from the upstream service.

Comment on lines +130 to +137
::::{step} Use the OpenTelemetry Bridge

If you're in transition and still use the OpenTelemetry API with an Elastic agent:

- Turn on the OpenTelemetry Bridge in the agent.
- Do not install a separate OpenTelemetry SDK in the same process.

This can help maintain context propagation during the migration. Plan to move to {{edot}} (OTel-native) when possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, the OpenTelemetry bridge does not handle custom context propagation, it only allows to capture metrics and traces with explicit calls to the Otel API from within the application, so we should probably remove this.

Comment on lines +141 to +156
::::{step} Keep dual‑propagation active during migrations

In mixed environments with OpenTelemetry SDKs (W3C only) and earlier versions of Elastic agents, keep the default dual‑propagation mode turned on so that:

- New services read W3C headers.
- Legacy services read the `elastic-apm-traceparent` header.

Turning off the legacy header too early can break trace continuity.

::::

::::{step} Turn off legacy headers after full migration

When all services support W3C Trace Context, you might turn off emission of the legacy header to reduce header size and network overhead.

Refer to agent-specific documentation to turn off legacy header output.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe as a simpler recommendation, we should advise in this kind of "mixed" environment to keep the elastic-apm-traceparent` header always provided, as it just maximizes the compatibility without any overhead nor complication.

Also, we should mention that if there is any HTTP proxy/gateway this header should be transmitted as-is (some opt-in configuration might be required to explicitly allow headers).


## Best practices

- Use {{edot}} for full OTel support and to avoid mixed-configuration issues.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For APM agents, updating to the latest available version would be a good recommended first step to minimize the possible compatibility issues, this also ensures that only the w3c standard header is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Trace context propagation issues

3 participants