Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry access logs: Missing span ID breaks trace-context correlation #33906

Closed
mmanciop opened this issue May 1, 2024 · 3 comments · Fixed by #33909
Closed

OpenTelemetry access logs: Missing span ID breaks trace-context correlation #33906

mmanciop opened this issue May 1, 2024 · 3 comments · Fixed by #33909
Labels
area/opentelemetry area/tracing bug stale stalebot believes this issue/PR has not been touched recently

Comments

@mmanciop
Copy link

mmanciop commented May 1, 2024

Description:

When an Envoy is configured to use both OpenTelemetry tracing and access_log exporting to OpenTelemetry, the log records sent downstream do not contain the span identifier, and that means that the trace context referenced in the OpenTelemetry log records is invalid. The following is a snippet of the debug output of the OpenTelemetry collector debugger from the updated OpenTelemetry example from my fork:

opentelemetry-opentelemetry-1      | Resource SchemaURL: 
opentelemetry-opentelemetry-1      | Resource attributes:
opentelemetry-opentelemetry-1      |      -> log_name: Str(otel_envoy_accesslog)
opentelemetry-opentelemetry-1      |      -> zone_name: Str()
opentelemetry-opentelemetry-1      |      -> cluster_name: Str()
opentelemetry-opentelemetry-1      |      -> node_name: Str()
opentelemetry-opentelemetry-1      | ScopeLogs #0
opentelemetry-opentelemetry-1      | ScopeLogs SchemaURL: 
opentelemetry-opentelemetry-1      | InstrumentationScope  
opentelemetry-opentelemetry-1      | LogRecord #0
opentelemetry-opentelemetry-1      | ObservedTimestamp: 1970-01-01 00:00:00 +0000 UTC
opentelemetry-opentelemetry-1      | Timestamp: 2024-05-01 09:41:34.775021 +0000 UTC
opentelemetry-opentelemetry-1      | SeverityText: 
opentelemetry-opentelemetry-1      | SeverityNumber: Unspecified(0)
opentelemetry-opentelemetry-1      | Body: Empty()
opentelemetry-opentelemetry-1      | Trace ID: 766dd16149e6fb945d09db78b7dc07e5
opentelemetry-opentelemetry-1      | Span ID:       <------ BORKED
opentelemetry-opentelemetry-1      | Flags: 0
opentelemetry-opentelemetry-1      |    {"kind": "exporter", "data_type": "logs", "name": "debug"}
opentelemetry-opentelemetry-1      | 2024-05-01T09:41:35.773Z   info    ResourceSpans #0
opentelemetry-opentelemetry-1      | Resource SchemaURL: 
opentelemetry-opentelemetry-1      | Resource attributes:
opentelemetry-opentelemetry-1      |      -> service.name: Str(front-envoy)
opentelemetry-opentelemetry-1      | ScopeSpans #0
opentelemetry-opentelemetry-1      | ScopeSpans SchemaURL: 
opentelemetry-opentelemetry-1      | InstrumentationScope  
opentelemetry-opentelemetry-1      | Span #0
opentelemetry-opentelemetry-1      |     Trace ID       : 766dd16149e6fb945d09db78b7dc07e5
opentelemetry-opentelemetry-1      |     Parent ID      : 
opentelemetry-opentelemetry-1      |     ID             : fdbb20b1c5d77be5
opentelemetry-opentelemetry-1      |     Name           : egress localhost:10000
opentelemetry-opentelemetry-1      |     Kind           : Client
opentelemetry-opentelemetry-1      |     Start time     : 2024-05-01 09:41:34.775021 +0000 UTC
opentelemetry-opentelemetry-1      |     End time       : 2024-05-01 09:41:34.778252 +0000 UTC
opentelemetry-opentelemetry-1      |     Status code    : Unset
opentelemetry-opentelemetry-1      |     Status message : 
opentelemetry-opentelemetry-1      | Attributes:
opentelemetry-opentelemetry-1      |      -> node_id: Str()
opentelemetry-opentelemetry-1      |      -> zone: Str()
opentelemetry-opentelemetry-1      |      -> guid:x-request-id: Str(ff62aa28-ba62-9083-b21a-16e16424c298)
opentelemetry-opentelemetry-1      |      -> http.url: Str(http://localhost:10000/trace/2)
opentelemetry-opentelemetry-1      |      -> http.method: Str(GET)
opentelemetry-opentelemetry-1      |      -> downstream_cluster: Str(-)
opentelemetry-opentelemetry-1      |      -> user_agent: Str(curl/8.1.2)
opentelemetry-opentelemetry-1      |      -> http.protocol: Str(HTTP/1.1)
opentelemetry-opentelemetry-1      |      -> peer.address: Str(192.168.65.1)
opentelemetry-opentelemetry-1      |      -> request_size: Str(0)
opentelemetry-opentelemetry-1      |      -> response_size: Str(37)
opentelemetry-opentelemetry-1      |      -> component: Str(proxy)
opentelemetry-opentelemetry-1      |      -> upstream_cluster: Str(envoy_cluster2)
opentelemetry-opentelemetry-1      |      -> upstream_cluster.name: Str(envoy_cluster2)
opentelemetry-opentelemetry-1      |      -> http.status_code: Str(200)
opentelemetry-opentelemetry-1      |      -> response_flags: Str(-)

The problem is the empty span id in the log, which instead should be fdbb20b1c5d77be5.
In terms of OpenTelemetry, the trace context reported in the log record is invalid, and will be ignored by tools that would use it, like Dash0.

The issues seems to me to be here: there is no setting of the span id. Confusingly, it seems that the log_context abstraction does not even have a field for the span id, which makes me wonder how other tracing systems, like Datadog's proprietary ones, are supposed to successfully use this correlation, as afaik, pretty much all modern distributed trace systems use (at least) a pair of trace identifier and span identifier to specify correlation with spans.

The way I would fix it, is to add a getSpanIdAsHex() to the Span abstraction in Envoy, leave it unimplemented for other tracers (although, as I wrote above, I suspect a few would need it), and use that to invoke log_entry.mutable_span_id() on the OpenTelemetry SDK.

Repro steps:

Try the updated OpenTelemetry example from my fork, send a request to one of the envoys, and check the debug output on the console.

Admin and Stats Output: N/A

Config: see the updated OpenTelemetry example from my fork

Logs: See above.

Call Stack: N/A

@ashishb-solo
Copy link
Contributor

ashishb-solo commented May 1, 2024

Thanks for posting this @mmanciop! I think the fix is fairly straightforward. I was able to find a bit of spare time to look at this. I think I have fixed the issue in the source code (though I'm not completely sure yet as I haven't had the time to build/test it). The tests are going to take a little longer to fix though, but I'll try to address them whenever I get the time

@phlax
Copy link
Member

phlax commented May 2, 2024

related to #30268

Copy link

github-actions bot commented Jun 2, 2024

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/opentelemetry area/tracing bug stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants