Skip to content

X-ray / opentelemetry issues #618

@mreimer

Description

@mreimer

Hi devs, I have a few issues using X-ray with the lambda runtime. For context, I'm using a custom lambda image with aws-otel-collector loaded as an extension, and my function sets up an opentelemetry pipeline to trace and export to the collector. I'm pretty new to rust, lambda, x-ray, and opentelemetry so please correct me if I've misunderstood how this all works. Here are my issues:

  1. If I have active tracing enabled on the lambda, for each execution I end up with two unconnected traces in the x-ray console. One is the active trace with initialization, invocation, and overhead. The other is the "Lambda runtime invoke" instrumentation built into the lambda runtime. I see that the lambda runtime is pulling a request id and x-ray trace id from the headers, but that doesn't seem to be sufficient to connect the instrumented span to the active trace. For contrast, within my handler if I use opentelemetry to extract a context from the headers and create a span with that context, it will end up in the active trace, which I think is a better behaviour. I don't know exactly what use cases this runtime was designed for, but IMO we only ever want one trace whether or not active tracing is enabled.

  2. The lambda runtime instrumentation from the most recent lambda execution isn't viewable in the x-ray console. It shows up in the list of traces, but attempting to view it just gets "An error occurred fetching your data" message. I'm inferring this is because the trace isn't complete, as execution of the lambda runtime was suspended before the end of the instrumented section (and subsequent flush). If I execute the lambda again, the previous trace now works, but the new most recent trace doesn't. If the lambda is unloaded after an idle period, the latest trace will stay in that broken state forever.

  3. I'm explicitly flushing the opentelemetry pipeline at the end of my lambda handler. This has nothing to do with the rust lambda runtime, but I've found that the flush takes about 40ms which is a huge performance hit. It would be a lot easier to stomach if there were a way to do it after sending the response to the end user.

I'd be happy to work on a PR for these issues if it would help, but I don't know what the appropriate solutions would be in this ecosystem. It seems to me that 2 and 3 are straightforward to solve if lambda offers any mechanism to execute code after sending a response to the end user. Is that possible or is execution suspended immediately at that point?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions