Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark OpenTelemetry as ready for production use #24672

Open
ashishb-solo opened this issue Dec 22, 2022 · 18 comments
Open

Mark OpenTelemetry as ready for production use #24672

ashishb-solo opened this issue Dec 22, 2022 · 18 comments
Labels
area/opentelemetry enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue

Comments

@ashishb-solo
Copy link
Contributor

Title: Mark OpenTelemetry as ready for production use

Description: We would like to transition the OpenTelemetry plugin status from wip to stable and label security_posture as robust_to_untrusted_downstream. According to the plugin's original author, it is essentially feature complete[1], and only awaiting fixes for reported issues. As it currently stands, this plugin has been available for multiple releases by now and there have not been many major issues reported, so we are interested in encouraging adoption of this plugin and declaring it suitable for public use.

Relevant Links:
[1]: #9958 (comment)

@ashishb-solo ashishb-solo added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Dec 22, 2022
@ashishb-solo
Copy link
Contributor Author

@AlexanderEllis - we have some customers that are interested in using this plugin, but are hesitant to do so until this issue is addressed. Do you have any input regarding whether there is any further work needed on this plugin before the status can be changed? Additionally, do you have any other thoughts regarding whether we can change the status of this plugin? Thanks!

@snowp
Copy link
Contributor

snowp commented Dec 31, 2022

cc @htuch as extension owner

@snowp snowp added area/opentelemetry and removed triage Issue requires triage labels Dec 31, 2022
@htuch
Copy link
Member

htuch commented Jan 6, 2023

Looking at https://github.com/envoyproxy/envoy/blob/main/EXTENSION_POLICY.md#extension-stability-and-security-posture, I think the main open questions from me would be:

  1. Do we have any production use cases yet with live traffic?
  2. Does @AlexanderEllis feel this is ready for production use cases and/or has someone done a code audit?
  3. Do we need any fuzzing?

@yanavlasov @krajshiva I think we eventually do want to have this as stable / production-ready for our own use cases (but nothing pressing today).

@ashishb-solo
Copy link
Contributor Author

I'll try to take a shot at these to keep this from falling off the radar....

  1. We have it enabled it very recently via our control plane, but are not currently aware of anyone using it just yet. Given how recently this has been made available, I doubt it so far, but hope this will change soon.

  2. It would be wonderful if we could hear from @AlexanderEllis, but it doesn't look like he's been too active on github lately. Is there some guidance in terms of who would be qualified to perform this kind of an audit? I've spent quite a fair amount of time analysing it myself, but doubt that I have the sufficient authority to sign it off.

  3. The SpanContextExtractor parses the span header from the request, so there's probably a need for fuzzing there. I think we would probably be happy to assist with setting that up! Beyond that, the plugin seems to weigh more on the side of emitting rather than digesting data, so I'm inclined to say that there probably is not a need for any fuzzing outside of that context. But I think a second look here (either from myself or someone else) would be worthwhile.

@htuch
Copy link
Member

htuch commented Jan 24, 2023

OK, @yanavlasov do you think we can get someone from EPT to help with working through 2/3 here? I'm happy to review any PRs.

@AlexanderEllis
Copy link
Contributor

Sorry for the delayed response here — holidays and travel kept me away from GH for a bit.

@ashishb-solo I believe the functionality is all set for production use cases — the only potential improvement I'm aware of is a nice-to-have (supporting either HTTP exporting or the existing gRPC exporting). Maybe @K-Prabha can weigh in here re: use in production?

@htuch I don't believe anyone has done a code audit, but that would be great to see (and I would definitely appreciate the additional eyes on it). Similarly, fuzzing for the context extractor would be great to see, as it's doing some work parsing headers. Happy to review PRs as well!

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Feb 23, 2023
@AlexanderEllis
Copy link
Contributor

From this comment on the parent tracking bug, it sounds like @lakamsani may also be using the extension with live traffic. @lakamsani, we're working towards marking the extension for production use, and it would be great to hear how it has done in production so far and any potential issues that have come up (or lack thereof).

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Feb 28, 2023
@lakamsani
Copy link
Contributor

Hi @AlexanderEllis sure. We are in pre-production testing right now. Expect to use it in production from May onwards.

@github-actions
Copy link

github-actions bot commented Apr 4, 2023

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Apr 4, 2023
@jmacd
Copy link
Contributor

jmacd commented Apr 5, 2023

I have a report that an Envoy configured with 100% sampling will drop spans compared with a similarly configured Lightstep/OpenTracing tracer, the one that was removed several releases ago as we put our focus on the OTel tracer. See #25102

It's not clear what kind of tuning parameters there are for the OTel tracer library, but none are documented for Envoy here: https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/trace/v3/opentelemetry.proto.html

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Apr 6, 2023
@AlexanderEllis
Copy link
Contributor

Hey @jmacd, thanks for bringing up that case, very interesting. I can take a look at adding more documentation on the tracer configuration. It does support two additional runtime settings that it uses for span exporting, tracing.opentelemetry.flush_interval_ms and tracing.opentelemetry.min_flush_spans, but those aren't included in the autogenerated proto docs you linked 👍

@DanTulovsky
Copy link

DanTulovsky commented Apr 7, 2023 via email

@AlexanderEllis
Copy link
Contributor

@DanTulovsky The plot thickens! Just to confirm for the sake of posterity, these are separate from the configuration settings in the OpenTelemetry config proto (the docs @jmacd linked above), but if you're setting the runtime values and they're not being respected, something must be up.

I'm traveling for the next week or so, but once I'm back at my usual computer I can dig in a little more to see why these runtime configurations aren't being read correctly.

@DanTulovsky
Copy link

DanTulovsky commented Apr 10, 2023 via email

@DanTulovsky
Copy link

DanTulovsky commented Apr 10, 2023 via email

@doppleware
Copy link

Hi Envoy team!
I suggest adding an instrumentation scope to the span with the OTEL library name to follow the recommended OTEL convention. Adding instrumentation library details is a convention that helps understand the source for providing the traces. I'm not sure its a requirement as per the OTEL spec but its generally adopted.

Thanks!

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label May 12, 2023
@htuch htuch added no stalebot Disables stalebot from closing an issue and removed stale stalebot believes this issue/PR has not been touched recently labels May 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/opentelemetry enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

No branches or pull requests

8 participants