Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parity between otel/opentelemetry-collector && amazon/aws-otel-collector #1480

Closed
elasticdotventures opened this issue Sep 13, 2022 · 5 comments
Assignees
Labels
ADOT collector ADOT Collector related issues documentation Improvements or additions to documentation stale

Comments

@elasticdotventures
Copy link

elasticdotventures commented Sep 13, 2022

Describe the question

I am hoping to make recommendations for my organizations overall strategy with regard to future Amazon products.
We do not currently have a support contact so please don't refer me there. This is part question, part feedback.

The question is should AWS clients expect distant future and/or near term feature parity "fast-following" the ADOT OTEL collector and OpenTelemetry collector. Or is ADOT a reimplementation of the same specification with a completely different feature matrix -- i.e.: are the two code bases for ADOT/OTEL Collector & OpenTelemetry Collector destined to diverge and be full of undocumented donut hole feature variances.

This ADOT OTEL collector repo afaik does not appear to be a fork that is tracking upstream changes of the otel/opentelemetry-collector so there is at least some divergence of feature behavior between the opentelemetry collector and ADOT OTEL.

Speaking as a person who is trying to use AWS Managed Prometheus along with OpenTelemetry I'm sort of forced to use ADOT because the Sigv4 extensions only appear in AWS. The OTEL Collector is far more mature, especially examples. ADOT is significantly more obtuse and also seems keen enable services such as X-Ray when I don't want them, it also lacks any terraform documentation or examples, numerous other things that make it quite obtuse to use with Amazon Managed Prometheus.

The documentation for ADOT is horrendous, I've found literally hundreds of mistakes, broken links, the same basic tutorial that is well written 2/3rds of the way -- then is rushed & unfinished in the last 1/3rd .. and there is no way except the dumb feedback button to suggest fixes which go into a black hole (so there is no community involvement).

ADOT does not do any validation on the parameters in the YAML. This alone is maddening.
For this reason alone the usability of the otel/opentelemetry-collector to me as a neophyte seems to far exceed the ADOT for OTEL.

Specifically key validation in the config.yaml (the otel collector validates keys, catching a plurality of yaml syntax errors)
The ADOT OTEL collector additionally seems to not support logging "loglevel" debug, and I have not been able to get the receivers for OTLP / GRPC to work (yet) using the opentelemetry-python sdk

(note: I am running a custom build / non-official release, but I am able to make python otlp grpc calls work with the official opentelemetry collector, but not ADOT, I'm not even sure if receiving OTLP over GRPC is supported/possible)

I desperately want to run AWS Managed Prometheus, that is the reason I'm running ADOT (for the Sigv4 prometheusremotewrite)

Steps to reproduce if your question is related to an action

This is a question, nothing needs to be reproduced, except OTEL collector compatibility.

What did you expect to see?

I expect to see ADOT OTEL collector be an upstream fork of the official OpenTelemetry collector which periodically synchronizes it's changes and adds amazon specific service extensions. ADOT feels inferior & less mature to the official OTEL collector.

Environment
IMAGE="otel/opentelemetry-collector:0.54.0"
IMAGE="amazon/aws-otel-collector:v0.21.1"

Additional context
Add any other context about the problem here.

@mhausenblas
Copy link
Member

ADOT PM here. Thanks for your feedback and I appreciate the time you took to share, in detail!

First off, ADOT is not a fork but a distribution of OpenTelemetry (one of the 20 odd vendors available). We do monthly releases, tracking upstream (in average two releases). So the code base is the same, we just happen to add a few telemetry related things, harden it, and provide performance & resource usage info along with it.

Agreed with the UX of the docs and this is a problem we've identified and are working on to fix, no argument there.

Concerning the components we include: we are very security focused, pen testing each of the receivers, processors, exporters, and extensions we include and the prioritization of what to include is driven by customer demand. Again, happy to learn what is missing for your use case (via GH issue feature request).

If you're thinking of a production use case and you do not want to use the ADOT collector (and with it support) I'd recommend building your own collector over using upstream/contrib directly as the security impact of the over hundred components included there is something you may not want to expose yourself to.

In case the above is not satisfying, I'm happy to jump on a call with you and walk you through the details (I'm based out of Ireland), so feel free to ping me on the CNCF Slack (mhausenblas) or send me a mail to hausenbl@amazon.com to arrange for a time slot, please.

@mhausenblas mhausenblas self-assigned this Sep 13, 2022
@mhausenblas mhausenblas added documentation Improvements or additions to documentation ADOT collector ADOT Collector related issues labels Sep 13, 2022
@elasticdotventures
Copy link
Author

👋 @mhausenblas
Thanks for the answer(s) - i'll definitely DM you on CNCF Slack.
In my message below "ADOT" refers to ADOT collector, and I'll say OTEL for the community collector.

UPDATE: Shortly after I wrote the original message I discovered the otel-collector-contrib where AWS @erichsueh3 had put the AuthSigV4 extensions. 😁 I realized I could simply run the otel-collector-contrib container which can consume the same ADOT config! 🥳 That was incredibly useful for isolating the ADOT vs OTEL behaviour differences (because I can use the same config file with SigV4 extensions) and then I could safely use the OTEL community docs & OTEL collector version numbers with respect to the code in the library.

Ultimately this lead to my culprit: which (as usual) was an issue in how I had structured our business metrics instrumentation code, specifically I didn't have the Python OTLPMetricsExporter initialized correctly, so it wasn't a problem in ADOT at all.
Despite this - ADOT did obscure the issue for me for ~6 hours of my life I will never get back. I found an example.py in the python library for metrics and it worked with ADOT and OTEL-community versions .. (metrics in opentelemetry-python is only in RC stage, not even fully released so I was thrilled to find a working example!)

Beyond that, from my journey my suggestions:

Continue to Increase Parameter Validation

When I mentioned parameter validation, I've noticed the AWS OTEL does the same or similar validation on top level keys in the config file. More is always better, so not just KEY but also VALUE validation. The OTEL files don't follow strict YAML guidelines so there is a lot of room for mis-interpretation (if such a thing as strict yaml even exists). -- did I do this value correctly/typo catching!

OTEL & ADOT + AWS is a whole level of complexity further than a simple OTEL community version.
Generally OTEL is a deep topic, jargon heavy, and more specifically instrumenting business applications is some weird fusion that IMHO goes well beyond "Full Stack dev" since there are many moving pieces and people like myself are often "jack of many, master of few". I think it safe to assume at least for the next few months/years OTEL will be a fast moving specification where nobody is fluent and the docs will be equally ambiguous, hopefully by next decade this topic will be less dynamic, codebases mature with better examples & less sharp edges, lest every noob experience the painful death by 1,000 cuts.

So having ADOT perform a basic diagnostic (i.e. internally, how is ADOT OTEL configured with respect to AWS), echo back the configuration parameters it understood.

Specifically echo'ing the settings ADOT understood, even as a feature flag/optional parameter, or a mode where it says "no data" was received in ~X minutes and that itself is an exported metric to exercise a "known working path" (or to further diagnose transmission failures)

Beyond checking YAML structure, tight parameter validation, AWS SDK parameter VALUES which are invalid/incorrect should also generate similar messages to:

Error: failed to get config: cannot unmarshal the configuration: unknown extensions type "sigv4auth" for "sigv4auth" (valid values: [health_check pprof zpages memory_ballast])

In the OTEL-contrib it looks like it internally generates it's own metrics about the collection of metrics (but I couldn't figure out how to get them to export 😉) .. but that was my inspiration -- something useful during setup "bare metal" circumstances to demo/test behaviors, IAM access, etc. .. and/or a one time (at startup) hello world" ping on type of internal signal that can be dumped to console & also exported to amp, etc. (with transmission confirmed/failed)

Synchronize Version Numbers

Additionally I would humbly propose in the future ADOT as an OTEL vendor distribution might attempt to synchronize version/release numbers, i.e. stay more in sync with OTEL, so at least I can follow OTEL docs, not constantly be needing to figure out what is a bug/issue, known behavior, blabla, and a quick place to reference/find documented errata/differences/divergence between OTEL & ADOT. As you said, the ADOT cadence, it's meticulous audit & review process it is likely to be slower than the community version due to sec audits, etc. so internally saying "this is our vendor release of ADOT vX.X.X" should correspond with OTEL vX.X.X - in the event ADOT gets ahead, just document that in errata or add a letter to the sem-ver style (a,b,c,d) .. sem-ver are more like pirate code, yes they are rules, but also these are guidelines.

Presently it's not possible to (easily, without reading code) determine where an ADOT release fits respect to OTEL code maturity & feature lifecycle .. this would greatly simplify reporting issues since the expectation is OTEL-contrib behavior (with sigv4, etc.) should more or less match ADOT vendor distribution. .. ADOT already uses a 'v' prefix in it's release tags where so there's no reason not to add an alpha suffix if/when ADOT requires it. I.e. I think this would help engagement since it's clear what is a known bug or difference and what isn't.

Because the ADOT plan is to stay more or less in sync then do a best effort to document known incompatibilities, also maintain a list of known errata in the ADOT OTEL collector repo so we have a quick place to look and don't need to put in an AWS Ticket or cut myself on undocumented sharp edges. Any sharp edges be fixed/known errata imho should also have a link to the issue on github for discussion +1 votes, blabla. (yes, I know this is what github issues is for - but a concise dump of known/acknowledged errata "roadmap" vs. untriaged issues, or maybe this is simply a tag in github issues, but I think a link from the ROADMAP would still be warranted to say "this is known issues") .. this way ADOT creates the expectation it intends to stay insync (mostly, as soon as commercially feasible) with OTEL.

And finally, this goes without saying --

DOCUMENTATION

ADOT OTEL docs should be community accessible/revisable, (in fact what I'm saying should be ALL AMAZON DOCS) .. but at least for ADOT, you've setup a separate website & portal, but it still follows the Amazon corporate format.

Stop/eliminate the AWS org Blog posts, etc. which "rot" and clutter the Internet, at least for ADOT -- consider every single post, blabla, should link to a project directory with a README.md containing "known errata" on the github repo where we can easily post "hey, this doesn't work" and/or submit fix examples. Whenever possible include more than one way to do it -- i.e. "aws cli" OR cloud formation. There must be a way to synchronize the content of these posts to/from git .. but before publishing sending a PR with the files & code should be an expectation for anybody on your team @mhausenblas .

Amazon Corp policy basically publishes docs that are well suited to be PDF's, it's opaque and difficult to provide feedback.
In a perfect world - blog posts that aren't maintained or are known to be wrong should either be removed or fixed.
I can't say how infuriating it is (for me) because often I'm converting cloudformation syntax to terraform only to find that the setting/repo/blabla has changed .. or was never correct (I have no way to know this, and no way to know if the problem was in my transcription or the example simply doesn't work)

Thanks for reading! Feel free to close this issue.

@mhausenblas
Copy link
Member

Thanks again for your detailed feedback @elasticdotventures and it's much appreciated. One thing I'd like to point out in the context of the validation topic (which is something super important and we need to do more, in upstream): ICYMI, check out https://github.com/lightstep/otel-config-validator which is a really good start. Let's work together to get this done!

@github-actions
Copy link
Contributor

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the stale label Nov 13, 2022
@github-actions
Copy link
Contributor

This issue was closed because it has been marked as stale for 30 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADOT collector ADOT Collector related issues documentation Improvements or additions to documentation stale
Projects
None yet
Development

No branches or pull requests

2 participants