Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(docs): Update opentelemetry example to show access logs #33925

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

mmanciop
Copy link

@mmanciop mmanciop commented May 2, 2024

Commit Message: Update the OpenTelemetry demo to show access log correlation and a more recent collector
Additional Description: This PR updates the OpenTelemetry demo to show how to send access logs to an OpenTelemetry collector, and how to correlate those logs with tracing data (pending #30268). Also, use a more recent build of the OpenTelemetry collector.
Risk Level: Low
Testing: Manual
Docs Changes: None
Release Notes: The OpenTelemetry demo has been updated to show access log correlation with tracing data and to use a more recent build of the OpenTelemetry collector
Platform Specific Features: N/A
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

Copy link

Hi @mmanciop, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #33925 was opened by mmanciop.

see: more, trace.

@mmanciop mmanciop force-pushed the opentelemetry-log-correlation branch from f36493c to cbf5b79 Compare May 2, 2024 06:53
@phlax phlax self-assigned this May 2, 2024
Copy link
Member

@phlax phlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this @mmanciop

can i suggest removing all changes other than the access log additions

as said below, if we make the access log changes then this should be reflected in the docs for the sandbox, and tested

examples/opentelemetry/Dockerfile-opentelemetry Outdated Show resolved Hide resolved
examples/opentelemetry/Dockerfile-opentelemetry Outdated Show resolved Hide resolved
examples/opentelemetry/Dockerfile-opentelemetry Outdated Show resolved Hide resolved
@@ -63,9 +63,11 @@ services:
healthcheck:
test: ["CMD-SHELL", "curl -sf http://localhost:13133 || exit 1"]
interval: 1s
timeout: 120s
timeout: 10s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not sure why this change is here or whether its desirable - i think it would make it retry every 1/12s

that said - 120 retries is probably too high - not sure where these values came from originally

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collector comes up in a second or two tops. 120s is IMO just cause for FUD for the end user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool - probably in that case tho - my reading of docs (maybe incorrectly) is that we would want also to change the retries values also to prevent it spinning out with healthchecks - happy to be wrong about that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, the Otel collector comes up or not. I don't remember I saw a collector crashing for issues that were not related with configuration being invalid. And those won't be helped with retries. I could happily see myself drop the retry entirely, although the original reason might have been to bring another collector up when the current goes OOM; whether that precaution in an example warrants the humongous delay on crash loop when the user edits the configs to be invalid... I am unconvinced.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - nm - i misread - interval is set

in which case i think retries is most likely not required at all, but im possibly missing something there

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interval

The health check will first run interval seconds after the container is started, and then again interval seconds after each previous check completes.

Timeout

If a single run of the check takes longer than timeout seconds then the check is considered to have failed.

Retries

It takes retries consecutive failures of the health check for the container to be considered unhealthy.

From StackOverflow and this cheatsheet. I feel retries are useless. The collector will virtually always answer if it's alive (unless under the most massive GC I ever saw it doing), and if it's not alive, it's 99% a config issue, which are not helped by retries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks - i didnt read docs properly obviously - not sure in that case but what we want to avoid at all costs is making the sandbox flakey in ci

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel getting rid of retries in this case should make anything flakier tbh. If anything, it'd fail faster.

examples/opentelemetry/docker-compose.yaml Outdated Show resolved Hide resolved
@@ -25,6 +25,16 @@ static_resources:
cluster_name: opentelemetry_collector
timeout: 0.250s
service_name: envoy-1
access_log:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think these access_log changes are the real change in the PR

im not totally clear what the purpose is but if we are going to change these then the docs and verify.sh script should reflect that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think these access_log changes are the real change in the PR

Pretty much, but also updating the collector to a more recent version and adjusting the configs is worthy in its own right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless i misunderstand tho - the version isnt updated - just made explicit

happy to carry some cleanups, but i think the change is probably great and requires focus

to land, we really need to update the docs to explain, and relatedly to test what the docs say to prevent bitrot etc

Copy link
Author

@mmanciop mmanciop May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the version did not actually change.

Before updating the docs, I would love for the correlation to... well actually work (see #30268). My idea to open the PR this early was to give @ashishb-solo a quick way to test the changes, and to motivate why the PR to actually fix the correlation was needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, thanks for context

is it the same issue as #33906 ?

i reopened the one you posted but i have a feeling that they are now duping

with these sandboxes they are documentation first, so to justify adding a load of configuration it needs to be documented and tested

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phlax : #33906 is about the correlation not working. And it still doesn't, it's broken at a technical level. This demo will start working as expected, and showcase the correlation, when the underlying bug in the opentelemetry access log exporter is fixed :-)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI #33909 solves the problem pretty nicely, see #33909 (comment)

examples/opentelemetry/otel-collector-config.yaml Outdated Show resolved Hide resolved
examples/opentelemetry/otel-collector-config.yaml Outdated Show resolved Hide resolved
Signed-off-by: Michele Mancioppi <michele.mancioppi@dash0.com>
Signed-off-by: Michele Mancioppi <michele.mancioppi@dash0.com>
Signed-off-by: Michele Mancioppi <michele.mancioppi@dash0.com>
Signed-off-by: Michele Mancioppi <michele.mancioppi@dash0.com>
@mmanciop mmanciop force-pushed the opentelemetry-log-correlation branch from 8e723eb to 03fc3a4 Compare May 2, 2024 12:28
@mmanciop mmanciop requested a review from phlax May 2, 2024 12:31
@phlax
Copy link
Member

phlax commented May 2, 2024

/docs

Copy link

Docs for this Pull Request will be rendered here:

https://storage.googleapis.com/envoy-pr/33925/docs/index.html

The docs are (re-)rendered each time the CI envoy-presubmit (precheck docs) job completes.

🐱

Caused by: a #33925 (comment) was created by @phlax.

see: more, trace.

Copy link
Member

@phlax phlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the file that needs to be edited to update docs https://github.com/envoyproxy/envoy/blob/main/docs/root/start/sandboxes/opentelemetry.rst

i think lets insert a new step 4 (maybe more) that demonstrates what the config is doing

and this is the rendered docs for this sandbox at current commit https://storage.googleapis.com/envoy-pr/a712ed7/docs/start/sandboxes/opentelemetry.html

once we update we can look at the https://github.com/envoyproxy/envoy/blob/main/examples/opentelemetry/verify.sh which is intended to ~reflect the steps in the docs

Signed-off-by: Michele Mancioppi <michele.mancioppi@dash0.com>
Signed-off-by: Michele Mancioppi <michele.mancioppi@dash0.com>
@mmanciop mmanciop force-pushed the opentelemetry-log-correlation branch from abc07d6 to 8abd1e3 Compare May 2, 2024 15:17
@mmanciop
Copy link
Author

mmanciop commented May 2, 2024

this is the file that needs to be edited to update docs https://github.com/envoyproxy/envoy/blob/main/docs/root/start/sandboxes/opentelemetry.rst

i think lets insert a new step 4 (maybe more) that demonstrates what the config is doing

and this is the rendered docs for this sandbox at current commit https://storage.googleapis.com/envoy-pr/a712ed7/docs/start/sandboxes/opentelemetry.html

once we update we can look at the https://github.com/envoyproxy/envoy/blob/main/examples/opentelemetry/verify.sh which is intended to ~reflect the steps in the docs

Done in 8abd1e3

@mmanciop
Copy link
Author

mmanciop commented May 2, 2024

Preparing the docs I noticed that maybe a little more setup is needed to add attributes. It will take me a little to hammer out :-)

Signed-off-by: Michele Mancioppi <michele.mancioppi@dash0.com>
Signed-off-by: Michele Mancioppi <michele.mancioppi@dash0.com>
@mmanciop
Copy link
Author

mmanciop commented May 3, 2024

/docs

Copy link

Docs for this Pull Request will be rendered here:

https://storage.googleapis.com/envoy-pr/33925/docs/index.html

The docs are (re-)rendered each time the CI envoy-presubmit (precheck docs) job completes.

🐱

Caused by: a #33925 (comment) was created by @mmanciop.

see: more, trace.

@phlax
Copy link
Member

phlax commented May 3, 2024

@mmanciop will review more thoroughly shortly - but copying out ci suggestion to get past (one of) ci fails

  Please fix your editor to ensure:

      - no trailing whitespace
      - no preceding mixed tabs/spaces
      - all files end with a newline

Signed-off-by: Michele Mancioppi <michele.mancioppi@dash0.com>
@mmanciop mmanciop force-pushed the opentelemetry-log-correlation branch from 051d660 to 2344c8f Compare May 3, 2024 11:20
@mmanciop mmanciop requested a review from phlax May 3, 2024 11:23
Copy link
Member

@phlax phlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for iterating @mmanciop - adding/updating these sandboxes generally takes a few

main feedback is to use the existing yaml files from the examples/opentelemtry dir as any user checking out repo and following along will have them directly

these get added automagically to the rst build dir (by bazel) so can be referenced with _include/openetelemetry/*yaml

existing sandbox docs mostly dont have captions but elsewhere we use this for downloading the file - in that case i think its "name" should be examples/opentelemetry/*.yaml - ie repo-relative

docs/root/start/sandboxes/opentelemetry.rst Outdated Show resolved Hide resolved
examples/opentelemetry/docker-compose.yaml Outdated Show resolved Hide resolved
examples/opentelemetry/envoy-1.yaml Show resolved Hide resolved
mmanciop and others added 5 commits May 3, 2024 13:58
Co-authored-by: phlax <phlax@users.noreply.github.com>
Signed-off-by: Michele Mancioppi <michele.mancioppi@dash0.com>
Signed-off-by: Michele Mancioppi <michele.mancioppi@dash0.com>
Signed-off-by: Michele Mancioppi <michele.mancioppi@dash0.com>
Signed-off-by: Michele Mancioppi <michele.mancioppi@dash0.com>
… for docs

Signed-off-by: Michele Mancioppi <michele.mancioppi@dash0.com>
@mmanciop mmanciop requested a review from phlax May 3, 2024 12:26
Copy link
Member

@phlax phlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmanciop quick docs review

step 5 comes after the see also section which should be last - i would move step 5 -> step 4 and increment current step 5

wrt step 6 - its misplaced here - this should be in the config reference section - but when i checked that it seems that this page is currently missing or ~empty - this is the only config reference i could see https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/stat_sinks/open_telemetry_stat_sink.html

@mmanciop mmanciop changed the title chore(ui): Update opentelemetry example to show access logs chore(docs): Update opentelemetry example to show access logs May 8, 2024
@mmanciop
Copy link
Author

mmanciop commented May 8, 2024

@mmanciop quick docs review

step 5 comes after the see also section which should be last - i would move step 5 -> step 4 and increment current step 5

wrt step 6 - its misplaced here - this should be in the config reference section - but when i checked that it seems that this page is currently missing or ~empty - this is the only config reference i could see https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/stat_sinks/open_telemetry_stat_sink.html

Moved see_also to the end of the page in 93676b8, and deleted the configuration section that is misplaced.

Signed-off-by: Michele Mancioppi <michele.mancioppi@dash0.com>
@mmanciop mmanciop force-pushed the opentelemetry-log-correlation branch from b7f93cf to 93676b8 Compare May 8, 2024 07:24
Signed-off-by: Michele Mancioppi <michele.mancioppi@dash0.com>
Copy link
Member

@phlax phlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really close, thanks @mmanciop

for the verify.sh script i think we just want to add another section reflecting the addition to the docs - the tests can be pretty simple - just grep in the logs for the highlighted lines - something like

"${DOCKER_COMPOSE[@]}" logs opentelemetry | grep "Trace ID"

docs/root/start/sandboxes/opentelemetry.rst Outdated Show resolved Hide resolved
docs/root/start/sandboxes/opentelemetry.rst Outdated Show resolved Hide resolved
docs/root/start/sandboxes/opentelemetry.rst Outdated Show resolved Hide resolved
docs/root/start/sandboxes/opentelemetry.rst Outdated Show resolved Hide resolved
Signed-off-by: Michele Mancioppi <michele.mancioppi@dash0.com>
@phlax
Copy link
Member

phlax commented May 10, 2024

@mmanciop could you update the verify.sh script and fix formatting so we can see the verify ci running

for ref you can run/test locally (assumes local user has access to Docker socket)

$ cd examples/opentelemetry
$ ./verify.sh

@mmanciop
Copy link
Author

@mmanciop could you update the verify.sh script and fix formatting so we can see the verify ci running

for ref you can run/test locally (assumes local user has access to Docker socket)

$ cd examples/opentelemetry

$ ./verify.sh

@phlax that's what I am working on, but it won't be pretty. The collector debug output cannot currently be formatted to JSON and that's gonna hurt.

@phlax
Copy link
Member

phlax commented May 10, 2024

in that case just use grep i reckon - that is ~equivalent to the highlighted lines in console output docs i think

@phlax
Copy link
Member

phlax commented May 10, 2024

for a stricter test you could fish out the ids and check they match - that would be pretty cool - but i reckon optional

…ccess logs match those of spans

Signed-off-by: Michele Mancioppi <michele.mancioppi@dash0.com>
@mmanciop
Copy link
Author

mmanciop commented May 10, 2024

in that case just use grep i reckon - that is ~equivalent to the highlighted lines in console output docs i think

Multiline grep is a portability nightmare in my experience. I cobbled up something that I think is reasonable in 0a71a08. It won't succeed until #33909 is merged and this PR is rebased on top of it.

@mmanciop mmanciop requested a review from phlax May 10, 2024 11:47
@phlax
Copy link
Member

phlax commented May 15, 2024

/wait for #33909

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants