Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(tracing): add support for otlp exporter #2918

Merged
merged 4 commits into from
Aug 21, 2022

Conversation

dbuades
Copy link
Contributor

@dbuades dbuades commented Aug 17, 2022

What does this PR address?

Mainly fixes #2917
However, it also addresses the following concerns:

  • Add opentelemetry-exporter-jaeger-thrift as an extra dependency in order to solve this issue
  • Update OpenTelemetry to 1.12 (no incompatibilities found on my side)
  • Log a warning message when sample_rate is not set for tracing.

Before submitting:

Who can help review?

It was mentioned in Slack that @bojiang worked on the tracing feature.

@dbuades dbuades requested a review from a team as a code owner August 17, 2022 17:43
@dbuades dbuades requested review from parano and removed request for a team August 17, 2022 17:43
@@ -374,6 +388,9 @@ def tracer_provider(

if sample_rate is None:
sample_rate = 0.0
logger.warning(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this will show up for all users that don't use tracing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@parano You are right, I'm going to change it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should warn if user explicitly set the sample_rate to 0.0. If the sample_rate is Null we can leave the behavior as-is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssheng I still think that we should warn in both situations, since it took us a while to debug why the traces weren't being sent. I'm open to discussing it though.

I just pushed a fix for this, let me know what you think.

opentelemetry-sdk==1.9.0
opentelemetry-semantic-conventions==0.28b0
opentelemetry-util-http==0.28b0
opentelemetry-api>=1.9.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bojiang what's the reason we lock the opentelemetry version before? was it to prevent installing the pre-release versions by accident? if that's the case it should be ok to change this to >= now

Copy link
Member

@aarnphm aarnphm Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some issue with opentelemetry pre-release before, couldn't remember what the issues are. Idk if that got fixed upstream.

Is there any reason why we need to bump this up @dbuades ?

Copy link
Member

@bojiang bojiang Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to lock the version of otlp libs/contribs because the opentelemetry community is not well organizing the dependency map. We should lock them to verified working versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aarnphm Unfortunately I needed to bump it since it was required by opentelemetry-exporter-otlp
@bojiang Would you be confortable if I lock them to their current version until the e2e tests are added? Although we don't have e2e tests, on our side we tested tracing with jaeger, zipkin and otlp exporters on these versions and didn't find any errors.

Copy link
Member

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some minor changes

ssheng
ssheng previously approved these changes Aug 18, 2022
Copy link
Collaborator

@ssheng ssheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to merge and update the warning change for you in a separate PR.

@@ -374,6 +388,9 @@ def tracer_provider(

if sample_rate is None:
sample_rate = 0.0
logger.warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should warn if user explicitly set the sample_rate to 0.0. If the sample_rate is Null we can leave the behavior as-is.

@codecov
Copy link

codecov bot commented Aug 18, 2022

Codecov Report

Merging #2918 (70ecf88) into main (372b530) will decrease coverage by 0.97%.
The diff coverage is 28.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2918      +/-   ##
==========================================
- Coverage   70.82%   69.84%   -0.98%     
==========================================
  Files         104      121      +17     
  Lines        9435     9936     +501     
==========================================
+ Hits         6682     6940     +258     
- Misses       2753     2996     +243     
Impacted Files Coverage Δ
bentoml/_internal/configuration/containers.py 76.53% <28.57%> (-3.26%) ⬇️
bentoml/_internal/service/loader.py 71.85% <0.00%> (-1.49%) ⬇️
bentoml/lightgbm.py 0.00% <0.00%> (ø)
bentoml/bentos.py 48.68% <0.00%> (ø)
bentoml/serve.py 0.00% <0.00%> (ø)
bentoml/transformers.py 0.00% <0.00%> (ø)
bentoml/keras.py 0.00% <0.00%> (ø)
bentoml/sklearn.py 0.00% <0.00%> (ø)
bentoml/catboost.py 0.00% <0.00%> (ø)
bentoml/pytorch_lightning.py 0.00% <0.00%> (ø)
... and 16 more

@bojiang bojiang self-requested a review August 18, 2022 03:39
@bojiang
Copy link
Member

bojiang commented Aug 18, 2022

It seems that opentelemetry-exporter-otlp is requiring newer version of the opentelemetry-api.
We should fully test the otlp features since otlp plugins are in an early stage and have strict requirements to the api, plus our e2e doesn't cover the otlp part.

setup.cfg Outdated Show resolved Hide resolved
@aarnphm
Copy link
Member

aarnphm commented Aug 19, 2022

Hmm maybe you need to rebase this onto main to remove the merge commit?

@ssheng
Copy link
Collaborator

ssheng commented Aug 21, 2022

I have verified that the change here is compatible with the existing jaeger and zipkin tracers. The otlp exporter is also working as expected. I will go ahead and merge this PR. Thanks again for your contribution, @dbuades!

Following this PR, I will lock opentelemetry dependencies to 0.33b0 until they stabilizes. I also discovered that the service name field was not properly set as the api_service or the runner names, causing the UI to display "unknown_service" as a result. I will look to address in the same following PR. However, using environment variable to set the service name is possible. Did you use the env var mechanism, @dbuades?

@ssheng ssheng merged commit 2552c8b into bentoml:main Aug 21, 2022
@dbuades
Copy link
Contributor Author

dbuades commented Aug 21, 2022

I have verified that the change here is compatible with the existing jaeger and zipkin tracers. The otlp exporter is also working as expected. I will go ahead and merge this PR. Thanks again for your contribution, @dbuades!

Following this PR, I will lock opentelemetry dependencies to 0.33b0 until they stabilizes. I also discovered that the service name field was not properly set as the api_service or the runner names, causing the UI to display "unknown_service" as a result. I will look to address in the same following PR. However, using environment variable to set the service name is possible. Did you use the env var mechanism, @dbuades?

Thanks @ssheng !
That's a very good point for the service name. We are currently are using the env var mechanism since we use that strategy it in all of our microservices, so it felt natural to use the same strategy here.
However, it makes total sense to infer it from the bento service name if the env var is not set. Let me know if you'd like me to help with the PR or with the review.

@ssheng
Copy link
Collaborator

ssheng commented Aug 21, 2022

Thanks for the feedback, @dbuades! We'd love to get your review on this, #2928.

ssheng added a commit that referenced this pull request Aug 23, 2022
## What does this PR address?
<!--
Thanks for sending a pull request!

Congrats for making it this far! Here's a 🍱 for you. There are still a few steps ahead.

Please make sure to read the contribution guidelines, then fill out the blanks below before requesting a code review.

Name your Pull Request with one of the following prefixes, e.g. "feat: add support for PyTorch", to indicate the type of changes proposed. This is based on the [Conventional Commits specification](https://www.conventionalcommits.org/en/v1.0.0/#summary).
  - feat: (new feature for the user, not a new feature for build script)
  - fix: (bug fix for the user, not a fix to a build script)
  - docs: (changes to the documentation)
  - style: (formatting, missing semicolons, etc; no production code change)
  - refactor: (refactoring production code, eg. renaming a variable)
  - perf: (code changes that improve performance)
  - test: (adding missing tests, refactoring tests; no production code change)
  - chore: (updating grunt tasks etc; no production code change)
  - build: (changes that affect the build system or external dependencies)
  - ci: (changes to configuration files and scripts)
  - revert: (reverts a previous commit)

Describe your changes in detail. Attach screenshots here if appropriate.

Once you're done with this, someone from BentoML team or community member will help review your PR (see "Who can help review?" section for potential reviewers.). If no one has reviewed your PR after a week have passed, don't hesitate to post a new comment and ping @-the same person. Notifications sometimes get lost 🥲.
-->

Following #2918, verified that the latest OpenTelemetry version `0.33b0` is compatible with the existing `otlp`, `jaeger` and `zipkin` exporters. Locking the dependency versions until OTEL stabilizes.

In addition, this PR also configures the OpenTelemetry resource automatically if user has not explicitly configured it through environment variables.
- API servers and runners are treated as OpenTelemetry services and named after their names.
- Service namespace is set as the Bento name
- Service version is set as the Bento version
- Service instance ID is set as the worker ID

Individual trace page on Jaeger. API server and runners are labeled as separate services.
![Screen Shot 2022-08-21 at 02 54 17](https://user-images.githubusercontent.com/861225/185787072-25e01da3-41a3-40a5-8277-fb8653311571.png)

Search page:
![Screen Shot 2022-08-21 at 03 00 02](https://user-images.githubusercontent.com/861225/185787156-5960e43c-a071-421d-9ba1-1752108d83b0.png)

## Before submitting:
<!--- Go over all the following points, and put an `x` in all the boxes that apply. -->
<!--- If you're unsure about any of these, don't hesitate to ask. We're here to help! -->
<!--- If you plan to update documentation or tests in follow-up, please note -->
- [ ] Does the Pull Request follow [Conventional Commits specification](https://www.conventionalcommits.org/en/v1.0.0/#summary) naming? Here are [GitHub's
guide](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request) on how to create a pull request.
- [ ] Does the code follow BentoML's code style, both `make format` and `make lint` script have passed ([instructions](https://github.com/bentoml/BentoML/blob/main/DEVELOPMENT.md#style-check-auto-formatting-type-checking))?
- [ ] Did you read through [contribution guidelines](https://github.com/bentoml/BentoML/blob/main/CONTRIBUTING.md#ways-to-contribute) and follow [development guidelines](https://github.com/bentoml/BentoML/blob/main/DEVELOPMENT.md#start-developing)?
- [ ] Did your changes require updates to the documentation? Have you updated
  those accordingly? Here are [documentation guidelines](https://github.com/bentoml/BentoML/tree/main/docs) and [tips on writting docs](https://github.com/bentoml/BentoML/tree/main/docs#writing-documentation).
- [ ] Did you write tests to cover your changes?

## Who can help review?

Feel free to tag members/contributors who can help review your PR.
<!--
Feel free to ping any of the BentoML members for help on your issue, but don't ping more than three people 😊.
If you know how to use git blame, that is probably the easiest way.

Team members that you can ping:
- @parano
- @yubozhao
- @bojiang
- @ssheng
- @aarnphm
- @sauyon
- @larme
- @yetone
- @jjmachan
-->

Co-authored-by: Aaron Pham <29749331+aarnphm@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(tracing): add support for otlp exporter
5 participants