Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance/Throughput Impact with auto instrumentation in Spring 5 applications #59

Closed
kurvatch opened this issue May 18, 2021 · 7 comments
Labels

Comments

@kurvatch
Copy link

kurvatch commented May 18, 2021

Describe the bug
We are seeing more than 50% performance degradation with instrumenting otel agents, Our application instrumented with otel runs on EKS cluster. OTel Collector running as daemon set in the same EKS cluster collects traces and ingest data to AWS Xray.

Steps to reproduce
This is Spring 5 project with webflux and spring cloud stream support interacting with SQS, DynamoDB and AWS MSK

What did you expect to see?
Without Otel Agent, application could reach upto 250 request per second with 2Gi memory.

What did you see instead?
After OTel agent, we are seeing ~65 request per second with same settings, I was expecting some degradation in the throughput but this is more 50%

Additional context
We are using aws-opentelemetry-agent-1.1.0 with default settings for BSP and sampling is set to 100% and metrics exporter is set to logging.

@stnor
Copy link

stnor commented May 19, 2021

What sampler are you using? Why 100% sampling? That will have a perf impact.

I used -Dotel.traces.sampler.traceidratio=true -Dotel.traces.sampler.arg=0.005 and had perf issues (cpu util rose w/ 50%).

Switched to the parent based ratio sampler parentbased_traceidratio and that had a huge impact for me, but I am using a lot of internal spans.

@kurvatch
Copy link
Author

kurvatch commented May 19, 2021

@stnor parentbased_always_on is the sampler. I am testing things out with default configurations, it was expected there will be a performance impact with 100% sampling but was shocked to see more than 50% degradation of throughput. I am doing another round of testing with parentbased_traceidratio and -Dotel.traces.sampler.arg=0.25 for the same application.

@stnor
Copy link

stnor commented May 19, 2021

25% is a very high sampling frequency in my experience.

@kurvatch
Copy link
Author

Is there a standards recommendation available. It would defiantly help to publish benchmark performance with different samplers and ratios with a demo application interacting database and a message system

@anuraaga
Copy link
Contributor

Hi @kurvatch - I agree that the performance impact seems much larger than we'd expect. Sampling rate is great for reducing load on backends, but we wouldn't expect such that much overhead at hundreds of QPS.

I have filed open-telemetry/opentelemetry-java-instrumentation#3047 as that repo is where the actual code is and the performance bottlenecks can be investigated.

@github-actions
Copy link

github-actions bot commented Oct 2, 2022

This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled

@github-actions github-actions bot added the stale label Oct 2, 2022
@github-actions
Copy link

github-actions bot commented Nov 6, 2022

This issue was closed because it has been marked as stale for 30 days with no activity.

@github-actions github-actions bot closed this as completed Nov 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants