Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add opentelemetry+elastic agent overhead benchmark on a weekly basis #3369

Closed
jackshirazi opened this issue Oct 18, 2023 · 17 comments
Closed
Assignees

Comments

@jackshirazi
Copy link
Contributor

The opentelemetry overhead benchmark is easily configurable to add in the Elastic agent, so is a nice one to run on a weekly basis

@jackshirazi
Copy link
Contributor Author

Here's a script that will run all the current agents (baseline, otel release, otel snapshot, elastic release, elastic snapshot, elastic release async) on a new VM (script requires github token access set in variables gituser gittoken)

sudo apt update
sudo apt install -y openjdk-17-jdk-headless
sudo apt install -y docker.io
sudo apt install -y jq
sudo apt install -y unzip
git clone https://github.com/open-telemetry/opentelemetry-java-instrumentation.git
cd opentelemetry-java-instrumentation/
./gradlew assemble
cd benchmark-overhead
ELASTIC_SNAPSHOT_URL=$(curl -s -u $gituser:$gittoken "https://api.github.com/repos/elastic/apm-agent-java/actions/workflows/49838992/runs?branch=main" | jq -c '.workflow_runs[] | {conclusion, updated_at, display_title, url}' | grep -v null  | grep -v pending | grep -v cancelled | grep success | head -1 | awk -F'":"' '{print $5}' | tr -d '"}')
ELASTIC_SNAPSHOT_ARTIFACTS=$(curl -s -u $gituser:$gittoken "$ELASTIC_SNAPSHOT_URL" | grep artifacts_url | awk -F'":' '{print $2}' | tr -d '"} ,')
ELASTIC_SNAPSHOT_ZIPFILE=$(curl -s -u $gituser:$gittoken "https://api.github.com/repos/elastic/apm-agent-java/actions/runs/6545518750/artifacts" | jq -c ".artifacts[] | {name,archive_download_url}" | grep '"elastic-apm-agent"' | awk -F'":' '{print $3}' | tr -d '"}')
curl -s --output "elastic-agent.zip"  -L -H "Accept: application/vnd.github+json" -H "Authorization: Bearer $gittoken" -H "X-GitHub-Api-Version: 2022-11-28" -u $gituser:$gittoken "$ELASTIC_SNAPSHOT_ZIPFILE"
unzip elastic-agent.zip
ELASTIC_SNAPSHOT_JAR=$(ls -1 elastic-apm-agent-*.jar)
ELASTIC_SNAPSHOT_ENTRY="new Agent(\\\"elastic-snapshot\\\",\\\"latest available snapshot version from elastic main\\\",\\\"file://$PWD/$ELASTIC_SNAPSHOT_JAR\\\")"
ELASTIC_LATEST_VERSION=$(curl -s https://repo1.maven.org/maven2/co/elastic/apm/elastic-apm-agent/ | perl -ne 's/<.*?>//g; if(s/^([\d\.]+).*$/$1/){print}' | sort -V | tail -1)
ELASTIC_LATEST_ENTRY="new Agent(\\\"elastic-latest\\\",\\\"latest available released version from elastic main\\\",\\\"https://repo1.maven.org/maven2/co/elastic/apm/elastic-apm-agent/$ELASTIC_LATEST_VERSION/elastic-apm-agent-$ELASTIC_LATEST_VERSION.jar\\\")"
ELASTIC_LATEST_ENTRY2="new Agent(\\\"elastic-async\\\",\\\"latest available released version from elastic main\\\",\\\"https://repo1.maven.org/maven2/co/elastic/apm/elastic-apm-agent/$ELASTIC_LATEST_VERSION/elastic-apm-agent-$ELASTIC_LATEST_VERSION.jar\\\", java.util.List.of(\\\"-Delastic.apm.delay_agent_premain_ms=15000\\\"))"
NEW_LINE="              .withAgents(Agent.NONE, Agent.LATEST_RELEASE, Agent.LATEST_SNAPSHOT, $ELASTIC_LATEST_ENTRY, $ELASTIC_LATEST_ENTRY2, $ELASTIC_SNAPSHOT_ENTRY)"
echo $NEW_LINE
perl -i -ne "if (/withAgents/) {print \"$NEW_LINE\n\"}else{print}" src/test/java/io/opentelemetry/config/Configs.java
sudo ./gradlew test
perl -ne '/Standard output/ && $on++; /\<\/pre\>/ && ($on=0);$on && s/\<.*\>//;$on && !/^\s*$/ && print' build/reports/tests/test/classes/io.opentelemetry.OverheadTests.html

@jackshirazi
Copy link
Contributor Author

jackshirazi commented Oct 18, 2023

Here's the output from a run - the script will need adjusting to provide that somewhere in a CSV or JSON format so that we can see trends. Note this run is on a VM which is running on a shared host, so variability could be down to resource conflicts. The weekly script needs to be on an isolated dedicated host

----------------------------------------------------------
 Run at Wed Oct 18 12:23:13 UTC 2023
 release : compares no agent, latest stable, and latest snapshot agents
 5 users, 5000 iterations
----------------------------------------------------------
Agent               :              none           latest         snapshot   elastic-latest    elastic-async elastic-snapshot
Run duration        :          00:00:55         00:01:05         00:01:07         00:00:59         00:01:00         00:00:59
Avg. CPU (user) %   :        0.36701292       0.42148226       0.41992074        0.3964025        0.4192863       0.39579248
Max. CPU (user) %   :            0.5275       0.56296295         0.566416             0.56        0.5785536           0.5425
Avg. mch tot cpu %  :         0.9447133       0.95868945       0.94469744       0.94119316       0.95311606         0.947395
Startup time (ms)   :              9066            12809            12752            12841             8626            12790
Total allocated MB  :          15065.52         20208.01         20320.67         15992.55         15519.32         15927.61
Min heap used (MB)  :            180.06           119.93           114.68           188.48           110.91           141.44
Max heap used (MB)  :            549.36           429.93           374.98           639.65           446.14           532.69
Thread switch rate  :         56717.484        56390.805        54232.383        55348.195         56344.96          54675.2
GC time (ms)        :               820              571              786              933              425              844
GC pause time (ms)  :               820              571              786              933              425              844
Req. mean (ms)      :              4.05             4.84             5.01             4.37             4.46             4.37
Req. p95 (ms)       :             10.47            12.82            13.09            11.33            11.51            11.42
Iter. mean (ms)     :             53.19            63.49            65.44            57.30            58.35            57.31
Iter. p95 (ms)      :             81.21            96.39           100.21            86.74            91.41            86.21
Net read avg (bps)  :       14473952.00      12654705.00      12173681.00      12410859.00      11300186.00      12229296.00
Net write avg (bps) :       19364629.00      69109205.00      66806772.00      16584103.00      15095361.00      16333396.00
Peak threads        :                40               53               54               47               47               47

@jackshirazi
Copy link
Contributor Author

Note the otel tests include a collector, the elastic tests need to include a mock apm server (eg the one in this project tests should work

@jackshirazi
Copy link
Contributor Author

@v1v this is the test we'd like to run on an isolated specific hardware configuration. The suggested Ubuntu 20.04 - 6 CPU Cores / 64117 MB Memory would be fine, and we can go on that in buildkite or wait for runners if those are likely to be available in the next couple of months

@v1v
Copy link
Member

v1v commented Oct 18, 2023

A few questions:

sudo ./gradlew test

What's the reason for sudo? I cannot see anything related to sudo in https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/benchmark-overhead#setup-and-usage

the elastic tests need to include a mock apm server

Can you provide the set of steps to run the mock apm-server?

So far I managed to test the above-mentioned steps in Buildkite, see https://buildkite.com/elastic/apm-agent-java-load-testing/builds/147#018b446f-ba68-4049-9a8e-78b07b3d4eb3

Those steps have been coded in #3371

@jackshirazi
Copy link
Contributor Author

What's the reason for sudo

I didn't actually try to solve why it failed without sudo, but the docker images wouldn't run. It's probably something to do with the docker install, it might not be the best choice of docker install.

Can you provide the set of steps to run the mock apm-server?

Will do, I'll update the script when I get there

@v1v
Copy link
Member

v1v commented Oct 19, 2023

Status update

This Buildkite build produced:

image

In addition it archives the html report, see here

Next steps:

  • Store benchmarks in ES

@jackshirazi
Copy link
Contributor Author

jackshirazi commented Oct 25, 2023

For adding the APM mock server, we need to do this before the test (eg anytime after docker is installed but before the test is run)

git clone https://github.com/elastic/apm-mutating-webhook.git
cd apm-mutating-webhook/test/mock
docker build -t mock-apm-server .
docker run -dp 127.0.0.1:8027:8027 mock-apm-server

and return to the root directory for the test script. Then at the end of the test for cleanup, we want to stop and remove the image

MOCK_APM_SERVER=$(docker ps | grep mock-apm-server | awk '{print $1}')
docker stop $MOCK_APM_SERVER
docker rm $MOCK_APM_SERVER

@jackshirazi jackshirazi mentioned this issue Oct 25, 2023
19 tasks
@jackshirazi
Copy link
Contributor Author

The class in #3384 will process the output and convert it for sending to ES the same way that PostProcessBenchmarkResults in run-benchmarks does

@v1v
Copy link
Member

v1v commented Oct 26, 2023

For adding the APM mock server,

That's now done and working like a charm see this build

Image

The class in #3384 will process the output and convert it for sending to ES

I'm gonna work on this now

@jackshirazi
Copy link
Contributor Author

There's one further change to the existing script, these 3 bash variables need to be changed to these

ELASTIC_SNAPSHOT_ENTRY="new Agent(\\\"elastic-snapshot\\\",\\\"latest available snapshot version from elastic main\\\",\\\"file://$PWD/$ELASTIC_SNAPSHOT_JAR\\\", java.util.List.of(\\\"-Delastic.apm.server_url=http://host.docker.internal:8027/\\\"))"
ELASTIC_LATEST_ENTRY="new Agent(\\\"elastic-latest\\\",\\\"latest available released version from elastic main\\\",\\\"https://repo1.maven.org/maven2/co/elastic/apm/elastic-apm-agent/$ELASTIC_LATEST_VERSION/elastic-apm-agent-$ELASTIC_LATEST_VERSION.jar\\\", java.util.List.of(\\\"-Delastic.apm.server_url=http://host.docker.internal:8027/\\\"))"
ELASTIC_LATEST_ENTRY2="new Agent(\\\"elastic-async\\\",\\\"latest available released version from elastic main\\\",\\\"https://repo1.maven.org/maven2/co/elastic/apm/elastic-apm-agent/$ELASTIC_LATEST_VERSION/elastic-apm-agent-$ELASTIC_LATEST_VERSION.jar\\\", java.util.List.of(\\\"-Delastic.apm.delay_agent_premain_ms=15000\\\",\\\"-Delastic.apm.server_url=http://host.docker.internal:8027/\\\"))"

@jackshirazi
Copy link
Contributor Author

And the final steps are to add during setup

git clone https://github.com/elastic/apm-agent-java.git
cd apm-agent-java
./mvnw clean install -DskipTests=true -Dmaven.javadoc.skip=true > mvn-log.log 2> mvn-err.log

and then after the benchmark is run

java -cp ~/apm-agent-java/apm-agent-benchmarks/target/benchmarks.jar co.elastic.apm.agent.benchmark.ProcessOtelBenchmarkResults build/reports/tests/test/classes/io.opentelemetry.OverheadTests.html output.json $ELASTIC_LATEST_VERSION opentelemetry-javaagent.jar

@v1v
Copy link
Member

v1v commented Oct 26, 2023

Status update

All the bits and pieces have been put in place and this build ran successfully and ingested the documents in the observability-benchmarks cluster.

Index name: otel-microbenchmarks

image

There is just one minor improvement to help with using the benchmarks.jar rather than building them from source code. See #3386

The reason is that we already use the GitHub api/cli to fetch elastic-apm-agent.jar as described in #3369 (comment)

@v1v
Copy link
Member

v1v commented Oct 26, 2023

I've started to see some failures when integrating a couple of new changes:



Failed to map supported failure 'org.opentest4j.AssertionFailedError: Unhandled exception in release' with mapper 'org.gradle.api.internal.tasks.testing.failure.mappers.OpenTestAssertionFailedMapper@4277127c': Cannot invoke "Object.getClass()" because "obj" is null
--
  |  
  | > Task :test
  |  
  | OverheadTests > runAllTestConfigurations() > release FAILED
  | org.opentest4j.AssertionFailedError at OverheadTests.java:72
  | Caused by: com.github.dockerjava.api.exception.ConflictException at OverheadTests.java:147
  |  
  | 1 test completed, 1 failed
  |  
  | > Task :test FAILED
  |  
  | FAILURE: Build failed with an exception.
  |  
  | * What went wrong:
  | Execution failed for task ':test'.
  | > There were failing tests. See the report at: file:///var/lib/buildkite-agent/.buildkite-agent/builds/worker-1799330-build-hel1-dc1-hetzner-elasticnet-co/elastic/apm-agent-java-load-testing/opentelemetry-java-instrumentation/benchmark-overhead/build/reports/tests/test/index.html
  |  
  | * Try:
  | > Run with --scan to get full insights.
  |  
  | Deprecated Gradle features were used in this build, making it incompatible with Gradle 9.0.
  |  
  | You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.
  |  
  | For more on this, please refer to https://docs.gradle.org/8.4/userguide/command_line_interface.html#sec:command_line_warnings in the Gradle documentation.
  |  
  | BUILD FAILED in 8m 14s
  | 3 actionable tasks: 3 executed


See https://buildkite.com/elastic/apm-agent-java-load-testing/builds/178#018b6cb0-53af-4835-a865-69b52c30ddc2/111-112

@jackshirazi , do you happen to know what's the reason?

@v1v
Copy link
Member

v1v commented Oct 26, 2023

@jackshirazi , do you happen to know what's the reason?

It worked in the next run https://buildkite.com/elastic/apm-agent-java-load-testing/builds/179, maybe some weird environmental issue, to help with I added the archiving for the index.html that contains the test results

@jackshirazi
Copy link
Contributor Author

I don't know what the failure was, and that index.html file won't help, it's the result file build/reports/tests/test/classes/io.opentelemetry.OverheadTests.html - the one that holds the results - that would have the details of the error. I think I saw a similar failure in one of my tests, it was caused by a failure to download the otel jar file from maven, ie maven flakiness

@jackshirazi
Copy link
Contributor Author

Completed with #3371 .

I'll spin out 2 subsequent tasks, the dashboard and adding continuous profiling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants