New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement JMH benchmark for support process instance creation on larger state #12241
Comments
🧪 Small breakthrough on the performance test site. I was able to write a JMH benchmark, which I want to use to test the improvements. I have run with the base (no state) and one with a big state to see the differences. We can see that base (as expected) is ~3-4 times faster than with bigger state. I will try to play around with the rocksdb config to easier reproduce the big state test, without having the big state in the repo (as resources).
Related branch https://github.com/camunda/zeebe/tree/zell-performance-test-engine |
I merged fixes branch into the branch with the JMH benchmark to see what difference it makes with the large state snapshot.
Looks pretty good, as we also have seen in our performance hack day 🙂 👍 I will try now whether I can reproduce the slow performance with a smaller RocksDB config, instead of having a large state in the repo. 😅 |
🔧 Benchmark ImprovementsI have changed the output time unit to seconds to make the results more representative. Also, the timeUnit for the test runs has been changed to seconds, which means it tries to run as many operations as possible in one second. Furthermore, I increased the iterations for warmup and benchmark measurement (100 & 200). This means we will already build up a lot of states in the warm-up phase, which shows based on our results of base already that the throughput drastically decreases. The Base result now looks like this (with default configs):
This means we can on average start ~20 PI per second, in the benchmark output we can see that it actually starts strong with ~370 PI/s but it drops quite fast. The idea and goal are, even if we have a longer warm-up phase or big state to start with the operations count per second (throughput shouldn't decrease), which is right now the case. 🕵️ Changing configAfter reducing the RocksDB memory size:
We can see as a result the performance dropped in the same way as it did before with more iterations (in the benchmark we had also high iteration count). BTW I have played before just with the configuration, without changing the iteration and got an similar effect.
I think the base benchmark is enough for now since we already have seen with many iterations that the throughput goes down. We can start with the first increment of changes and improvements and validate this with the JMH benchmarks, but of course also with our normal Zeebe benchmarks. Later if we a confident we found several solutions which make the base look good/well, we can play around again with starting direct with bigger state and/or different configs, which should have a negligible effect. |
Since I have changed the configs of the JMH recently I have rerun the previous benchmark with the state fixes. Furthermore, I pushed the JMH with the fixes to a separate branch here, to rerun it later.
I looked into this and found out that the RecordingExporter was an issue, I reset it now after every run, and this no longer slows the benchmark down. This also shows again that the POC changes perform pretty good. I will update this comment continuously in order to reflect our progress and how the changes affect performance.
|
With the JMH tests running in Intellij, I'm also able to run the async profiler inside the IDE. This allowed me to give some more insights into what might be pressing and potential issues to fix. For example, we can see that seek is a big part of the equation and that we check for existence where it might be not necessary yet. This brings again up the topic of rearranging data for better operability. Like if we would prefix most keys, for example of variables with the scope key, we could do a range delete instead of iterating over the column family. Update: Without consistency check it becomes even more obvious. To note in the benchmarks we don't use any variables. |
12475: Introduce cache for process versions r=megglos a=Zelldon ## Description When executing a process instance we often have to get the process model and related version to it. When running a cluster for a while (creating a lot of state etc.) the process version will eventually be migrated to a low level of RocksDB (potentially L3), because most of the time process models are not deployed that often. In other words, if a key is not updated it will be moved to a lower level by RocksDB. Accessing lower levels of RocksDB is slower than accessing higher levels or mem tables. You might ask why is it slow, even if we repeatedly access via RocksDB, why is it not in the cache? There are multiple reasons for it. 1. We only have caches for L0, and L1 configured (not for lower levels) 2. We have limited the cache sizes to a certain amount which might cause continuous eviction In order to avoid running into issues with cold data, which is mostly static data, we can introduce our own caches, to work around this. This allows us to avoid unnecessary RocksDB access, unnecessary io, etc. **This PR does the following:** * Refactors the NextValueManager, including renaming as its only purpose is to be used for ProcessVersions * Introduce a new cache for the version of each process, in order to avoid access to cold data. **Performance:** We run again a JMH benchmark with the changes and can see that the performance _slightly_ increased (potentially 8%), not significant but it will likely come into play with other changes later. [See more details here ](#12241 (comment)) <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> closes #12034 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
12483: Introduce experimental SST partitioning r=Zelldon a=Zelldon ## Description Discovered this via in [the RocksDB google group post ](https://groups.google.com/g/rocksdb/c/l3CzFD4YBYQ#:~:text=another%20way%20that%20might%20be%20helpful%20is%20using%20sst_partitioner_factory%20.%20By%20using%20this%20experimental%20feature%2C%20you%20can%20partition%20the%20ssts%20based%20on%20your%20desired%20prefix%20which%20means%20you%20would%20only%20have%20to%20tell%20how%20many%20entries%20are%20in%20that%20sst.) [Form the java docs](https://javadoc.io/static/org.rocksdb/rocksdbjni/6.20.3/org/rocksdb/ColumnFamilyOptionsInterface.html#setSstPartitionerFactory(org.rocksdb.SstPartitionerFactory)) > use the specified factory for a function to determine the partitioning of sst files. This helps compaction to split the files on interesting boundaries (key prefixes) to make propagation of sst files less write amplifying (covering the whole key space). ### Details SST partitioning based on column family prefix (virtual column family) allows to split up key ranges in separate SST files, which should improve compaction and makes propagation of SST files less write amplifying. It will cause to create of more files in runtime and snapshot as it will create more SST files. At least for each column family we use it at runtime. As discussed here https://camunda.slack.com/archives/C04T7T0RPLY/p1681931668446069 we want to add this as an experimental feature for now, so people can play around with it and we can do as well. From the benchmark results so far it looked quite promising. The feature itself is marked as experimental as well at RocksDB so it makes sense to mark it on our side as experimental as well. Open questions: 1. it seems that the config is marked as an experimental feature, at RocksDB Idk what this exactly means, is this a problem for us? Would we just stay on the version when they remove it ? Is it unstable? Not sure yet. 2. The maximum throughput seems to be degraded a bit, as I mentioned earlier we are currently able to reach around ~240 PI/s, [with the configuration we are reaching ~220 PI/s. ](https://grafana.dev.zeebe.io/d/I4lo7_EZk/zeebe?orgId=1&refresh=10s&from=now-6h&to=now&var-DS_PROMETHEUS=Prometheus&var-cluster=All&var-namespace=zell-max-out-sst-partitioner&var-pod=All&var-partition=All)I think it depends what right now is our priority, is it the maximum throughput or is it that we can provide stable performance on the larger state. Is it ok to hurt our maximum throughput a little? We will need to investigate this further. ### JMH Benchmarks I tried it with the JMH benchmark and it gave impressive results ``` Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime": 656.639 ±(99.9%) 91.394 ops/s [Average] (min, avg, max) = (1.775, 656.639, 1163.635), stdev = 386.967 CI (99.9%): [565.246, 748.033] (assumes normal distribution) # Run complete. Total time: 00:07:12 Benchmark Mode Cnt Score Error Units EnginePerformanceTest.measureProcessExecutionTime thrpt 200 656.639 ± 91.394 ops/s ``` [Remember the base was ~230](#12241 (comment)) ### Zeebe Benchmarks After the JMH benchmark I started some new benchmarks like for the large state. I wanted to see how it would survive when we continuously just start instances. Remember: Previously we died after ~1 hour, when reaching 800 MB of state. [In the benchmark we had reached at least ~4.5 gig and were still able to handle the same load (over 6 hours). ](https://grafana.dev.zeebe.io/d/I4lo7_EZk/zeebe?orgId=1&from=1681912207012&to=1681930704963&var-DS_PROMETHEUS=Prometheus&var-cluster=All&var-namespace=zell-large-state-sst-partition&var-pod=All&var-partition=All):exploding_head: ![snapshot](https://user-images.githubusercontent.com/2758593/235164591-0ba3cb40-aa47-4bf4-b647-9992ac5d7e88.png) ![general](https://user-images.githubusercontent.com/2758593/235164598-5da0906e-a50f-4235-a5b8-48181dffc9d5.png) #### Maxing out benchmark ![maxgeneral](https://user-images.githubusercontent.com/2758593/235164601-bab9f40c-20be-4cbe-8530-c0ba791ec0f0.png) <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> related to #12033 Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>
Just to document what I have discussed with @oleschoenburg Notes:
Next steps
|
13121: Performance test for large state and stable performance r=Zelldon a=Zelldon ## Description `@oleschoenburg` I created this PR in order to get the first increment merged and to not overwhelm you with more changes. Most of the changes here are refactorings and preparation for the JMH test, plus of course the JMH test itself. I applied already several things we have discussed, like creation of large state in setup phase, keeping the test in engine module etc. In an upcoming PR I will execute the JMH benchmark inside a unit test, which is executed in a separate CI job, please see [related comment](#12241 (comment)). <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> related to #12241 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
With #13135 we added a unit test which allows us to integrate the JMH benchmark in our CI and run it everytime when our CI runs. I would mark this issue as resolved then. |
related to #12033
Description
In the POC for determining whether the found issues can help us to resolve the issue, I created a simple unit test which also showed us the performance regression on a unit test level.
The test is on a branch, you had to run it in a loop to see the performance differences, to allow jit kick in, etc.
Ideally, we have a JMH benchmark which allows verifying the performance of new process instance creations when we have a larger state or have a similar setup (maybe with decreasing the size of RocksDB which mimics the large state scenarios).
Later, we should think about how we can automate such JMH benchmarks, to avoid regressions. For the first iteration, it should be enough to build such a JMH setup and use it to validate certain changes we plan with #12033
Resources:
The text was updated successfully, but these errors were encountered: