Implement JMH benchmark for support process instance creation on larger state #12241

Zelldon · 2023-04-04T13:08:34Z

related to #12033

Description

In the POC for determining whether the found issues can help us to resolve the issue, I created a simple unit test which also showed us the performance regression on a unit test level.

The test is on a branch, you had to run it in a loop to see the performance differences, to allow jit kick in, etc.

Ideally, we have a JMH benchmark which allows verifying the performance of new process instance creations when we have a larger state or have a similar setup (maybe with decreasing the size of RocksDB which mimics the large state scenarios).

Later, we should think about how we can automate such JMH benchmarks, to avoid regressions. For the first iteration, it should be enough to build such a JMH setup and use it to validate certain changes we plan with #12033

Resources:

integrate JMH with unit tests
all POC changes in order to resolve the large state issue
running jmh from eclipse
idea jmh plugin pretty cool to easily run the jmh benchmark

Zelldon · 2023-04-04T13:20:09Z

🧪 Small breakthrough on the performance test site. I was able to write a JMH benchmark, which I want to use to test the improvements. I have run with the base (no state) and one with a big state to see the differences.

We can see that base (as expected) is ~3-4 times faster than with bigger state. I will try to play around with the rocksdb config to easier reproduce the big state test, without having the big state in the repo (as resources).

  // BASE
  //  Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime":
  //      0.370 ±(99.9%) 0.029 ops/ms [Average]
  //      (min, avg, max) = (0.088, 0.370, 0.670), stdev = 0.125
  //  CI (99.9%): [0.340, 0.399] (assumes normal distribution)
  //
  //  # Run complete. Total time: 00:00:04
  //
  //  Benchmark                                           Mode  Cnt  Score   Error   Units
  //  EnginePerformanceTest.measureProcessExecutionTime  thrpt  200  0.370 ± 0.029  ops/ms
  //
  //  Process finished with exit code 0

  //  BIG STATE
  //  Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime":
  //      0.093 ±(99.9%) 0.003 ops/ms [Average]
  //      (min, avg, max) = (0.051, 0.093, 0.117), stdev = 0.014
  //  CI (99.9%): [0.090, 0.097] (assumes normal distribution)
  //  # Run complete. Total time: 00:00:05
  //  Benchmark                                           Mode  Cnt  Score   Error   Units
  //  EnginePerformanceTest.measureProcessExecutionTime  thrpt  200  0.093 ± 0.003  ops/ms

Related branch https://github.com/camunda/zeebe/tree/zell-performance-test-engine

Zelldon · 2023-04-04T14:56:58Z

I merged fixes branch into the branch with the JMH benchmark to see what difference it makes with the large state snapshot.

// fixes
//  Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime":
//      0.417 ±(99.9%) 0.035 ops/ms [Average]
//      (min, avg, max) = (0.081, 0.417, 0.768), stdev = 0.146
//  CI (99.9%): [0.383, 0.452] (assumes normal distribution)
//
//
//      # Run complete. Total time: 00:00:04
//
//  Benchmark                                           Mode  Cnt  Score   Error   Units
//  EnginePerformanceTest.measureProcessExecutionTime  thrpt  200  0.417 ± 0.035  ops/ms
//

Looks pretty good, as we also have seen in our performance hack day 🙂 👍

I will try now whether I can reproduce the slow performance with a smaller RocksDB config, instead of having a large state in the repo. 😅

Zelldon · 2023-04-05T10:02:42Z

🔧 Benchmark Improvements

I have changed the output time unit to seconds to make the results more representative. Also, the timeUnit for the test runs has been changed to seconds, which means it tries to run as many operations as possible in one second. Furthermore, I increased the iterations for warmup and benchmark measurement (100 & 200). This means we will already build up a lot of states in the warm-up phase, which shows based on our results of base already that the throughput drastically decreases.

The Base result now looks like this (with default configs):

INFO  io.camunda.zeebe.engine.perf - Started 10720 process instances

Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime":
  20.182 ±(99.9%) 0.868 ops/s [Average]
  (min, avg, max) = (12.536, 20.182, 29.170), stdev = 3.674
  CI (99.9%): [19.314, 21.050] (assumes normal distribution)


# Run complete. Total time: 00:05:15

Benchmark                                           Mode  Cnt   Score   Error  Units
EnginePerformanceTest.measureProcessExecutionTime  thrpt  200  20.182 ± 0.868  ops/s

This means we can on average start ~20 PI per second, in the benchmark output we can see that it actually starts strong with ~370 PI/s but it drops quite fast.

The idea and goal are, even if we have a longer warm-up phase or big state to start with the operations count per second (throughput shouldn't decrease), which is right now the case.

🕵️ Changing config

After reducing the RocksDB memory size:

    // enable consistency checks for tests - this is enabled in our benchmarks as well
    final var consistencyChecks = new ConsistencyChecksSettings(true, true);
    final var zbFactory =
        new ZeebeRocksDbFactory<ZbColumnFamilies>(
            new RocksDbConfiguration().setMemoryLimit(5 * 1024), consistencyChecks);
    streamProcessingComposite =
        new StreamProcessingComposite(testStreams, 1, zbFactory, actorScheduler);

We can see as a result the performance dropped in the same way as it did before with more iterations (in the benchmark we had also high iteration count). BTW I have played before just with the configuration, without changing the iteration and got an similar effect.

Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime":
  20.099 ±(99.9%) 0.784 ops/s [Average]
  (min, avg, max) = (13.999, 20.099, 28.292), stdev = 3.319
  CI (99.9%): [19.315, 20.883] (assumes normal distribution)


# Run complete. Total time: 00:05:15

Benchmark                                           Mode  Cnt   Score   Error  Units
EnginePerformanceTest.measureProcessExecutionTime  thrpt  200  20.099 ± 0.784  ops/s

I think the base benchmark is enough for now since we already have seen with many iterations that the throughput goes down. We can start with the first increment of changes and improvements and validate this with the JMH benchmarks, but of course also with our normal Zeebe benchmarks.

Later if we a confident we found several solutions which make the base look good/well, we can play around again with starting direct with bigger state and/or different configs, which should have a negligible effect.

Zelldon · 2023-04-17T11:52:08Z

Since I have changed the configs of the JMH recently I have rerun the previous benchmark with the state fixes. Furthermore, I pushed the JMH with the fixes to a separate branch here, to rerun it later.

What is interesting is that we would expect much better performance with our POC, we can see that it improved by ~1.5 but still the drop of performance over time is still significant. Either we need to adjust/reevaluate our benchmark or also check again the POC changes again with our other zeebe cluster benchmark were we started endless instances.

I looked into this and found out that the RecordingExporter was an issue, I reset it now after every run, and this no longer slows the benchmark down. This also shows again that the POC changes perform pretty good.

I will update this comment continuously in order to reflect our progress and how the changes affect performance.

Test	Mode	Count	Score	Error rate	Units
Base	thrpt	200	230.023	± 26.432	ops/s
POC	thrpt	200	540.681	± 103.742	ops/s
Blacklist fix	thrpt	200	227.188	± 19.717	ops/s
Disable consistency check (base)	thrpt	200	442.961	± 27.054	ops/s
Cache static process version	thrpt	200	249.552	± 45.303	ops/s
POC: Cache key generator values	thrpt	200	251.027	± 55.103	ops/s
Increase fixed prefix extractor	thrpt	200	552.536	± 89.015	ops/s
Introduce SST partitioning	thrpt	200	656.639	± 91.394	ops/s
POC: Enhance CF prefix check	thrpt	200	223.584	± 2.052	ops/s
POC: Introduce Key Formats	thrpt	200	227.234	± 15.272	ops/s
POC: Zeebe db cache	thrpt	200	225.579	± 2.684	ops/s
POC: Zeebe db cache + disable consistency checks	thrpt	200	442.611	± 15.342	ops/s
POC: Create additional CF	thrpt	200	681.872	± 77.924	ops/s

Zelldon · 2023-04-17T13:20:57Z

With the JMH tests running in Intellij, I'm also able to run the async profiler inside the IDE.

This allowed me to give some more insights into what might be pressing and potential issues to fix.

For example, we can see that seek is a big part of the equation and that we check for existence where it might be not necessary yet.

This brings again up the topic of rearranging data for better operability. Like if we would prefix most keys, for example of variables with the scope key, we could do a range delete instead of iterating over the column family.

Update: Without consistency check it becomes even more obvious.

To note in the benchmarks we don't use any variables.

12475: Introduce cache for process versions r=megglos a=Zelldon ## Description When executing a process instance we often have to get the process model and related version to it. When running a cluster for a while (creating a lot of state etc.) the process version will eventually be migrated to a low level of RocksDB (potentially L3), because most of the time process models are not deployed that often. In other words, if a key is not updated it will be moved to a lower level by RocksDB. Accessing lower levels of RocksDB is slower than accessing higher levels or mem tables. You might ask why is it slow, even if we repeatedly access via RocksDB, why is it not in the cache? There are multiple reasons for it. 1. We only have caches for L0, and L1 configured (not for lower levels) 2. We have limited the cache sizes to a certain amount which might cause continuous eviction In order to avoid running into issues with cold data, which is mostly static data, we can introduce our own caches, to work around this. This allows us to avoid unnecessary RocksDB access, unnecessary io, etc. **This PR does the following:** * Refactors the NextValueManager, including renaming as its only purpose is to be used for ProcessVersions * Introduce a new cache for the version of each process, in order to avoid access to cold data. **Performance:** We run again a JMH benchmark with the changes and can see that the performance _slightly_ increased (potentially 8%), not significant but it will likely come into play with other changes later. [See more details here ](#12241 (comment))  ## Related issues  closes #12034 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>

12483: Introduce experimental SST partitioning r=Zelldon a=Zelldon ## Description Discovered this via in [the RocksDB google group post ](https://groups.google.com/g/rocksdb/c/l3CzFD4YBYQ#:~:text=another%20way%20that%20might%20be%20helpful%20is%20using%20sst_partitioner_factory%20.%20By%20using%20this%20experimental%20feature%2C%20you%20can%20partition%20the%20ssts%20based%20on%20your%20desired%20prefix%20which%20means%20you%20would%20only%20have%20to%20tell%20how%20many%20entries%20are%20in%20that%20sst.) [Form the java docs](https://javadoc.io/static/org.rocksdb/rocksdbjni/6.20.3/org/rocksdb/ColumnFamilyOptionsInterface.html#setSstPartitionerFactory(org.rocksdb.SstPartitionerFactory)) > use the specified factory for a function to determine the partitioning of sst files. This helps compaction to split the files on interesting boundaries (key prefixes) to make propagation of sst files less write amplifying (covering the whole key space). ### Details SST partitioning based on column family prefix (virtual column family) allows to split up key ranges in separate SST files, which should improve compaction and makes propagation of SST files less write amplifying. It will cause to create of more files in runtime and snapshot as it will create more SST files. At least for each column family we use it at runtime. As discussed here https://camunda.slack.com/archives/C04T7T0RPLY/p1681931668446069 we want to add this as an experimental feature for now, so people can play around with it and we can do as well. From the benchmark results so far it looked quite promising. The feature itself is marked as experimental as well at RocksDB so it makes sense to mark it on our side as experimental as well. Open questions: 1. it seems that the config is marked as an experimental feature, at RocksDB Idk what this exactly means, is this a problem for us? Would we just stay on the version when they remove it ? Is it unstable? Not sure yet. 2. The maximum throughput seems to be degraded a bit, as I mentioned earlier we are currently able to reach around ~240 PI/s, [with the configuration we are reaching ~220 PI/s. ](https://grafana.dev.zeebe.io/d/I4lo7_EZk/zeebe?orgId=1&refresh=10s&from=now-6h&to=now&var-DS_PROMETHEUS=Prometheus&var-cluster=All&var-namespace=zell-max-out-sst-partitioner&var-pod=All&var-partition=All)I think it depends what right now is our priority, is it the maximum throughput or is it that we can provide stable performance on the larger state. Is it ok to hurt our maximum throughput a little? We will need to investigate this further. ### JMH Benchmarks I tried it with the JMH benchmark and it gave impressive results ``` Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime": 656.639 ±(99.9%) 91.394 ops/s [Average] (min, avg, max) = (1.775, 656.639, 1163.635), stdev = 386.967 CI (99.9%): [565.246, 748.033] (assumes normal distribution) # Run complete. Total time: 00:07:12 Benchmark Mode Cnt Score Error Units EnginePerformanceTest.measureProcessExecutionTime thrpt 200 656.639 ± 91.394 ops/s ``` [Remember the base was ~230](#12241 (comment)) ### Zeebe Benchmarks After the JMH benchmark I started some new benchmarks like for the large state. I wanted to see how it would survive when we continuously just start instances. Remember: Previously we died after ~1 hour, when reaching 800 MB of state. [In the benchmark we had reached at least ~4.5 gig and were still able to handle the same load (over 6 hours). ](https://grafana.dev.zeebe.io/d/I4lo7_EZk/zeebe?orgId=1&from=1681912207012&to=1681930704963&var-DS_PROMETHEUS=Prometheus&var-cluster=All&var-namespace=zell-large-state-sst-partition&var-pod=All&var-partition=All):exploding_head: ![snapshot](https://user-images.githubusercontent.com/2758593/235164591-0ba3cb40-aa47-4bf4-b647-9992ac5d7e88.png) ![general](https://user-images.githubusercontent.com/2758593/235164598-5da0906e-a50f-4235-a5b8-48181dffc9d5.png) #### Maxing out benchmark ![maxgeneral](https://user-images.githubusercontent.com/2758593/235164601-bab9f40c-20be-4cbe-8530-c0ba791ec0f0.png)  ## Related issues  related to #12033 Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>

Zelldon · 2023-06-14T11:32:10Z

Just to document what I have discussed with @oleschoenburg

Notes:

it is important that the test all run with the same set up (same large state)
We lean to recreate the state in the JMH setup code, because:
- makes it less complex, if we have a pre-built state we need to resolve where to store, how to receive etc.
- state creation allows to always have a compatible state
- makes variation simpler
- We will create state in setup code, not via warm-up to guarantee to always have the same setting and same large state e.g. same amount of instances
JUNIT
- test should stay in Engine module, but extra naming (like randomise tests), this way performance test run only once via ci
- Integrate in CI but extra job, to give own resources etc
- If possible verify whether junit test can be excluded from intellij, so it just run via maven profile, so it is not executed by accident when run all tests in intellij
- Setup the unit test similar to
  https://www.retit.de/continuous-benchmarking-with-jmh-and-junit-2/
  Last run
  Iteration 200: 11:17:20.385 [] INFO io.camunda.zeebe.engine.perf - Started 227176 process instances
  Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime":
  615.279 ±(99.9%) 28.398 ops/s [Average]
  (min, avg, max) = (322.794, 615.279, 945.527), stdev = 120.239
  CI (99.9%): [586.881, 643.677] (assumes normal distribution)
  Current:
  https://camunda.slack.com/archives/D02MKQEC30D/p1686662505116359

Next steps

Move creating state in setup code
Clean up code
Create junit test which executed benchmark
Ignore for intellij, exclude for junit, define new pattern to run as separate ci job
Create ci job
Run as separate ci job

13121: Performance test for large state and stable performance r=Zelldon a=Zelldon ## Description `@oleschoenburg` I created this PR in order to get the first increment merged and to not overwhelm you with more changes. Most of the changes here are refactorings and preparation for the JMH test, plus of course the JMH test itself. I applied already several things we have discussed, like creation of large state in setup phase, keeping the test in engine module etc. In an upcoming PR I will execute the JMH benchmark inside a unit test, which is executed in a separate CI job, please see [related comment](#12241 (comment)).  ## Related issues  related to #12241 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>

Zelldon · 2023-06-28T11:06:05Z

With #13135 we added a unit test which allows us to integrate the JMH benchmark in our CI and run it everytime when our CI runs.

I would mark this issue as resolved then.

Zelldon mentioned this issue Apr 4, 2023

[EPIC] Support stable performance for new instances even on larger state #12033

Closed

Zelldon changed the title ~~Find a good and fast way to validate changes and assumptions~~ Find a good and fast way to validate changes and assumptions regarding performance Apr 4, 2023

Zelldon changed the title ~~Find a good and fast way to validate changes and assumptions regarding performance~~ Implement JMH benchmark for support process instance creation on larger state Apr 4, 2023

Zelldon added area/performance Marks an issue as performance related component/engine component/stream-platform labels Apr 4, 2023

Zelldon self-assigned this Apr 4, 2023

Zelldon mentioned this issue Apr 17, 2023

fix(engine): skip unnecessary blacklist check #12306

Merged

15 tasks

This was referenced May 3, 2023

POC: Enhance CF prefix check #12654

Closed

POC: Zeebe db cache #12677

Closed

Zelldon mentioned this issue Jun 5, 2023

Compare SST partitioning with real Column Family split #12955

Closed

Zelldon mentioned this issue Jun 14, 2023

Performance test for large state and stable performance #13121

Merged

Zelldon mentioned this issue Jun 15, 2023

Create unit test for JMH engine performance test #13135

Merged

Zelldon closed this as completed Jun 28, 2023

Zelldon added the version:8.3.0-alpha3 Marks an issue as being completely or in parts released in 8.3.0-alpha3 label Jul 6, 2023

megglos added the version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0 label Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement JMH benchmark for support process instance creation on larger state #12241

Implement JMH benchmark for support process instance creation on larger state #12241

Zelldon commented Apr 4, 2023 •

edited

Zelldon commented Apr 4, 2023

Zelldon commented Apr 4, 2023 •

edited

Zelldon commented Apr 5, 2023 •

edited

Zelldon commented Apr 17, 2023 •

edited

Zelldon commented Apr 17, 2023 •

edited

Zelldon commented Jun 14, 2023 •

edited

Zelldon commented Jun 28, 2023

Implement JMH benchmark for support process instance creation on larger state #12241

Implement JMH benchmark for support process instance creation on larger state #12241

Comments

Zelldon commented Apr 4, 2023 • edited

Zelldon commented Apr 4, 2023

Zelldon commented Apr 4, 2023 • edited

Zelldon commented Apr 5, 2023 • edited

🔧 Benchmark Improvements

🕵️ Changing config

Zelldon commented Apr 17, 2023 • edited

Zelldon commented Apr 17, 2023 • edited

Zelldon commented Jun 14, 2023 • edited

Zelldon commented Jun 28, 2023

Zelldon commented Apr 4, 2023 •

edited

Zelldon commented Apr 4, 2023 •

edited

Zelldon commented Apr 5, 2023 •

edited

Zelldon commented Apr 17, 2023 •

edited

Zelldon commented Apr 17, 2023 •

edited

Zelldon commented Jun 14, 2023 •

edited