Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement JMH benchmark for support process instance creation on larger state #12241

Closed
Tracked by #12033
Zelldon opened this issue Apr 4, 2023 · 7 comments
Closed
Tracked by #12033
Assignees
Labels
area/performance Marks an issue as performance related component/engine component/stream-platform version:8.3.0-alpha3 Marks an issue as being completely or in parts released in 8.3.0-alpha3 version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0

Comments

@Zelldon
Copy link
Member

Zelldon commented Apr 4, 2023

related to #12033

Description

In the POC for determining whether the found issues can help us to resolve the issue, I created a simple unit test which also showed us the performance regression on a unit test level.

The test is on a branch, you had to run it in a loop to see the performance differences, to allow jit kick in, etc.

Ideally, we have a JMH benchmark which allows verifying the performance of new process instance creations when we have a larger state or have a similar setup (maybe with decreasing the size of RocksDB which mimics the large state scenarios).

Later, we should think about how we can automate such JMH benchmarks, to avoid regressions. For the first iteration, it should be enough to build such a JMH setup and use it to validate certain changes we plan with #12033

Resources:

@Zelldon Zelldon changed the title Find a good and fast way to validate changes and assumptions Find a good and fast way to validate changes and assumptions regarding performance Apr 4, 2023
@Zelldon Zelldon changed the title Find a good and fast way to validate changes and assumptions regarding performance Implement JMH benchmark for support process instance creation on larger state Apr 4, 2023
@Zelldon Zelldon self-assigned this Apr 4, 2023
@Zelldon
Copy link
Member Author

Zelldon commented Apr 4, 2023

🧪 Small breakthrough on the performance test site. I was able to write a JMH benchmark, which I want to use to test the improvements. I have run with the base (no state) and one with a big state to see the differences.

We can see that base (as expected) is ~3-4 times faster than with bigger state. I will try to play around with the rocksdb config to easier reproduce the big state test, without having the big state in the repo (as resources).

  // BASE
  //  Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime":
  //      0.370 ±(99.9%) 0.029 ops/ms [Average]
  //      (min, avg, max) = (0.088, 0.370, 0.670), stdev = 0.125
  //  CI (99.9%): [0.340, 0.399] (assumes normal distribution)
  //
  //  # Run complete. Total time: 00:00:04
  //
  //  Benchmark                                           Mode  Cnt  Score   Error   Units
  //  EnginePerformanceTest.measureProcessExecutionTime  thrpt  200  0.370 ± 0.029  ops/ms
  //
  //  Process finished with exit code 0

  //  BIG STATE
  //  Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime":
  //      0.093 ±(99.9%) 0.003 ops/ms [Average]
  //      (min, avg, max) = (0.051, 0.093, 0.117), stdev = 0.014
  //  CI (99.9%): [0.090, 0.097] (assumes normal distribution)
  //  # Run complete. Total time: 00:00:05
  //  Benchmark                                           Mode  Cnt  Score   Error   Units
  //  EnginePerformanceTest.measureProcessExecutionTime  thrpt  200  0.093 ± 0.003  ops/ms

Related branch https://github.com/camunda/zeebe/tree/zell-performance-test-engine

@Zelldon
Copy link
Member Author

Zelldon commented Apr 4, 2023

I merged fixes branch into the branch with the JMH benchmark to see what difference it makes with the large state snapshot.

// fixes
//  Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime":
//      0.417 ±(99.9%) 0.035 ops/ms [Average]
//      (min, avg, max) = (0.081, 0.417, 0.768), stdev = 0.146
//  CI (99.9%): [0.383, 0.452] (assumes normal distribution)
//
//
//      # Run complete. Total time: 00:00:04
//
//  Benchmark                                           Mode  Cnt  Score   Error   Units
//  EnginePerformanceTest.measureProcessExecutionTime  thrpt  200  0.417 ± 0.035  ops/ms
//

Looks pretty good, as we also have seen in our performance hack day 🙂 👍

I will try now whether I can reproduce the slow performance with a smaller RocksDB config, instead of having a large state in the repo. 😅

@Zelldon
Copy link
Member Author

Zelldon commented Apr 5, 2023

🔧 Benchmark Improvements

I have changed the output time unit to seconds to make the results more representative. Also, the timeUnit for the test runs has been changed to seconds, which means it tries to run as many operations as possible in one second. Furthermore, I increased the iterations for warmup and benchmark measurement (100 & 200). This means we will already build up a lot of states in the warm-up phase, which shows based on our results of base already that the throughput drastically decreases.

The Base result now looks like this (with default configs):

INFO  io.camunda.zeebe.engine.perf - Started 10720 process instances

Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime":
  20.182 ±(99.9%) 0.868 ops/s [Average]
  (min, avg, max) = (12.536, 20.182, 29.170), stdev = 3.674
  CI (99.9%): [19.314, 21.050] (assumes normal distribution)


# Run complete. Total time: 00:05:15

Benchmark                                           Mode  Cnt   Score   Error  Units
EnginePerformanceTest.measureProcessExecutionTime  thrpt  200  20.182 ± 0.868  ops/s

This means we can on average start ~20 PI per second, in the benchmark output we can see that it actually starts strong with ~370 PI/s but it drops quite fast.

The idea and goal are, even if we have a longer warm-up phase or big state to start with the operations count per second (throughput shouldn't decrease), which is right now the case.

🕵️ Changing config

After reducing the RocksDB memory size:

    // enable consistency checks for tests - this is enabled in our benchmarks as well
    final var consistencyChecks = new ConsistencyChecksSettings(true, true);
    final var zbFactory =
        new ZeebeRocksDbFactory<ZbColumnFamilies>(
            new RocksDbConfiguration().setMemoryLimit(5 * 1024), consistencyChecks);
    streamProcessingComposite =
        new StreamProcessingComposite(testStreams, 1, zbFactory, actorScheduler);

We can see as a result the performance dropped in the same way as it did before with more iterations (in the benchmark we had also high iteration count). BTW I have played before just with the configuration, without changing the iteration and got an similar effect.

Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime":
  20.099 ±(99.9%) 0.784 ops/s [Average]
  (min, avg, max) = (13.999, 20.099, 28.292), stdev = 3.319
  CI (99.9%): [19.315, 20.883] (assumes normal distribution)


# Run complete. Total time: 00:05:15

Benchmark                                           Mode  Cnt   Score   Error  Units
EnginePerformanceTest.measureProcessExecutionTime  thrpt  200  20.099 ± 0.784  ops/s

I think the base benchmark is enough for now since we already have seen with many iterations that the throughput goes down. We can start with the first increment of changes and improvements and validate this with the JMH benchmarks, but of course also with our normal Zeebe benchmarks.

Later if we a confident we found several solutions which make the base look good/well, we can play around again with starting direct with bigger state and/or different configs, which should have a negligible effect.

@Zelldon
Copy link
Member Author

Zelldon commented Apr 17, 2023

Since I have changed the configs of the JMH recently I have rerun the previous benchmark with the state fixes. Furthermore, I pushed the JMH with the fixes to a separate branch here, to rerun it later.

What is interesting is that we would expect much better performance with our POC, we can see that it improved by ~1.5 but still the drop of performance over time is still significant. Either we need to adjust/reevaluate our benchmark or also check again the POC changes again with our other zeebe cluster benchmark were we started endless instances.

I looked into this and found out that the RecordingExporter was an issue, I reset it now after every run, and this no longer slows the benchmark down. This also shows again that the POC changes perform pretty good.


I will update this comment continuously in order to reflect our progress and how the changes affect performance.

Test Mode Count Score Error rate Units
Base thrpt 200 230.023 ± 26.432 ops/s
POC thrpt 200 540.681 ± 103.742 ops/s
Blacklist fix thrpt 200 227.188 ± 19.717 ops/s
Disable consistency check (base) thrpt 200 442.961 ± 27.054 ops/s
Cache static process version thrpt 200 249.552 ± 45.303 ops/s
POC: Cache key generator values thrpt 200 251.027 ± 55.103 ops/s
Increase fixed prefix extractor thrpt 200 552.536 ± 89.015 ops/s
Introduce SST partitioning thrpt 200 656.639 ± 91.394 ops/s
POC: Enhance CF prefix check thrpt 200 223.584 ± 2.052 ops/s
POC: Introduce Key Formats thrpt 200 227.234 ± 15.272 ops/s
POC: Zeebe db cache thrpt 200 225.579 ± 2.684 ops/s
POC: Zeebe db cache + disable consistency checks thrpt 200 442.611 ± 15.342 ops/s
POC: Create additional CF thrpt 200 681.872 ± 77.924 ops/s

@Zelldon
Copy link
Member Author

Zelldon commented Apr 17, 2023

With the JMH tests running in Intellij, I'm also able to run the async profiler inside the IDE.

This allowed me to give some more insights into what might be pressing and potential issues to fix.

jmh2
jmh

For example, we can see that seek is a big part of the equation and that we check for existence where it might be not necessary yet.

This brings again up the topic of rearranging data for better operability. Like if we would prefix most keys, for example of variables with the scope key, we could do a range delete instead of iterating over the column family.


Update: Without consistency check it becomes even more obvious.

jmh3

To note in the benchmarks we don't use any variables.

zeebe-bors-camunda bot added a commit that referenced this issue Apr 20, 2023
12475: Introduce cache for process versions r=megglos a=Zelldon

## Description

When executing a process instance we often have to get the process model and related version to it.

When running a cluster for a while (creating a lot of state etc.) the process version will eventually be migrated to a low level of RocksDB (potentially L3), because most of the time process models are not deployed that often. In other words, if a key is not updated it will be moved to a lower level by RocksDB. Accessing lower levels of RocksDB is slower than accessing higher levels or mem tables.

You might ask why is it slow, even if we repeatedly access via RocksDB, why is it not in the cache? There are multiple reasons for it. 

1. We only have caches for L0, and L1 configured (not for lower levels)
2. We have limited the cache sizes to a certain amount which might cause continuous eviction

In order to avoid running into issues with cold data, which is mostly static data, we can introduce our own caches, to work around this. This allows us to avoid unnecessary RocksDB access, unnecessary io, etc.

**This PR does the following:**

 * Refactors the NextValueManager, including renaming as its only purpose is to be used for ProcessVersions
 * Introduce a new cache for the version of each process, in order to avoid access to cold data.

 **Performance:**

We run again a JMH benchmark with the changes and can see that the performance _slightly_ increased (potentially 8%), not significant but it will likely come into play with other changes later.

[See more details here ](#12241 (comment))
<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #12034



Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
zeebe-bors-camunda bot added a commit that referenced this issue May 2, 2023
12483: Introduce experimental SST partitioning r=Zelldon a=Zelldon

## Description

Discovered this via in [the RocksDB google group post ](https://groups.google.com/g/rocksdb/c/l3CzFD4YBYQ#:~:text=another%20way%20that%20might%20be%20helpful%20is%20using%20sst_partitioner_factory%20.%20By%20using%20this%20experimental%20feature%2C%20you%20can%20partition%20the%20ssts%20based%20on%20your%20desired%20prefix%20which%20means%20you%20would%20only%20have%20to%20tell%20how%20many%20entries%20are%20in%20that%20sst.)

[Form the java docs](https://javadoc.io/static/org.rocksdb/rocksdbjni/6.20.3/org/rocksdb/ColumnFamilyOptionsInterface.html#setSstPartitionerFactory(org.rocksdb.SstPartitionerFactory))
> use the specified factory for a function to determine the partitioning of sst files. This helps compaction to split the files on interesting boundaries (key prefixes) to make propagation of sst files less write amplifying (covering the whole key space).

### Details

SST partitioning based on column family prefix (virtual column family) allows to split up key ranges in separate SST files, which should improve compaction and makes propagation of SST files less write amplifying.

It will cause to create of more files in runtime and snapshot as it will create more SST files. At least for each column family we use it at runtime.

As discussed here https://camunda.slack.com/archives/C04T7T0RPLY/p1681931668446069 we want to add this as an experimental feature for now, so people can play around with it and we can do as well. From the benchmark results so far it looked quite promising. The feature itself is marked as experimental as well at RocksDB so it makes sense to mark it on our side as experimental as well.

Open questions:

1. it seems that the config is marked as an experimental feature, at RocksDB Idk what this exactly means, is this a problem for us? Would we just stay on the version when they remove it ? Is it unstable? Not sure yet.
2. The maximum throughput seems to be degraded a bit, as I mentioned earlier we are currently able to reach around ~240 PI/s, [with the configuration we are reaching ~220 PI/s. ](https://grafana.dev.zeebe.io/d/I4lo7_EZk/zeebe?orgId=1&refresh=10s&from=now-6h&to=now&var-DS_PROMETHEUS=Prometheus&var-cluster=All&var-namespace=zell-max-out-sst-partitioner&var-pod=All&var-partition=All)I think it depends what right now is our priority, is it the maximum throughput or is it that we can provide stable performance on the larger state. Is it ok to hurt our maximum throughput a little? We will need to investigate this further.

### JMH Benchmarks

I tried it with the JMH benchmark and it gave impressive results
```
Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime":
  656.639 ±(99.9%) 91.394 ops/s [Average]
  (min, avg, max) = (1.775, 656.639, 1163.635), stdev = 386.967
  CI (99.9%): [565.246, 748.033] (assumes normal distribution)
# Run complete. Total time: 00:07:12
Benchmark                                           Mode  Cnt    Score    Error  Units
EnginePerformanceTest.measureProcessExecutionTime  thrpt  200  656.639 ± 91.394  ops/s
```

[Remember the base was ~230](#12241 (comment))

### Zeebe Benchmarks

After the JMH benchmark I started some new benchmarks like for the large state. I wanted to see how it would survive when we continuously just start instances.

Remember: Previously we died after ~1 hour, when reaching 800 MB of state.
[In the benchmark we had reached at least ~4.5 gig and were still able to handle the same load (over 6 hours). ](https://grafana.dev.zeebe.io/d/I4lo7_EZk/zeebe?orgId=1&from=1681912207012&to=1681930704963&var-DS_PROMETHEUS=Prometheus&var-cluster=All&var-namespace=zell-large-state-sst-partition&var-pod=All&var-partition=All):exploding_head:
![snapshot](https://user-images.githubusercontent.com/2758593/235164591-0ba3cb40-aa47-4bf4-b647-9992ac5d7e88.png)
![general](https://user-images.githubusercontent.com/2758593/235164598-5da0906e-a50f-4235-a5b8-48181dffc9d5.png)

#### Maxing out benchmark

![maxgeneral](https://user-images.githubusercontent.com/2758593/235164601-bab9f40c-20be-4cbe-8530-c0ba791ec0f0.png)


<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->
related to #12033

Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>
This was referenced May 3, 2023
@Zelldon
Copy link
Member Author

Zelldon commented Jun 14, 2023

Just to document what I have discussed with @oleschoenburg

Notes:

  • it is important that the test all run with the same set up (same large state)
  • We lean to recreate the state in the JMH setup code, because:
    • makes it less complex, if we have a pre-built state we need to resolve where to store, how to receive etc.
    • state creation allows to always have a compatible state
    • makes variation simpler
    • We will create state in setup code, not via warm-up to guarantee to always have the same setting and same large state e.g. same amount of instances
  • JUNIT
    • test should stay in Engine module, but extra naming (like randomise tests), this way performance test run only once via ci
    • Integrate in CI but extra job, to give own resources etc
    • If possible verify whether junit test can be excluded from intellij, so it just run via maven profile, so it is not executed by accident when run all tests in intellij
    • Setup the unit test similar to
      https://www.retit.de/continuous-benchmarking-with-jmh-and-junit-2/
      Last run
      Iteration 200: 11:17:20.385 [] INFO io.camunda.zeebe.engine.perf - Started 227176 process instances
      Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime":
      615.279 ±(99.9%) 28.398 ops/s [Average]
      (min, avg, max) = (322.794, 615.279, 945.527), stdev = 120.239
      CI (99.9%): [586.881, 643.677] (assumes normal distribution)
      Current:
      https://camunda.slack.com/archives/D02MKQEC30D/p1686662505116359

Next steps

  • Move creating state in setup code
  • Clean up code
  • Create junit test which executed benchmark
  • Ignore for intellij, exclude for junit, define new pattern to run as separate ci job
  • Create ci job
  • Run as separate ci job

zeebe-bors-camunda bot added a commit that referenced this issue Jun 14, 2023
13121: Performance test for large state and stable performance r=Zelldon a=Zelldon

## Description

`@oleschoenburg` I created this PR in order to get the first increment merged and to not overwhelm you with more changes. Most of the changes here are refactorings and preparation for the JMH test, plus of course the JMH test itself.

 I applied already several things we have discussed, like creation of large state in setup phase, keeping the test in engine module etc. In an upcoming PR I will execute the JMH benchmark inside a unit test, which is executed in a separate CI job, please see [related comment](#12241 (comment)).


<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

related to #12241 


Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
@Zelldon
Copy link
Member Author

Zelldon commented Jun 28, 2023

With #13135 we added a unit test which allows us to integrate the JMH benchmark in our CI and run it everytime when our CI runs.

I would mark this issue as resolved then.

@Zelldon Zelldon closed this as completed Jun 28, 2023
@Zelldon Zelldon added the version:8.3.0-alpha3 Marks an issue as being completely or in parts released in 8.3.0-alpha3 label Jul 6, 2023
@megglos megglos added the version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0 label Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Marks an issue as performance related component/engine component/stream-platform version:8.3.0-alpha3 Marks an issue as being completely or in parts released in 8.3.0-alpha3 version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0
Projects
None yet
Development

No branches or pull requests

2 participants