[EPIC] Support stable performance for new instances even on larger state #12033

Zelldon · 2023-03-15T12:41:14Z

Problem Description

Came up with #11813, that right now our system runs into performance issues when having a bigger state. Especially we have encountered that if we hit a certain limit of state in RocksDB the performance drops suddenly, which is unexpected.

Goal / Focus

This EPIC can be seen as the first iteration to resolve this. We are setting our focus on "Hot data should always be executed fast/performant. Cold data or big state shouldn't have an impact on new data and the performance of them".

In other words, if we create always new instances the creation and execution of these instances should be always executed with the same or similar performances as the previously created instances. We want stable performance for new instances.

To be specific, something like the following shouldn't happen:

Here we can see that the performance dropped after ~1 hour of accumulating instances and state. We are stable in the sense of the cluster is still accepting new instances and doesn't crash, but the performance has significantly degraded.

Next / Upcoming

The following shows the next planned steps, this is updated incrementally

Break down

Discover

Give feedback

Run the proposed changes in Large runtime state slows down access to RocksDB #11813 together to verify whether we reaching with that our goal
[Discussion] Revisit RocksDB and ZeebeDb responsibility #12203

component/db component/zeebe kind/question kind/toil
POC for Revisite state data structure #12209
[POC]: Use delete range instead of iterating #12464

area/performance component/db kind/research kind/toil
[POC]: Replace OptimisticTransaction with WriteBatch #12478

area/performance component/db kind/toil
Use larger fixed length prefix extractor #12476

area/performance kind/research
POC: Zeebe db cache #12677
Introduce experimental SST partitioning #12483

area/performance backport stable/8.0 backport stable/8.1 backport stable/8.2 benchmark kind/research version:8.0.14 version:8.1.12 version:8.2.4 version:8.3.0-alpha1
POC: Enhance CF prefix check #12654
POC: introduce key formats #12661
POC: Create additional CF #12675

benchmark
Investigate and compare SST partition and Column Family separation Compare SST partitioning with real Column Family split #12955
Options

As part of #11813, there already have been potential issues and solutions identified. Besides the identified issues we are planning to run some POCs as part of a discovering phase which might identify more potential tasks.

Testing

Give feedback

Implement JMH benchmark for support process instance creation on larger state #12241

area/performance component/engine component/stream-platform version:8.3.0 version:8.3.0-alpha3
Automate such tests
Options

In order to validate our changes we will need to implement some tests (later automated), this should allow us to make changes iteratively and which should also prevent later regressions.

Improvements

Give feedback

Fetching constantly static information #12034

area/performance backport stable/8.1 backport stable/8.2 component/engine kind/toil version:8.1.11 version:8.2.3
Unnecessary Blacklist checks #12041

area/performance backport stable/8.1 component/engine kind/toil version:8.2.3 version:8.3.0 version:8.3.0-alpha1
Introduce experimental SST partitioning #12483

area/performance backport stable/8.0 backport stable/8.1 backport stable/8.2 benchmark kind/research version:8.0.14 version:8.1.12 version:8.2.4 version:8.3.0-alpha1
Point lookups with non-existing keys #12035

1 of 5

area/performance component/engine component/zeebe kind/toil
RocksDB Iterator running out of bounds #12036

area/performance component/engine component/zeebe kind/toil
Multiple Get-Update-Loops to update values in RocksDB #12037

area/performance component/db component/engine component/zeebe kind/toil
Options

The above breakdown should help us to achieve our goals.

korthout · 2023-03-30T14:15:21Z

@Zelldon As our teams will have to collaborate on this topic, it would be helpful for us to have a product-hub issue. This would help product management understand that we're both involved and must spend time on it. That should help us find the needed time.

Zelldon · 2023-03-30T14:51:53Z

There is already https://github.com/camunda/product-hub/issues/989 which either needs to be adjusted or a new needs to be created I will align soon with @felix-mueller

BTW I don't think that it is necessary that for everything we do needs to be an produc hub issue. You need to reflect that in your planning and make this transparent. But i see that this could help, but seems to duplicate work. At least this wasn't communicated to me/us that this is a new process.

felix-mueller · 2023-03-30T15:17:31Z

From my perspective we should reuse camunda/product-hub#989 for this and adjust scope for first iteration accordingly of the product hub epic.
If we believe there will be future iterations (e.g. in zeebe or also in Other components like Operate), then we can create further epics in product-hub.

Zelldon · 2023-04-03T07:15:19Z

POC outcomes

Performance Hackday

We have run a performance hack day as a team, and tried out the provided POC branch from @romansmirnov

The results look quite promising, we were able to verify that instance creation was still performant and the throughput doesn't break, it was stable a long period of time.

When using workers, there performance was not similar to without, but at least it was much better then without the changes.

Results:

We were able to show that implementing the highlighted issues in #11813, is enough to reach our goals. You can see in the screenshot, that we reached a stable creation performance, where the base benchmark broke after some time.
We used a new unit test with big state and our benchmarks to verify the performance goals and impact.
We have seen that the job activation is still an issue, but this is for the first iteration out of scope.
- The POC branch still performed better, as the base, but hasn't reached the 150 PI/s.
- We are working on a job push model, which might resolve this issue as well.
We discussed several other ideas and improvements, which might be interesting for the next iteration.

Next:

Some of the solutions used in the POC are not fully working and we need to think a bit more how we can implement them in the right way. We will start with some smaller easy-picks which sounds already quite promising, like blacklist check improvements etc.

Running with TMPFS

Begin of the year we added a new configuration which allows to separate the runtime and snapshot directory #11772. This allows us to put the runtime directory into tmpfs. We assumed or wanted to verify whether this would allows us better or stable performance for creating new instances even on larger state.

TLDR; Unfortunately, we were not able to show this.

Changes for the values file:

$ diff default/values.yaml zell-larger-state/values.yaml 
9c9
<   replicas: 3
---
>   replicas: 0
14c14
<   rate: 75
---
>   rate: 150
111,112c89,90
<         cpu: 1350m
<         memory: 4Gi
---
>         cpu: 2
>         memory: 8Gi
125a104,123
> 
>     extraVolumes:
>       - name: zeebe-config
>         configMap:
>           name: zeebe-config
>           defaultMode: 0754
>       - name: pyroscope
>         emptyDir: {}
>       - name: tmpfs
>         emptyDir:
>           medium: Memory
> 
>     extraVolumeMounts:
>       - name: pyroscope
>         mountPath: /pyroscope
>       - name: zeebe-config
>         mountPath: /usr/local/zeebe/config/application.yaml
>         subPath: application.yml
>       - mountPath: /usr/local/zeebe/runtime
>         name: tmpfs

We can see based on the metrics that the throughput was still breaking down after some time.

To avoid that swapping might be an issue I the container 32 gig, but this didn't help.

@oleschoenburg and I have realized that TMPFS mount will get half of the memory of the node. This might be also an influence factor and that it swaps when there is not enough. But here we are not sure and haven't investigated further.

koevskinikola · 2023-04-05T12:03:52Z

The ZPA team will support @Zelldon on this topic through @koevskinikola and @berkaycanbc.

12483: Introduce experimental SST partitioning r=Zelldon a=Zelldon ## Description Discovered this via in [the RocksDB google group post ](https://groups.google.com/g/rocksdb/c/l3CzFD4YBYQ#:~:text=another%20way%20that%20might%20be%20helpful%20is%20using%20sst_partitioner_factory%20.%20By%20using%20this%20experimental%20feature%2C%20you%20can%20partition%20the%20ssts%20based%20on%20your%20desired%20prefix%20which%20means%20you%20would%20only%20have%20to%20tell%20how%20many%20entries%20are%20in%20that%20sst.) [Form the java docs](https://javadoc.io/static/org.rocksdb/rocksdbjni/6.20.3/org/rocksdb/ColumnFamilyOptionsInterface.html#setSstPartitionerFactory(org.rocksdb.SstPartitionerFactory)) > use the specified factory for a function to determine the partitioning of sst files. This helps compaction to split the files on interesting boundaries (key prefixes) to make propagation of sst files less write amplifying (covering the whole key space). ### Details SST partitioning based on column family prefix (virtual column family) allows to split up key ranges in separate SST files, which should improve compaction and makes propagation of SST files less write amplifying. It will cause to create of more files in runtime and snapshot as it will create more SST files. At least for each column family we use it at runtime. As discussed here https://camunda.slack.com/archives/C04T7T0RPLY/p1681931668446069 we want to add this as an experimental feature for now, so people can play around with it and we can do as well. From the benchmark results so far it looked quite promising. The feature itself is marked as experimental as well at RocksDB so it makes sense to mark it on our side as experimental as well. Open questions: 1. it seems that the config is marked as an experimental feature, at RocksDB Idk what this exactly means, is this a problem for us? Would we just stay on the version when they remove it ? Is it unstable? Not sure yet. 2. The maximum throughput seems to be degraded a bit, as I mentioned earlier we are currently able to reach around ~240 PI/s, [with the configuration we are reaching ~220 PI/s. ](https://grafana.dev.zeebe.io/d/I4lo7_EZk/zeebe?orgId=1&refresh=10s&from=now-6h&to=now&var-DS_PROMETHEUS=Prometheus&var-cluster=All&var-namespace=zell-max-out-sst-partitioner&var-pod=All&var-partition=All)I think it depends what right now is our priority, is it the maximum throughput or is it that we can provide stable performance on the larger state. Is it ok to hurt our maximum throughput a little? We will need to investigate this further. ### JMH Benchmarks I tried it with the JMH benchmark and it gave impressive results ``` Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime": 656.639 ±(99.9%) 91.394 ops/s [Average] (min, avg, max) = (1.775, 656.639, 1163.635), stdev = 386.967 CI (99.9%): [565.246, 748.033] (assumes normal distribution) # Run complete. Total time: 00:07:12 Benchmark Mode Cnt Score Error Units EnginePerformanceTest.measureProcessExecutionTime thrpt 200 656.639 ± 91.394 ops/s ``` [Remember the base was ~230](#12241 (comment)) ### Zeebe Benchmarks After the JMH benchmark I started some new benchmarks like for the large state. I wanted to see how it would survive when we continuously just start instances. Remember: Previously we died after ~1 hour, when reaching 800 MB of state. [In the benchmark we had reached at least ~4.5 gig and were still able to handle the same load (over 6 hours). ](https://grafana.dev.zeebe.io/d/I4lo7_EZk/zeebe?orgId=1&from=1681912207012&to=1681930704963&var-DS_PROMETHEUS=Prometheus&var-cluster=All&var-namespace=zell-large-state-sst-partition&var-pod=All&var-partition=All):exploding_head: ![snapshot](https://user-images.githubusercontent.com/2758593/235164591-0ba3cb40-aa47-4bf4-b647-9992ac5d7e88.png) ![general](https://user-images.githubusercontent.com/2758593/235164598-5da0906e-a50f-4235-a5b8-48181dffc9d5.png) #### Maxing out benchmark ![maxgeneral](https://user-images.githubusercontent.com/2758593/235164601-bab9f40c-20be-4cbe-8530-c0ba791ec0f0.png)  ## Related issues  related to #12033 Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>

12625: [Backport stable/8.2] Introduce experimental SST partitioning r=Zelldon a=backport-action # Description Backport of #12483 to `stable/8.2`. relates to #12033 Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>

12629: [Backport 8.1]: Introduce experimental SST partitioning r=remcowesterhoud a=Zelldon ## Description Backports #12483  ## Related issues  relates to #12033 12646: [Backport 8.1]: Restore blacklist metric r=remcowesterhoud a=Zelldon ## Description Backports #12606 Merge conflicts because of imports. ## Related issues closes #8263 Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>

12630: [Backport 8.0]: Introduce experimental SST partitioning r=remcowesterhoud a=Zelldon ## Description Backports #12483  ## Related issues  closes #12033 12645: [Backport 8.0]: Restore blacklist metric r=remcowesterhoud a=Zelldon ## Description Backports #12606  The PR https://github.com/camunda/zeebe/pull/12306/files wasn't backported to 8.0, which caused some conflicts. I had to add the onRecovered method and call it in the ZeebeDbState. ## Related issues  closes #8263 Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>

Zelldon · 2023-05-05T11:28:23Z

Results of POC Week

TL;DR;

What I can say after this week of deeper investigation is that splitting up the key space will give us the most performance boost, and this can either be archived via enabling SST partitioning or introducing new real column families, which allows RocksDB to split them up correctly. Note: we will always treat performance with resources, either space or memory or both.

What I have seen this week is that if we use one of these solutions all other issues we have found earlier seem to be negligible. If we are not using them we can clearly see that consistency checks are the most prominent issue (especially searching for not existing keys) in larger states. Followed by iterator seek, which is impacted enormously by a larger state as well, due to using one column family and having all keys together. The seek issue can only be handled either, by not iterating at all (if not necessary) which is currently hard to determine whether this is really possible, or using one of our options above, SST partitioning or using extra column families. Right now the SST partitioning large state benchmark is running for longer than 6 hours, and looks still stable.

Using the JMH Benchmark setup really helped me to understand the impact of solutions in a fast way and easily profile the solutions, all results can be found here.

Details of the Week:

Investigating Prefix seek

Roman pinged me that he was able to reproduce that the iterating doesn’t work as expected, and we iterate over bounces if we are not actively checking the prefix, as described here. I was able to reproduce this as well when removing the prefix check some tests fail, because they iterate into a different column family.
The issue is related to dirty transactions and no support of prefix iterating inside transactions
Related to this I found another bug on our side, that checkers use the same transaction

DeleteRange not supported:

I got confirmation that DeleteRange is not supported on WriteBatchWithIndex (and also not in Transactions).
There is an open issue about this, and a PR to add it for WriteBatchWithIndex. I added a comment there to ask whether it can be reawoken
I closed our related issue to investigate (POC) it

Discussed with Roman SST partitioning:

We agreed that the solutions look quite promising, but there are still some open questions we should answer and clarify. Thanks @romansmirnov for your input!

Questions:

What does the experimental marker mean
Can we turn the feature on and off, without issues?
Why did the previous benchmark die?
Because of disk space. I started a new benchmark with a much larger disk space in order to see how it behaves and how long it will survive, whether the disk is the only limit.
My assumption is that splitting into several CF behaves similar, is this true? We have shown this in a POC below.

I asked in the RocksDB google group about the experimental feature
Important responses:

Typically, experimental features are marked as such meaning that the API is subject to change. Looking back through the history file, I believe this class was introduced in 6.12 in July 2020. Since that time I cannot see any changes to this class. YMMV
Breaking files by prefix is a good idea in some environments (for example MyRocks that uses prefix to define tables) . The iterator gain is only when you seek for all the keys in this prefix. The other gains may also be in compaction if you write multiple sequences (one for each prefix) The only pitfall is to ensure you do not create too many small files. We have tested this feature and did not found any issue with it,,,

Answers reflects our current understanding of this feature. It is interesting to note that other Vendors or applications use this feature as well. Gives us more trust in the feature itself. Especially since they mention it hasn’t changed since 2020, and they tested it as well. The use case in general fits I think our needs as well.

POC enhance column family (CF) prefix check

We use currently a long as a prefix for all keys, whereas the CF ordinal fits in a byte. We need to compare that every time on iterating. The idea here was to reduce here the overhead of checking the prefix, by just checking the actual byte.
Result: No significant impact, also not on max out performance.

POC introduce key formats

We use in several places invalid keys, searching for keys which are not existing, like -1. This causes RocksDB to search for such a key in several places, where we could avoid this overhead when we can easily detect that the key is not valid. Related to Point lookups with non-existing keys.
Idea: To respond faster on invalid keys, like -1 (simply return null on invalid format)
Result: No significant impact.
Profiling: Shows again that iterator.seek is a huge problem.

POC cache data in ZeebeDB

The idea was to instead add several caches on top of ZeebeDB (which we already did in some parts), we add caching inside the ZeebeDB itself. It turned out to be not as easy as we thought.
For example I have run again into issues with resuing of value and key instances, the cache was easily corrupted by this. This means we need to make always a copy of instances we want to keep.
Furthermore, it is not that easy to add caching over transaction bounds. Transaction might be committed or rolled back, and in this case we might several layers of caches.
For the POC I went the easy route and added caching for the transaction, which allows to return values that have been written or requested earlier (which we also have in our processing multiple times). The cache will be invalided after commit or rollback.
Result:
- No significant impact, the error rate improved. Disabling consistency checks shows similar results with base and disabling the checks, but improves the error rate. Looks like caching can smooth out the error rate.
Profiling:
- This shows again that consistency checks are dominators. Especially, Variable.createScope where we check for nonexisting keys, which can’t be in the cache of course. This means this cache will not help here.
- Adding caching for nonexistence also didn’t help, since it will be updated afterward, and will always be a cache miss on the first check.

POC Create additional CF

Hypothesis is that we can achieve the same what we did with SST partitioning by separating some column families from the default. Meaning we creating some more real column families for our virtual/logic column families.
In the POC I implemented it a way that we can configure based on a prefix which column families should be separated. I started with Jobs and element instances. Resulting in three column families: jobs (where all virtual column families end up which start with JOBS_), element_instance (where all virtual column families end up which start with ELEMENT_INSTANCE_), and default (the rest).
Results: Again impressive results comparable to SST partitioning (even a bit better).
- Zeebe Benchmarks clearly shows that we need more resources due to configuring more column families, which is expected of course. This means we either need to add more resources or spent some time to find good a configuration, split up the resources for the configured column families.
- The large state benchmark looks so far pretty good as well.
Note: Migrating to such might be a bit more complicated. From the existing data set, but possible (we need to migrate all key-values). Also when turning it off again it might be problematic as well. Maybe this is acceptable since this is a database schema change, and we need to document that going back is not possible (?). Only with clean data, similar to normal relational schema changes. Could be an alternative to the sst partition if the experimental flag is too hot, but as shown above it seems to be stable as well.

Conclusion:

Based on the results of the POCs and benchmarks I think it makes more sense to follow the results of the profiling. We have seen that caching doesn't bring much. Questioning whether migrating to WriteBatch would.
The most blocking is the repetitive queries for keys that don’t exist, especially with larger data sets. This can be removed by disabling consistency checks.
When disabling the checks the next blocking (based on profiling) is the iterator seek, which slows the computation/processing down. This can be overcome by SST partitioning or separating column families. Or completely avoiding iterating, if possible?
We have seen that SST partitioning and separating column families also help in the first case (we no longer need to disable the checks) this is because it reduces the time of seeking and searching for a key (which is the slow part in the consistency checks).
When SST partitioning is enabled the profile looks pretty good, there are some seeks which could be removed or avoided to improve further
Based on an answer from the RocksDB Google group the feature of SST partition is quite stable (not changed since 2020), and used in other products as well (it has been tested). The only pitfall might be to have many too many small files.
A good alternative might be the separation of column families, which gives us similar results. But has its own drawbacks of course.
It might make sense to test all of these POCs in combination, but in order to focus I would move forward on investigating further SST partitioning and CF separation for now. With these, we might have found solutions that don't impact the general usage in the engine itself. Meaning we don't need to change the engine code in that case, the changes are quite focused on the ZeebeDB itself.

What is the impact?
What are the Pros and Cons for each?
Which should we prefer?
How long are we surviving with these solutions? Is the disk size the only limit?
Running benchmark for SST Partitioning
Running benchmark for Column Family separation
Can I turn it on and off again? SST Partitioning and Column Family separation.

aivinog1 · 2023-05-08T08:35:28Z

Hey @Zelldon!
It is probably not my business, but I haven't seen any dashboards about memory and GC. I wrote some notes with my experience of using Zeebe, and optimizations for a Zeebe Broker. Maybe you will find something useful here.

Zelldon · 2023-05-15T07:10:39Z

@aivinog1 actually we have panels in our Zeebe dashboard about memory AND GC :)

Zelldon · 2023-06-05T11:45:35Z

Small update on what I did in between my recent FTOs (I just want to share this also more in public) :

Column Family benchmarks failed, due to resource issues (memory), which caused restarts and then running into disk issues. (Be aware there was a Prometheus migration in between)
I have run a benchmark with the SST partitioning feature over a week. It was able to survive for around 4 days, with stable throughput, until it reached the disk size of (500 gig) 🚀
- Some key facts:
  - At the end snapshot size of ~80 gigs:boom: we seem to be able to work with even such big snapshot sizes.
  - Due to SST partitioning, we have at the end around 30-35K files in a snapshot
  - The throughput was pretty straight/stable
  - There were some role changes happening but could be also related to pod restarts
The recent large state benchmark showed that SST partitioning worked pretty well, and we can survive as long as we have enough disk space.
- I did some further investigations on whether we can enable and disable the SST partitioning without issues.
  - First chaos day with a single instance created before and completed after
  - Second chaos day with experimenting with bigger state and toggling the flag
- Result: Both chaos days showed that there were no problems with enabling and disabling the flag. I was even able to show that it is possible to enable the flag later, when we already reached a larger state.
- I got confirmation in the RocksDb google group about the question of whether it is possible to enable or disable the flag without impact. This confirmed our observation from our chaos days, and increased our confidence in this new feature.

As the next step, I plan to compare in more depth the potential solution of separating column families vs. SST partitioning and try to conclude this week.

Zelldon · 2023-06-14T07:48:18Z

For posterity and to post it in public:

Comparing solutions

I compared the two possible approaches and concluded that we will move forward with SST partitioning, see slack

My conclusion was the following:

Based on the results above I will conclude this comparison and mark SST partitioning as the preferred solution for now.
It gives a lot of gains, like performance improvement in a larger state (which was the goal) with smaller risks and effort to implement compared to splitting up the column families.

We were able to show a stable performance for new instances even on a large state with the SST partitioning, which means we reached our goal here. For more details take a look at the referenced issue.

Enabling SST per default

A PR was created in order to make the SST partitioning enabled by default

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Support stable performance for new instances even on larger state #12033

[EPIC] Support stable performance for new instances even on larger state #12033

Zelldon commented Mar 15, 2023 •

edited