Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Support stable performance for new instances even on larger state #12033

Closed
16 of 22 tasks
Zelldon opened this issue Mar 15, 2023 · 11 comments
Closed
16 of 22 tasks

[EPIC] Support stable performance for new instances even on larger state #12033

Zelldon opened this issue Mar 15, 2023 · 11 comments
Assignees
Labels
area/performance Marks an issue as performance related component/db component/engine kind/epic Categorizes an issue as an umbrella issue (e.g. OKR) which references other, smaller issues kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. version:8.2.4 Marks an issue as being completely or in parts released in 8.2.4

Comments

@Zelldon
Copy link
Member

Zelldon commented Mar 15, 2023

Problem Description

Came up with #11813, that right now our system runs into performance issues when having a bigger state. Especially we have encountered that if we hit a certain limit of state in RocksDB the performance drops suddenly, which is unexpected.

Goal / Focus

This EPIC can be seen as the first iteration to resolve this. We are setting our focus on "Hot data should always be executed fast/performant. Cold data or big state shouldn't have an impact on new data and the performance of them".

In other words, if we create always new instances the creation and execution of these instances should be always executed with the same or similar performances as the previously created instances. We want stable performance for new instances.

To be specific, something like the following shouldn't happen:

image1

Here we can see that the performance dropped after ~1 hour of accumulating instances and state. We are stable in the sense of the cluster is still accepting new instances and doesn't crash, but the performance has significantly degraded.

Next / Upcoming

The following shows the next planned steps, this is updated incrementally

Next

Break down

Discover

  1. component/db component/zeebe kind/question kind/toil
    abbasadel megglos
  2. area/performance component/db kind/research kind/toil
  3. area/performance component/db kind/toil
  4. area/performance kind/research
  5. area/performance backport stable/8.0 backport stable/8.1 backport stable/8.2 benchmark kind/research version:8.0.14 version:8.1.12 version:8.2.4 version:8.3.0-alpha1
  6. benchmark

As part of #11813, there already have been potential issues and solutions identified. Besides the identified issues we are planning to run some POCs as part of a discovering phase which might identify more potential tasks.

Testing

  1. area/performance component/engine component/stream-platform version:8.3.0 version:8.3.0-alpha3
    Zelldon

In order to validate our changes we will need to implement some tests (later automated), this should allow us to make changes iteratively and which should also prevent later regressions.

Improvements

  1. area/performance backport stable/8.1 backport stable/8.2 component/engine kind/toil version:8.1.11 version:8.2.3
    Zelldon
  2. area/performance backport stable/8.1 component/engine kind/toil version:8.2.3 version:8.3.0 version:8.3.0-alpha1
    Zelldon
  3. area/performance backport stable/8.0 backport stable/8.1 backport stable/8.2 benchmark kind/research version:8.0.14 version:8.1.12 version:8.2.4 version:8.3.0-alpha1
  4. 1 of 5
    area/performance component/engine component/zeebe kind/toil
  5. area/performance component/engine component/zeebe kind/toil
  6. area/performance component/db component/engine component/zeebe kind/toil

The above breakdown should help us to achieve our goals.

@korthout
Copy link
Member

@Zelldon As our teams will have to collaborate on this topic, it would be helpful for us to have a product-hub issue. This would help product management understand that we're both involved and must spend time on it. That should help us find the needed time.

@Zelldon
Copy link
Member Author

Zelldon commented Mar 30, 2023

There is already https://github.com/camunda/product-hub/issues/989 which either needs to be adjusted or a new needs to be created I will align soon with @felix-mueller

BTW I don't think that it is necessary that for everything we do needs to be an produc hub issue. You need to reflect that in your planning and make this transparent. But i see that this could help, but seems to duplicate work. At least this wasn't communicated to me/us that this is a new process.

@felix-mueller
Copy link
Member

From my perspective we should reuse camunda/product-hub#989 for this and adjust scope for first iteration accordingly of the product hub epic.
If we believe there will be future iterations (e.g. in zeebe or also in Other components like Operate), then we can create further epics in product-hub.

@Zelldon
Copy link
Member Author

Zelldon commented Apr 3, 2023

POC outcomes

Performance Hackday

We have run a performance hack day as a team, and tried out the provided POC branch from @romansmirnov

The results look quite promising, we were able to verify that instance creation was still performant and the throughput doesn't break, it was stable a long period of time.

comparison

When using workers, there performance was not similar to without, but at least it was much better then without the changes.

comparison-job-activation

Results:

  • We were able to show that implementing the highlighted issues in #11813, is enough to reach our goals. You can see in the screenshot, that we reached a stable creation performance, where the base benchmark broke after some time.
  • We used a new unit test with big state and our benchmarks to verify the performance goals and impact.
  • We have seen that the job activation is still an issue, but this is for the first iteration out of scope.
    • The POC branch still performed better, as the base, but hasn't reached the 150 PI/s.
    • We are working on a job push model, which might resolve this issue as well.
  • We discussed several other ideas and improvements, which might be interesting for the next iteration.

Next:

Some of the solutions used in the POC are not fully working and we need to think a bit more how we can implement them in the right way. We will start with some smaller easy-picks which sounds already quite promising, like blacklist check improvements etc.

Running with TMPFS

Begin of the year we added a new configuration which allows to separate the runtime and snapshot directory #11772. This allows us to put the runtime directory into tmpfs. We assumed or wanted to verify whether this would allows us better or stable performance for creating new instances even on larger state.

TLDR; Unfortunately, we were not able to show this.

Changes for the values file:

$ diff default/values.yaml zell-larger-state/values.yaml 
9c9
<   replicas: 3
---
>   replicas: 0
14c14
<   rate: 75
---
>   rate: 150
111,112c89,90
<         cpu: 1350m
<         memory: 4Gi
---
>         cpu: 2
>         memory: 8Gi
125a104,123
> 
>     extraVolumes:
>       - name: zeebe-config
>         configMap:
>           name: zeebe-config
>           defaultMode: 0754
>       - name: pyroscope
>         emptyDir: {}
>       - name: tmpfs
>         emptyDir:
>           medium: Memory
> 
>     extraVolumeMounts:
>       - name: pyroscope
>         mountPath: /pyroscope
>       - name: zeebe-config
>         mountPath: /usr/local/zeebe/config/application.yaml
>         subPath: application.yml
>       - mountPath: /usr/local/zeebe/runtime
>         name: tmpfs

We can see based on the metrics that the throughput was still breaking down after some time.

general

To avoid that swapping might be an issue I the container 32 gig, but this didn't help.

@oleschoenburg and I have realized that TMPFS mount will get half of the memory of the node. This might be also an influence factor and that it swaps when there is not enough. But here we are not sure and haven't investigated further.

@koevskinikola
Copy link
Member

The ZPA team will support @Zelldon on this topic through @koevskinikola and @berkaycanbc.

zeebe-bors-camunda bot added a commit that referenced this issue May 2, 2023
12483: Introduce experimental SST partitioning r=Zelldon a=Zelldon

## Description

Discovered this via in [the RocksDB google group post ](https://groups.google.com/g/rocksdb/c/l3CzFD4YBYQ#:~:text=another%20way%20that%20might%20be%20helpful%20is%20using%20sst_partitioner_factory%20.%20By%20using%20this%20experimental%20feature%2C%20you%20can%20partition%20the%20ssts%20based%20on%20your%20desired%20prefix%20which%20means%20you%20would%20only%20have%20to%20tell%20how%20many%20entries%20are%20in%20that%20sst.)

[Form the java docs](https://javadoc.io/static/org.rocksdb/rocksdbjni/6.20.3/org/rocksdb/ColumnFamilyOptionsInterface.html#setSstPartitionerFactory(org.rocksdb.SstPartitionerFactory))
> use the specified factory for a function to determine the partitioning of sst files. This helps compaction to split the files on interesting boundaries (key prefixes) to make propagation of sst files less write amplifying (covering the whole key space).

### Details

SST partitioning based on column family prefix (virtual column family) allows to split up key ranges in separate SST files, which should improve compaction and makes propagation of SST files less write amplifying.

It will cause to create of more files in runtime and snapshot as it will create more SST files. At least for each column family we use it at runtime.

As discussed here https://camunda.slack.com/archives/C04T7T0RPLY/p1681931668446069 we want to add this as an experimental feature for now, so people can play around with it and we can do as well. From the benchmark results so far it looked quite promising. The feature itself is marked as experimental as well at RocksDB so it makes sense to mark it on our side as experimental as well.

Open questions:

1. it seems that the config is marked as an experimental feature, at RocksDB Idk what this exactly means, is this a problem for us? Would we just stay on the version when they remove it ? Is it unstable? Not sure yet.
2. The maximum throughput seems to be degraded a bit, as I mentioned earlier we are currently able to reach around ~240 PI/s, [with the configuration we are reaching ~220 PI/s. ](https://grafana.dev.zeebe.io/d/I4lo7_EZk/zeebe?orgId=1&refresh=10s&from=now-6h&to=now&var-DS_PROMETHEUS=Prometheus&var-cluster=All&var-namespace=zell-max-out-sst-partitioner&var-pod=All&var-partition=All)I think it depends what right now is our priority, is it the maximum throughput or is it that we can provide stable performance on the larger state. Is it ok to hurt our maximum throughput a little? We will need to investigate this further.

### JMH Benchmarks

I tried it with the JMH benchmark and it gave impressive results
```
Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime":
  656.639 ±(99.9%) 91.394 ops/s [Average]
  (min, avg, max) = (1.775, 656.639, 1163.635), stdev = 386.967
  CI (99.9%): [565.246, 748.033] (assumes normal distribution)
# Run complete. Total time: 00:07:12
Benchmark                                           Mode  Cnt    Score    Error  Units
EnginePerformanceTest.measureProcessExecutionTime  thrpt  200  656.639 ± 91.394  ops/s
```

[Remember the base was ~230](#12241 (comment))

### Zeebe Benchmarks

After the JMH benchmark I started some new benchmarks like for the large state. I wanted to see how it would survive when we continuously just start instances.

Remember: Previously we died after ~1 hour, when reaching 800 MB of state.
[In the benchmark we had reached at least ~4.5 gig and were still able to handle the same load (over 6 hours). ](https://grafana.dev.zeebe.io/d/I4lo7_EZk/zeebe?orgId=1&from=1681912207012&to=1681930704963&var-DS_PROMETHEUS=Prometheus&var-cluster=All&var-namespace=zell-large-state-sst-partition&var-pod=All&var-partition=All):exploding_head:
![snapshot](https://user-images.githubusercontent.com/2758593/235164591-0ba3cb40-aa47-4bf4-b647-9992ac5d7e88.png)
![general](https://user-images.githubusercontent.com/2758593/235164598-5da0906e-a50f-4235-a5b8-48181dffc9d5.png)

#### Maxing out benchmark

![maxgeneral](https://user-images.githubusercontent.com/2758593/235164601-bab9f40c-20be-4cbe-8530-c0ba791ec0f0.png)


<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->
related to #12033

Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>
zeebe-bors-camunda bot added a commit that referenced this issue May 2, 2023
12625: [Backport stable/8.2] Introduce experimental SST partitioning r=Zelldon a=backport-action

# Description
Backport of #12483 to `stable/8.2`.

relates to #12033

Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>
zeebe-bors-camunda bot added a commit that referenced this issue May 3, 2023
12629: [Backport 8.1]: Introduce experimental SST partitioning r=remcowesterhoud a=Zelldon

## Description
Backports #12483 
<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

relates to #12033



12646: [Backport 8.1]: Restore blacklist metric r=remcowesterhoud a=Zelldon

## Description
Backports #12606

Merge conflicts because of imports.

## Related issues

closes #8263



Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>
zeebe-bors-camunda bot added a commit that referenced this issue May 3, 2023
12630: [Backport 8.0]: Introduce experimental SST partitioning r=remcowesterhoud a=Zelldon

## Description
Backports  #12483
<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #12033



12645: [Backport 8.0]: Restore blacklist metric r=remcowesterhoud a=Zelldon

## Description

Backports #12606
<!-- Please explain the changes you made here. -->

The PR https://github.com/camunda/zeebe/pull/12306/files wasn't backported to 8.0, which caused some conflicts. I had to add the onRecovered method and call it in the ZeebeDbState.

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #8263



Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>
@remcowesterhoud remcowesterhoud added the version:8.2.4 Marks an issue as being completely or in parts released in 8.2.4 label May 3, 2023
@Zelldon
Copy link
Member Author

Zelldon commented May 5, 2023

Results of POC Week

TL;DR;

What I can say after this week of deeper investigation is that splitting up the key space will give us the most performance boost, and this can either be archived via enabling SST partitioning or introducing new real column families, which allows RocksDB to split them up correctly. Note: we will always treat performance with resources, either space or memory or both.

What I have seen this week is that if we use one of these solutions all other issues we have found earlier seem to be negligible. If we are not using them we can clearly see that consistency checks are the most prominent issue (especially searching for not existing keys) in larger states. Followed by iterator seek, which is impacted enormously by a larger state as well, due to using one column family and having all keys together. The seek issue can only be handled either, by not iterating at all (if not necessary) which is currently hard to determine whether this is really possible, or using one of our options above, SST partitioning or using extra column families. Right now the SST partitioning large state benchmark is running for longer than 6 hours, and looks still stable.

Using the JMH Benchmark setup really helped me to understand the impact of solutions in a fast way and easily profile the solutions, all results can be found here.

Details of the Week:

Investigating Prefix seek

  • Roman pinged me that he was able to reproduce that the iterating doesn’t work as expected, and we iterate over bounces if we are not actively checking the prefix, as described here. I was able to reproduce this as well when removing the prefix check some tests fail, because they iterate into a different column family.
  • The issue is related to dirty transactions and no support of prefix iterating inside transactions
  • Related to this I found another bug on our side, that checkers use the same transaction

DeleteRange not supported:

Discussed with Roman SST partitioning:

We agreed that the solutions look quite promising, but there are still some open questions we should answer and clarify. Thanks @romansmirnov for your input!

Questions:

  • What does the experimental marker mean
  • Can we turn the feature on and off, without issues?
  • Why did the previous benchmark die?
  • Because of disk space. I started a new benchmark with a much larger disk space in order to see how it behaves and how long it will survive, whether the disk is the only limit.
  • My assumption is that splitting into several CF behaves similar, is this true? We have shown this in a POC below.

I asked in the RocksDB google group about the experimental feature
Important responses:

  • Typically, experimental features are marked as such meaning that the API is subject to change. Looking back through the history file, I believe this class was introduced in 6.12 in July 2020. Since that time I cannot see any changes to this class. YMMV

  • Breaking files by prefix is a good idea in some environments (for example MyRocks that uses prefix to define tables) . The iterator gain is only when you seek for all the keys in this prefix. The other gains may also be in compaction if you write multiple sequences (one for each prefix) The only pitfall is to ensure you do not create too many small files. We have tested this feature and did not found any issue with it,,,

Answers reflects our current understanding of this feature. It is interesting to note that other Vendors or applications use this feature as well. Gives us more trust in the feature itself. Especially since they mention it hasn’t changed since 2020, and they tested it as well. The use case in general fits I think our needs as well.

POC enhance column family (CF) prefix check

  • We use currently a long as a prefix for all keys, whereas the CF ordinal fits in a byte. We need to compare that every time on iterating. The idea here was to reduce here the overhead of checking the prefix, by just checking the actual byte.
  • Result: No significant impact, also not on max out performance.

POC introduce key formats

  • We use in several places invalid keys, searching for keys which are not existing, like -1. This causes RocksDB to search for such a key in several places, where we could avoid this overhead when we can easily detect that the key is not valid. Related to Point lookups with non-existing keys.
  • Idea: To respond faster on invalid keys, like -1 (simply return null on invalid format)
  • Result: No significant impact.
  • Profiling: Shows again that iterator.seek is a huge problem.

POC cache data in ZeebeDB

  • The idea was to instead add several caches on top of ZeebeDB (which we already did in some parts), we add caching inside the ZeebeDB itself. It turned out to be not as easy as we thought.
  • For example I have run again into issues with resuing of value and key instances, the cache was easily corrupted by this. This means we need to make always a copy of instances we want to keep.
  • Furthermore, it is not that easy to add caching over transaction bounds. Transaction might be committed or rolled back, and in this case we might several layers of caches.
  • For the POC I went the easy route and added caching for the transaction, which allows to return values that have been written or requested earlier (which we also have in our processing multiple times). The cache will be invalided after commit or rollback.
  • Result:
  • Profiling:
    • This shows again that consistency checks are dominators. Especially, Variable.createScope where we check for nonexisting keys, which can’t be in the cache of course. This means this cache will not help here.
    • Adding caching for nonexistence also didn’t help, since it will be updated afterward, and will always be a cache miss on the first check.

POC Create additional CF

  • Hypothesis is that we can achieve the same what we did with SST partitioning by separating some column families from the default. Meaning we creating some more real column families for our virtual/logic column families.
  • In the POC I implemented it a way that we can configure based on a prefix which column families should be separated. I started with Jobs and element instances. Resulting in three column families: jobs (where all virtual column families end up which start with JOBS_), element_instance (where all virtual column families end up which start with ELEMENT_INSTANCE_), and default (the rest).
  • Results: Again impressive results comparable to SST partitioning (even a bit better).
    • Zeebe Benchmarks clearly shows that we need more resources due to configuring more column families, which is expected of course. This means we either need to add more resources or spent some time to find good a configuration, split up the resources for the configured column families.
    • The large state benchmark looks so far pretty good as well.
  • Note: Migrating to such might be a bit more complicated. From the existing data set, but possible (we need to migrate all key-values). Also when turning it off again it might be problematic as well. Maybe this is acceptable since this is a database schema change, and we need to document that going back is not possible (?). Only with clean data, similar to normal relational schema changes. Could be an alternative to the sst partition if the experimental flag is too hot, but as shown above it seems to be stable as well.

Conclusion:

  • Based on the results of the POCs and benchmarks I think it makes more sense to follow the results of the profiling. We have seen that caching doesn't bring much. Questioning whether migrating to WriteBatch would.
  • The most blocking is the repetitive queries for keys that don’t exist, especially with larger data sets. This can be removed by disabling consistency checks.
  • When disabling the checks the next blocking (based on profiling) is the iterator seek, which slows the computation/processing down. This can be overcome by SST partitioning or separating column families. Or completely avoiding iterating, if possible?
  • We have seen that SST partitioning and separating column families also help in the first case (we no longer need to disable the checks) this is because it reduces the time of seeking and searching for a key (which is the slow part in the consistency checks).
  • When SST partitioning is enabled the profile looks pretty good, there are some seeks which could be removed or avoided to improve further
  • Based on an answer from the RocksDB Google group the feature of SST partition is quite stable (not changed since 2020), and used in other products as well (it has been tested). The only pitfall might be to have many too many small files.
  • A good alternative might be the separation of column families, which gives us similar results. But has its own drawbacks of course.
  • It might make sense to test all of these POCs in combination, but in order to focus I would move forward on investigating further SST partitioning and CF separation for now. With these, we might have found solutions that don't impact the general usage in the engine itself. Meaning we don't need to change the engine code in that case, the changes are quite focused on the ZeebeDB itself.

Next:

Investigate and compare SST partitioning and Column family separation

@Zelldon Zelldon added the area/performance Marks an issue as performance related label May 5, 2023
@aivinog1
Copy link
Contributor

aivinog1 commented May 8, 2023

Hey @Zelldon!
It is probably not my business, but I haven't seen any dashboards about memory and GC. I wrote some notes with my experience of using Zeebe, and optimizations for a Zeebe Broker. Maybe you will find something useful here.

@Zelldon
Copy link
Member Author

Zelldon commented May 15, 2023

@aivinog1 actually we have panels in our Zeebe dashboard about memory AND GC :)

mem

@Zelldon
Copy link
Member Author

Zelldon commented Jun 5, 2023

Small update on what I did in between my recent FTOs (I just want to share this also more in public) :

As the next step, I plan to compare in more depth the potential solution of separating column families vs. SST partitioning and try to conclude this week.

@Zelldon
Copy link
Member Author

Zelldon commented Jun 14, 2023

For posterity and to post it in public:

Comparing solutions

I compared the two possible approaches and concluded that we will move forward with SST partitioning, see slack

My conclusion was the following:

Based on the results above I will conclude this comparison and mark SST partitioning as the preferred solution for now.
It gives a lot of gains, like performance improvement in a larger state (which was the goal) with smaller risks and effort to implement compared to splitting up the column families.

We were able to show a stable performance for new instances even on a large state with the SST partitioning, which means we reached our goal here. For more details take a look at the referenced issue.

Enabling SST per default

A PR was created in order to make the SST partitioning enabled by default

Next

With #12241 I created a JMH benchmark which helped me to determine which solutions are worth investigating further. I would like to spend some time in order to migrate that into a unit test and integrate that into our CI. This should allow us to detect when we introduce performance degradations.

I will pair on this with @oleschoenburg

Afterward, I will mark this topic as complete for now. Other topics which have been raised during this project are still valid, but depending on the priority are tackled later.

@Zelldon
Copy link
Member Author

Zelldon commented Jun 28, 2023

With the most recent PRs #13135 and #13121 we added a JMH benchmark which should allow us to prevent regressions. The JMH test can be executed as a unit test that enables us to run via our CI regularly.

With that, I will mark this EPIC as closed. There are several issues that came up or are discovered that we still need to take a look at or discuss like #12203 but this will be done separate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Marks an issue as performance related component/db component/engine kind/epic Categorizes an issue as an umbrella issue (e.g. OKR) which references other, smaller issues kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. version:8.2.4 Marks an issue as being completely or in parts released in 8.2.4
Projects
None yet
Development

No branches or pull requests

7 participants