New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Support stable performance for new instances even on larger state #12033
Comments
@Zelldon As our teams will have to collaborate on this topic, it would be helpful for us to have a product-hub issue. This would help product management understand that we're both involved and must spend time on it. That should help us find the needed time. |
There is already https://github.com/camunda/product-hub/issues/989 which either needs to be adjusted or a new needs to be created I will align soon with @felix-mueller BTW I don't think that it is necessary that for everything we do needs to be an produc hub issue. You need to reflect that in your planning and make this transparent. But i see that this could help, but seems to duplicate work. At least this wasn't communicated to me/us that this is a new process. |
From my perspective we should reuse camunda/product-hub#989 for this and adjust scope for first iteration accordingly of the product hub epic. |
POC outcomesPerformance HackdayWe have run a performance hack day as a team, and tried out the provided POC branch from @romansmirnov The results look quite promising, we were able to verify that instance creation was still performant and the throughput doesn't break, it was stable a long period of time. When using workers, there performance was not similar to without, but at least it was much better then without the changes. Results:
Next: Some of the solutions used in the POC are not fully working and we need to think a bit more how we can implement them in the right way. We will start with some smaller easy-picks which sounds already quite promising, like blacklist check improvements etc. Running with TMPFSBegin of the year we added a new configuration which allows to separate the runtime and snapshot directory #11772. This allows us to put the runtime directory into tmpfs. We assumed or wanted to verify whether this would allows us better or stable performance for creating new instances even on larger state. TLDR; Unfortunately, we were not able to show this. Changes for the values file: $ diff default/values.yaml zell-larger-state/values.yaml
9c9
< replicas: 3
---
> replicas: 0
14c14
< rate: 75
---
> rate: 150
111,112c89,90
< cpu: 1350m
< memory: 4Gi
---
> cpu: 2
> memory: 8Gi
125a104,123
>
> extraVolumes:
> - name: zeebe-config
> configMap:
> name: zeebe-config
> defaultMode: 0754
> - name: pyroscope
> emptyDir: {}
> - name: tmpfs
> emptyDir:
> medium: Memory
>
> extraVolumeMounts:
> - name: pyroscope
> mountPath: /pyroscope
> - name: zeebe-config
> mountPath: /usr/local/zeebe/config/application.yaml
> subPath: application.yml
> - mountPath: /usr/local/zeebe/runtime
> name: tmpfs We can see based on the metrics that the throughput was still breaking down after some time. To avoid that swapping might be an issue I the container 32 gig, but this didn't help. @oleschoenburg and I have realized that TMPFS mount will get half of the memory of the node. This might be also an influence factor and that it swaps when there is not enough. But here we are not sure and haven't investigated further. |
The ZPA team will support @Zelldon on this topic through @koevskinikola and @berkaycanbc. |
12483: Introduce experimental SST partitioning r=Zelldon a=Zelldon ## Description Discovered this via in [the RocksDB google group post ](https://groups.google.com/g/rocksdb/c/l3CzFD4YBYQ#:~:text=another%20way%20that%20might%20be%20helpful%20is%20using%20sst_partitioner_factory%20.%20By%20using%20this%20experimental%20feature%2C%20you%20can%20partition%20the%20ssts%20based%20on%20your%20desired%20prefix%20which%20means%20you%20would%20only%20have%20to%20tell%20how%20many%20entries%20are%20in%20that%20sst.) [Form the java docs](https://javadoc.io/static/org.rocksdb/rocksdbjni/6.20.3/org/rocksdb/ColumnFamilyOptionsInterface.html#setSstPartitionerFactory(org.rocksdb.SstPartitionerFactory)) > use the specified factory for a function to determine the partitioning of sst files. This helps compaction to split the files on interesting boundaries (key prefixes) to make propagation of sst files less write amplifying (covering the whole key space). ### Details SST partitioning based on column family prefix (virtual column family) allows to split up key ranges in separate SST files, which should improve compaction and makes propagation of SST files less write amplifying. It will cause to create of more files in runtime and snapshot as it will create more SST files. At least for each column family we use it at runtime. As discussed here https://camunda.slack.com/archives/C04T7T0RPLY/p1681931668446069 we want to add this as an experimental feature for now, so people can play around with it and we can do as well. From the benchmark results so far it looked quite promising. The feature itself is marked as experimental as well at RocksDB so it makes sense to mark it on our side as experimental as well. Open questions: 1. it seems that the config is marked as an experimental feature, at RocksDB Idk what this exactly means, is this a problem for us? Would we just stay on the version when they remove it ? Is it unstable? Not sure yet. 2. The maximum throughput seems to be degraded a bit, as I mentioned earlier we are currently able to reach around ~240 PI/s, [with the configuration we are reaching ~220 PI/s. ](https://grafana.dev.zeebe.io/d/I4lo7_EZk/zeebe?orgId=1&refresh=10s&from=now-6h&to=now&var-DS_PROMETHEUS=Prometheus&var-cluster=All&var-namespace=zell-max-out-sst-partitioner&var-pod=All&var-partition=All)I think it depends what right now is our priority, is it the maximum throughput or is it that we can provide stable performance on the larger state. Is it ok to hurt our maximum throughput a little? We will need to investigate this further. ### JMH Benchmarks I tried it with the JMH benchmark and it gave impressive results ``` Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime": 656.639 ±(99.9%) 91.394 ops/s [Average] (min, avg, max) = (1.775, 656.639, 1163.635), stdev = 386.967 CI (99.9%): [565.246, 748.033] (assumes normal distribution) # Run complete. Total time: 00:07:12 Benchmark Mode Cnt Score Error Units EnginePerformanceTest.measureProcessExecutionTime thrpt 200 656.639 ± 91.394 ops/s ``` [Remember the base was ~230](#12241 (comment)) ### Zeebe Benchmarks After the JMH benchmark I started some new benchmarks like for the large state. I wanted to see how it would survive when we continuously just start instances. Remember: Previously we died after ~1 hour, when reaching 800 MB of state. [In the benchmark we had reached at least ~4.5 gig and were still able to handle the same load (over 6 hours). ](https://grafana.dev.zeebe.io/d/I4lo7_EZk/zeebe?orgId=1&from=1681912207012&to=1681930704963&var-DS_PROMETHEUS=Prometheus&var-cluster=All&var-namespace=zell-large-state-sst-partition&var-pod=All&var-partition=All):exploding_head: ![snapshot](https://user-images.githubusercontent.com/2758593/235164591-0ba3cb40-aa47-4bf4-b647-9992ac5d7e88.png) ![general](https://user-images.githubusercontent.com/2758593/235164598-5da0906e-a50f-4235-a5b8-48181dffc9d5.png) #### Maxing out benchmark ![maxgeneral](https://user-images.githubusercontent.com/2758593/235164601-bab9f40c-20be-4cbe-8530-c0ba791ec0f0.png) <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> related to #12033 Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>
12629: [Backport 8.1]: Introduce experimental SST partitioning r=remcowesterhoud a=Zelldon ## Description Backports #12483 <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> relates to #12033 12646: [Backport 8.1]: Restore blacklist metric r=remcowesterhoud a=Zelldon ## Description Backports #12606 Merge conflicts because of imports. ## Related issues closes #8263 Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>
12630: [Backport 8.0]: Introduce experimental SST partitioning r=remcowesterhoud a=Zelldon ## Description Backports #12483 <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> closes #12033 12645: [Backport 8.0]: Restore blacklist metric r=remcowesterhoud a=Zelldon ## Description Backports #12606 <!-- Please explain the changes you made here. --> The PR https://github.com/camunda/zeebe/pull/12306/files wasn't backported to 8.0, which caused some conflicts. I had to add the onRecovered method and call it in the ZeebeDbState. ## Related issues <!-- Which issues are closed by this PR or are related --> closes #8263 Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>
Results of POC WeekTL;DR;What I can say after this week of deeper investigation is that splitting up the key space will give us the most performance boost, and this can either be archived via enabling SST partitioning or introducing new real column families, which allows RocksDB to split them up correctly. Note: we will always treat performance with resources, either space or memory or both. What I have seen this week is that if we use one of these solutions all other issues we have found earlier seem to be negligible. If we are not using them we can clearly see that consistency checks are the most prominent issue (especially searching for not existing keys) in larger states. Followed by iterator seek, which is impacted enormously by a larger state as well, due to using one column family and having all keys together. The seek issue can only be handled either, by not iterating at all (if not necessary) which is currently hard to determine whether this is really possible, or using one of our options above, SST partitioning or using extra column families. Right now the SST partitioning large state benchmark is running for longer than 6 hours, and looks still stable. Using the JMH Benchmark setup really helped me to understand the impact of solutions in a fast way and easily profile the solutions, all results can be found here. Details of the Week:Investigating Prefix seek
DeleteRange not supported:
Discussed with Roman SST partitioning:We agreed that the solutions look quite promising, but there are still some open questions we should answer and clarify. Thanks @romansmirnov for your input! Questions:
I asked in the RocksDB google group about the experimental feature
Answers reflects our current understanding of this feature. It is interesting to note that other Vendors or applications use this feature as well. Gives us more trust in the feature itself. Especially since they mention it hasn’t changed since 2020, and they tested it as well. The use case in general fits I think our needs as well. POC enhance column family (CF) prefix check
POC introduce key formats
POC cache data in ZeebeDB
POC Create additional CF
Conclusion:
Next:Investigate and compare SST partitioning and Column family separation
|
@aivinog1 actually we have panels in our Zeebe dashboard about memory AND GC :) |
Small update on what I did in between my recent FTOs (I just want to share this also more in public) :
As the next step, I plan to compare in more depth the potential solution of separating column families vs. SST partitioning and try to conclude this week. |
For posterity and to post it in public: Comparing solutionsI compared the two possible approaches and concluded that we will move forward with SST partitioning, see slack My conclusion was the following:
We were able to show a stable performance for new instances even on a large state with the SST partitioning, which means we reached our goal here. For more details take a look at the referenced issue. Enabling SST per defaultA PR was created in order to make the SST partitioning enabled by default NextWith #12241 I created a JMH benchmark which helped me to determine which solutions are worth investigating further. I would like to spend some time in order to migrate that into a unit test and integrate that into our CI. This should allow us to detect when we introduce performance degradations. I will pair on this with @oleschoenburg Afterward, I will mark this topic as complete for now. Other topics which have been raised during this project are still valid, but depending on the priority are tackled later. |
With the most recent PRs #13135 and #13121 we added a JMH benchmark which should allow us to prevent regressions. The JMH test can be executed as a unit test that enables us to run via our CI regularly. With that, I will mark this EPIC as closed. There are several issues that came up or are discovered that we still need to take a look at or discuss like #12203 but this will be done separate. |
Problem Description
Came up with #11813, that right now our system runs into performance issues when having a bigger state. Especially we have encountered that if we hit a certain limit of state in RocksDB the performance drops suddenly, which is unexpected.
Goal / Focus
This EPIC can be seen as the first iteration to resolve this. We are setting our focus on "Hot data should always be executed fast/performant. Cold data or big state shouldn't have an impact on new data and the performance of them".
In other words, if we create always new instances the creation and execution of these instances should be always executed with the same or similar performances as the previously created instances. We want stable performance for new instances.
To be specific, something like the following shouldn't happen:
Here we can see that the performance dropped after ~1 hour of accumulating instances and state. We are stable in the sense of the cluster is still accepting new instances and doesn't crash, but the performance has significantly degraded.
Next / Upcoming
The following shows the next planned steps, this is updated incrementally
Next
Break down
Discover
As part of #11813, there already have been potential issues and solutions identified. Besides the identified issues we are planning to run some POCs as part of a discovering phase which might identify more potential tasks.
Testing
In order to validate our changes we will need to implement some tests (later automated), this should allow us to make changes iteratively and which should also prevent later regressions.
Improvements
The above breakdown should help us to achieve our goals.
The text was updated successfully, but these errors were encountered: