New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ZeebePartition recovery time with large state #13775
Comments
Useful link as well: https://github.com/facebook/rocksdb/wiki/Speed-Up-DB-Open |
ZDP-Triage:
=> relates to the epic to support large state - maintaining consistent performance on larger states |
realtes to another iteration on https://github.com/camunda/product-hub/issues/989 |
@megglos follow up with PM in regards to a potential new epic on improving the ability of zeebe to cope with a large state |
I think in the incident above the snapshot replication was more the issue, and restarting of replications (due to pod restarts and due to new snapshots etc.). So we have multiple root causes which can have as a symptom that the broker is unable to start when the state is large. |
Yeah, Remco and I talked about it, but still felt it was relevant to post here for now to highlight multiple issues with startup time and large state (even if the original issue also talks about liveness) |
ZDP-Planning:
=> just fixing recovery time might not be enough => we need to asses this in context of the epic https://github.com/camunda/product-hub/issues/1480 |
13884: test(broker): add state controller performance tests r=npepinpe a=npepinpe ## Description This PR adds a new performance test to track the `StateController#recover` process, which includes copying a snapshot and opening the underlying RocksDB state. Specifically, it measures the case where the state is considered "large", up to 6GB. > **Note** > The way to compute the snapshot size while we're generating it is far from accurate, since RocksDB is working in the background, performing compaction, flushing, etc., so the size is constantly fluctuating. This means that, typically, the final state size is less than the desired targeted size. We could add some factor as a heuristic, but it's unclear to me if that's super useful. > > For now, we log the desired and actual snapshot size. I'm a bit worried however that this will significantly skew results, making it hard to assign a correct reference score and maximum deviation. To simplify adding this test and future tests, a new `JMHTestCase` utility is added, which lets you easily write a JUnit test that runs a benchmark and asserts its result. Additionally, a new `JMHTest` annotation is added, which is coupled with a JUnit 5 extension that will inject a pre-configurd `JMHTestCase` into a test method as a parameter. The annotation ensures all JMH tests will have an appropriate tag and will be excluded from the main CI pipeline, but included in the performance test job. > **Note** > All of this can be expanded later as we need more stuff. There are some notable downsides to this approach of running JMH tests however: 1. Since JMH forks, it's not possible to debug the tests easily locally. To do so, you have to disable forking. Not too difficult, but not obvious when you're not familiar with it. 2. The reference score is typically different locally and on CI, and we already see that it sometimes randomly fluctuate on CI. Would be nice to find a better approach. 3. When a benchmark fails with an error, the actual assertion error is no benchmarks were run. This is pretty bad; luckily, the benchmark logs are present in the test output, but it means you have to wait for the complete workflow to finish before being able to grab them. ## Related issues related #13775 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
The issue we merged is related, but does not complete this. |
13937: Allow opening an existing DB in snapshot-only mode r=npepinpe a=npepinpe ## Description This PR adds a new API call to `ZeebeDbFactory` which allows opening existing databases in a snapshot-only mode. This returns a `ZeebeDb` which can ONLY take snapshots, and will throw a `new UnsupportedOperationException` for any other operations. While it would be nice to have a proper read-only DB implementation, there are no transactions in a such a DB, and our whole access to the DB is coupled with the transaction API. As we were thinking of switching away from transactions, it will likely be easier after that to implement a read-only DB. This will be used as part of the fix for #13775 to create the runtime directory faster via hard-links instead of a full copy. ## Related issues related to #13775 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
Description
We had an incident where the liveness probes failed after 45s due to large state. This is a cluster with 3 partitions, each with a RocksDB state of over 7GB. For each partition, it took about 10 seconds purely to open the DB, which includes copying the snapshot and opening the DB. Furthermore, it seems like each partition was recovered sequentially and not in parallel, from the logs, but I haven't verified that.
The goal here would be to try and either optimize the recovery time, or have it not be w.r.t the size of the state (ideally the second one).
One issue which might be the biggest culprit in this case is this one: #5682
The text was updated successfully, but these errors were encountered: