Improve ZeebePartition recovery time with large state #13775

npepinpe · 2023-08-02T13:43:46Z

Description

We had an incident where the liveness probes failed after 45s due to large state. This is a cluster with 3 partitions, each with a RocksDB state of over 7GB. For each partition, it took about 10 seconds purely to open the DB, which includes copying the snapshot and opening the DB. Furthermore, it seems like each partition was recovered sequentially and not in parallel, from the logs, but I haven't verified that.

The goal here would be to try and either optimize the recovery time, or have it not be w.r.t the size of the state (ideally the second one).

One issue which might be the biggest culprit in this case is this one: #5682

npepinpe · 2023-08-02T13:49:51Z

Useful link as well: https://github.com/facebook/rocksdb/wiki/Speed-Up-DB-Open

megglos · 2023-08-03T12:27:41Z

ZDP-Triage:

sequential partition recover prolongs startup
such situations may require increasing the timeout for the liveness probe
we need to double-check whether we disabled the WAL of rocksdb for all versions or just 8.2
increased number of sst files due to partitoning is default in 8.3 => this may make it worse
it raised the question if we aim to support such a large snapshot state
on such snapshot sizes snapshot replication is also problematic

=> relates to the epic to support large state - maintaining consistent performance on larger states

megglos · 2023-08-03T12:29:18Z

realtes to another iteration on https://github.com/camunda/product-hub/issues/989

megglos · 2023-08-03T12:30:13Z

@megglos follow up with PM in regards to a potential new epic on improving the ability of zeebe to cope with a large state

remcowesterhoud · 2023-08-09T19:05:54Z

Large state also proofs problematic for the Startup Probe.

Warning  Unhealthy  4m28s (x288 over 71m)  kubelet  Startup probe failed: HTTP probe failed with statuscode: 503

The snapshot size of the cluster is very large:

The last log before the pod got killed:

RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Started receiving new snapshot FileBasedReceivedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/pending/307189154-1021-698451241-698450032-1, snapshotStore=SnapshotStore-2, metadata=FileBasedSnapshotId{index=307189154, term=1021, processedPosition=698451241, exporterPosition=698450032}} from 1

My hypothesis is that receiving the snapshot takes longer than the startup probe has before it kills the pod, resulting in a startup loop.

Zelldon · 2023-08-10T06:48:41Z

I think in the incident above the snapshot replication was more the issue, and restarting of replications (due to pod restarts and due to new snapshots etc.). So we have multiple root causes which can have as a symptom that the broker is unable to start when the state is large.

npepinpe · 2023-08-10T07:40:00Z

Yeah, Remco and I talked about it, but still felt it was relevant to post here for now to highlight multiple issues with startup time and large state (even if the original issue also talks about liveness)

megglos · 2023-08-14T11:34:43Z

ZDP-Planning:

initial issue was that partition startup was also sequential - this may be mitigated by parallel startup already (not intended to be backported), might be bound by IO limits then
another case snapshot replication took a long time
should be handled as part of the large state
probe timeouts were increased to give more headroom (from 45s to 1m30s)
would be worth assessing the risk for other customers and current prospects

=> just fixing recovery time might not be enough => we need to asses this in context of the epic https://github.com/camunda/product-hub/issues/1480

13884: test(broker): add state controller performance tests r=npepinpe a=npepinpe ## Description This PR adds a new performance test to track the `StateController#recover` process, which includes copying a snapshot and opening the underlying RocksDB state. Specifically, it measures the case where the state is considered "large", up to 6GB. > **Note** > The way to compute the snapshot size while we're generating it is far from accurate, since RocksDB is working in the background, performing compaction, flushing, etc., so the size is constantly fluctuating. This means that, typically, the final state size is less than the desired targeted size. We could add some factor as a heuristic, but it's unclear to me if that's super useful. > > For now, we log the desired and actual snapshot size. I'm a bit worried however that this will significantly skew results, making it hard to assign a correct reference score and maximum deviation. To simplify adding this test and future tests, a new `JMHTestCase` utility is added, which lets you easily write a JUnit test that runs a benchmark and asserts its result. Additionally, a new `JMHTest` annotation is added, which is coupled with a JUnit 5 extension that will inject a pre-configurd `JMHTestCase` into a test method as a parameter. The annotation ensures all JMH tests will have an appropriate tag and will be excluded from the main CI pipeline, but included in the performance test job. > **Note** > All of this can be expanded later as we need more stuff. There are some notable downsides to this approach of running JMH tests however: 1. Since JMH forks, it's not possible to debug the tests easily locally. To do so, you have to disable forking. Not too difficult, but not obvious when you're not familiar with it. 2. The reference score is typically different locally and on CI, and we already see that it sometimes randomly fluctuate on CI. Would be nice to find a better approach. 3. When a benchmark fails with an error, the actual assertion error is no benchmarks were run. This is pretty bad; luckily, the benchmark logs are present in the test output, but it means you have to wait for the complete workflow to finish before being able to grab them. ## Related issues related #13775 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>

npepinpe · 2023-08-16T13:35:18Z

The issue we merged is related, but does not complete this.

13937: Allow opening an existing DB in snapshot-only mode r=npepinpe a=npepinpe ## Description This PR adds a new API call to `ZeebeDbFactory` which allows opening existing databases in a snapshot-only mode. This returns a `ZeebeDb` which can ONLY take snapshots, and will throw a `new UnsupportedOperationException` for any other operations. While it would be nice to have a proper read-only DB implementation, there are no transactions in a such a DB, and our whole access to the DB is coupled with the transaction API. As we were thinking of switching away from transactions, it will likely be easier after that to implement a read-only DB. This will be used as part of the fix for #13775 to create the runtime directory faster via hard-links instead of a full copy. ## Related issues related to #13775 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>

megglos self-assigned this Aug 3, 2023

npepinpe added the incident Marks an issue or PR as related to an incident report label Aug 9, 2023

megglos assigned npepinpe and Zelldon and unassigned megglos Aug 14, 2023

megglos removed the planning/discuss To be discussed at the next planning. label Aug 14, 2023

npepinpe linked a pull request Aug 14, 2023 that will close this issue

test(broker): add state controller performance tests #13884

Merged

14 tasks

npepinpe mentioned this issue Aug 15, 2023

test(broker): add state controller performance tests #13884

Merged

14 tasks

zeebe-bors-camunda bot closed this as completed in #13884 Aug 16, 2023

npepinpe reopened this Aug 16, 2023

This was referenced Aug 17, 2023

Allow opening an existing DB in snapshot-only mode #13937

Merged

Speed up state controller recovery from existing snapshot #13938

Merged

npepinpe linked a pull request Aug 17, 2023 that will close this issue

Speed up state controller recovery from existing snapshot #13938

Merged

14 tasks

zeebe-bors-camunda bot closed this as completed in d4b846f Aug 21, 2023

abbasadel added version:8.3.0-alpha5 Marks an issue as being completely or in parts released in 8.3.0-alpha5 version:8.3.0-alpha6 Marks an issue as being completely or in parts released in 8.3.0-alpha6 labels Sep 7, 2023

megglos added the version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0 label Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ZeebePartition recovery time with large state #13775

Improve ZeebePartition recovery time with large state #13775

npepinpe commented Aug 2, 2023

npepinpe commented Aug 2, 2023

megglos commented Aug 3, 2023

megglos commented Aug 3, 2023

megglos commented Aug 3, 2023

remcowesterhoud commented Aug 9, 2023

Zelldon commented Aug 10, 2023

npepinpe commented Aug 10, 2023

megglos commented Aug 14, 2023

npepinpe commented Aug 16, 2023

Improve ZeebePartition recovery time with large state #13775

Improve ZeebePartition recovery time with large state #13775

Comments

npepinpe commented Aug 2, 2023

npepinpe commented Aug 2, 2023

megglos commented Aug 3, 2023

megglos commented Aug 3, 2023

megglos commented Aug 3, 2023

remcowesterhoud commented Aug 9, 2023

Zelldon commented Aug 10, 2023

npepinpe commented Aug 10, 2023

megglos commented Aug 14, 2023

npepinpe commented Aug 16, 2023