Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ZeebePartition recovery time with large state #13775

Closed
npepinpe opened this issue Aug 2, 2023 · 9 comments · Fixed by #13884 or #13938
Closed

Improve ZeebePartition recovery time with large state #13775

npepinpe opened this issue Aug 2, 2023 · 9 comments · Fixed by #13884 or #13938
Assignees
Labels
area/performance Marks an issue as performance related area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) component/broker component/db incident Marks an issue or PR as related to an incident report kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. version:8.3.0-alpha5 Marks an issue as being completely or in parts released in 8.3.0-alpha5 version:8.3.0-alpha6 Marks an issue as being completely or in parts released in 8.3.0-alpha6 version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0

Comments

@npepinpe
Copy link
Member

npepinpe commented Aug 2, 2023

Description

We had an incident where the liveness probes failed after 45s due to large state. This is a cluster with 3 partitions, each with a RocksDB state of over 7GB. For each partition, it took about 10 seconds purely to open the DB, which includes copying the snapshot and opening the DB. Furthermore, it seems like each partition was recovered sequentially and not in parallel, from the logs, but I haven't verified that.

The goal here would be to try and either optimize the recovery time, or have it not be w.r.t the size of the state (ideally the second one).

One issue which might be the biggest culprit in this case is this one: #5682

@npepinpe npepinpe added kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. area/performance Marks an issue as performance related area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) component/db component/broker planning/discuss To be discussed at the next planning. labels Aug 2, 2023
@npepinpe
Copy link
Member Author

npepinpe commented Aug 2, 2023

@megglos
Copy link
Contributor

megglos commented Aug 3, 2023

ZDP-Triage:

  • sequential partition recover prolongs startup
  • such situations may require increasing the timeout for the liveness probe
  • we need to double-check whether we disabled the WAL of rocksdb for all versions or just 8.2
  • increased number of sst files due to partitoning is default in 8.3 => this may make it worse
  • it raised the question if we aim to support such a large snapshot state
  • on such snapshot sizes snapshot replication is also problematic

=> relates to the epic to support large state - maintaining consistent performance on larger states

@megglos
Copy link
Contributor

megglos commented Aug 3, 2023

realtes to another iteration on https://github.com/camunda/product-hub/issues/989

@megglos megglos self-assigned this Aug 3, 2023
@megglos
Copy link
Contributor

megglos commented Aug 3, 2023

@megglos follow up with PM in regards to a potential new epic on improving the ability of zeebe to cope with a large state

@npepinpe npepinpe added the incident Marks an issue or PR as related to an incident report label Aug 9, 2023
@remcowesterhoud
Copy link
Contributor

Large state also proofs problematic for the Startup Probe.

Warning  Unhealthy  4m28s (x288 over 71m)  kubelet  Startup probe failed: HTTP probe failed with statuscode: 503

The snapshot size of the cluster is very large:
image

The last log before the pod got killed:

RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Started receiving new snapshot FileBasedReceivedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/pending/307189154-1021-698451241-698450032-1, snapshotStore=SnapshotStore-2, metadata=FileBasedSnapshotId{index=307189154, term=1021, processedPosition=698451241, exporterPosition=698450032}} from 1

My hypothesis is that receiving the snapshot takes longer than the startup probe has before it kills the pod, resulting in a startup loop.

@Zelldon
Copy link
Member

Zelldon commented Aug 10, 2023

I think in the incident above the snapshot replication was more the issue, and restarting of replications (due to pod restarts and due to new snapshots etc.). So we have multiple root causes which can have as a symptom that the broker is unable to start when the state is large.

@npepinpe
Copy link
Member Author

Yeah, Remco and I talked about it, but still felt it was relevant to post here for now to highlight multiple issues with startup time and large state (even if the original issue also talks about liveness)

@megglos
Copy link
Contributor

megglos commented Aug 14, 2023

ZDP-Planning:

  • initial issue was that partition startup was also sequential - this may be mitigated by parallel startup already (not intended to be backported), might be bound by IO limits then
  • another case snapshot replication took a long time
  • should be handled as part of the large state
  • probe timeouts were increased to give more headroom (from 45s to 1m30s)
  • would be worth assessing the risk for other customers and current prospects

=> just fixing recovery time might not be enough => we need to asses this in context of the epic https://github.com/camunda/product-hub/issues/1480

@megglos megglos assigned npepinpe and Zelldon and unassigned megglos Aug 14, 2023
@megglos megglos removed the planning/discuss To be discussed at the next planning. label Aug 14, 2023
@npepinpe npepinpe linked a pull request Aug 14, 2023 that will close this issue
14 tasks
zeebe-bors-camunda bot added a commit that referenced this issue Aug 16, 2023
13884: test(broker): add state controller performance tests r=npepinpe a=npepinpe

## Description

This PR adds a new performance test to track the `StateController#recover` process, which includes copying a snapshot and opening the underlying RocksDB state. Specifically, it measures the case where the state is considered "large", up to 6GB.

> **Note**
> The way to compute the snapshot size while we're generating it is far from accurate, since RocksDB is working in the background, performing compaction, flushing, etc., so the size is constantly fluctuating. This means that, typically, the final state size is less than the desired targeted size. We could add some factor as a heuristic, but it's unclear to me if that's super useful.
>
> For now, we log the desired and actual snapshot size. I'm a bit worried however that this will significantly skew results, making it hard to assign a correct reference score and maximum deviation.

To simplify adding this test and future tests, a new `JMHTestCase` utility is added, which lets you easily write a JUnit test that runs a benchmark and asserts its result. Additionally, a new `JMHTest` annotation is added, which is coupled with a JUnit 5 extension that will inject a pre-configurd `JMHTestCase` into a test method as a parameter. The annotation ensures all JMH tests will have an appropriate tag and will be excluded from the main CI pipeline, but included in the performance test job.

> **Note**
> All of this can be expanded later as we need more stuff.

There are some notable downsides to this approach of running JMH tests however:

1. Since JMH forks, it's not possible to debug the tests easily locally. To do so, you have to disable forking. Not too difficult, but not obvious when you're not familiar with it.
2. The reference score is typically different locally and on CI, and we already see that it sometimes randomly fluctuate on CI. Would be nice to find a better approach.
3. When a benchmark fails with an error, the actual assertion error is no benchmarks were run. This is pretty bad; luckily, the benchmark logs are present in the test output, but it means you have to wait for the complete workflow to finish before being able to grab them.

## Related issues

related #13775 



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
@npepinpe npepinpe reopened this Aug 16, 2023
@npepinpe
Copy link
Member Author

The issue we merged is related, but does not complete this.

@npepinpe npepinpe linked a pull request Aug 17, 2023 that will close this issue
14 tasks
zeebe-bors-camunda bot added a commit that referenced this issue Aug 21, 2023
13937: Allow opening an existing DB in snapshot-only mode r=npepinpe a=npepinpe

## Description

This PR adds a new API call to `ZeebeDbFactory` which allows opening existing databases in a snapshot-only mode. This returns a `ZeebeDb` which can ONLY take snapshots, and will throw a `new UnsupportedOperationException` for any other operations.

While it would be nice to have a proper read-only DB implementation, there are no transactions in a such a DB, and our whole access to the DB is coupled with the transaction API. As we were thinking of switching away from transactions, it will likely be easier after that to implement a read-only DB.

This will be used as part of the fix for #13775 to create the runtime directory faster via hard-links instead of a full copy.

## Related issues

related to #13775 



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
@abbasadel abbasadel added version:8.3.0-alpha5 Marks an issue as being completely or in parts released in 8.3.0-alpha5 version:8.3.0-alpha6 Marks an issue as being completely or in parts released in 8.3.0-alpha6 labels Sep 7, 2023
@megglos megglos added the version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0 label Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Marks an issue as performance related area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) component/broker component/db incident Marks an issue or PR as related to an incident report kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. version:8.3.0-alpha5 Marks an issue as being completely or in parts released in 8.3.0-alpha5 version:8.3.0-alpha6 Marks an issue as being completely or in parts released in 8.3.0-alpha6 version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0
Projects
None yet
5 participants