Unstable cluster on bigger state #4560

Zelldon · 2020-05-20T05:08:06Z

Describe the bug

I have setup an stable cluster on gke (no preemtable nodes) and expected to run a benchmark for a while. Unfortunately it died after around 12 hours.

Investigating this I saw that we had unexpected often Leader changes in the cluster.

These caused then also a growing of the disk space. Because we are not able to compact if we have to often a leader change.

As we can see above only a few restarts are really related due to restarts. The restarts were caused by reaching the memory limit. Probably rocksdb reached there limits again.

We had the problem with our helm charts that the gateway used only one thread to process the requests, which means we were not able to work fast on the created jobs. See related issue #4524 . This caused an growing running count.

As you can see we have around 2 million running instances. This big state causes also a bigger snapshot size.

This might be related to the leader changes, because it will take a long time to replicate these snapshots. Furthermore it increases the disk usage as well.

To Reproduce

Start an usual benchmark with non preemtable nodes and start less workers.
Steps to reproduce the behavior

Expected behavior
A stable running system.

Environment:

OS: k8
Zeebe Version: 0.24.0-SNAPSHOT
Configuration: elastic, standalone gw

Zelldon · 2020-05-22T06:24:06Z

Happens again with a new benchmark. We are not able to work on all jobs, which means we building up a bigger state. This causes also bigger snapshot sizes.

We see a lot of leader changes in the benchmark and missing heartbeats.

In the end we went out of disk space.

npepinpe · 2020-11-09T08:27:49Z

Downgrading, as while this is important to figure out, it does not impact our own production environment (thanks pre-emptible nodes), and as such isn't critical to our OKRs.

* style: add incidents table * refactor: use grid view instead of separate height calculation

npepinpe added Priority: High and removed Status: Needs Priority labels May 20, 2020

npepinpe assigned korthout Jun 3, 2020

korthout removed their assignment Aug 19, 2020

npepinpe added Priority: Mid and removed Priority: High labels Nov 9, 2020

Zelldon mentioned this issue Jan 26, 2021

Use one column family with RocksDB #6162

Merged

9 tasks

zeebe-bors bot closed this as completed in 1324f89 Jan 28, 2021

zeebe-bors bot closed this as completed in #6162 Jan 28, 2021

Zelldon mentioned this issue Feb 3, 2021

ZCL doesn't update issues when multiple issues are referenced zeebe-io/zeebe-changelog#4

Closed

github-merge-queue bot pushed a commit that referenced this issue Mar 14, 2024

style: add incidents table (#4560)

c09684a

* style: add incidents table * refactor: use grid view instead of separate height calculation

github-merge-queue bot pushed a commit that referenced this issue Apr 16, 2024

feat: return Groups internal API (#4560)

53d5cc3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unstable cluster on bigger state #4560

Unstable cluster on bigger state #4560

Zelldon commented May 20, 2020

Zelldon commented May 22, 2020

npepinpe commented Nov 9, 2020

Unstable cluster on bigger state #4560

Unstable cluster on bigger state #4560

Comments

Zelldon commented May 20, 2020

Zelldon commented May 22, 2020

npepinpe commented Nov 9, 2020