Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable cluster on bigger state #4560

Closed
Zelldon opened this issue May 20, 2020 · 2 comments · Fixed by #6162
Closed

Unstable cluster on bigger state #4560

Zelldon opened this issue May 20, 2020 · 2 comments · Fixed by #6162
Labels
area/performance Marks an issue as performance related kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround

Comments

@Zelldon
Copy link
Member

Zelldon commented May 20, 2020

Describe the bug

I have setup an stable cluster on gke (no preemtable nodes) and expected to run a benchmark for a while. Unfortunately it died after around 12 hours.

general

Investigating this I saw that we had unexpected often Leader changes in the cluster.

to-many-leader-changes

These caused then also a growing of the disk space. Because we are not able to compact if we have to often a leader change.

growing-resources

As we can see above only a few restarts are really related due to restarts. The restarts were caused by reaching the memory limit. Probably rocksdb reached there limits again.

memory

We had the problem with our helm charts that the gateway used only one thread to process the requests, which means we were not able to work fast on the created jobs. See related issue #4524 . This caused an growing running count.

running

As you can see we have around 2 million running instances. This big state causes also a bigger snapshot size.
snapshot-size

This might be related to the leader changes, because it will take a long time to replicate these snapshots. Furthermore it increases the disk usage as well.

To Reproduce

Start an usual benchmark with non preemtable nodes and start less workers.
Steps to reproduce the behavior

Expected behavior
A stable running system.

Environment:

  • OS: k8
  • Zeebe Version: 0.24.0-SNAPSHOT
  • Configuration: elastic, standalone gw
@Zelldon Zelldon added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog area/performance Marks an issue as performance related Impact: Availability severity/high Marks a bug as having a noticeable impact on the user with no known workaround labels May 20, 2020
@Zelldon
Copy link
Member Author

Zelldon commented May 22, 2020

Happens again with a new benchmark. We are not able to work on all jobs, which means we building up a bigger state. This causes also bigger snapshot sizes.

bigger-load-snapshot

We see a lot of leader changes in the benchmark and missing heartbeats.

role-changes

In the end we went out of disk space.

disk

@korthout korthout removed their assignment Aug 19, 2020
@npepinpe
Copy link
Member

npepinpe commented Nov 9, 2020

Downgrading, as while this is important to figure out, it does not impact our own production environment (thanks pre-emptible nodes), and as such isn't critical to our OKRs.

@zeebe-bors zeebe-bors bot closed this as completed in 1324f89 Jan 28, 2021
github-merge-queue bot pushed a commit that referenced this issue Mar 14, 2024
* style: add incidents table

* refactor: use grid view instead of separate height calculation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Marks an issue as performance related kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants