Unstable cluster on bigger state #4560
Labels
area/performance
Marks an issue as performance related
kind/bug
Categorizes an issue or PR as a bug
scope/broker
Marks an issue or PR to appear in the broker section of the changelog
severity/high
Marks a bug as having a noticeable impact on the user with no known workaround
Describe the bug
I have setup an stable cluster on gke (no preemtable nodes) and expected to run a benchmark for a while. Unfortunately it died after around 12 hours.
Investigating this I saw that we had unexpected often Leader changes in the cluster.
These caused then also a growing of the disk space. Because we are not able to compact if we have to often a leader change.
As we can see above only a few restarts are really related due to restarts. The restarts were caused by reaching the memory limit. Probably rocksdb reached there limits again.
We had the problem with our helm charts that the gateway used only one thread to process the requests, which means we were not able to work fast on the created jobs. See related issue #4524 . This caused an growing running count.
As you can see we have around 2 million running instances. This big state causes also a bigger snapshot size.
This might be related to the leader changes, because it will take a long time to replicate these snapshots. Furthermore it increases the disk usage as well.
To Reproduce
Start an usual benchmark with non preemtable nodes and start less workers.
Steps to reproduce the behavior
Expected behavior
A stable running system.
Environment:
The text was updated successfully, but these errors were encountered: