Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Filing as an enhancement since I'm not certain this is due to a defect, feel free to recategorize.
Recently we encountered an issue where a cluster became unavailable and the user lost access to metrics following a cold restart. The logs were spammed with
@bdarnell noticed there was a relatively high (and stable) number of cgo calls:
pprof flame graph and goroutine dump indicated that there was high lock contention in the raft scheduler.
Setting the following three env variables and restarting the entire cluster cleared the contention and allowed replicas to quiesce:
Not certain what specific enhancement needs to be made, but ideally we wouldn't require manual intervention to get the cluster available again.
I believe this is the same as #35063, although that was with rolling restarts rather than a whole-cluster cold start.
There are two separate issues here.