Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: Improve experience during cold starts in the presence of large datasets #39117

Open
tim-o opened this issue Jul 26, 2019 · 1 comment

Comments

@tim-o
Copy link
Contributor

@tim-o tim-o commented Jul 26, 2019

Filing as an enhancement since I'm not certain this is due to a defect, feel free to recategorize.

Recently we encountered an issue where a cluster became unavailable and the user lost access to metrics following a cold restart. The logs were spammed with refusing gossip from x; forwarding to y errors, despite the fact that the nodes were all connected to each other and reporting the same connectivity graph. No ranges were available due to r1

@bdarnell noticed there was a relatively high (and stable) number of cgo calls:

I190723 17:04:32.587894 1087 server/status/runtime.go:500  [n1] runtime stats: 12 GiB RSS, 623 goroutines, 784 MiB/612 MiB/1.7 GiB GO alloc/idle/total, 8.5 GiB/10 GiB CGO alloc/total, 9010.8 CGO/sec, 219.5/13.5 %(u/s)time, 0.0 %gc (2x), 1.1 MiB/2.5 MiB (r/w)net
I190723 17:04:33.800635 1719 server/status/runtime.go:500  [n4] runtime stats: 11 GiB RSS, 630 goroutines, 909 MiB/471 MiB/1.7 GiB GO alloc/idle/total, 8.1 GiB/9.6 GiB CGO alloc/total, 9515.1 CGO/sec, 216.3/11.5 %(u/s)time, 0.0 %gc (2x), 2.2 MiB/3.8 MiB (r/w)net
I190723 17:04:35.379305 3464 server/status/runtime.go:500  [n2] runtime stats: 12 GiB RSS, 643 goroutines, 1.1 GiB/1.3 GiB/2.7 GiB GO alloc/idle/total, 8.4 GiB/10 GiB CGO alloc/total, 10387.9 CGO/sec, 229.9/15.0 %(u/s)time, 0.0 %gc (2x), 3.5 MiB/2.1 MiB (r/w)net

pprof flame graph and goroutine dump indicated that there was high lock contention in the raft scheduler.

Setting the following three env variables and restarting the entire cluster cleared the contention and allowed replicas to quiesce:

COCKROACH_SCHEDULER_CONCURRENCY=100
COCKROACH_SCAN_INTERVAL=60m
COCKROACH_SCAN_MIN_IDLE_TIME=1s

Not certain what specific enhancement needs to be made, but ideally we wouldn't require manual intervention to get the cluster available again.

@tim-o tim-o added this to Incoming in KV via automation Jul 26, 2019
tim-o added a commit to cockroachdb/docs that referenced this issue Jul 26, 2019
@bdarnell

This comment has been minimized.

Copy link
Member

@bdarnell bdarnell commented Jul 29, 2019

I believe this is the same as #35063, although that was with rolling restarts rather than a whole-cluster cold start.

There are two separate issues here.

  1. The nodes had 48 cpus, and we saw a lot of lock contention in the raft scheduler (which creates nCPU*8=384 goroutines).Setting COCKROACH_SCHEDULER_CONCURRENCY is an attempt to alleviate this. If this works, we could simply put a max on the number of goroutines here or rethink the locking. 8 goroutines per cpu also seems like a lot so we should evaluate whether this is too high even on smaller machines. Note that I think this is a secondary issue; without the scanner issues I don't think we would have seen problems here.
  2. The scanner wakes up every range in a 10 minute cycle, which for nodes with 100k replicas causes raft elections to occur faster than we can handle them, saturating the scheduler threads and causing new leaders to time out as soon as the election completes. Ideally we'd have some sort of backpressure here, preventing the scanner from waking up too many replicas at once. Tweaking the scanner constants might be a simpler way to mitigate the problem, although there's a tradeoff with responsiveness to tasks that the scanner needs to process. I think increasing min idle time from its default of 10ms is probably the best move here, although it needs testing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
KV
Incoming
2 participants
You can’t perform that action at this time.