New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: bound minimum raft scheduler workers per store #120162
kv: bound minimum raft scheduler workers per store #120162
Conversation
In a448edc, we switched from replicating COCKROACH_SCHEDULER_CONCURRENCY on each store to evenly distributing the workers across each store. For example, in an 8-store, 32-vCPU node, the number of raft scheduler workers per store went from 96 to 12. This was done to avoid scheduler thrashing and excessive memory usage in many-store nodes. Unfortunately, we have seen that this change could also lead to situations where workers were spread so thin across stores in a many-store system that any single store's worker pool could not keep up with temporary imbalanced load. In extreme cases where the `PreIngestDelay` mechanism is kicking in, this could lead to high scheduler latency across replica on the store. This commit establishes an intermediate solution. We will continue to distribute workers across stores, but we will also ensure that each store has at least `COCKROACH_SCHEDULER_MIN_CONCURRENCY_PER_STORE` workers. This will prevents any single store from being able to absorb imbalanced load. The value defaults to `GOMAXPROCS`, so that in the previous example, each store would have at least 32 workers. Epic: None Release note (ops change): a minimum Raft scheduler concurrency is now enforced per store so that nodes with many stores do not spread workers too thin. This avoids high scheduler latency across replicas on a store when load is imbalanced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sad about the fallout here. I have a few questions though.
-
This will increase the lower concurrency bound from 16 to 32 workers on 2-CPU nodes. Are we ok with that?
-
Do we understand why the AddSSTable commands were so imbalanced? Presumably the replica/store allocation should be roughly random. The reason I ask is that the scheduler sharding will have the same effect, and it defaults to 16 workers per shard. So if we saw pathological imbalance with random distribution across 12-worker stores, we'll presumably see similar pathological imbalance with random distribution across 16-worker shards, which won't be addressed by this change.
-
I believe the primary reason to cap worker scaling is to avoid mutex contention. That's also addressed by sharding. Should we use more aggressive worker scaling now that we have sharding to avoid starvation, e.g. by removing the worker cap at 128?
Reviewable status: complete! 0 of 0 LGTMs obtained
Follow-up question: should we add a setting to disable the below-Raft throttling too? |
The per-store low bound defaults to
It seems much more common to me that hotspots might emerge on some stores than on some worker shards. Stores use range partitioning, requiring active rebalancing to combat hotspots which emerge from sudden load on a keyspace and that keyspace being split into many ranges. These new ranges won't move off the store unless the allocator decides to move them, which might take time. Worker shards use hash partitioning, so the ranges will naturally be balanced based on their ID. If a single range splits into 100s of new ranges quickly, there is nothing that would encourage them to land on the same worker shard — we would expect uniform distribution due to the round-robin effect of the range's leaseholder's Range ID allocator (
I think we probably should at some point. We already went up from 96 to 128 after that change. Increasing the concurrency again is probably a good idea. At the same time, we've removed raft log fsync blocking and PreIngestDelay blocking from under this worker pool, so there's less and less reason why we need more workers than vCPUs.
We have it. In v23.1, you can set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems much more common to me that hotspots might emerge on some stores than on some worker shards. Stores use range partitioning, requiring active rebalancing to combat hotspots which emerge from sudden load on a keyspace and that keyspace being split into many ranges.
Yeah, I realized the split case shortly after writing this. This is of course the common case with imports and other bulk operations on new tables.
I think we probably should at some point. We already went up from 96 to 128 after that change. Increasing the concurrency again is probably a good idea. At the same time, we've removed raft log fsync blocking and PreIngestDelay blocking from under this worker pool, so there's less and less reason why we need more workers than vCPUs.
On a 32-vCPU system this would result in 256 workers, up from 128, which doesn't sound too excessive to me. It could conceivably be problematic on 96-vCPU systems though (768 workers), but I believe we only officially support up to 32 vCPUs. We could consider upping the cap to 256 perhaps, in a separate non-backport PR?
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
PS: consider adding a sentence saying that splits have store affinity, and will frequently cause imbalances with bulk operations on empty tables. |
I think we should wait and see if this is a concern. A 96 vCPU system will have proportionally more resources to run these workers, so the primary concern would be contention between them. Your recent work to shard the per-store raft schedulers into fixed size shards should alleviate that concern. TFTR! bors r+ |
In a448edc, we switched from replicating COCKROACH_SCHEDULER_CONCURRENCY on each store to evenly distributing the workers across each store. For example, in an 8-store, 32-vCPU node, the number of raft scheduler workers per store went from 96 to 12.
This was done to avoid scheduler thrashing and excessive memory usage in many-store nodes. Unfortunately, we have seen that this change could also lead to situations where workers were spread so thin across stores in a many-store system that any single store's worker pool could not keep up with temporary imbalanced load. In extreme cases where the
PreIngestDelay
mechanism is kicking in, this could lead to high scheduler latency across replica on the store.This commit establishes an intermediate solution. We will continue to distribute workers across stores, but we will also ensure that each store has at least
COCKROACH_SCHEDULER_MIN_CONCURRENCY_PER_STORE
workers. This will prevents any single store from being able to absorb imbalanced load. The value defaults toGOMAXPROCS
, so that in the previous example, each store would have at least 32 workers.Epic: None
Release note (ops change): a minimum Raft scheduler concurrency is now enforced per store so that nodes with many stores do not spread workers too thin. This avoids high scheduler latency across replicas on a store when load is imbalanced.