kv: bound minimum raft scheduler workers per store #120162

nvanbenschoten · 2024-03-09T00:11:02Z

In a448edc, we switched from replicating COCKROACH_SCHEDULER_CONCURRENCY on each store to evenly distributing the workers across each store. For example, in an 8-store, 32-vCPU node, the number of raft scheduler workers per store went from 96 to 12.

This was done to avoid scheduler thrashing and excessive memory usage in many-store nodes. Unfortunately, we have seen that this change could also lead to situations where workers were spread so thin across stores in a many-store system that any single store's worker pool could not keep up with temporary imbalanced load. In extreme cases where the PreIngestDelay mechanism is kicking in, this could lead to high scheduler latency across replica on the store.

This commit establishes an intermediate solution. We will continue to distribute workers across stores, but we will also ensure that each store has at least COCKROACH_SCHEDULER_MIN_CONCURRENCY_PER_STORE workers. This will prevents any single store from being able to absorb imbalanced load. The value defaults to GOMAXPROCS, so that in the previous example, each store would have at least 32 workers.

Epic: None

Release note (ops change): a minimum Raft scheduler concurrency is now enforced per store so that nodes with many stores do not spread workers too thin. This avoids high scheduler latency across replicas on a store when load is imbalanced.

In a448edc, we switched from replicating COCKROACH_SCHEDULER_CONCURRENCY on each store to evenly distributing the workers across each store. For example, in an 8-store, 32-vCPU node, the number of raft scheduler workers per store went from 96 to 12. This was done to avoid scheduler thrashing and excessive memory usage in many-store nodes. Unfortunately, we have seen that this change could also lead to situations where workers were spread so thin across stores in a many-store system that any single store's worker pool could not keep up with temporary imbalanced load. In extreme cases where the `PreIngestDelay` mechanism is kicking in, this could lead to high scheduler latency across replica on the store. This commit establishes an intermediate solution. We will continue to distribute workers across stores, but we will also ensure that each store has at least `COCKROACH_SCHEDULER_MIN_CONCURRENCY_PER_STORE` workers. This will prevents any single store from being able to absorb imbalanced load. The value defaults to `GOMAXPROCS`, so that in the previous example, each store would have at least 32 workers. Epic: None Release note (ops change): a minimum Raft scheduler concurrency is now enforced per store so that nodes with many stores do not spread workers too thin. This avoids high scheduler latency across replicas on a store when load is imbalanced.

cockroach-teamcity · 2024-03-09T00:11:10Z

This change is

erikgrinaker

Sad about the fallout here. I have a few questions though.

This will increase the lower concurrency bound from 16 to 32 workers on 2-CPU nodes. Are we ok with that?
Do we understand why the AddSSTable commands were so imbalanced? Presumably the replica/store allocation should be roughly random. The reason I ask is that the scheduler sharding will have the same effect, and it defaults to 16 workers per shard. So if we saw pathological imbalance with random distribution across 12-worker stores, we'll presumably see similar pathological imbalance with random distribution across 16-worker shards, which won't be addressed by this change.
I believe the primary reason to cap worker scaling is to avoid mutex contention. That's also addressed by sharding. Should we use more aggressive worker scaling now that we have sharding to avoid starvation, e.g. by removing the worker cap at 128?

Reviewable status: complete! 0 of 0 LGTMs obtained

erikgrinaker · 2024-03-11T13:15:03Z

Follow-up question: should we add a setting to disable the below-Raft throttling too?

nvanbenschoten · 2024-03-11T14:25:38Z

This will increase the lower concurrency bound from 16 to 32 workers on 2-CPU nodes. Are we ok with that?

The per-store low bound defaults to GOMAXPROX, so it only kicks in for many-store configurations. You're saying in the case where there are more than 8 stores? I think I'm ok with that. 1 goroutine per store is really low. I also don't think a 2vCPU, 10 store configuration is realistic.

Do we understand why the AddSSTable commands were so imbalanced? Presumably the replica/store allocation should be roughly random. The reason I ask is that the scheduler sharding will have the same effect, and it defaults to 16 workers per shard. ...

It seems much more common to me that hotspots might emerge on some stores than on some worker shards. Stores use range partitioning, requiring active rebalancing to combat hotspots which emerge from sudden load on a keyspace and that keyspace being split into many ranges. These new ranges won't move off the store unless the allocator decides to move them, which might take time. Worker shards use hash partitioning, so the ranges will naturally be balanced based on their ID. If a single range splits into 100s of new ranges quickly, there is nothing that would encourage them to land on the same worker shard — we would expect uniform distribution due to the round-robin effect of the range's leaseholder's Range ID allocator (idalloc).

I believe the primary reason to cap worker scaling is to avoid mutex contention. That's also addressed by sharding. Should we use more aggressive worker scaling now that we have sharding to avoid starvation, e.g. by removing the worker cap at 128?

I think we probably should at some point. We already went up from 96 to 128 after that change. Increasing the concurrency again is probably a good idea. At the same time, we've removed raft log fsync blocking and PreIngestDelay blocking from under this worker pool, so there's less and less reason why we need more workers than vCPUs.

should we add a setting to disable the below-Raft throttling too?

We have it. In v23.1, you can set rocksdb.ingest_backpressure.l0_file_count_threshold to a very high value to prevent it from ever kicking in. In v23.2, you can just set pebble.pre_ingest_delay.enabled directly.

erikgrinaker

It seems much more common to me that hotspots might emerge on some stores than on some worker shards. Stores use range partitioning, requiring active rebalancing to combat hotspots which emerge from sudden load on a keyspace and that keyspace being split into many ranges.

Yeah, I realized the split case shortly after writing this. This is of course the common case with imports and other bulk operations on new tables.

I think we probably should at some point. We already went up from 96 to 128 after that change. Increasing the concurrency again is probably a good idea. At the same time, we've removed raft log fsync blocking and PreIngestDelay blocking from under this worker pool, so there's less and less reason why we need more workers than vCPUs.

On a 32-vCPU system this would result in 256 workers, up from 128, which doesn't sound too excessive to me. It could conceivably be problematic on 96-vCPU systems though (768 workers), but I believe we only officially support up to 32 vCPUs. We could consider upping the cap to 256 perhaps, in a separate non-backport PR?

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

erikgrinaker · 2024-03-11T14:54:10Z

PS: consider adding a sentence saying that splits have store affinity, and will frequently cause imbalances with bulk operations on empty tables.

nvanbenschoten · 2024-03-20T22:32:03Z

It could conceivably be problematic on 96-vCPU systems though (768 workers), but I believe we only officially support up to 32 vCPUs. We could consider upping the cap to 256 perhaps, in a separate non-backport PR?

I think we should wait and see if this is a concern. A 96 vCPU system will have proportionally more resources to run these workers, so the primary concern would be contention between them. Your recent work to shard the per-store raft schedulers into fixed size shards should alleviate that concern.

TFTR!

bors r+

craig · 2024-03-20T23:27:36Z

Build succeeded:

nvanbenschoten added O-support Originated from a customer backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. labels Mar 9, 2024

nvanbenschoten requested a review from erikgrinaker March 9, 2024 00:11

nvanbenschoten requested a review from a team as a code owner March 9, 2024 00:11

erikgrinaker reviewed Mar 11, 2024

View reviewed changes

erikgrinaker approved these changes Mar 11, 2024

View reviewed changes

craig bot merged commit 6889563 into cockroachdb:master Mar 20, 2024
19 checks passed

This was referenced Mar 20, 2024

release-23.1: kv: bound minimum raft scheduler workers per store #120797

Merged

release-23.2: kv: bound minimum raft scheduler workers per store #120798

Merged

nvanbenschoten deleted the nvanbenschoten/defRaftConc branch March 24, 2024 01:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: bound minimum raft scheduler workers per store #120162

kv: bound minimum raft scheduler workers per store #120162

nvanbenschoten commented Mar 9, 2024

cockroach-teamcity commented Mar 9, 2024

erikgrinaker left a comment

erikgrinaker commented Mar 11, 2024

nvanbenschoten commented Mar 11, 2024

erikgrinaker left a comment

erikgrinaker commented Mar 11, 2024

nvanbenschoten commented Mar 20, 2024

craig bot commented Mar 20, 2024

kv: bound minimum raft scheduler workers per store #120162

kv: bound minimum raft scheduler workers per store #120162

Conversation

nvanbenschoten commented Mar 9, 2024

cockroach-teamcity commented Mar 9, 2024

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker commented Mar 11, 2024

nvanbenschoten commented Mar 11, 2024

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker commented Mar 11, 2024

nvanbenschoten commented Mar 20, 2024

craig bot commented Mar 20, 2024