Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: bound minimum raft scheduler workers per store #120162

Merged
merged 1 commit into from Mar 20, 2024

Conversation

nvanbenschoten
Copy link
Member

In a448edc, we switched from replicating COCKROACH_SCHEDULER_CONCURRENCY on each store to evenly distributing the workers across each store. For example, in an 8-store, 32-vCPU node, the number of raft scheduler workers per store went from 96 to 12.

This was done to avoid scheduler thrashing and excessive memory usage in many-store nodes. Unfortunately, we have seen that this change could also lead to situations where workers were spread so thin across stores in a many-store system that any single store's worker pool could not keep up with temporary imbalanced load. In extreme cases where the PreIngestDelay mechanism is kicking in, this could lead to high scheduler latency across replica on the store.

This commit establishes an intermediate solution. We will continue to distribute workers across stores, but we will also ensure that each store has at least COCKROACH_SCHEDULER_MIN_CONCURRENCY_PER_STORE workers. This will prevents any single store from being able to absorb imbalanced load. The value defaults to GOMAXPROCS, so that in the previous example, each store would have at least 32 workers.

Epic: None

Release note (ops change): a minimum Raft scheduler concurrency is now enforced per store so that nodes with many stores do not spread workers too thin. This avoids high scheduler latency across replicas on a store when load is imbalanced.

In a448edc, we switched from replicating COCKROACH_SCHEDULER_CONCURRENCY on
each store to evenly distributing the workers across each store. For example,
in an 8-store, 32-vCPU node, the number of raft scheduler workers per store
went from 96 to 12.

This was done to avoid scheduler thrashing and excessive memory usage in
many-store nodes. Unfortunately, we have seen that this change could also lead
to situations where workers were spread so thin across stores in a many-store
system that any single store's worker pool could not keep up with temporary
imbalanced load. In extreme cases where the `PreIngestDelay` mechanism is
kicking in, this could lead to high scheduler latency across replica on the
store.

This commit establishes an intermediate solution. We will continue to distribute
workers across stores, but we will also ensure that each store has at least
`COCKROACH_SCHEDULER_MIN_CONCURRENCY_PER_STORE` workers. This will prevents any
single store from being able to absorb imbalanced load. The value defaults to
`GOMAXPROCS`, so that in the previous example, each store would have at least 32
workers.

Epic: None

Release note (ops change): a minimum Raft scheduler concurrency is now enforced
per store so that nodes with many stores do not spread workers too thin. This
avoids high scheduler latency across replicas on a store when load is
imbalanced.
@nvanbenschoten nvanbenschoten added O-support Originated from a customer backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. labels Mar 9, 2024
@nvanbenschoten nvanbenschoten requested a review from a team as a code owner March 9, 2024 00:11
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sad about the fallout here. I have a few questions though.

  • This will increase the lower concurrency bound from 16 to 32 workers on 2-CPU nodes. Are we ok with that?

  • Do we understand why the AddSSTable commands were so imbalanced? Presumably the replica/store allocation should be roughly random. The reason I ask is that the scheduler sharding will have the same effect, and it defaults to 16 workers per shard. So if we saw pathological imbalance with random distribution across 12-worker stores, we'll presumably see similar pathological imbalance with random distribution across 16-worker shards, which won't be addressed by this change.

  • I believe the primary reason to cap worker scaling is to avoid mutex contention. That's also addressed by sharding. Should we use more aggressive worker scaling now that we have sharding to avoid starvation, e.g. by removing the worker cap at 128?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@erikgrinaker
Copy link
Contributor

Follow-up question: should we add a setting to disable the below-Raft throttling too?

@nvanbenschoten
Copy link
Member Author

This will increase the lower concurrency bound from 16 to 32 workers on 2-CPU nodes. Are we ok with that?

The per-store low bound defaults to GOMAXPROX, so it only kicks in for many-store configurations. You're saying in the case where there are more than 8 stores? I think I'm ok with that. 1 goroutine per store is really low. I also don't think a 2vCPU, 10 store configuration is realistic.

Do we understand why the AddSSTable commands were so imbalanced? Presumably the replica/store allocation should be roughly random. The reason I ask is that the scheduler sharding will have the same effect, and it defaults to 16 workers per shard. ...

It seems much more common to me that hotspots might emerge on some stores than on some worker shards. Stores use range partitioning, requiring active rebalancing to combat hotspots which emerge from sudden load on a keyspace and that keyspace being split into many ranges. These new ranges won't move off the store unless the allocator decides to move them, which might take time. Worker shards use hash partitioning, so the ranges will naturally be balanced based on their ID. If a single range splits into 100s of new ranges quickly, there is nothing that would encourage them to land on the same worker shard — we would expect uniform distribution due to the round-robin effect of the range's leaseholder's Range ID allocator (idalloc).

I believe the primary reason to cap worker scaling is to avoid mutex contention. That's also addressed by sharding. Should we use more aggressive worker scaling now that we have sharding to avoid starvation, e.g. by removing the worker cap at 128?

I think we probably should at some point. We already went up from 96 to 128 after that change. Increasing the concurrency again is probably a good idea. At the same time, we've removed raft log fsync blocking and PreIngestDelay blocking from under this worker pool, so there's less and less reason why we need more workers than vCPUs.

should we add a setting to disable the below-Raft throttling too?

We have it. In v23.1, you can set rocksdb.ingest_backpressure.l0_file_count_threshold to a very high value to prevent it from ever kicking in. In v23.2, you can just set pebble.pre_ingest_delay.enabled directly.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems much more common to me that hotspots might emerge on some stores than on some worker shards. Stores use range partitioning, requiring active rebalancing to combat hotspots which emerge from sudden load on a keyspace and that keyspace being split into many ranges.

Yeah, I realized the split case shortly after writing this. This is of course the common case with imports and other bulk operations on new tables.

I think we probably should at some point. We already went up from 96 to 128 after that change. Increasing the concurrency again is probably a good idea. At the same time, we've removed raft log fsync blocking and PreIngestDelay blocking from under this worker pool, so there's less and less reason why we need more workers than vCPUs.

On a 32-vCPU system this would result in 256 workers, up from 128, which doesn't sound too excessive to me. It could conceivably be problematic on 96-vCPU systems though (768 workers), but I believe we only officially support up to 32 vCPUs. We could consider upping the cap to 256 perhaps, in a separate non-backport PR?

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

@erikgrinaker
Copy link
Contributor

PS: consider adding a sentence saying that splits have store affinity, and will frequently cause imbalances with bulk operations on empty tables.

@nvanbenschoten
Copy link
Member Author

It could conceivably be problematic on 96-vCPU systems though (768 workers), but I believe we only officially support up to 32 vCPUs. We could consider upping the cap to 256 perhaps, in a separate non-backport PR?

I think we should wait and see if this is a concern. A 96 vCPU system will have proportionally more resources to run these workers, so the primary concern would be contention between them. Your recent work to shard the per-store raft schedulers into fixed size shards should alleviate that concern.

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 20, 2024

@craig craig bot merged commit 6889563 into cockroachdb:master Mar 20, 2024
19 checks passed
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/defRaftConc branch March 24, 2024 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. O-support Originated from a customer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants