Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: remove below-raft throttling #57247

Closed
sumeerbhola opened this issue Nov 30, 2020 · 8 comments · Fixed by #98762
Closed

kvserver: remove below-raft throttling #57247

sumeerbhola opened this issue Nov 30, 2020 · 8 comments · Fixed by #98762
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv-replication KV Replication Team

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Nov 30, 2020

We currently throttle when adding sstables to the LSM due to background bulk operations. This throttling happens in two places, in Store.Send, prior to the proposal, and in addSSTablePreApply, when applying to the state machine at each replica. The latter is problematic in that it (a) delays applying later things in the raft log at the replica, and (b) blocks a worker on the raft scheduler, which can impact other ranges.

The Store.Send throttling only looks at the store health on the leaseholder, and we have seen occurrences where one node (or a few nodes) in a cluster has an unhealthy LSM store. If we maintained soft state in a node about the health of other stores in the system (this state would be slow changing), we could use it in Store.Send and remove the apply-time throttling.

This store health information can also be useful for throttling in the GC queue.

Jira issue: CRDB-2847

Epic CRDB-15069

@sumeerbhola sumeerbhola added A-disaster-recovery A-kv-replication Relating to Raft, consensus, and coordination. labels Nov 30, 2020
@blathers-crl
Copy link

blathers-crl bot commented Nov 30, 2020

Hi @sumeerbhola, please add a C-ategory label to your issue. Check out the label system docs.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@sumeerbhola sumeerbhola added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Nov 30, 2020
@mwang1026 mwang1026 added this to Triage in Disaster Recovery Backlog via automation Dec 8, 2020
@mwang1026 mwang1026 added this to Incoming in Storage via automation Dec 15, 2020
@mwang1026 mwang1026 removed this from Triage in Disaster Recovery Backlog Dec 15, 2020
@jlinder jlinder added the T-storage Storage Team label Jun 16, 2021
@mwang1026 mwang1026 added A-kv-distribution Relating to rebalancing and leasing. T-kv KV Team and removed A-kv-replication Relating to Raft, consensus, and coordination. T-storage Storage Team labels Dec 8, 2021
@blathers-crl blathers-crl bot added this to Incoming in KV Dec 8, 2021
@mwang1026 mwang1026 added this to Triage in Disaster Recovery Backlog via automation Dec 8, 2021
@blathers-crl
Copy link

blathers-crl bot commented Dec 8, 2021

cc @cockroachdb/bulk-io

@erikgrinaker erikgrinaker added A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication KV Replication Team and removed A-kv-distribution Relating to rebalancing and leasing. T-kv KV Team labels Dec 9, 2021
@erikgrinaker erikgrinaker added this to Incoming in Replication via automation Dec 9, 2021
@erikgrinaker erikgrinaker removed this from Incoming in KV Dec 9, 2021
@erikgrinaker erikgrinaker removed this from Triage in Disaster Recovery Backlog Dec 9, 2021
@erikgrinaker
Copy link
Contributor

@cockroachdb/repl discussed this a bit last week. We definitely need to avoid below-Raft throttling while holding the Raft mutex. Posting some preliminary thoughts below, nothing concrete yet.

We think that the store-level throttling likely happens a bit early -- it seems like the ideal throttling point would be right as the request is about to acquires latches in the concurrency manager. Requests can be hanging around for a long time between traversing the store and being evaluated, e.g. because they're waiting for latches. And throttling later than that basically means that we'll have to choose between head-of-line blocking (because we acquired latches), reordering problems (because we've already evaluated the request), or request reevaluation (to handle reordering).

We also felt like a distributed throttling scheme to take into account the health of follower stores would likely be too complex (and possibly more in the purview of the allocator, which should move replicas off of hot nodes). On followers, we can often defer command application as long as commands get committed to the log, which would allow us to do some level of local Raft throttling. But we would have to do this throttling without holding onto a scheduler goroutine, by returning early and backing off.

We'll also need to keep other throttling mechanisms in mind here, e.g. admission control and the proposal quota pool.

@erikgrinaker
Copy link
Contributor

Note to self: if we were to do store-level throttling based on the state of other replica stores, we should consider using the existing store gossip mechanism:

// Gossip store descriptor.
return s.cfg.Gossip.AddInfoProto(gossipStoreKey, storeDesc, gossip.StoreTTL)

// StoreDescriptor holds store information including store attributes, node
// descriptor and store capacity.
message StoreDescriptor {
optional int32 store_id = 1 [(gogoproto.nullable) = false,
(gogoproto.customname) = "StoreID", (gogoproto.casttype) = "StoreID"];
optional Attributes attrs = 2 [(gogoproto.nullable) = false];
optional NodeDescriptor node = 3 [(gogoproto.nullable) = false];
optional StoreCapacity capacity = 4 [(gogoproto.nullable) = false];
optional StoreProperties properties = 5 [(gogoproto.nullable) = false];
}

Remote store info is maintained in StorePool (primarily for use by the allocator/rebalancer) by subscribing to gossip updates:

storeRegex := gossip.MakePrefixPattern(gossip.KeyStorePrefix)
g.RegisterCallback(storeRegex, sp.storeGossipUpdate, gossip.Redundant)

@erikgrinaker
Copy link
Contributor

#74254 is related.

@erikgrinaker
Copy link
Contributor

Store read amp is gossiped as of #77040.

@tbg tbg changed the title bulk, kv: improve throttling for background write operations to LSM store by using store state of all replicas kvserver: remove below-raft throttling May 31, 2022
@tbg
Copy link
Member

tbg commented May 31, 2022

Updating this issue to link it properly into the #79215.

@tbg
Copy link
Member

tbg commented Jun 2, 2022

#75066 (comment) is also concerned with removing below-raft throttling, so closing this one.

@tbg tbg closed this as completed Jun 2, 2022
Replication automation moved this from 22.2 to Done Jun 2, 2022
Storage automation moved this from Incoming to Done Jun 2, 2022
tbg added a commit to tbg/cockroach that referenced this issue Mar 16, 2023
**This is for 23.2 only**

We shouldn't artificially delay SST ingestion below raft because this
exacerbates memory pressure (cockroachdb#81834).

Anecdotally I see in my "typical experiment" (restores under I/O
bandwidth constraints) that `PreIngestDelay` is mostly fairly small
compared to the slowness that comes from write bandwidth overload
itself, so at least in those experiments removing this has little
to no effect.

As we are also working on replication admission control[^1] and are
looking into improving the memory footprint under unchecked overload[^2]
now's a good time to rip this out as we'll be in a good place to address
any detrimental fallout from doing so.

[^1]: cockroachdb#95563
[^2]: cockroachdb#98576

Touches cockroachdb#81834.
Touches cockroachdb#57247.

Epic: none
Release note: None
@tbg tbg reopened this Mar 16, 2023
@tbg tbg self-assigned this Mar 16, 2023
tbg added a commit to tbg/cockroach that referenced this issue Apr 14, 2023
We shouldn't artificially delay SST ingestion below raft because this
exacerbates memory pressure (cockroachdb#81834).

Anecdotally I see in my "typical experiment" (restores under I/O
bandwidth constraints) that `PreIngestDelay` is mostly fairly small
compared to the slowness that comes from write bandwidth overload
itself, so at least in those experiments removing this has little
to no effect.

As we are also working on replication admission control[^1] and are
looking into improving the memory footprint under unchecked overload[^2]
now's a good time to rip this out as we'll be in a good place to address
any detrimental fallout from doing so.

[^1]: cockroachdb#95563
[^2]: cockroachdb#98576

Touches cockroachdb#81834.
Touches cockroachdb#57247.

Epic: none
Release note: None
@craig craig bot closed this as completed in 479ee62 Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv-replication KV Replication Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants