*: implement replication admission control #95563

irfansharif · 2023-01-19T23:57:01Z

This is the tracking issue to merge the prototype for replication admission control: #93102. Internal experiments (see #admission-control) demonstrate its ability to provide throughput isolation in the face of large index backfills, where none exist today. The motivating issues are #82556 and #85641. The design doc for this work can be found internally. We expect the work here to break down into the following PRs (-ish, and in no particular order):

The remaining "rollout" steps (and the validation needed) is being tracked in #98703.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-16542

Jira issue: CRDB-23589

Epic CRDB-25348

The text was updated successfully, but these errors were encountered:

Follower replication work, today, is not subject to admission control. It consumes IO tokens without waiting, which both (i) does not prevent the LSM from being inverted, and (ii) can cause priority inversion where low-pri follower write work ends up causing IO token exhaustion, which in turn causes throughput and latency impact for high-pri non-follower write work on that same store. This latter behavior was especially noticeble with large index backfills (cockroachdb#82556) where >2/3rds of write traffic on stores could be follower work for large AddSSTs, causing IO token exhaustion for regular write work being proposed on those stores. We last looked at this problem as part of cockroachdb#79215, settling on cockroachdb#83851 which pauses replication traffic to stores close to exceeding their IO overload threshold (data that's periodically gossiped). In large index backfill experiments we found this to help slightly, but it's still a coarse and imperfect solution -- we're deliberately causing under-replication instead of being able to shape the rate of incoming writes for low-pri work closer to the origin. As part of cockroachdb#95563 we're introducing machinery for "replication admission control" -- end-to-end flow control for replication traffic. With it we expect to no longer need to bypass follower write work in admission control and solve the issues mentioned above. Some small degree of familiarity with the design is assumed below. In this first, proto{col,buf}/interface-only PR, we introduce: 1. Package kvflowcontrol{,pb}, which will provide flow control for replication traffic in KV. It will be part of the integration layer between KV and admission control. In it we have a few central interfaces: - kvflowcontrol.Controller, held at the node-level and holds all kvflowcontrol.Tokens for each kvflowcontrol.Stream (one per store we're sending raft traffic to and tenant we're sending it for). - kvflowcontrol.Handle, which will held at the replica-level (only on those who are both leaseholder and raft leader), and will be used to interface with the node-level kvflowcontrol.Controller. When replicating log entries, these replicas choose the log position (term+index) the data is to end up at, and use this handle to track the token deductions on a per log position basis. Later when freeing up tokens (after being informed of said log entries being admitted on the receiving end of the stream), it's done so by specifying the log position up to which we free up all deducted tokens. type Controller interface { Admit(admissionpb.WorkPriority, ...Stream) DeductTokens(admissionpb.WorkPriority, Tokens, ...Stream) ReturnTokens(admissionpb.WorkPriority, Tokens, ...Stream) } type Handle interface { Admit(admissionpb.WorkPriority) DeductTokensFor(admissionpb.WorkPriority, kvflowcontrolpb.RaftLogPosition, Tokens) ReturnTokensUpto(admissionpb.WorkPriority, kvflowcontrolpb.RaftLogPosition) TrackLowWater(Stream, kvflowcontrolpb.RaftLogPosition) Close() } 2. kvflowcontrolpb.RaftAdmissionMeta and relevant encoding/decoding routines. RaftAdmissionMeta is 'embedded' within a kvserverpb.RaftCommand, and includes necessary AC metadata on a per raft entry basis. Entries that contain this metadata will make use of the AC-specific raft log entry encodings described earlier. The AC metadata is decoded below-raft when looking to admit the write work. Also included is the node where this command originated, who wants to eventually learn of this command's admission. message RaftAdmissionMeta { int32 admission_priority = ...; int64 admission_create_time = ...; int32 admission_origin_node = ...; } 3. kvflowcontrolpb.AdmittedRaftLogEntries, which now features in kvserverpb.RaftMessageRequest, the unit of what's sent back-and-forth between two nodes over their two uni-directional raft transport streams. AdmittedRaftLogEntries, just like raft heartbeats, is coalesced information about all raft log entries that were admitted below raft. We'll use the origin node encoded in raft entry (admission_origin_node from from above) to know where to send these to. This information used on the origin node to release flow tokens that were acquired when replicating the original log entries. message AdmittedRaftLogEntries { int64 range_id = ...; int32 admission_priority = ...; RaftLogPosition up_to_raft_log_position = ...; uint64 store_id = ...; } message RaftLogPosition { uint64 term = ...; uint64 index = ...; } 4. kvflowcontrol.Dispatch, which is used to dispatch information about admitted raft log entries (see AdmittedRaftLogEntries from above) to specific nodes where (i) said entries originated, (ii) flow tokens were deducted and (iii) are waiting to be returned. The interface is also used to read pending dispatches, which will be used in the raft transport layer when looking to piggyback information on traffic already bound to specific nodes. Since timely dispatching (read: piggybacking) is not guaranteed, we allow querying for all long-overdue dispatches. The interface looks roughly like: type Dispatch interface { Dispatch(roachpb.NodeID, kvflowcontrolpb.AdmittedRaftLogEntries) PendingDispatch() []roachpb.NodeID PendingDispatchFor(roachpb.NodeID) []kvflowcontrolpb.AdmittedRaftLogEntries } 5. Two new encodings for raft log entries, EntryEncoding{Standard,Sideloaded}WithAC. Raft log entries have prefix byte that informs decoding routines how to interpret the subsequent bytes. To date we've had two, EntryEncoding{Standard,Sideloaded} (now renamed to EntryEncoding{Standard,Sideloaded}WithoutAC), to indicate whether the entry came with sideloaded data (these are typically AddSSTs, the storage for which is treated differently for performance). Our two additions here will be used to indicate whether the particular entry is subject to replication admission control. If so, right as we persist entries into the raft log storage, we'll admit the work without blocking. - We'll come back to this non-blocking admission in the AdmitRaftEntry section below, even though the implementation is left for a future PR. - The decision to use replication admission control happens above raft, and AC-specific metadata is plumbed down as part of the marshaled raft command, as described for RaftAdmissionMeta above. 6. An unused version gate (V23_1UseEncodingWithBelowRaftAdmissionData) to use replication admission control. Since we're using a different prefix byte for raft commands (see EntryEncodings above), one not recognized in earlier CRDB versions, we need explicit versioning. 7. AdmitRaftEntry, on the kvadmission.Controller interface. We'll use this as the integration point for log entries received below raft, right as they're being written to storage. This will be non-blocking since we'll be below raft in the raft.Ready() loop, and will effectively enqueue a "virtual" work item in underlying StoreWorkQueue mediating store IO. This virtual work item is what later gets dequeued once the store granter informs the work queue of newly available IO tokens. For standard work queue ordering, our work item needs to include the create time and admission pri. The tenant ID is plumbed to find the right tenant heap to queue it under (for inter-tenant isolation); the store ID to find the right store work queue on multi-store nodes. The raftpb.Entry encodes within it its origin node (see RaftAdmissionMeta above), which is used post-admission to inform the right node of said admission. It looks like: // AdmitRaftEntry informs admission control of a raft log entry being // written to storage. AdmitRaftEntry(roachpb.TenantID, roachpb.StoreID, roachpb.RangeID, raftpb.Entry)

Follower replication work, today, is not subject to admission control. It consumes IO tokens without waiting, which both (i) does not prevent the LSM from being inverted, and (ii) can cause priority inversion where low-pri follower write work ends up causing IO token exhaustion, which in turn causes throughput and latency impact for high-pri non-follower write work on that same store. This latter behavior was especially noticeble with large index backfills (cockroachdb#82556) where >2/3rds of write traffic on stores could be follower work for large AddSSTs, causing IO token exhaustion for regular write work being proposed on those stores. We last looked at this problem as part of cockroachdb#79215, settling on cockroachdb#83851 which pauses replication traffic to stores close to exceeding their IO overload threshold (data that's periodically gossiped). In large index backfill experiments we found this to help slightly, but it's still a coarse and imperfect solution -- we're deliberately causing under-replication instead of being able to shape the rate of incoming writes for low-pri work closer to the origin. As part of cockroachdb#95563 we're introducing machinery for "replication admission control" -- end-to-end flow control for replication traffic. With it we expect to no longer need to bypass follower write work in admission control and solve the issues mentioned above. Some small degree of familiarity with the design is assumed below. In this first, proto{col,buf}/interface-only PR, we introduce: 1. Package kvflowcontrol{,pb}, which will provide flow control for replication traffic in KV. It will be part of the integration layer between KV and admission control. In it we have a few central interfaces: - kvflowcontrol.Controller, held at the node-level and holds all kvflowcontrol.Tokens for each kvflowcontrol.Stream (one per store we're sending raft traffic to and tenant we're sending it for). - kvflowcontrol.Handle, which will held at the replica-level (only on those who are both leaseholder and raft leader), and will be used to interface with the node-level kvflowcontrol.Controller. When replicating log entries, these replicas choose the log position (term+index) the data is to end up at, and use this handle to track the token deductions on a per log position basis. Later when freeing up tokens (after being informed of said log entries being admitted on the receiving end of the stream), it's done so by specifying the log position up to which we free up all deducted tokens. type Controller interface { Admit(admissionpb.WorkPriority, ...Stream) DeductTokens(admissionpb.WorkPriority, Tokens, ...Stream) ReturnTokens(admissionpb.WorkPriority, Tokens, ...Stream) } type Handle interface { Admit(admissionpb.WorkPriority) DeductTokensFor(admissionpb.WorkPriority, kvflowcontrolpb.RaftLogPosition, Tokens) ReturnTokensUpto(admissionpb.WorkPriority, kvflowcontrolpb.RaftLogPosition) TrackLowWater(Stream, kvflowcontrolpb.RaftLogPosition) Close() } 2. kvflowcontrolpb.RaftAdmissionMeta and relevant encoding/decoding routines. RaftAdmissionMeta is 'embedded' within a kvserverpb.RaftCommand, and includes necessary AC metadata on a per raft entry basis. Entries that contain this metadata will make use of the AC-specific raft log entry encodings described earlier. The AC metadata is decoded below-raft when looking to admit the write work. Also included is the node where this command originated, who wants to eventually learn of this command's admission. message RaftAdmissionMeta { int32 admission_priority = ...; int64 admission_create_time = ...; int32 admission_origin_node = ...; } 3. kvflowcontrolpb.AdmittedRaftLogEntries, which now features in kvserverpb.RaftMessageRequest, the unit of what's sent back-and-forth between two nodes over their two uni-directional raft transport streams. AdmittedRaftLogEntries, just like raft heartbeats, is coalesced information about all raft log entries that were admitted below raft. We'll use the origin node encoded in raft entry (admission_origin_node from from above) to know where to send these to, which in turn will release flow tokens that were acquired when replicating the original log entries. message AdmittedRaftLogEntries { int64 range_id = ...; int32 admission_priority = ...; RaftLogPosition up_to_raft_log_position = ...; uint64 store_id = ...; } message RaftLogPosition { uint64 term = ...; uint64 index = ...; } 4. kvflowcontrol.Dispatch, which is used to dispatch information about admitted raft log entries (see AdmittedRaftLogEntries from above) to specific nodes where (i) said entries originated, (ii) flow tokens were deducted and (iii) are waiting to be returned. The interface is also used to read pending dispatches, which will be used in the raft transport layer when looking to piggyback information on traffic already bound to specific nodes. Since timely dispatching (read: piggybacking) is not guaranteed, we allow querying for all long-overdue dispatches. The interface looks roughly like: type Dispatch interface { Dispatch(roachpb.NodeID, kvflowcontrolpb.AdmittedRaftLogEntries) PendingDispatch() []roachpb.NodeID PendingDispatchFor(roachpb.NodeID) []kvflowcontrolpb.AdmittedRaftLogEntries } 5. Two new encodings for raft log entries, EntryEncoding{Standard,Sideloaded}WithAC. Raft log entries have prefix byte that informs decoding routines how to interpret the subsequent bytes. To date we've had two, EntryEncoding{Standard,Sideloaded} (now renamed to EntryEncoding{Standard,Sideloaded}WithoutAC), to indicate whether the entry came with sideloaded data (these are typically AddSSTs, the storage for which is treated differently for performance). Our two additions here will be used to indicate whether the particular entry is subject to replication admission control. If so, right as we persist entries into the raft log storage, we'll admit the work without blocking. - We'll come back to this non-blocking admission in the AdmitRaftEntry section below, even though the implementation is left for a future PR. - The decision to use replication admission control happens above raft, and AC-specific metadata is plumbed down as part of the marshaled raft command, as described for RaftAdmissionMeta above. 6. An unused version gate (V23_1UseEncodingWithBelowRaftAdmissionData) to use replication admission control. Since we're using a different prefix byte for raft commands (see EntryEncodings above), one not recognized in earlier CRDB versions, we need explicit versioning. 7. AdmitRaftEntry, on the kvadmission.Controller interface. We'll use this as the integration point for log entries received below raft, right as they're being written to storage. This will be non-blocking since we'll be below raft in the raft.Ready() loop, and will effectively enqueue a "virtual" work item in underlying StoreWorkQueue mediating store IO. This virtual work item is what later gets dequeued once the store granter informs the work queue of newly available IO tokens. For standard work queue ordering, our work item needs to include the create time and admission pri. The tenant ID is plumbed to find the right tenant heap to queue it under (for inter-tenant isolation); the store ID to find the right store work queue on multi-store nodes. The raftpb.Entry encodes within it its origin node (see RaftAdmissionMeta above), which is used post-admission to inform the right node of said admission. It looks like: // AdmitRaftEntry informs admission control of a raft log entry being // written to storage. AdmitRaftEntry(roachpb.TenantID, roachpb.StoreID, roachpb.RangeID, raftpb.Entry) Release note: None

Part of cockroachdb#95563. Predecessor to cockroachdb#95637. This commit introduces two new encodings for raft log entries, EntryEncoding{Standard,Sideloaded}WithAC. Raft log entries have prefix byte that informs decoding routines how to interpret the subsequent bytes. To date we've had two, EntryEncoding{Standard,Sideloaded}[^1], to indicate whether the entry came with sideloaded data[^2]. Our two additions here will be used to indicate whether the particular entry is subject to replication admission control. If so, right as we persist entries into the raft log storage, we'll "admit the work without blocking", which is further explained in cockroachdb#95637. The decision to use replication admission control happens above raft and a per-entry basis. If using replication admission control, AC-specific metadata will be plumbed down as part of the marshaled raft command. This too is explained in in cockroachdb#95637, specifically, the 'RaftAdmissionMeta' section. This commit then adds an unused version gate (V23_1UseEncodingWithBelowRaftAdmissionData) to use replication admission control. Since we're using a different prefix byte for raft commands (see EntryEncodings above), one not recognized in earlier CRDB versions, we need explicit versioning. We add it out of development convenience -- adding version gates is most prone to merge conflicts. We expect to use it shortly, before alpha/beta cuts. [^1]: Now renamed to EntryEncoding{Standard,Sideloaded}WithoutAC. [^2]: These are typically AddSSTs, the storage for which is treated differently for performance reasons. Release note: None

Follower replication work, today, is not subject to admission control. It consumes IO tokens without waiting, which both (i) does not prevent the LSM from being inverted, and (ii) can cause priority inversion where low-pri follower write work ends up causing IO token exhaustion, which in turn causes throughput and latency impact for high-pri non-follower write work on that same store. This latter behavior was especially noticeble with large index backfills (cockroachdb#82556) where >2/3rds of write traffic on stores could be follower work for large AddSSTs, causing IO token exhaustion for regular write work being proposed on those stores. We last looked at this problem as part of cockroachdb#79215, settling on cockroachdb#83851 which pauses replication traffic to stores close to exceeding their IO overload threshold (data that's periodically gossiped). In large index backfill experiments we found this to help slightly, but it's still a coarse and imperfect solution -- we're deliberately causing under-replication instead of being able to shape the rate of incoming writes for low-pri work closer to the origin. As part of cockroachdb#95563 we're introducing machinery for "replication admission control" -- end-to-end flow control for replication traffic. With it we expect to no longer need to bypass follower write work in admission control and solve the issues mentioned above. Some small degree of familiarity with the design is assumed below. In this proto{col,buf}/interface-only commit and the previous raft log encoding commit, we introduce: 1. Package kvflowcontrol{,pb}, which will provide flow control for replication traffic in KV. It will be part of the integration layer between KV and admission control. In it we have a few central interfaces: - kvflowcontrol.Controller, held at the node-level and holds all kvflowcontrol.Tokens for each kvflowcontrol.Stream (one per store we're sending raft traffic to and tenant we're sending it for). - kvflowcontrol.Handle, which will held at the replica-level (only on those who are both leaseholder and raft leader), and will be used to interface with the node-level kvflowcontrol.Controller. When replicating log entries, these replicas choose the log position (term+index) the data is to end up at, and use this handle to track the token deductions on a per log position basis. Later when freeing up tokens (after being informed of said log entries being admitted on the receiving end of the stream), it's done so by specifying the log position up to which we free up all deducted tokens. type Controller interface { Admit(admissionpb.WorkPriority, ...Stream) DeductTokens(admissionpb.WorkPriority, Tokens, ...Stream) ReturnTokens(admissionpb.WorkPriority, Tokens, ...Stream) } type Handle interface { Admit(admissionpb.WorkPriority) DeductTokensFor(admissionpb.WorkPriority, kvflowcontrolpb.RaftLogPosition, Tokens) ReturnTokensUpto(admissionpb.WorkPriority, kvflowcontrolpb.RaftLogPosition, Stream) TrackLowWater(kvflowcontrolpb.RaftLogPosition, Stream) Close() } 2. kvflowcontrolpb.RaftAdmissionMeta and relevant encoding/decoding routines. RaftAdmissionMeta is 'embedded' within a kvserverpb.RaftCommand, and includes necessary AC metadata on a per raft entry basis. Entries that contain this metadata will make use of the AC-specific raft log entry encodings described earlier. The AC metadata is decoded below-raft when looking to admit the write work. Also included is the node where this command originated, who wants to eventually learn of this command's admission. message RaftAdmissionMeta { int32 admission_priority = ...; int64 admission_create_time = ...; int32 admission_origin_node = ...; } 3. kvflowcontrolpb.AdmittedRaftLogEntries, which now features in kvserverpb.RaftMessageRequest, the unit of what's sent back-and-forth between two nodes over their two uni-directional raft transport streams. AdmittedRaftLogEntries, just like raft heartbeats, is coalesced information about all raft log entries that were admitted below raft. We'll use the origin node encoded in raft entry (admission_origin_node from from above) to know where to send these to, which in turn will release flow tokens that were acquired when replicating the original log entries. message AdmittedRaftLogEntries { int64 range_id = ...; int32 admission_priority = ...; RaftLogPosition up_to_raft_log_position = ...; uint64 store_id = ...; } message RaftLogPosition { uint64 term = ...; uint64 index = ...; } 4. kvflowcontrol.Dispatch, which is used to dispatch information about admitted raft log entries (see AdmittedRaftLogEntries from above) to specific nodes where (i) said entries originated, (ii) flow tokens were deducted and (iii) are waiting to be returned. The interface is also used to read pending dispatches, which will be used in the raft transport layer when looking to piggyback information on traffic already bound to specific nodes. Since timely dispatching (read: piggybacking) is not guaranteed, we allow querying for all long-overdue dispatches. The interface looks roughly like: type Dispatch interface { Dispatch(roachpb.NodeID, kvflowcontrolpb.AdmittedRaftLogEntries) PendingDispatch() []roachpb.NodeID PendingDispatchFor(roachpb.NodeID) []kvflowcontrolpb.AdmittedRaftLogEntries } 5. Two new encodings for raft log entries, EntryEncoding{Standard,Sideloaded}WithAC. Raft log entries have prefix byte that informs decoding routines how to interpret the subsequent bytes. To date we've had two, EntryEncoding{Standard,Sideloaded} (now renamed to EntryEncoding{Standard,Sideloaded}WithoutAC), to indicate whether the entry came with sideloaded data (these are typically AddSSTs, the storage for which is treated differently for performance). Our two additions here will be used to indicate whether the particular entry is subject to replication admission control. If so, right as we persist entries into the raft log storage, we'll admit the work without blocking. - We'll come back to this non-blocking admission in the AdmitRaftEntry section below, even though the implementation is left for a future PR. - The decision to use replication admission control happens above raft, and AC-specific metadata is plumbed down as part of the marshaled raft command, as described for RaftAdmissionMeta above. 6. An unused version gate (V23_1UseEncodingWithBelowRaftAdmissionData) to use replication admission control. Since we're using a different prefix byte for raft commands (see EntryEncodings above), one not recognized in earlier CRDB versions, we need explicit versioning. 7. AdmitRaftEntry, on the kvadmission.Controller interface. We'll use this as the integration point for log entries received below raft, right as they're being written to storage. This will be non-blocking since we'll be below raft in the raft.Ready() loop, and will effectively enqueue a "virtual" work item in underlying StoreWorkQueue mediating store IO. This virtual work item is what later gets dequeued once the store granter informs the work queue of newly available IO tokens. For standard work queue ordering, our work item needs to include the create time and admission pri. The tenant ID is plumbed to find the right tenant heap to queue it under (for inter-tenant isolation); the store ID to find the right store work queue on multi-store nodes. The raftpb.Entry encodes within it its origin node (see RaftAdmissionMeta above), which is used post-admission to inform the right node of said admission. It looks like: // AdmitRaftEntry informs admission control of a raft log entry being // written to storage. AdmitRaftEntry(roachpb.TenantID, roachpb.StoreID, roachpb.RangeID, raftpb.Entry) Release note: None

Part of cockroachdb#95563. This PR integrates various kvflowcontrol components into the critical path for replication traffic. It does so by introducing two "integration interfaces" in the kvserver package to intercept various points of a replica's lifecycle, using it to manage the underlying replication streams and flow tokens. The integration is mediated through two cluster settings: - kvadmission.flow_control.enabled This is a top-level kill-switch to revert to pre-kvflowcontrol behavior where follower writes unilaterally deducted IO tokens without blocking. - kvadmission.flow_control.mode It can take on one of three settings, each exercising the flow control machinery to varying degrees. - apply_to_elastic Only applies admission delays to elastic traffic. - apply_to_all Applies admission delays to {regular,elastic} traffic. When the mode is changed, we simply admit all waiting requests. This risks possibly over-admitting work, but that's ok -- we assume these mode changes are rare events and done under supervision. These settings are hooked into in the kvadmission and kvflowcontroller packages. As for the actual integration interfaces in kvserver, they are: - replicaFlowControlIntegration: used to integrate with replication flow control. It's intercepts various points in a replica's lifecycle, like it acquiring raft leadership or losing it, or its raft membership changing, etc. Accessing it requires Replica.mu to be held, exclusively (this is asserted on in the canonical implementation). type replicaFlowControlIntegration interface { handle() (kvflowcontrol.Handle, bool) onBecameLeader(context.Context) onBecameFollower(context.Context) onDescChanged(context.Context) onFollowersPaused(context.Context) onReplicaDestroyed(context.Context) onProposalQuotaUpdated(context.Context) } - replicaForFlowControl abstracts the interface of an individual Replica, as needed by replicaFlowControlIntegration. type replicaForFlowControl interface { assertLocked() annotateCtx(context.Context) context.Context getTenantID() roachpb.TenantID getReplicaID() roachpb.ReplicaID getRangeID() roachpb.RangeID getDescriptor() *roachpb.RangeDescriptor pausedFollowers() map[roachpb.ReplicaID]struct{} isFollowerActive(context.Context, roachpb.ReplicaID) bool appliedLogPosition() kvflowcontrolpb.RaftLogPosition withReplicaProgress(f func(roachpb.ReplicaID, rafttracker.Progress)) } Release note: None

101786: workload: introduce timeout for pre-warming connection pool r=sean- a=sean- Interrupting target instances during prewarming shouldn't cause workload to proceed: introduce a timeout to prewarming connections. Connections will have 15s to 5min to warmup before the context will expire. Epic: none 101987: cli/sql: new option autocerts for TLS client cert auto-discovery r=rafiss a=knz Fixes #101986. See the release note below. An additional benefit not mentioned in the release note is that it simplifies switching from one tenant to another when using shared-process multitenancy. For example, this becomes possible: ``` > CREATE TENANT foo; > ALTER TENANT foo START SERVICE SHARED; > \c cluster:foo root - - autocerts ``` Alternatively, this can also be used to quickly switch from a non-root user in an app tenant to the root user in the system tenant: ``` > \c cluster:system root - - autocerts ``` This works because (currently) all tenant servers running side-by-side use the same TLS CA to validate SQL client certs. ---- Release note (cli change): The `\connect` client-side command for the SQL shell (included in `cockroach sql`, `cockroach demo`, `cockroach-sql`) now recognizes an option `autocerts` as last argument. When provided, `\c` will now try to discover a TLS client certificate and key in the same directory(ies) as used by the previous connection URL. This feature makes it easier to switch usernames when TLS client/key files are available for both the previous and the new username. 102382: c2c: deflake c2c/shutdown roachtests r=stevendanna a=msbutler c2c: deflake c2c/shutdown roachtests This patch addresses to roachtest failure modes: - Prevents roachtest failure if a query fails during a node shutdown. - Prevents the src cluster from returning a single node topology, which could cause the stream ingestion job to hang if the participating src node gets shut down. Longer term, automatic replanning will prevent this. Specifically, this patch changes the kv workload to split and scatter the kv table across the cluster before the c2c job begins. Fixes #101898 Fixes #102111 This patch also makes it easier to reproduce c2c roachtest failures by plumbing a random seed to several components of the roachtest driver. Release note: None c2c: rename completeStreamIngestion to applyCutoverTime Release note: none workload: add --scatter flag to kv workload The user can now run `./workload init kv --scatter ....` which scatters the kv table across the cluster after the initial data load. This flag is best used with `--splits`, `--max-block-bytes`, and `--insert-count`. Epic: none Release note: none 102819: admission: move CreateTime-sequencing below-raft r=irfansharif a=irfansharif These are already reviewed commits from #98308. Part of #95563. --- **admission: move CreateTime-sequencing below-raft** We move kvflowsequencer.Sequencer and its use in kvflowhandle.Handle (above-raft) to admission.sequencer, now used by admission.StoreWorkQueue (below-raft). This variant appeared in an earlier revision of #97599 where we first introduced monotonically increasing CreateTimes for a given raft group. In a subsequent commit, when integrating kvflowcontrol into the critical path for replication traffic, we'll observe that it's quite difficult to create sequencing CreateTimes[^1] above raft. This is because these sequence numbers are encoded as part of the raft proposal[^2], and at encode-time, we don't actually know what log position the proposal is going to end up in. It's hard to explicitly guarantee that a proposal with log-position P1 will get encoded before another with log position P2, where P1 < P2. Naively sequencing CreateTimes at proposal-encode-time could result in over-admission. This is because of how we return flow tokens -- up to some log index[^3], and how use these sequence numbers in below-raft WorkQueues. If P2 ends up with a lower sequence number/CreateTime, it would get admitted first, and when returning flow tokens by log position, in specifying up-to-P2, we'll early return P1's flow tokens despite it not being admitted. So we'd over-admit at the sender. This is all within a <tenant,priority> pair. [^1]: We use CreateTimes as "sequence numbers" in replication admission control. We want to assign each AC-queued work below-raft a "sequence number" for FIFO ordering within a <tenant,priority>. We ensure these timestamps are roughly monotonic with respect to log positions of replicated work by sequencing work in log position order. [^2]: In kvflowcontrolpb.RaftAdmissionMeta. [^3]: See kvflowcontrolpb.AdmittedRaftLogEntries. --- **admission: add intercept points for when replicated work gets admitted** In a subsequent commit, when integrating kvflowcontrol into the critical path for replication traffic, we'll set up the return of flow tokens from the receiver node back to the sender once log entries get (asynchronously) admitted[^4]. So we need to intercept the exact points at which the virtually enqueued work items get admitted, since it all happens asynchronously[^5]. To that end we introduce the following interface: ```go // OnLogEntryAdmitted is used to observe the specific entries // (identified by rangeID + log position) that were admitted. Since // admission control for log entries is asynchronous/non-blocking, // this allows callers to do requisite post-admission // bookkeeping. type OnLogEntryAdmitted interface { AdmittedLogEntry( origin roachpb.NodeID, /* node where the entry originated */ pri admissionpb.WorkPriority, /* admission priority of the entry */ storeID roachpb.StoreID, /* store on which the entry was admitted */ rangeID roachpb.RangeID, /* identifying range for the log entry */ pos LogPosition, /* log position of the entry that was admitted*/ ) } ``` For now we pass in a no-op implementation in production code, but this will change shortly. Seeing as how the asynchronous admit interface is going to be the primary once once we enable replication admission control by default, for IO control, we no longer need the storeWriteDone interfaces and corresponding types. It's being used by our current (and soon-to-be legacy) above-raft IO admission control to inform granters of when the write was actually done, post-admission. For above-raft IO control, at admit-time we do not have sizing info for the writes, so by intercepting these writes at write-done time we're able to make any outstanding token adjustments in the granter. To reflect this new world, we: - Rename setAdmittedDoneModels to setLinearModels. - Introduce a storeReplicatedWorkAdmittedInfo[^6]. It provides information about the size of replicated work once it's admitted (which happens asynchronously from the work itself). This lets us use the underlying linear models for L0 {writes,ingests} to deduct an appropriate number of tokens from the granter, for the admitted work size[^7]. - Rename the granterWithStoreWriteDone interface to granterWithStoreReplicatedWorkAdmitted. We'll still intercept the actual point of admission for some token adjustments, through the the storeReplicatedWorkAdmittedLocked API shown below. There are two callstacks through which this API gets invoked, one where the coord.mu is already held, and one where it isn't. We plumb this information through so the lock is acquired if not already held. The locking structure is unfortunate, but this was a minimally invasive diff. ```go storeReplicatedWorkAdmittedLocked( originalTokens int64, admittedInfo storeReplicatedWorkAdmittedInfo, ) (additionalTokens int64) ``` While here, we also export an admission.TestingReverseWorkPriorityDict. There are at least three tests that have re-invented the wheel. [^4]: This will happen through the kvflowcontrol.Dispatch interface introduced back in #97766, after integrating it with the RaftTransport layer. [^5]: Introduced in #97599, for replicated write work. [^6]: Identical to the previous StoreWorkDoneInfo. [^7]: There's a peculiarity here in that at enqueuing-time we actually know the size of the write, so we could have deducted the right number of tokens upfront and avoid this post-admit granter token adjustment. We inherit this structure from earlier, and just leave a TODO for now. 103116: generate-logic-test: fix incorrect timeout in logictests template r=rickystewart a=healthy-pod In #102719, we changed the way we set `-test.timeout` but didn't update the logictests template. This code change updates the template. Release note: None Epic: none Co-authored-by: Sean Chittenden <sean@chittenden.org> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Michael Butler <butler@cockroachlabs.com> Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: healthy-pod <ahmad@cockroachlabs.com>

Part of cockroachdb#95563. This PR integrates various kvflowcontrol components into the critical path for replication traffic. It does so by introducing two "integration interfaces" in the kvserver package to intercept various points of a replica's lifecycle, using it to manage the underlying replication streams and flow tokens. The integration is mediated through two cluster settings: - kvadmission.flow_control.enabled This is a top-level kill-switch to revert to pre-kvflowcontrol behavior where follower writes unilaterally deducted IO tokens without blocking. - kvadmission.flow_control.mode It can take on one of three settings, each exercising the flow control machinery to varying degrees. - apply_to_elastic Only applies admission delays to elastic traffic. - apply_to_all Applies admission delays to {regular,elastic} traffic. When the mode is changed, we simply admit all waiting requests. This risks possibly over-admitting work, but that's ok -- we assume these mode changes are rare events and done under supervision. These settings are hooked into in the kvadmission and kvflowcontroller packages. As for the actual integration interfaces in kvserver, they are: - replicaFlowControlIntegration: used to integrate with replication flow control. It's intercepts various points in a replica's lifecycle, like it acquiring raft leadership or losing it, or its raft membership changing, etc. Accessing it requires Replica.mu to be held, exclusively (this is asserted on in the canonical implementation). type replicaFlowControlIntegration interface { handle() (kvflowcontrol.Handle, bool) onBecameLeader(context.Context) onBecameFollower(context.Context) onDescChanged(context.Context) onFollowersPaused(context.Context) onReplicaDestroyed(context.Context) onProposalQuotaUpdated(context.Context) } - replicaForFlowControl abstracts the interface of an individual Replica, as needed by replicaFlowControlIntegration. type replicaForFlowControl interface { assertLocked() annotateCtx(context.Context) context.Context getTenantID() roachpb.TenantID getReplicaID() roachpb.ReplicaID getRangeID() roachpb.RangeID getDescriptor() *roachpb.RangeDescriptor pausedFollowers() map[roachpb.ReplicaID]struct{} isFollowerActive(context.Context, roachpb.ReplicaID) bool appliedLogPosition() kvflowcontrolpb.RaftLogPosition withReplicaProgress(f func(roachpb.ReplicaID, rafttracker.Progress)) } Release note: None

Part of cockroachdb#95563. This PR integrates various kvflowcontrol components into the critical path for replication traffic. It does so by introducing two "integration interfaces" in the kvserver package to intercept various points of a replica's lifecycle, using it to manage the underlying replication streams and flow tokens. The integration is mediated through two cluster settings: - kvadmission.flow_control.enabled This is a top-level kill-switch to revert to pre-kvflowcontrol behavior where follower writes unilaterally deducted IO tokens without blocking. - kvadmission.flow_control.mode It can take on one of two settings, each exercising the flow control machinery to varying degrees. - apply_to_elastic Only applies admission delays to elastic traffic. - apply_to_all Applies admission delays to {regular,elastic} traffic. When the mode is changed, we simply admit all waiting requests. This risks possibly over-admitting work, but that's ok -- we assume these mode changes are rare events and done under supervision. These settings are hooked into in the kvadmission and kvflowcontroller packages. As for the actual integration interfaces in kvserver, they are: - replicaFlowControlIntegration: used to integrate with replication flow control. It's intercepts various points in a replica's lifecycle, like it acquiring raft leadership or losing it, or its raft membership changing, etc. Accessing it requires Replica.mu to be held, exclusively (this is asserted on in the canonical implementation). type replicaFlowControlIntegration interface { handle() (kvflowcontrol.Handle, bool) onBecameLeader(context.Context) onBecameFollower(context.Context) onDescChanged(context.Context) onFollowersPaused(context.Context) onReplicaDestroyed(context.Context) onProposalQuotaUpdated(context.Context) } - replicaForFlowControl abstracts the interface of an individual Replica, as needed by replicaFlowControlIntegration. type replicaForFlowControl interface { assertLocked() annotateCtx(context.Context) context.Context getTenantID() roachpb.TenantID getReplicaID() roachpb.ReplicaID getRangeID() roachpb.RangeID getDescriptor() *roachpb.RangeDescriptor pausedFollowers() map[roachpb.ReplicaID]struct{} isFollowerActive(context.Context, roachpb.ReplicaID) bool appliedLogPosition() kvflowcontrolpb.RaftLogPosition withReplicaProgress(f func(roachpb.ReplicaID, rafttracker.Progress)) } Release note: None

98308: kvserver,kvflowcontrol: integrate flow control r=irfansharif a=irfansharif Part of #95563. See individual commits. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>

Fixes cockroachdb#102683. Part of cockroachdb#104154. These were added way back in cockroachdb#36403 and cockroachdb#73904, pre-dating much of IO admission control for leaseholder writes. With cockroachdb#95563, we now have IO admission control for follower writes. Put together, have ample LSM read-amp protection through AC alone. These concurrency limiters are now redundant and oblivious to more sophisticated AC measures. We recently removed the below-raft equivalents of these limiters (cockroachdb#98762), and like mentioned there, these limiters can exacerbate memory pressure. Separately, we're looking to work on speedier restores, and these limiters are starting to get in the way. While here, we also disable the pre-ingest delay mechanism in pebble, which too pre-dates AC, introduced way back in cockroachdb#34258 for RocksDB and in \cockroachdb#41839 for Pebble. IO AC is able to limit the number of L0 files, and this pre-ingest delay with its maximum per-request delay time of 5s can be less than effective. It's worth noting that the L0 file count threshold at which this pre-ingest delay mechanism kicked in was 20, while AC aims for 1000[^1]. This commit doesn't go as far as removing these limiters outright, merely disabling them. This is just out of an overabundance of caution. We can probably remove them once kvflowcontrol.enabled has had >1 release worth of baking time. Until then, it's nice to know we have these old safety hatches. We have ample time in the release to assess fallout from this commit, and also use this increased AddSST concurrency to stress the kvflowcontrol machinery. [^1]: The 1000 file limit exists to bound how long it takes to clear L0 completely. Envelope math cribbed from elsewhere: With 2MiB files, 1000 files is ~2GB, which at 40MB/s of compaction throughput (with a compaction slot consistently dedicated to L0) takes < 60s to clear the backlog. So the 'recovery' time is modest in that operators should not need to take manual action Release note: None

104861: kvserver: disable pre-AC above-raft AddSST throttling r=irfansharif a=irfansharif Fixes #102683. Part of #104154. These were added way back in #36403 and #73904, pre-dating much of IO admission control for leaseholder writes. With #95563, we now have IO admission control for follower writes. Put together, have ample LSM read-amp protection through AC alone. These concurrency limiters are now redundant and oblivious to more sophisticated AC measures. We recently removed the below-raft equivalents of these limiters (#98762), and like mentioned there, these limiters can exacerbate memory pressure. Separately, we're looking to work on speedier restores, and these limiters are starting to get in the way. While here, we also disable the pre-ingest delay mechanism in pebble, which too pre-dates AC, introduced way back in #34258 for RocksDB and in \#41839 for Pebble. IO AC is able to limit the number of L0 files, and this pre-ingest delay with its maximum per-request delay time of 5s can be less than effective. It's worth noting that the L0 file count threshold at which this pre-ingest delay mechanism kicked in was 20, while AC aims for 1000[^1]. This commit doesn't go as far as removing these limiters outright, merely disabling them. This is just out of an overabundance of caution. We can probably remove them once kvflowcontrol.enabled has had >1 release worth of baking time. Until then, it's nice to know we have these old safety hatches. We have ample time in the release to assess fallout from this commit, and also use this increased AddSST concurrency to stress the kvflowcontrol machinery. [^1]: The 1000 file limit exists to bound how long it takes to clear L0 completely. Envelope math cribbed from elsewhere: With 2MiB files, 1000 files is ~2GB, which at 40MB/s of compaction throughput (with a compaction slot consistently dedicated to L0) takes < 60s to clear the backlog. So the 'recovery' time is modest in that operators should not need to take manual action. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>

irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control labels Jan 19, 2023

irfansharif self-assigned this Jan 19, 2023

irfansharif mentioned this issue Jan 21, 2023

kvflowcontrol,raftlog: interfaces for replication control #95637

Merged

irfansharif mentioned this issue Jan 23, 2023

admission: lack of intra-tenant prioritization for IO work #95678

Open

irfansharif mentioned this issue Jan 24, 2023

raftlog: introduce EntryEncoding{Standard,Sideloaded}WithAC #95748

Merged

irfansharif changed the title *: replication admission control *: productionize replication admission control Jan 24, 2023

irfansharif mentioned this issue May 7, 2023

admission: move CreateTime-sequencing below-raft #102819

Merged

irfansharif mentioned this issue May 31, 2023

kvflowcontrol,kvserver: fast follow-on work #104154

Closed

14 tasks

craig bot pushed a commit that referenced this issue Jun 9, 2023

Merge #98308

40c8fe1

98308: kvserver,kvflowcontrol: integrate flow control r=irfansharif a=irfansharif Part of #95563. See individual commits. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>

irfansharif closed this as completed Jun 9, 2023

irfansharif mentioned this issue Jun 14, 2023

kvserver: disable pre-AC above-raft AddSST throttling #104861

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: implement replication admission control #95563

*: implement replication admission control #95563

irfansharif commented Jan 19, 2023 •

edited

Loading

*: implement replication admission control #95563

*: implement replication admission control #95563

Comments

irfansharif commented Jan 19, 2023 • edited Loading

irfansharif commented Jan 19, 2023 •

edited

Loading