Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: implement replication admission control #95563

Closed
10 tasks done
irfansharif opened this issue Jan 19, 2023 · 0 comments
Closed
10 tasks done

*: implement replication admission control #95563

irfansharif opened this issue Jan 19, 2023 · 0 comments
Assignees
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Jan 19, 2023

This is the tracking issue to merge the prototype for replication admission control: #93102. Internal experiments (see #admission-control) demonstrate its ability to provide throughput isolation in the face of large index backfills, where none exist today. The motivating issues are #82556 and #85641. The design doc for this work can be found internally. We expect the work here to break down into the following PRs (-ish, and in no particular order):

  • raftlog: introduce EntryEncoding{Standard,Sideloaded}WithAC #95748. This includes changes to raft encodings, needed for the protocol changes described next.
  • kvflowcontrol,raftlog: interfaces for replication control #95637. Protocol changes for data sent back and forth over the raft transport, tying flow token deductions to specific raft log positions (term+index). Include various interfaces.
    • See kvflowcontrol/doc.go in PR above. Tech-note/large overview comment for replication admission control.
  • kvflowcontroller: implement kvflowcontrol.Controller #95905. Implement kvflowcontrol.Controller: Per-node flow token bucket, internally segmented by work class, tenants and stores it's controlling write traffic to.
    • See pkg/util/asciitsdb and pkg/kv/kvserver/kvflowcontrol/kvflowcontroller:simulator in PR above. Write a simulator for the flow tokens component, showing how throughput varies across each unit as tokens flow back, etc. Do it for multiple tenants.
  • kvflowhandle: implement kvflowcontrol.Handle #96642. Implement kvflowcontrol.Handle: Per-replica tracker for flow token deductions. The lifecycle of this handle is tied to a leaseholder replica also being the raft leader.
  • kvflowcontrol: implement kvflowcontrol.Dispatch #97766. Implement kvflowcontrol.Dispatch: Message box used to dispatch information about admitted raft log entries to specific nodes.
  • admission: support non-blocking {Store,}WorkQueue.Admit() #97599. Implement kvadmission.AdmitRaftEntry: Change admission.{Store,}WorkQueue to support logical admission/enqueuing of virtual work items, an "async admit" interface.
  • kvserver,kvflowcontrol: integrate flow control #98308. Integrate various components end-to-end, and add cluster settings to disable replication admission control entirely, just for regular requests.
    • [ ] Also support a mode where we do end-to-end flow control token tracking but don't actually block at admit time due to lack of requisite flow tokens. It'll let us look at production systems and understand that we are losing performance isolation due to a lack flow control.
    • Support and add tests to make sure we don't leak flow tokens, or return them repeatedly, in the face of node failures, gRPC streams breaking (including intermittently), reproposals, snapshots, log truncations, splits, merges, lease transfers, leadership transfers, raft membership changing, follower pausing, prolonged leaseholder != leader, etc.

The remaining "rollout" steps (and the validation needed) is being tracked in #98703.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-16542

Jira issue: CRDB-23589

Epic CRDB-25348

@irfansharif irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control labels Jan 19, 2023
@irfansharif irfansharif self-assigned this Jan 19, 2023
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jan 23, 2023
Follower replication work, today, is not subject to admission control.
It consumes IO tokens without waiting, which both (i) does not prevent
the LSM from being inverted, and (ii) can cause priority inversion where
low-pri follower write work ends up causing IO token exhaustion, which
in turn causes throughput and latency impact for high-pri non-follower
write work on that same store. This latter behavior was especially
noticeble with large index backfills (cockroachdb#82556) where >2/3rds of write
traffic on stores could be follower work for large AddSSTs, causing IO
token exhaustion for regular write work being proposed on those stores.

We last looked at this problem as part of cockroachdb#79215, settling on cockroachdb#83851
which pauses replication traffic to stores close to exceeding their IO
overload threshold (data that's periodically gossiped). In large index
backfill experiments we found this to help slightly, but it's still a
coarse and imperfect solution -- we're deliberately causing
under-replication instead of being able to shape the rate of incoming
writes for low-pri work closer to the origin.

As part of cockroachdb#95563 we're introducing machinery for "replication admission
control" -- end-to-end flow control for replication traffic. With it we
expect to no longer need to bypass follower write work in admission
control and solve the issues mentioned above. Some small degree of
familiarity with the design is assumed below. In this first,
proto{col,buf}/interface-only PR, we introduce:

1. Package kvflowcontrol{,pb}, which will provide flow control for
   replication traffic in KV. It will be part of the integration layer
   between KV and admission control. In it we have a few central
   interfaces:

   - kvflowcontrol.Controller, held at the node-level and holds all
     kvflowcontrol.Tokens for each kvflowcontrol.Stream (one per store
     we're sending raft traffic to and tenant we're sending it for).
   - kvflowcontrol.Handle, which will held at the replica-level (only
     on those who are both leaseholder and raft leader), and will be
     used to interface with the node-level kvflowcontrol.Controller.
     When replicating log entries, these replicas choose the log
     position (term+index) the data is to end up at, and use this handle
     to track the token deductions on a per log position basis. Later
     when freeing up tokens (after being informed of said log entries
     being admitted on the receiving end of the stream), it's done so by
     specifying the log position up to which we free up all deducted
     tokens.

   type Controller interface {
     Admit(admissionpb.WorkPriority, ...Stream)
     DeductTokens(admissionpb.WorkPriority, Tokens, ...Stream)
     ReturnTokens(admissionpb.WorkPriority, Tokens, ...Stream)
   }

   type Handle interface {
     Admit(admissionpb.WorkPriority)
     DeductTokensFor(admissionpb.WorkPriority, kvflowcontrolpb.RaftLogPosition, Tokens)
     ReturnTokensUpto(admissionpb.WorkPriority, kvflowcontrolpb.RaftLogPosition)
     TrackLowWater(Stream, kvflowcontrolpb.RaftLogPosition)
     Close()
   }

2. kvflowcontrolpb.RaftAdmissionMeta and relevant encoding/decoding
   routines. RaftAdmissionMeta is 'embedded' within a
   kvserverpb.RaftCommand, and includes necessary AC metadata on a per
   raft entry basis. Entries that contain this metadata will make use of
   the AC-specific raft log entry encodings described earlier. The AC
   metadata is decoded below-raft when looking to admit the write work.
   Also included is the node where this command originated, who wants to
   eventually learn of this command's admission.

   message RaftAdmissionMeta {
     int32 admission_priority = ...;
     int64 admission_create_time = ...;
     int32 admission_origin_node = ...;
   }

3. kvflowcontrolpb.AdmittedRaftLogEntries, which now features in
   kvserverpb.RaftMessageRequest, the unit of what's sent
   back-and-forth between two nodes over their two uni-directional raft
   transport streams. AdmittedRaftLogEntries, just like raft
   heartbeats, is coalesced information about all raft log entries that
   were admitted below raft. We'll use the origin node encoded in raft
   entry (admission_origin_node from from above) to know where to
   send these to. This information used on the origin node to release
   flow tokens that were acquired when replicating the original log
   entries.

   message AdmittedRaftLogEntries {
     int64 range_id = ...;
     int32 admission_priority = ...;
     RaftLogPosition up_to_raft_log_position = ...;
     uint64 store_id = ...;
   }

   message RaftLogPosition {
     uint64 term = ...;
     uint64 index = ...;
   }

4. kvflowcontrol.Dispatch, which is used to dispatch information about
   admitted raft log entries (see AdmittedRaftLogEntries from above) to
   specific nodes where (i) said entries originated, (ii) flow tokens
   were deducted and (iii) are waiting to be returned. The interface is
   also used to read pending dispatches, which will be used in the raft
   transport layer when looking to piggyback information on traffic
   already bound to specific nodes. Since timely dispatching (read:
   piggybacking) is not guaranteed, we allow querying for all
   long-overdue dispatches. The interface looks roughly like:

   type Dispatch interface {
     Dispatch(roachpb.NodeID, kvflowcontrolpb.AdmittedRaftLogEntries)
     PendingDispatch() []roachpb.NodeID
     PendingDispatchFor(roachpb.NodeID) []kvflowcontrolpb.AdmittedRaftLogEntries
   }

5. Two new encodings for raft log entries,
   EntryEncoding{Standard,Sideloaded}WithAC. Raft log entries have
   prefix byte that informs decoding routines how to interpret the
   subsequent bytes. To date we've had two,
   EntryEncoding{Standard,Sideloaded} (now renamed to
   EntryEncoding{Standard,Sideloaded}WithoutAC), to indicate whether
   the entry came with sideloaded data (these are typically AddSSTs, the
   storage for which is treated differently for performance). Our two
   additions here will be used to indicate whether the particular entry
   is subject to replication admission control. If so, right as we
   persist entries into the raft log storage, we'll admit the work
   without blocking.
   - We'll come back to this non-blocking admission in the
     AdmitRaftEntry section below, even though the implementation is
     left for a future PR.
   - The decision to use replication admission control happens above
     raft, and AC-specific metadata is plumbed down as part of the
     marshaled raft command, as described for RaftAdmissionMeta above.

6. An unused version gate (V23_1UseEncodingWithBelowRaftAdmissionData)
   to use replication admission control. Since we're using a different
   prefix byte for raft commands (see EntryEncodings above), one not
   recognized in earlier CRDB versions, we need explicit versioning.

7. AdmitRaftEntry, on the kvadmission.Controller interface. We'll
   use this as the integration point for log entries received below
   raft, right as they're being written to storage. This will be
   non-blocking since we'll be below raft in the raft.Ready() loop,
   and will effectively enqueue a "virtual" work item in underlying
   StoreWorkQueue mediating store IO. This virtual work item is what
   later gets dequeued once the store granter informs the work queue of
   newly available IO tokens. For standard work queue ordering, our work
   item needs to include the create time and admission pri. The tenant
   ID is plumbed to find the right tenant heap to queue it under (for
   inter-tenant isolation); the store ID to find the right store work
   queue on multi-store nodes. The raftpb.Entry encodes within it its
   origin node (see RaftAdmissionMeta above), which is used
   post-admission to inform the right node of said admission. It looks
   like:

   // AdmitRaftEntry informs admission control of a raft log entry being
   // written to storage.
   AdmitRaftEntry(roachpb.TenantID, roachpb.StoreID, roachpb.RangeID, raftpb.Entry)
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jan 24, 2023
Follower replication work, today, is not subject to admission control.
It consumes IO tokens without waiting, which both (i) does not prevent
the LSM from being inverted, and (ii) can cause priority inversion where
low-pri follower write work ends up causing IO token exhaustion, which
in turn causes throughput and latency impact for high-pri non-follower
write work on that same store. This latter behavior was especially
noticeble with large index backfills (cockroachdb#82556) where >2/3rds of write
traffic on stores could be follower work for large AddSSTs, causing IO
token exhaustion for regular write work being proposed on those stores.

We last looked at this problem as part of cockroachdb#79215, settling on cockroachdb#83851
which pauses replication traffic to stores close to exceeding their IO
overload threshold (data that's periodically gossiped). In large index
backfill experiments we found this to help slightly, but it's still a
coarse and imperfect solution -- we're deliberately causing
under-replication instead of being able to shape the rate of incoming
writes for low-pri work closer to the origin.

As part of cockroachdb#95563 we're introducing machinery for "replication admission
control" -- end-to-end flow control for replication traffic. With it we
expect to no longer need to bypass follower write work in admission
control and solve the issues mentioned above. Some small degree of
familiarity with the design is assumed below. In this first,
proto{col,buf}/interface-only PR, we introduce:

1. Package kvflowcontrol{,pb}, which will provide flow control for
   replication traffic in KV. It will be part of the integration layer
   between KV and admission control. In it we have a few central
   interfaces:

   - kvflowcontrol.Controller, held at the node-level and holds all
     kvflowcontrol.Tokens for each kvflowcontrol.Stream (one per store
     we're sending raft traffic to and tenant we're sending it for).
   - kvflowcontrol.Handle, which will held at the replica-level (only
     on those who are both leaseholder and raft leader), and will be
     used to interface with the node-level kvflowcontrol.Controller.
     When replicating log entries, these replicas choose the log
     position (term+index) the data is to end up at, and use this handle
     to track the token deductions on a per log position basis. Later
     when freeing up tokens (after being informed of said log entries
     being admitted on the receiving end of the stream), it's done so by
     specifying the log position up to which we free up all deducted
     tokens.

   type Controller interface {
     Admit(admissionpb.WorkPriority, ...Stream)
     DeductTokens(admissionpb.WorkPriority, Tokens, ...Stream)
     ReturnTokens(admissionpb.WorkPriority, Tokens, ...Stream)
   }

   type Handle interface {
     Admit(admissionpb.WorkPriority)
     DeductTokensFor(admissionpb.WorkPriority, kvflowcontrolpb.RaftLogPosition, Tokens)
     ReturnTokensUpto(admissionpb.WorkPriority, kvflowcontrolpb.RaftLogPosition)
     TrackLowWater(Stream, kvflowcontrolpb.RaftLogPosition)
     Close()
   }

2. kvflowcontrolpb.RaftAdmissionMeta and relevant encoding/decoding
   routines. RaftAdmissionMeta is 'embedded' within a
   kvserverpb.RaftCommand, and includes necessary AC metadata on a per
   raft entry basis. Entries that contain this metadata will make use of
   the AC-specific raft log entry encodings described earlier. The AC
   metadata is decoded below-raft when looking to admit the write work.
   Also included is the node where this command originated, who wants to
   eventually learn of this command's admission.

   message RaftAdmissionMeta {
     int32 admission_priority = ...;
     int64 admission_create_time = ...;
     int32 admission_origin_node = ...;
   }

3. kvflowcontrolpb.AdmittedRaftLogEntries, which now features in
   kvserverpb.RaftMessageRequest, the unit of what's sent
   back-and-forth between two nodes over their two uni-directional raft
   transport streams. AdmittedRaftLogEntries, just like raft
   heartbeats, is coalesced information about all raft log entries that
   were admitted below raft. We'll use the origin node encoded in raft
   entry (admission_origin_node from from above) to know where to
   send these to, which in turn will release flow tokens that were
   acquired when replicating the original log entries.

   message AdmittedRaftLogEntries {
     int64 range_id = ...;
     int32 admission_priority = ...;
     RaftLogPosition up_to_raft_log_position = ...;
     uint64 store_id = ...;
   }

   message RaftLogPosition {
     uint64 term = ...;
     uint64 index = ...;
   }

4. kvflowcontrol.Dispatch, which is used to dispatch information about
   admitted raft log entries (see AdmittedRaftLogEntries from above) to
   specific nodes where (i) said entries originated, (ii) flow tokens
   were deducted and (iii) are waiting to be returned. The interface is
   also used to read pending dispatches, which will be used in the raft
   transport layer when looking to piggyback information on traffic
   already bound to specific nodes. Since timely dispatching (read:
   piggybacking) is not guaranteed, we allow querying for all
   long-overdue dispatches. The interface looks roughly like:

   type Dispatch interface {
     Dispatch(roachpb.NodeID, kvflowcontrolpb.AdmittedRaftLogEntries)
     PendingDispatch() []roachpb.NodeID
     PendingDispatchFor(roachpb.NodeID) []kvflowcontrolpb.AdmittedRaftLogEntries
   }

5. Two new encodings for raft log entries,
   EntryEncoding{Standard,Sideloaded}WithAC. Raft log entries have
   prefix byte that informs decoding routines how to interpret the
   subsequent bytes. To date we've had two,
   EntryEncoding{Standard,Sideloaded} (now renamed to
   EntryEncoding{Standard,Sideloaded}WithoutAC), to indicate whether
   the entry came with sideloaded data (these are typically AddSSTs, the
   storage for which is treated differently for performance). Our two
   additions here will be used to indicate whether the particular entry
   is subject to replication admission control. If so, right as we
   persist entries into the raft log storage, we'll admit the work
   without blocking.
   - We'll come back to this non-blocking admission in the
     AdmitRaftEntry section below, even though the implementation is
     left for a future PR.
   - The decision to use replication admission control happens above
     raft, and AC-specific metadata is plumbed down as part of the
     marshaled raft command, as described for RaftAdmissionMeta above.

6. An unused version gate (V23_1UseEncodingWithBelowRaftAdmissionData)
   to use replication admission control. Since we're using a different
   prefix byte for raft commands (see EntryEncodings above), one not
   recognized in earlier CRDB versions, we need explicit versioning.

7. AdmitRaftEntry, on the kvadmission.Controller interface. We'll
   use this as the integration point for log entries received below
   raft, right as they're being written to storage. This will be
   non-blocking since we'll be below raft in the raft.Ready() loop,
   and will effectively enqueue a "virtual" work item in underlying
   StoreWorkQueue mediating store IO. This virtual work item is what
   later gets dequeued once the store granter informs the work queue of
   newly available IO tokens. For standard work queue ordering, our work
   item needs to include the create time and admission pri. The tenant
   ID is plumbed to find the right tenant heap to queue it under (for
   inter-tenant isolation); the store ID to find the right store work
   queue on multi-store nodes. The raftpb.Entry encodes within it its
   origin node (see RaftAdmissionMeta above), which is used
   post-admission to inform the right node of said admission. It looks
   like:

   // AdmitRaftEntry informs admission control of a raft log entry being
   // written to storage.
   AdmitRaftEntry(roachpb.TenantID, roachpb.StoreID, roachpb.RangeID, raftpb.Entry)

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jan 24, 2023
Part of cockroachdb#95563. Predecessor to cockroachdb#95637.

This commit introduces two new encodings for raft log entries,
EntryEncoding{Standard,Sideloaded}WithAC. Raft log entries have prefix
byte that informs decoding routines how to interpret the subsequent
bytes. To date we've had two, EntryEncoding{Standard,Sideloaded}[^1], to
indicate whether the entry came with sideloaded data[^2]. Our two
additions here will be used to indicate whether the particular entry is
subject to replication admission control. If so, right as we persist
entries into the raft log storage, we'll "admit the work without
blocking", which is further explained in cockroachdb#95637.

The decision to use replication admission control happens above raft and
a per-entry basis. If using replication admission control, AC-specific
metadata will be plumbed down as part of the marshaled raft command.
This too is explained in in cockroachdb#95637, specifically, the
'RaftAdmissionMeta' section. This commit then adds an unused version
gate (V23_1UseEncodingWithBelowRaftAdmissionData) to use replication
admission control. Since we're using a different prefix byte for raft
commands (see EntryEncodings above), one not recognized in earlier CRDB
versions, we need explicit versioning. We add it out of development
convenience -- adding version gates is most prone to merge conflicts. We
expect to use it shortly, before alpha/beta cuts.

[^1]: Now renamed to EntryEncoding{Standard,Sideloaded}WithoutAC.
[^2]: These are typically AddSSTs, the storage for which is treated
      differently for performance reasons.

Release note: None
@irfansharif irfansharif changed the title *: replication admission control *: productionize replication admission control Jan 24, 2023
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jan 25, 2023
Follower replication work, today, is not subject to admission control.
It consumes IO tokens without waiting, which both (i) does not prevent
the LSM from being inverted, and (ii) can cause priority inversion where
low-pri follower write work ends up causing IO token exhaustion, which
in turn causes throughput and latency impact for high-pri non-follower
write work on that same store. This latter behavior was especially
noticeble with large index backfills (cockroachdb#82556) where >2/3rds of write
traffic on stores could be follower work for large AddSSTs, causing IO
token exhaustion for regular write work being proposed on those stores.

We last looked at this problem as part of cockroachdb#79215, settling on cockroachdb#83851
which pauses replication traffic to stores close to exceeding their IO
overload threshold (data that's periodically gossiped). In large index
backfill experiments we found this to help slightly, but it's still a
coarse and imperfect solution -- we're deliberately causing
under-replication instead of being able to shape the rate of incoming
writes for low-pri work closer to the origin.

As part of cockroachdb#95563 we're introducing machinery for "replication admission
control" -- end-to-end flow control for replication traffic. With it we
expect to no longer need to bypass follower write work in admission
control and solve the issues mentioned above. Some small degree of
familiarity with the design is assumed below. In this
proto{col,buf}/interface-only commit and the previous raft log encoding
commit, we introduce:

1. Package kvflowcontrol{,pb}, which will provide flow control for
   replication traffic in KV. It will be part of the integration layer
   between KV and admission control. In it we have a few central
   interfaces:

   - kvflowcontrol.Controller, held at the node-level and holds all
     kvflowcontrol.Tokens for each kvflowcontrol.Stream (one per store
     we're sending raft traffic to and tenant we're sending it for).
   - kvflowcontrol.Handle, which will held at the replica-level (only
     on those who are both leaseholder and raft leader), and will be
     used to interface with the node-level kvflowcontrol.Controller.
     When replicating log entries, these replicas choose the log
     position (term+index) the data is to end up at, and use this handle
     to track the token deductions on a per log position basis. Later
     when freeing up tokens (after being informed of said log entries
     being admitted on the receiving end of the stream), it's done so by
     specifying the log position up to which we free up all deducted
     tokens.

   type Controller interface {
     Admit(admissionpb.WorkPriority, ...Stream)
     DeductTokens(admissionpb.WorkPriority, Tokens, ...Stream)
     ReturnTokens(admissionpb.WorkPriority, Tokens, ...Stream)
   }

   type Handle interface {
     Admit(admissionpb.WorkPriority)
     DeductTokensFor(admissionpb.WorkPriority, kvflowcontrolpb.RaftLogPosition, Tokens)
     ReturnTokensUpto(admissionpb.WorkPriority, kvflowcontrolpb.RaftLogPosition, Stream)
     TrackLowWater(kvflowcontrolpb.RaftLogPosition, Stream)
     Close()
   }

2. kvflowcontrolpb.RaftAdmissionMeta and relevant encoding/decoding
   routines. RaftAdmissionMeta is 'embedded' within a
   kvserverpb.RaftCommand, and includes necessary AC metadata on a per
   raft entry basis. Entries that contain this metadata will make use of
   the AC-specific raft log entry encodings described earlier. The AC
   metadata is decoded below-raft when looking to admit the write work.
   Also included is the node where this command originated, who wants to
   eventually learn of this command's admission.

   message RaftAdmissionMeta {
     int32 admission_priority = ...;
     int64 admission_create_time = ...;
     int32 admission_origin_node = ...;
   }

3. kvflowcontrolpb.AdmittedRaftLogEntries, which now features in
   kvserverpb.RaftMessageRequest, the unit of what's sent
   back-and-forth between two nodes over their two uni-directional raft
   transport streams. AdmittedRaftLogEntries, just like raft
   heartbeats, is coalesced information about all raft log entries that
   were admitted below raft. We'll use the origin node encoded in raft
   entry (admission_origin_node from from above) to know where to
   send these to, which in turn will release flow tokens that were
   acquired when replicating the original log entries.

   message AdmittedRaftLogEntries {
     int64 range_id = ...;
     int32 admission_priority = ...;
     RaftLogPosition up_to_raft_log_position = ...;
     uint64 store_id = ...;
   }

   message RaftLogPosition {
     uint64 term = ...;
     uint64 index = ...;
   }

4. kvflowcontrol.Dispatch, which is used to dispatch information about
   admitted raft log entries (see AdmittedRaftLogEntries from above) to
   specific nodes where (i) said entries originated, (ii) flow tokens
   were deducted and (iii) are waiting to be returned. The interface is
   also used to read pending dispatches, which will be used in the raft
   transport layer when looking to piggyback information on traffic
   already bound to specific nodes. Since timely dispatching (read:
   piggybacking) is not guaranteed, we allow querying for all
   long-overdue dispatches. The interface looks roughly like:

   type Dispatch interface {
     Dispatch(roachpb.NodeID, kvflowcontrolpb.AdmittedRaftLogEntries)
     PendingDispatch() []roachpb.NodeID
     PendingDispatchFor(roachpb.NodeID) []kvflowcontrolpb.AdmittedRaftLogEntries
   }

5. Two new encodings for raft log entries,
   EntryEncoding{Standard,Sideloaded}WithAC. Raft log entries have
   prefix byte that informs decoding routines how to interpret the
   subsequent bytes. To date we've had two,
   EntryEncoding{Standard,Sideloaded} (now renamed to
   EntryEncoding{Standard,Sideloaded}WithoutAC), to indicate whether
   the entry came with sideloaded data (these are typically AddSSTs, the
   storage for which is treated differently for performance). Our two
   additions here will be used to indicate whether the particular entry
   is subject to replication admission control. If so, right as we
   persist entries into the raft log storage, we'll admit the work
   without blocking.
   - We'll come back to this non-blocking admission in the
     AdmitRaftEntry section below, even though the implementation is
     left for a future PR.
   - The decision to use replication admission control happens above
     raft, and AC-specific metadata is plumbed down as part of the
     marshaled raft command, as described for RaftAdmissionMeta above.

6. An unused version gate (V23_1UseEncodingWithBelowRaftAdmissionData)
   to use replication admission control. Since we're using a different
   prefix byte for raft commands (see EntryEncodings above), one not
   recognized in earlier CRDB versions, we need explicit versioning.

7. AdmitRaftEntry, on the kvadmission.Controller interface. We'll
   use this as the integration point for log entries received below
   raft, right as they're being written to storage. This will be
   non-blocking since we'll be below raft in the raft.Ready() loop,
   and will effectively enqueue a "virtual" work item in underlying
   StoreWorkQueue mediating store IO. This virtual work item is what
   later gets dequeued once the store granter informs the work queue of
   newly available IO tokens. For standard work queue ordering, our work
   item needs to include the create time and admission pri. The tenant
   ID is plumbed to find the right tenant heap to queue it under (for
   inter-tenant isolation); the store ID to find the right store work
   queue on multi-store nodes. The raftpb.Entry encodes within it its
   origin node (see RaftAdmissionMeta above), which is used
   post-admission to inform the right node of said admission. It looks
   like:

   // AdmitRaftEntry informs admission control of a raft log entry being
   // written to storage.
   AdmitRaftEntry(roachpb.TenantID, roachpb.StoreID, roachpb.RangeID, raftpb.Entry)

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jan 25, 2023
Follower replication work, today, is not subject to admission control.
It consumes IO tokens without waiting, which both (i) does not prevent
the LSM from being inverted, and (ii) can cause priority inversion where
low-pri follower write work ends up causing IO token exhaustion, which
in turn causes throughput and latency impact for high-pri non-follower
write work on that same store. This latter behavior was especially
noticeble with large index backfills (cockroachdb#82556) where >2/3rds of write
traffic on stores could be follower work for large AddSSTs, causing IO
token exhaustion for regular write work being proposed on those stores.

We last looked at this problem as part of cockroachdb#79215, settling on cockroachdb#83851
which pauses replication traffic to stores close to exceeding their IO
overload threshold (data that's periodically gossiped). In large index
backfill experiments we found this to help slightly, but it's still a
coarse and imperfect solution -- we're deliberately causing
under-replication instead of being able to shape the rate of incoming
writes for low-pri work closer to the origin.

As part of cockroachdb#95563 we're introducing machinery for "replication admission
control" -- end-to-end flow control for replication traffic. With it we
expect to no longer need to bypass follower write work in admission
control and solve the issues mentioned above. Some small degree of
familiarity with the design is assumed below. In this
proto{col,buf}/interface-only commit and the previous raft log encoding
commit, we introduce:

1. Package kvflowcontrol{,pb}, which will provide flow control for
   replication traffic in KV. It will be part of the integration layer
   between KV and admission control. In it we have a few central
   interfaces:

   - kvflowcontrol.Controller, held at the node-level and holds all
     kvflowcontrol.Tokens for each kvflowcontrol.Stream (one per store
     we're sending raft traffic to and tenant we're sending it for).
   - kvflowcontrol.Handle, which will held at the replica-level (only
     on those who are both leaseholder and raft leader), and will be
     used to interface with the node-level kvflowcontrol.Controller.
     When replicating log entries, these replicas choose the log
     position (term+index) the data is to end up at, and use this handle
     to track the token deductions on a per log position basis. Later
     when freeing up tokens (after being informed of said log entries
     being admitted on the receiving end of the stream), it's done so by
     specifying the log position up to which we free up all deducted
     tokens.

   type Controller interface {
     Admit(admissionpb.WorkPriority, ...Stream)
     DeductTokens(admissionpb.WorkPriority, Tokens, ...Stream)
     ReturnTokens(admissionpb.WorkPriority, Tokens, ...Stream)
   }

   type Handle interface {
     Admit(admissionpb.WorkPriority)
     DeductTokensFor(admissionpb.WorkPriority, kvflowcontrolpb.RaftLogPosition, Tokens)
     ReturnTokensUpto(admissionpb.WorkPriority, kvflowcontrolpb.RaftLogPosition, Stream)
     TrackLowWater(kvflowcontrolpb.RaftLogPosition, Stream)
     Close()
   }

2. kvflowcontrolpb.RaftAdmissionMeta and relevant encoding/decoding
   routines. RaftAdmissionMeta is 'embedded' within a
   kvserverpb.RaftCommand, and includes necessary AC metadata on a per
   raft entry basis. Entries that contain this metadata will make use of
   the AC-specific raft log entry encodings described earlier. The AC
   metadata is decoded below-raft when looking to admit the write work.
   Also included is the node where this command originated, who wants to
   eventually learn of this command's admission.

   message RaftAdmissionMeta {
     int32 admission_priority = ...;
     int64 admission_create_time = ...;
     int32 admission_origin_node = ...;
   }

3. kvflowcontrolpb.AdmittedRaftLogEntries, which now features in
   kvserverpb.RaftMessageRequest, the unit of what's sent
   back-and-forth between two nodes over their two uni-directional raft
   transport streams. AdmittedRaftLogEntries, just like raft
   heartbeats, is coalesced information about all raft log entries that
   were admitted below raft. We'll use the origin node encoded in raft
   entry (admission_origin_node from from above) to know where to
   send these to, which in turn will release flow tokens that were
   acquired when replicating the original log entries.

   message AdmittedRaftLogEntries {
     int64 range_id = ...;
     int32 admission_priority = ...;
     RaftLogPosition up_to_raft_log_position = ...;
     uint64 store_id = ...;
   }

   message RaftLogPosition {
     uint64 term = ...;
     uint64 index = ...;
   }

4. kvflowcontrol.Dispatch, which is used to dispatch information about
   admitted raft log entries (see AdmittedRaftLogEntries from above) to
   specific nodes where (i) said entries originated, (ii) flow tokens
   were deducted and (iii) are waiting to be returned. The interface is
   also used to read pending dispatches, which will be used in the raft
   transport layer when looking to piggyback information on traffic
   already bound to specific nodes. Since timely dispatching (read:
   piggybacking) is not guaranteed, we allow querying for all
   long-overdue dispatches. The interface looks roughly like:

   type Dispatch interface {
     Dispatch(roachpb.NodeID, kvflowcontrolpb.AdmittedRaftLogEntries)
     PendingDispatch() []roachpb.NodeID
     PendingDispatchFor(roachpb.NodeID) []kvflowcontrolpb.AdmittedRaftLogEntries
   }

5. Two new encodings for raft log entries,
   EntryEncoding{Standard,Sideloaded}WithAC. Raft log entries have
   prefix byte that informs decoding routines how to interpret the
   subsequent bytes. To date we've had two,
   EntryEncoding{Standard,Sideloaded} (now renamed to
   EntryEncoding{Standard,Sideloaded}WithoutAC), to indicate whether
   the entry came with sideloaded data (these are typically AddSSTs, the
   storage for which is treated differently for performance). Our two
   additions here will be used to indicate whether the particular entry
   is subject to replication admission control. If so, right as we
   persist entries into the raft log storage, we'll admit the work
   without blocking.
   - We'll come back to this non-blocking admission in the
     AdmitRaftEntry section below, even though the implementation is
     left for a future PR.
   - The decision to use replication admission control happens above
     raft, and AC-specific metadata is plumbed down as part of the
     marshaled raft command, as described for RaftAdmissionMeta above.

6. An unused version gate (V23_1UseEncodingWithBelowRaftAdmissionData)
   to use replication admission control. Since we're using a different
   prefix byte for raft commands (see EntryEncodings above), one not
   recognized in earlier CRDB versions, we need explicit versioning.

7. AdmitRaftEntry, on the kvadmission.Controller interface. We'll
   use this as the integration point for log entries received below
   raft, right as they're being written to storage. This will be
   non-blocking since we'll be below raft in the raft.Ready() loop,
   and will effectively enqueue a "virtual" work item in underlying
   StoreWorkQueue mediating store IO. This virtual work item is what
   later gets dequeued once the store granter informs the work queue of
   newly available IO tokens. For standard work queue ordering, our work
   item needs to include the create time and admission pri. The tenant
   ID is plumbed to find the right tenant heap to queue it under (for
   inter-tenant isolation); the store ID to find the right store work
   queue on multi-store nodes. The raftpb.Entry encodes within it its
   origin node (see RaftAdmissionMeta above), which is used
   post-admission to inform the right node of said admission. It looks
   like:

   // AdmitRaftEntry informs admission control of a raft log entry being
   // written to storage.
   AdmitRaftEntry(roachpb.TenantID, roachpb.StoreID, roachpb.RangeID, raftpb.Entry)

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Apr 30, 2023
Part of cockroachdb#95563. This PR integrates various kvflowcontrol components into
the critical path for replication traffic. It does so by introducing two
"integration interfaces" in the kvserver package to intercept various
points of a replica's lifecycle, using it to manage the underlying
replication streams and flow tokens. The integration is mediated through
two cluster settings:

  - kvadmission.flow_control.enabled
      This is a top-level kill-switch to revert to pre-kvflowcontrol
      behavior where follower writes unilaterally deducted IO tokens
      without blocking.

  - kvadmission.flow_control.mode
      It can take on one of three settings, each exercising the flow
      control machinery to varying degrees.

      - apply_to_elastic
          Only applies admission delays to elastic traffic.

      - apply_to_all
          Applies admission delays to {regular,elastic} traffic.

When the mode is changed, we simply admit all waiting requests. This
risks possibly over-admitting work, but that's ok -- we assume these
mode changes are rare events and done under supervision. These
settings are hooked into in the kvadmission and kvflowcontroller
packages. As for the actual integration interfaces in kvserver, they
are:

  - replicaFlowControlIntegration: used to integrate with replication
    flow control. It's intercepts various points in a replica's
    lifecycle, like it acquiring raft leadership or losing it, or its
    raft membership changing, etc. Accessing it requires Replica.mu to
    be held, exclusively (this is asserted on in the canonical
    implementation).

      type replicaFlowControlIntegration interface {
        handle() (kvflowcontrol.Handle, bool)
        onBecameLeader(context.Context)
        onBecameFollower(context.Context)
        onDescChanged(context.Context)
        onFollowersPaused(context.Context)
        onReplicaDestroyed(context.Context)
        onProposalQuotaUpdated(context.Context)
      }

  - replicaForFlowControl abstracts the interface of an individual
    Replica, as needed by replicaFlowControlIntegration.

      type replicaForFlowControl interface {
        assertLocked()
        annotateCtx(context.Context) context.Context
        getTenantID() roachpb.TenantID
        getReplicaID() roachpb.ReplicaID
        getRangeID() roachpb.RangeID
        getDescriptor() *roachpb.RangeDescriptor
        pausedFollowers() map[roachpb.ReplicaID]struct{}
        isFollowerActive(context.Context, roachpb.ReplicaID) bool
        appliedLogPosition() kvflowcontrolpb.RaftLogPosition
        withReplicaProgress(f func(roachpb.ReplicaID, rafttracker.Progress))
      }

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue May 7, 2023
Part of cockroachdb#95563. This PR integrates various kvflowcontrol components into
the critical path for replication traffic. It does so by introducing two
"integration interfaces" in the kvserver package to intercept various
points of a replica's lifecycle, using it to manage the underlying
replication streams and flow tokens. The integration is mediated through
two cluster settings:

  - kvadmission.flow_control.enabled
      This is a top-level kill-switch to revert to pre-kvflowcontrol
      behavior where follower writes unilaterally deducted IO tokens
      without blocking.

  - kvadmission.flow_control.mode
      It can take on one of three settings, each exercising the flow
      control machinery to varying degrees.

      - apply_to_elastic
          Only applies admission delays to elastic traffic.

      - apply_to_all
          Applies admission delays to {regular,elastic} traffic.

When the mode is changed, we simply admit all waiting requests. This
risks possibly over-admitting work, but that's ok -- we assume these
mode changes are rare events and done under supervision. These
settings are hooked into in the kvadmission and kvflowcontroller
packages. As for the actual integration interfaces in kvserver, they
are:

  - replicaFlowControlIntegration: used to integrate with replication
    flow control. It's intercepts various points in a replica's
    lifecycle, like it acquiring raft leadership or losing it, or its
    raft membership changing, etc. Accessing it requires Replica.mu to
    be held, exclusively (this is asserted on in the canonical
    implementation).

      type replicaFlowControlIntegration interface {
        handle() (kvflowcontrol.Handle, bool)
        onBecameLeader(context.Context)
        onBecameFollower(context.Context)
        onDescChanged(context.Context)
        onFollowersPaused(context.Context)
        onReplicaDestroyed(context.Context)
        onProposalQuotaUpdated(context.Context)
      }

  - replicaForFlowControl abstracts the interface of an individual
    Replica, as needed by replicaFlowControlIntegration.

      type replicaForFlowControl interface {
        assertLocked()
        annotateCtx(context.Context) context.Context
        getTenantID() roachpb.TenantID
        getReplicaID() roachpb.ReplicaID
        getRangeID() roachpb.RangeID
        getDescriptor() *roachpb.RangeDescriptor
        pausedFollowers() map[roachpb.ReplicaID]struct{}
        isFollowerActive(context.Context, roachpb.ReplicaID) bool
        appliedLogPosition() kvflowcontrolpb.RaftLogPosition
        withReplicaProgress(f func(roachpb.ReplicaID, rafttracker.Progress))
      }

Release note: None
craig bot pushed a commit that referenced this issue May 11, 2023
101786: workload: introduce timeout for pre-warming connection pool r=sean- a=sean-

Interrupting target instances during prewarming shouldn't cause workload to proceed: introduce a timeout to prewarming connections.  Connections will have 15s to 5min to warmup before the context will expire.

Epic: none

101987: cli/sql: new option autocerts for TLS client cert auto-discovery r=rafiss a=knz

Fixes #101986.

See the release note below.
An additional benefit not mentioned in the release note is that
it simplifies switching from one tenant to another when using
shared-process multitenancy. For example, this becomes possible:

```
> CREATE TENANT foo;
> ALTER TENANT foo START SERVICE SHARED;
> \c cluster:foo root - - autocerts
```

Alternatively, this can also be used to quickly switch from a non-root
user in an app tenant to the root user in the system tenant:
```
> \c cluster:system root - - autocerts
```

This works because (currently) all tenant servers running side-by-side
use the same TLS CA to validate SQL client certs.

----

Release note (cli change): The `\connect` client-side command for the
SQL shell (included in `cockroach sql`, `cockroach demo`,
`cockroach-sql`) now recognizes an option `autocerts` as last
argument.

When provided, `\c` will now try to discover a TLS client
certificate and key in the same directory(ies) as used by the previous
connection URL.

This feature makes it easier to switch usernames when
TLS client/key files are available for both the previous and the new
username.

102382: c2c: deflake c2c/shutdown roachtests r=stevendanna a=msbutler

   c2c: deflake c2c/shutdown roachtests

    This patch addresses to roachtest failure modes:
    - Prevents roachtest failure if a query fails during a node shutdown.

    - Prevents the src cluster from returning a single node topology, which could
      cause the stream ingestion job to hang if the participating src node gets
    shut down. Longer term, automatic replanning will prevent this. Specifically,
    this patch changes the kv workload to split and scatter the kv table across the
    cluster before the c2c job begins.

    Fixes #101898
    Fixes #102111

    This patch also makes it easier to reproduce c2c roachtest failures by plumbing
    a random seed to several components of the roachtest driver.

    Release note: None


    c2c: rename completeStreamIngestion to applyCutoverTime

    Release note: none


    workload: add --scatter flag to kv workload

    The user can now run `./workload init kv --scatter ....` which scatters the kv
    table across the cluster after the initial data load. This flag is best used
    with `--splits`, `--max-block-bytes`, and `--insert-count`.

    Epic: none

    Release note: none

102819: admission: move CreateTime-sequencing below-raft r=irfansharif a=irfansharif

These are already reviewed commits from #98308. Part of #95563.

---

**admission: move CreateTime-sequencing below-raft**

We move kvflowsequencer.Sequencer and its use in kvflowhandle.Handle (above-raft) to admission.sequencer, now used by admission.StoreWorkQueue (below-raft). This variant appeared in an earlier revision of #97599 where we first introduced monotonically increasing CreateTimes for a given raft group.

In a subsequent commit, when integrating kvflowcontrol into the critical path for replication traffic, we'll observe that it's quite difficult to create sequencing CreateTimes[^1] above raft. This is because these sequence numbers are encoded as part of the raft proposal[^2], and at encode-time, we don't actually know what log position the proposal is going to end up in. It's hard to explicitly guarantee that a proposal with log-position P1 will get encoded before another with log position P2, where P1 < P2.

Naively sequencing CreateTimes at proposal-encode-time could result in over-admission. This is because of how we return flow tokens -- up to some log index[^3], and how use these sequence numbers in below-raft WorkQueues. If P2 ends up with a lower sequence number/CreateTime, it would get admitted first, and when returning flow tokens by log position, in specifying up-to-P2, we'll early return P1's flow tokens despite it not being admitted. So we'd over-admit at the sender. This is all within a <tenant,priority> pair.

[^1]: We use CreateTimes as "sequence numbers" in replication admission control. We want to assign each AC-queued work below-raft a "sequence number" for FIFO ordering within a <tenant,priority>. We ensure these timestamps are roughly monotonic with respect to log positions of replicated work by sequencing work in log position order.
[^2]: In kvflowcontrolpb.RaftAdmissionMeta.
[^3]: See kvflowcontrolpb.AdmittedRaftLogEntries.

---

**admission: add intercept points for when replicated work gets admitted**

In a subsequent commit, when integrating kvflowcontrol into the critical path for replication traffic, we'll set up the return of flow tokens from the receiver node back to the sender once log entries get (asynchronously) admitted[^4]. So we need to intercept the exact points at which the virtually enqueued work items get admitted, since it all happens asynchronously[^5]. To that end we introduce the following interface:
```go
    // OnLogEntryAdmitted is used to observe the specific entries
    // (identified by rangeID + log position) that were admitted. Since
    // admission control for log entries is asynchronous/non-blocking,
    // this allows callers to do requisite post-admission
    // bookkeeping.
    type OnLogEntryAdmitted interface {
     AdmittedLogEntry(
       origin roachpb.NodeID, /* node where the entry originated */
       pri admissionpb.WorkPriority, /* admission priority of the entry */
       storeID roachpb.StoreID, /* store on which the entry was admitted */
       rangeID roachpb.RangeID, /* identifying range for the log entry */
       pos LogPosition, /* log position of the entry that was admitted*/
     )
    }
```
For now we pass in a no-op implementation in production code, but this will change shortly.

Seeing as how the asynchronous admit interface is going to be the primary once once we enable replication admission control by default, for IO control, we no longer need the storeWriteDone interfaces and corresponding types. It's being used by our current (and soon-to-be legacy) above-raft IO admission control to inform granters of when the write was actually done, post-admission. For above-raft IO control, at admit-time we do not have sizing info for the writes, so by intercepting these writes at write-done time we're able to make any outstanding token adjustments in the granter.

To reflect this new world, we:
- Rename setAdmittedDoneModels to setLinearModels.
- Introduce a storeReplicatedWorkAdmittedInfo[^6]. It provides information about the size of replicated work once it's admitted (which happens asynchronously from the work itself). This lets us use the underlying linear models for L0 {writes,ingests} to deduct an appropriate number of tokens from the granter, for the admitted work size[^7].
- Rename the granterWithStoreWriteDone interface to granterWithStoreReplicatedWorkAdmitted. We'll still intercept the actual point of admission for some token adjustments, through the the storeReplicatedWorkAdmittedLocked API shown below. There are two callstacks through which this API gets invoked, one where the coord.mu is already held, and one where it isn't. We plumb this information through so the lock is acquired if not already held. The locking structure is unfortunate, but this was a minimally invasive diff.
```go
   storeReplicatedWorkAdmittedLocked(
    originalTokens int64,
    admittedInfo storeReplicatedWorkAdmittedInfo,
   ) (additionalTokens int64)
```
While here, we also export an admission.TestingReverseWorkPriorityDict. There are at least three tests that have re-invented the wheel.

[^4]: This will happen through the kvflowcontrol.Dispatch interface introduced back in #97766, after integrating it with the RaftTransport layer.
[^5]: Introduced in #97599, for replicated write work.
[^6]: Identical to the previous StoreWorkDoneInfo.
[^7]: There's a peculiarity here in that at enqueuing-time we actually know the size of the write, so we could have deducted the right number of tokens upfront and avoid this post-admit granter token adjustment. We inherit this structure from earlier, and just leave a TODO for now.


103116: generate-logic-test: fix incorrect timeout in logictests template r=rickystewart a=healthy-pod

In #102719, we changed the way we set `-test.timeout` but didn't update the logictests template. This code change updates the template.

Release note: None
Epic: none

Co-authored-by: Sean Chittenden <sean@chittenden.org>
Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Michael Butler <butler@cockroachlabs.com>
Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Co-authored-by: healthy-pod <ahmad@cockroachlabs.com>
irfansharif added a commit to irfansharif/cockroach that referenced this issue May 31, 2023
Part of cockroachdb#95563. This PR integrates various kvflowcontrol components into
the critical path for replication traffic. It does so by introducing two
"integration interfaces" in the kvserver package to intercept various
points of a replica's lifecycle, using it to manage the underlying
replication streams and flow tokens. The integration is mediated through
two cluster settings:

  - kvadmission.flow_control.enabled
      This is a top-level kill-switch to revert to pre-kvflowcontrol
      behavior where follower writes unilaterally deducted IO tokens
      without blocking.

  - kvadmission.flow_control.mode
      It can take on one of three settings, each exercising the flow
      control machinery to varying degrees.

      - apply_to_elastic
          Only applies admission delays to elastic traffic.

      - apply_to_all
          Applies admission delays to {regular,elastic} traffic.

When the mode is changed, we simply admit all waiting requests. This
risks possibly over-admitting work, but that's ok -- we assume these
mode changes are rare events and done under supervision. These
settings are hooked into in the kvadmission and kvflowcontroller
packages. As for the actual integration interfaces in kvserver, they
are:

  - replicaFlowControlIntegration: used to integrate with replication
    flow control. It's intercepts various points in a replica's
    lifecycle, like it acquiring raft leadership or losing it, or its
    raft membership changing, etc. Accessing it requires Replica.mu to
    be held, exclusively (this is asserted on in the canonical
    implementation).

      type replicaFlowControlIntegration interface {
        handle() (kvflowcontrol.Handle, bool)
        onBecameLeader(context.Context)
        onBecameFollower(context.Context)
        onDescChanged(context.Context)
        onFollowersPaused(context.Context)
        onReplicaDestroyed(context.Context)
        onProposalQuotaUpdated(context.Context)
      }

  - replicaForFlowControl abstracts the interface of an individual
    Replica, as needed by replicaFlowControlIntegration.

      type replicaForFlowControl interface {
        assertLocked()
        annotateCtx(context.Context) context.Context
        getTenantID() roachpb.TenantID
        getReplicaID() roachpb.ReplicaID
        getRangeID() roachpb.RangeID
        getDescriptor() *roachpb.RangeDescriptor
        pausedFollowers() map[roachpb.ReplicaID]struct{}
        isFollowerActive(context.Context, roachpb.ReplicaID) bool
        appliedLogPosition() kvflowcontrolpb.RaftLogPosition
        withReplicaProgress(f func(roachpb.ReplicaID, rafttracker.Progress))
      }

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue May 31, 2023
Part of cockroachdb#95563. This PR integrates various kvflowcontrol components into
the critical path for replication traffic. It does so by introducing two
"integration interfaces" in the kvserver package to intercept various
points of a replica's lifecycle, using it to manage the underlying
replication streams and flow tokens. The integration is mediated through
two cluster settings:

  - kvadmission.flow_control.enabled
      This is a top-level kill-switch to revert to pre-kvflowcontrol
      behavior where follower writes unilaterally deducted IO tokens
      without blocking.

  - kvadmission.flow_control.mode
      It can take on one of three settings, each exercising the flow
      control machinery to varying degrees.

      - apply_to_elastic
          Only applies admission delays to elastic traffic.

      - apply_to_all
          Applies admission delays to {regular,elastic} traffic.

When the mode is changed, we simply admit all waiting requests. This
risks possibly over-admitting work, but that's ok -- we assume these
mode changes are rare events and done under supervision. These
settings are hooked into in the kvadmission and kvflowcontroller
packages. As for the actual integration interfaces in kvserver, they
are:

  - replicaFlowControlIntegration: used to integrate with replication
    flow control. It's intercepts various points in a replica's
    lifecycle, like it acquiring raft leadership or losing it, or its
    raft membership changing, etc. Accessing it requires Replica.mu to
    be held, exclusively (this is asserted on in the canonical
    implementation).

      type replicaFlowControlIntegration interface {
        handle() (kvflowcontrol.Handle, bool)
        onBecameLeader(context.Context)
        onBecameFollower(context.Context)
        onDescChanged(context.Context)
        onFollowersPaused(context.Context)
        onReplicaDestroyed(context.Context)
        onProposalQuotaUpdated(context.Context)
      }

  - replicaForFlowControl abstracts the interface of an individual
    Replica, as needed by replicaFlowControlIntegration.

      type replicaForFlowControl interface {
        assertLocked()
        annotateCtx(context.Context) context.Context
        getTenantID() roachpb.TenantID
        getReplicaID() roachpb.ReplicaID
        getRangeID() roachpb.RangeID
        getDescriptor() *roachpb.RangeDescriptor
        pausedFollowers() map[roachpb.ReplicaID]struct{}
        isFollowerActive(context.Context, roachpb.ReplicaID) bool
        appliedLogPosition() kvflowcontrolpb.RaftLogPosition
        withReplicaProgress(f func(roachpb.ReplicaID, rafttracker.Progress))
      }

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jun 6, 2023
Part of cockroachdb#95563. This PR integrates various kvflowcontrol components into
the critical path for replication traffic. It does so by introducing two
"integration interfaces" in the kvserver package to intercept various
points of a replica's lifecycle, using it to manage the underlying
replication streams and flow tokens. The integration is mediated through
two cluster settings:

  - kvadmission.flow_control.enabled
      This is a top-level kill-switch to revert to pre-kvflowcontrol
      behavior where follower writes unilaterally deducted IO tokens
      without blocking.

  - kvadmission.flow_control.mode
      It can take on one of three settings, each exercising the flow
      control machinery to varying degrees.

      - apply_to_elastic
          Only applies admission delays to elastic traffic.

      - apply_to_all
          Applies admission delays to {regular,elastic} traffic.

When the mode is changed, we simply admit all waiting requests. This
risks possibly over-admitting work, but that's ok -- we assume these
mode changes are rare events and done under supervision. These
settings are hooked into in the kvadmission and kvflowcontroller
packages. As for the actual integration interfaces in kvserver, they
are:

  - replicaFlowControlIntegration: used to integrate with replication
    flow control. It's intercepts various points in a replica's
    lifecycle, like it acquiring raft leadership or losing it, or its
    raft membership changing, etc. Accessing it requires Replica.mu to
    be held, exclusively (this is asserted on in the canonical
    implementation).

      type replicaFlowControlIntegration interface {
        handle() (kvflowcontrol.Handle, bool)
        onBecameLeader(context.Context)
        onBecameFollower(context.Context)
        onDescChanged(context.Context)
        onFollowersPaused(context.Context)
        onReplicaDestroyed(context.Context)
        onProposalQuotaUpdated(context.Context)
      }

  - replicaForFlowControl abstracts the interface of an individual
    Replica, as needed by replicaFlowControlIntegration.

      type replicaForFlowControl interface {
        assertLocked()
        annotateCtx(context.Context) context.Context
        getTenantID() roachpb.TenantID
        getReplicaID() roachpb.ReplicaID
        getRangeID() roachpb.RangeID
        getDescriptor() *roachpb.RangeDescriptor
        pausedFollowers() map[roachpb.ReplicaID]struct{}
        isFollowerActive(context.Context, roachpb.ReplicaID) bool
        appliedLogPosition() kvflowcontrolpb.RaftLogPosition
        withReplicaProgress(f func(roachpb.ReplicaID, rafttracker.Progress))
      }

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jun 8, 2023
Part of cockroachdb#95563. This PR integrates various kvflowcontrol components into
the critical path for replication traffic. It does so by introducing two
"integration interfaces" in the kvserver package to intercept various
points of a replica's lifecycle, using it to manage the underlying
replication streams and flow tokens. The integration is mediated through
two cluster settings:

  - kvadmission.flow_control.enabled
      This is a top-level kill-switch to revert to pre-kvflowcontrol
      behavior where follower writes unilaterally deducted IO tokens
      without blocking.

  - kvadmission.flow_control.mode
      It can take on one of two settings, each exercising the flow
      control machinery to varying degrees.

      - apply_to_elastic
          Only applies admission delays to elastic traffic.

      - apply_to_all
          Applies admission delays to {regular,elastic} traffic.

When the mode is changed, we simply admit all waiting requests. This
risks possibly over-admitting work, but that's ok -- we assume these
mode changes are rare events and done under supervision. These
settings are hooked into in the kvadmission and kvflowcontroller
packages. As for the actual integration interfaces in kvserver, they
are:

  - replicaFlowControlIntegration: used to integrate with replication
    flow control. It's intercepts various points in a replica's
    lifecycle, like it acquiring raft leadership or losing it, or its
    raft membership changing, etc. Accessing it requires Replica.mu to
    be held, exclusively (this is asserted on in the canonical
    implementation).

      type replicaFlowControlIntegration interface {
        handle() (kvflowcontrol.Handle, bool)
        onBecameLeader(context.Context)
        onBecameFollower(context.Context)
        onDescChanged(context.Context)
        onFollowersPaused(context.Context)
        onReplicaDestroyed(context.Context)
        onProposalQuotaUpdated(context.Context)
      }

  - replicaForFlowControl abstracts the interface of an individual
    Replica, as needed by replicaFlowControlIntegration.

      type replicaForFlowControl interface {
        assertLocked()
        annotateCtx(context.Context) context.Context
        getTenantID() roachpb.TenantID
        getReplicaID() roachpb.ReplicaID
        getRangeID() roachpb.RangeID
        getDescriptor() *roachpb.RangeDescriptor
        pausedFollowers() map[roachpb.ReplicaID]struct{}
        isFollowerActive(context.Context, roachpb.ReplicaID) bool
        appliedLogPosition() kvflowcontrolpb.RaftLogPosition
        withReplicaProgress(f func(roachpb.ReplicaID, rafttracker.Progress))
      }

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jun 9, 2023
Part of cockroachdb#95563. This PR integrates various kvflowcontrol components into
the critical path for replication traffic. It does so by introducing two
"integration interfaces" in the kvserver package to intercept various
points of a replica's lifecycle, using it to manage the underlying
replication streams and flow tokens. The integration is mediated through
two cluster settings:

  - kvadmission.flow_control.enabled
      This is a top-level kill-switch to revert to pre-kvflowcontrol
      behavior where follower writes unilaterally deducted IO tokens
      without blocking.

  - kvadmission.flow_control.mode
      It can take on one of two settings, each exercising the flow
      control machinery to varying degrees.

      - apply_to_elastic
          Only applies admission delays to elastic traffic.

      - apply_to_all
          Applies admission delays to {regular,elastic} traffic.

When the mode is changed, we simply admit all waiting requests. This
risks possibly over-admitting work, but that's ok -- we assume these
mode changes are rare events and done under supervision. These
settings are hooked into in the kvadmission and kvflowcontroller
packages. As for the actual integration interfaces in kvserver, they
are:

  - replicaFlowControlIntegration: used to integrate with replication
    flow control. It's intercepts various points in a replica's
    lifecycle, like it acquiring raft leadership or losing it, or its
    raft membership changing, etc. Accessing it requires Replica.mu to
    be held, exclusively (this is asserted on in the canonical
    implementation).

      type replicaFlowControlIntegration interface {
        handle() (kvflowcontrol.Handle, bool)
        onBecameLeader(context.Context)
        onBecameFollower(context.Context)
        onDescChanged(context.Context)
        onFollowersPaused(context.Context)
        onReplicaDestroyed(context.Context)
        onProposalQuotaUpdated(context.Context)
      }

  - replicaForFlowControl abstracts the interface of an individual
    Replica, as needed by replicaFlowControlIntegration.

      type replicaForFlowControl interface {
        assertLocked()
        annotateCtx(context.Context) context.Context
        getTenantID() roachpb.TenantID
        getReplicaID() roachpb.ReplicaID
        getRangeID() roachpb.RangeID
        getDescriptor() *roachpb.RangeDescriptor
        pausedFollowers() map[roachpb.ReplicaID]struct{}
        isFollowerActive(context.Context, roachpb.ReplicaID) bool
        appliedLogPosition() kvflowcontrolpb.RaftLogPosition
        withReplicaProgress(f func(roachpb.ReplicaID, rafttracker.Progress))
      }

Release note: None
craig bot pushed a commit that referenced this issue Jun 9, 2023
98308: kvserver,kvflowcontrol: integrate flow control r=irfansharif a=irfansharif

Part of #95563. See individual commits.

Release note: None

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jun 14, 2023
Fixes cockroachdb#102683. Part of cockroachdb#104154.

These were added way back in cockroachdb#36403 and cockroachdb#73904, pre-dating much of
IO admission control for leaseholder writes. With cockroachdb#95563, we now have IO
admission control for follower writes. Put together, have ample
LSM read-amp protection through AC alone. These concurrency limiters are
now redundant and oblivious to more sophisticated AC measures. We
recently removed the below-raft equivalents of these limiters (cockroachdb#98762),
and like mentioned there, these limiters can exacerbate memory pressure.
Separately, we're looking to work on speedier restores, and these
limiters are starting to get in the way.

While here, we also disable the pre-ingest delay mechanism in pebble,
which too pre-dates AC, introduced way back in cockroachdb#34258 for RocksDB and in
\cockroachdb#41839 for Pebble. IO AC is able to limit the number of L0 files, and
this pre-ingest delay with its maximum per-request delay time of 5s can
be less than effective. It's worth noting that the L0 file count
threshold at which this pre-ingest delay mechanism kicked in was 20,
while AC aims for 1000[^1].

This commit doesn't go as far as removing these limiters outright,
merely disabling them. This is just out of an overabundance of caution.
We can probably remove them once kvflowcontrol.enabled has had >1
release worth of baking time. Until then, it's nice to know we have
these old safety hatches. We have ample time in the release to assess
fallout from this commit, and also use this increased AddSST concurrency
to stress the kvflowcontrol machinery.

[^1]: The 1000 file limit exists to bound how long it takes to clear L0
      completely. Envelope math cribbed from elsewhere: With 2MiB files,
      1000 files is ~2GB, which at 40MB/s of compaction throughput (with
      a compaction slot consistently dedicated to L0) takes < 60s to
      clear the backlog. So the 'recovery' time is modest in that
      operators should not need to take manual action

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jul 4, 2023
Fixes cockroachdb#102683. Part of cockroachdb#104154.

These were added way back in cockroachdb#36403 and cockroachdb#73904, pre-dating much of
IO admission control for leaseholder writes. With cockroachdb#95563, we now have IO
admission control for follower writes. Put together, have ample
LSM read-amp protection through AC alone. These concurrency limiters are
now redundant and oblivious to more sophisticated AC measures. We
recently removed the below-raft equivalents of these limiters (cockroachdb#98762),
and like mentioned there, these limiters can exacerbate memory pressure.
Separately, we're looking to work on speedier restores, and these
limiters are starting to get in the way.

While here, we also disable the pre-ingest delay mechanism in pebble,
which too pre-dates AC, introduced way back in cockroachdb#34258 for RocksDB and in
\cockroachdb#41839 for Pebble. IO AC is able to limit the number of L0 files, and
this pre-ingest delay with its maximum per-request delay time of 5s can
be less than effective. It's worth noting that the L0 file count
threshold at which this pre-ingest delay mechanism kicked in was 20,
while AC aims for 1000[^1].

This commit doesn't go as far as removing these limiters outright,
merely disabling them. This is just out of an overabundance of caution.
We can probably remove them once kvflowcontrol.enabled has had >1
release worth of baking time. Until then, it's nice to know we have
these old safety hatches. We have ample time in the release to assess
fallout from this commit, and also use this increased AddSST concurrency
to stress the kvflowcontrol machinery.

[^1]: The 1000 file limit exists to bound how long it takes to clear L0
      completely. Envelope math cribbed from elsewhere: With 2MiB files,
      1000 files is ~2GB, which at 40MB/s of compaction throughput (with
      a compaction slot consistently dedicated to L0) takes < 60s to
      clear the backlog. So the 'recovery' time is modest in that
      operators should not need to take manual action

Release note: None
craig bot pushed a commit that referenced this issue Jul 4, 2023
104861: kvserver: disable pre-AC above-raft AddSST throttling r=irfansharif a=irfansharif

Fixes #102683. Part of #104154.

These were added way back in #36403 and #73904, pre-dating much of IO admission control for leaseholder writes. With #95563, we now have IO admission control for follower writes. Put together, have ample LSM read-amp protection through AC alone. These concurrency limiters are now redundant and oblivious to more sophisticated AC measures. We recently removed the below-raft equivalents of these limiters (#98762), and like mentioned there, these limiters can exacerbate memory pressure. Separately, we're looking to work on speedier restores, and these limiters are starting to get in the way.

While here, we also disable the pre-ingest delay mechanism in pebble, which too pre-dates AC, introduced way back in #34258 for RocksDB and in \#41839 for Pebble. IO AC is able to limit the number of L0 files, and this pre-ingest delay with its maximum per-request delay time of 5s can be less than effective. It's worth noting that the L0 file count threshold at which this pre-ingest delay mechanism kicked in was 20, while AC aims for 1000[^1].

This commit doesn't go as far as removing these limiters outright, merely disabling them. This is just out of an overabundance of caution. We can probably remove them once kvflowcontrol.enabled has had >1 release worth of baking time. Until then, it's nice to know we have these old safety hatches. We have ample time in the release to assess fallout from this commit, and also use this increased AddSST concurrency to stress the kvflowcontrol machinery.

[^1]: The 1000 file limit exists to bound how long it takes to clear L0 completely. Envelope math cribbed from elsewhere: With 2MiB files, 1000 files is ~2GB, which at 40MB/s of compaction throughput (with a compaction slot consistently dedicated to L0) takes < 60s to clear the backlog. So the 'recovery' time is modest in that operators should not need to take manual action.

Release note: None

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

No branches or pull requests

1 participant