kvserver: handle stalled or unresponsive replicas #104262

erikgrinaker · 2023-06-02T15:16:39Z

Replicas may stall indefinitely, i.e. become unresponsive. The typical examples are:

Disk stalls.
Log stalls.
Deadlocks.
Blocked syscalls (e.g. network hardware failure or kernel bug).

During a stall, all replica processing (both reads and writes) will stall, because callers are unable to acquire various replica mutexes (e.g. Replica.mu or Replica.readOnlyCmdMu).

In the general case, this can cause permanent range unavailability (see e.g. the failover/non-system/deadlock unavailability benchmark). The exception is disk stalls, which have dedicated handling both in Pebble and CockroachDB, self-terminating the node after 20 seconds and resolving the stall.

A stalled replica may or may not lose its lease:

Epoch lease: if the node can send a heartbeat RPC to the liveness range leaseholder, and can perform a noop write on all local disks, the replica retains its lease.
Expiration lease: the replica will lose its lease within 6 seconds since it can't replicate the lease extension.

As long as the replica retains its lease, the range will be unavailable.

However, even if the replica loses its lease, and it is acquired elsewhere, this can still cause widespread unavailability. Any request that arrived while the replica still held its lease will hang forever (or until the client times out), blocking e.g. on mutexes, latches, syscalls, etc. Stale DistSender caches may also direct requests to the old, stalled leaseholder, where they get stuck. If there is a bounded worker pool, e.g. a connection pool, then there is a chance that many or all workers will have stalled on the replica (especially if several replicas stall), starving out clients and in the worst case causing a complete workload stall.

There are two main approaches we can take here:

Kill the node. This is the approach taken both by the Pebble disk stall detector and by other databases (e.g. FoundationDB). It is simple, effective, and reliable, but will affect all processing on the node (including otherwise healthy replicas and stores), and can cause quorum loss with correlated stalls across multiple nodes (e.g. network disk outages). Pebble also defaults to a fairly conservative 20 seconds threshold, to avoid false positives, which is considered too high by many users.
Cancel all requests. This is harder, because we have to keep track of the contexts of all in-flight requests (this had a non-negligible latency cost last time we prototyped this), and the requests have to actually be responsive to context cancellation (they're not if they're waiting for a mutex or syscall). But if we can do this, it would limit the impact to the stalled replicas.

There is also the question of how to detect stalls. Pebble does this by timing individual syscalls. At the replica level, we could e.g. have a replica mutex watchdog, or monitor latency in the Raft scheduler and during request processing.

Jira issue: CRDB-28435

Epic CRDB-25199

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-06-02T15:16:42Z

cc @cockroachdb/replication

ajd12342 · 2023-10-31T21:42:31Z

Hi @erikgrinaker !
I am Anuj Diwan, a Computer Science PhD student at UT Austin. I am part of a team along with @arjunrs1 (Arjun Somayazulu) and we're taking a graduate Distributed Systems course. For our course project, we are interested in contributing to CockroachDB. This issue is related to our course material. Could we work on this issue? Any pointers for us to get started would be appreciated as well.

Thanks and regards,
Anuj.

erikgrinaker · 2023-11-01T10:20:25Z

Hi @ajd12342! This is a complex issue that requires a lot of context and design work, so I don't think it's suitable for external contributors to pick up. You may be more interested in a simpler issue, like #99783.

119130: batcheval: don't take out latches for `LeaseInfo` r=erikgrinaker a=erikgrinaker Extracted from #118943. --- This patch removes the `LeaseInfo` read latch on `RangeLeaseKey`. This makes the request latchless, which allows the request to be used as a simple replica health check without worrying about latch delays (e.g. with DistSender circuit breaker probes). This latch was ineffectual anyway, since lease requests don't take out write latches (they're evaluated on the proposer). The lease is read from the replica's in-memory state. Touches #104262. Touches #118943. Epic: none Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

119865: kvcoord: don't probe idle replicas with tripped circuit breakers r=erikgrinaker a=erikgrinaker Previously, DistSender circuit breakers would continually probe replicas with tripped circuit breakers. This could result in a large number of concurrent probe goroutines. To reduce the number of probes, this patch instead only probes replicas that have seen traffic in the past few probe intervals. Otherwise, the probe exits (leaving the breaker tripped), and a new probe will be launched on the next request to the replica. This will increase latency for the first request(s) following recovery, as it must wait for the probe to succeed and the DistSender to retry the replica. In the common case (e.g. with a disk stall), the lease will shortly have moved to a different replica and the DistSender will stop sending requests to it. Replicas with tripped circuit breakers are also made eligible for garbage collection, but with a higher threshold of 1 hour to avoid overeager GC. Resolves #119917. Touches #104262. Touches #105168. Epic: none Release note: None 120168: sql: reserve tenant 2 r=dt a=dt We might want this later and it is easier to reserve now than it will be then if we let 24.1 UA use it. Release note: none. Epic: none. 120197: db-console/debug: clarify cpu profile link r=dt a=dt The all-caps MEMORY OVERHEAD in the title for this one profile looks out of place compared to the rest of the page and is somewhat confusing: is it profiling the memory overhead, as implied by inclusion of the title which for all other links says what the tool does? Or does the tool itself have overhead? While in practice it is the latter, serving any request has _some_ overhead so this hardly worth this weird treatment. Release note: none. Epic: none. 120263: restore: record provenance of restored tenant for pcr r=dt a=dt Release note (enterprise change): a restored backup of a virtual cluster, produced via BACKUP VIRTUAL CLUSTER and RESTORE VIRTUAL CLUSTER, can now be used to start PCR replication without an initial scan. Epic: none. Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com>

erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv-replication KV Replication Team labels Jun 2, 2023

sumeerbhola mentioned this issue Jun 6, 2023

sharedcache: move to fine-grained locking cockroachdb/pebble#2586

Merged

This was referenced Jun 19, 2023

kvserver: investigate failover/partial/lease-leader/lease=expiration unavailability #103769

Closed

kvserver: DistSender should detect lease expiration and redirect requests #105168

Closed

dhartunian added the O-postmortem Originated from a Postmortem action item. label Jun 22, 2023

erikgrinaker added the O-support Originated from a customer label Oct 25, 2023

nvanbenschoten added the P-3 Issues/test failures with no fix SLA label Jan 10, 2024

exalate-issue-sync bot added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 26, 2024

This was referenced Feb 8, 2024

kvcoord: add DistSender circuit breakers #118943

Merged

batcheval: don't take out latches for LeaseInfo #119130

Merged

nvanbenschoten mentioned this issue Feb 20, 2024

kv: tolerate disk stalls on otherwise healthy leaseholder VMs #119407

Open

erikgrinaker mentioned this issue Mar 4, 2024

kvcoord: don't probe idle replicas with tripped circuit breakers #119865

Merged

craig bot closed this as completed in bf013ea Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: handle stalled or unresponsive replicas #104262

kvserver: handle stalled or unresponsive replicas #104262

erikgrinaker commented Jun 2, 2023 •

edited

blathers-crl bot commented Jun 2, 2023

ajd12342 commented Oct 31, 2023

erikgrinaker commented Nov 1, 2023

kvserver: handle stalled or unresponsive replicas #104262

kvserver: handle stalled or unresponsive replicas #104262

Comments

erikgrinaker commented Jun 2, 2023 • edited

blathers-crl bot commented Jun 2, 2023

ajd12342 commented Oct 31, 2023

erikgrinaker commented Nov 1, 2023

erikgrinaker commented Jun 2, 2023 •

edited