Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: handle stalled or unresponsive replicas #104262

Closed
erikgrinaker opened this issue Jun 2, 2023 · 3 comments · Fixed by #118943
Closed

kvserver: handle stalled or unresponsive replicas #104262

erikgrinaker opened this issue Jun 2, 2023 · 3 comments · Fixed by #118943
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-postmortem Originated from a Postmortem action item. O-support Originated from a customer P-3 Issues/test failures with no fix SLA T-kv-replication KV Replication Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 2, 2023

Replicas may stall indefinitely, i.e. become unresponsive. The typical examples are:

  • Disk stalls.
  • Log stalls.
  • Deadlocks.
  • Blocked syscalls (e.g. network hardware failure or kernel bug).

During a stall, all replica processing (both reads and writes) will stall, because callers are unable to acquire various replica mutexes (e.g. Replica.mu or Replica.readOnlyCmdMu).

In the general case, this can cause permanent range unavailability (see e.g. the failover/non-system/deadlock unavailability benchmark). The exception is disk stalls, which have dedicated handling both in Pebble and CockroachDB, self-terminating the node after 20 seconds and resolving the stall.

A stalled replica may or may not lose its lease:

  • Epoch lease: if the node can send a heartbeat RPC to the liveness range leaseholder, and can perform a noop write on all local disks, the replica retains its lease.

  • Expiration lease: the replica will lose its lease within 6 seconds since it can't replicate the lease extension.

As long as the replica retains its lease, the range will be unavailable.

However, even if the replica loses its lease, and it is acquired elsewhere, this can still cause widespread unavailability. Any request that arrived while the replica still held its lease will hang forever (or until the client times out), blocking e.g. on mutexes, latches, syscalls, etc. Stale DistSender caches may also direct requests to the old, stalled leaseholder, where they get stuck. If there is a bounded worker pool, e.g. a connection pool, then there is a chance that many or all workers will have stalled on the replica (especially if several replicas stall), starving out clients and in the worst case causing a complete workload stall.

There are two main approaches we can take here:

  1. Kill the node. This is the approach taken both by the Pebble disk stall detector and by other databases (e.g. FoundationDB). It is simple, effective, and reliable, but will affect all processing on the node (including otherwise healthy replicas and stores), and can cause quorum loss with correlated stalls across multiple nodes (e.g. network disk outages). Pebble also defaults to a fairly conservative 20 seconds threshold, to avoid false positives, which is considered too high by many users.

  2. Cancel all requests. This is harder, because we have to keep track of the contexts of all in-flight requests (this had a non-negligible latency cost last time we prototyped this), and the requests have to actually be responsive to context cancellation (they're not if they're waiting for a mutex or syscall). But if we can do this, it would limit the impact to the stalled replicas.

There is also the question of how to detect stalls. Pebble does this by timing individual syscalls. At the replica level, we could e.g. have a replica mutex watchdog, or monitor latency in the Raft scheduler and during request processing.

Jira issue: CRDB-28435

Epic CRDB-25199

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv-replication KV Replication Team labels Jun 2, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 2, 2023

cc @cockroachdb/replication

@ajd12342
Copy link

Hi @erikgrinaker !
I am Anuj Diwan, a Computer Science PhD student at UT Austin. I am part of a team along with @arjunrs1 (Arjun Somayazulu) and we're taking a graduate Distributed Systems course. For our course project, we are interested in contributing to CockroachDB. This issue is related to our course material. Could we work on this issue? Any pointers for us to get started would be appreciated as well.

Thanks and regards,
Anuj.

@erikgrinaker
Copy link
Contributor Author

Hi @ajd12342! This is a complex issue that requires a lot of context and design work, so I don't think it's suitable for external contributors to pick up. You may be more interested in a simpler issue, like #99783.

@nvanbenschoten nvanbenschoten added the P-3 Issues/test failures with no fix SLA label Jan 10, 2024
@exalate-issue-sync exalate-issue-sync bot added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 26, 2024
craig bot pushed a commit that referenced this issue Feb 20, 2024
119130: batcheval: don't take out latches for `LeaseInfo` r=erikgrinaker a=erikgrinaker

Extracted from #118943.

---

This patch removes the `LeaseInfo` read latch on `RangeLeaseKey`. This makes the request latchless, which allows the request to be used as a simple replica health check without worrying about latch delays (e.g. with DistSender circuit breaker probes).

This latch was ineffectual anyway, since lease requests don't take out write latches (they're evaluated on the proposer). The lease is read from the replica's in-memory state.

Touches #104262.
Touches #118943.
Epic: none
Release note: None

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
@craig craig bot closed this as completed in bf013ea Mar 4, 2024
craig bot pushed a commit that referenced this issue Mar 12, 2024
119865: kvcoord: don't probe idle replicas with tripped circuit breakers r=erikgrinaker a=erikgrinaker

Previously, DistSender circuit breakers would continually probe replicas with tripped circuit breakers. This could result in a large number of concurrent probe goroutines.

To reduce the number of probes, this patch instead only probes replicas that have seen traffic in the past few probe intervals. Otherwise, the probe exits (leaving the breaker tripped), and a new probe will be launched on the next request to the replica. This will increase latency for the first request(s) following recovery, as it must wait for the probe to succeed and the DistSender to retry the replica. In the common case (e.g. with a disk stall), the lease will shortly have moved to a different replica and the DistSender will stop sending requests to it.

Replicas with tripped circuit breakers are also made eligible for garbage collection, but with a higher threshold of 1 hour to avoid overeager GC.

Resolves #119917.
Touches #104262.
Touches #105168.
Epic: none
Release note: None

120168: sql: reserve tenant 2 r=dt a=dt

We might want this later and it is easier to reserve now than it will be then if we let 24.1 UA use it.

Release note: none.
Epic: none.

120197: db-console/debug: clarify cpu profile link r=dt a=dt

The all-caps MEMORY OVERHEAD in the title for this one profile looks out of place compared to the rest of the page and is somewhat confusing: is it profiling the memory overhead, as implied by inclusion of the title which for all other links says what the tool does? Or does the tool itself have overhead? While in practice it is the latter, serving any request has _some_ overhead so this hardly worth this weird treatment.

Release note: none.
Epic: none.

120263: restore: record provenance of restored tenant for pcr r=dt a=dt

Release note (enterprise change): a restored backup of a virtual cluster, produced via BACKUP VIRTUAL CLUSTER and RESTORE VIRTUAL CLUSTER, can now be used to start PCR replication without an initial scan.
Epic: none.

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-postmortem Originated from a Postmortem action item. O-support Originated from a customer P-3 Issues/test failures with no fix SLA T-kv-replication KV Replication Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants