kvserver: disk stall prevents lease transfer #81100

erikgrinaker · 2022-05-06T17:01:19Z

A node experiencing a persistent disk stall will fail to maintain liveness, since we do a sync disk write before each heartbeat:

cockroach/pkg/kv/kvserver/liveness/liveness.go

Lines 1258 to 1264 in 0b7ba56

    
           // We synchronously write to all disks before updating liveness because we 
        
           // don't want any excessively slow disks to prevent leases from being 
        
           // shifted to other nodes. A slow/stalled disk would block here and cause 
        
           // the node to lose its leases. 
        
           if err := storage.WriteSyncNoop(ctx, eng); err != nil { 
        
           	return Record{}, errors.Wrapf(err, "couldn't update node liveness because disk write failed") 
        
           }

This will in turn cause it to lose all leases. However, it will also prevent anyone else from acquiring the lease, causing all relevant ranges to get stuck until the disk recovers or the node is shut down (e.g. due to the Pebble disk stall detector which panics).

Notes from elsewhere below.

Reproduction Details

I'm intentionally trying to reduce the failure modes to the simplest case here -- we can look into more complicated ones when we've picked the low-hanging fruit.

6 nodes: n6 only used for workload, n5 is the faulty node.
All system ranges are pinned to n1-4, to avoid interactions with them.
Logs are stored on a separate disk, so logging is not stalled.
All client traffic is directed at n1-4, to avoid interactions with client processing.

Instructions:

Create a 6-node cluster and stage it with necessary binaries:

roachprod create -n 6 grinaker-diskstall
roachprod stage grinaker-diskstall release v21.2.9
roachprod put grinaker-diskstall bin.docker_amd64/workload

Wrap the SSDs in a delayable device, with 0 delay initially:

roachprod run grinaker-diskstall 'sudo umount /mnt/data1'
roachprod run grinaker-diskstall 'echo "0 `sudo blockdev --getsz /dev/nvme0n1` delay /dev/nvme0n1 0 0" | sudo dmsetup create delayed'
roachprod run grinaker-diskstall 'sudo mount /dev/mapper/delayed /mnt/data1'

Start the cluster and configure it:

roachprod start --racks 5 grinaker-diskstall:1-5

ALTER RANGE meta CONFIGURE ZONE USING num_replicas = 3, constraints = '[-rack=4]';
ALTER RANGE liveness CONFIGURE ZONE USING num_replicas = 3, constraints = '[-rack=4]';
ALTER RANGE system CONFIGURE ZONE USING num_replicas = 3, constraints = '[-rack=4]';
ALTER RANGE timeseries CONFIGURE ZONE USING num_replicas = 3, constraints = '[-rack=4]';
ALTER DATABASE system CONFIGURE ZONE USING num_replicas = 3, constraints = '[-rack=4]';
ALTER RANGE default CONFIGURE ZONE USING gc.ttlseconds = 600;

Start a KV workload from n6 against n1-4:

./workload run kv --init --splits 100 --read-percent 95 --batch 10 --min-block-bytes 4000 --max-block-bytes 4000 --tolerate-errors <pgurls>

Introduce an IO delay of 100s per operation on n5 (immediately takes effect, but command hangs for a while):

echo "0 `sudo blockdev --getsz /dev/nvme0n1` delay /dev/nvme0n1 0 100000" | sudo dmsetup reload delayed && sudo dmsetup resume delayed

# Inspect delay
sudo dmsetup table delayed

# Disable delay
echo "0 `sudo blockdev --getsz /dev/nvme0n1` delay /dev/nvme0n1 0 0" | sudo dmsetup reload delayed && sudo dmsetup resume delayed

The workload throughput will now have dropped to ~0.

When n5 stalls, all other nodes have it cached as the leaseholder, and will send RPCs to it first. These nodes rely on n5 to return a NotLeaseHolderError to inform them that the lease is invalid, at which point they will try a different node. When this other node receives the RPC, it will detect the invalid lease, acquire a new lease, then return a successful response to the caller who updates its lease cache.

What happens here is that n5 never returns a NotLeaseHolderError, which prevents the rest from happening -- including another node claiming the lease. The lease thus remains invalid, and the workload stalls.

So why doesn't n5 return a NotLeaseHolderError? It notices that its lease is invalid and tries to obtain it for itself first. However, while doing so it first sends a heartbeat to make sure it's live before acquiring the lease. Recall that sending a heartbeat will perform a sync write (which is what made it lose the lease in the first place), and because the disk stalls this operation therefore also stalls. Forever.

cockroach/pkg/kv/kvserver/replica_range_lease.go

Lines 373 to 377 in 5ffaa9e

    
           // If this replica is previous & next lease holder, manually heartbeat to become live. 
        
           if status.OwnedBy(nextLeaseHolder.StoreID) && 
        
           	p.repl.store.StoreID() == nextLeaseHolder.StoreID { 
        
           	if err = p.repl.store.cfg.NodeLiveness.Heartbeat(ctx, status.Liveness); err != nil { 
        
           		log.Errorf(ctx, "failed to heartbeat own liveness record: %s", err)

If I add a simple 5-second timeout for lease acquisitions, this operations fails and returns a NotLeaseHolderError, allowing a different node to claim the lease. You may think "how does context cancellation work when we're blocked on a sync disk write?". Because we're not actually blocked on a sync write, we're blocked on a semaphore waiting for a previous heartbeat to complete, and that's blocked on a sync write:

cockroach/pkg/kv/kvserver/liveness/liveness.go

Lines 890 to 897 in 0b7ba56

    
           // Allow only one heartbeat at a time. 
        
           nodeID := nl.gossip.NodeID.Get() 
        
           sem := nl.sem(nodeID) 
        
           select { 
        
           case sem <- struct{}{}: 
        
           case <-ctx.Done(): 
        
           	return ctx.Err() 
        
           }

With this timeout in place, things are much better:

A read-only workload recovers within 15 seconds of the disk stall.
A write-only workload hangs forever, because all clients are blocked on Raft application. However, new reads and writes succeed after 15 seconds.
A write-only workload with 10-second statement timeouts recovers within 30 seconds.
A read/write workload with 10-second statement timeouts recovers within 15 seconds.

To completely fix this, such that clients don't get stuck on Raft application (or waiting on their latches), we'll need circuit breaker support (#74712). However, even without that, this is a much better state of affairs where clients with timeouts will recover within a reasonable time.

I'm not sure a timeout is necessarily the best approach here, so I'm going to look into a few options, but this is definitely tractable in the short term and likely backportable in some form.

Jira issue: CRDB-15145

gz#13737

Epic CRDB-19227

The text was updated successfully, but these errors were encountered:

81133: liveness: improve disk probes during node liveness updates r=tbg a=erikgrinaker When `NodeLiveness` updates the liveness record (e.g. during heartbeats), it first does a noop sync write to all disks. This ensures that a node with a stalled disk will fail to maintain liveness and lose its leases. However, this sync write could block indefinitely, and would not respect the caller's context, which could cause the caller to stall rather than time out. This in turn could lead to stalls higher up in the stack, in particular with lease acquisitions that do a synchronous heartbeat. This patch does the sync write in a separate goroutine in order to respect the caller's context. The write operation itself will not (can not) respect the context, and may thus leak a goroutine. However, concurrent sync writes will coalesce onto an in-flight write. Additionally, this runs the sync writes in parallel across all disks, since we can now trivially do so. This may be advantageous on nodes with many stores, to avoid spurious heartbeat failures under load. Touches #81100. Release note (bug fix): Disk write probes during node liveness heartbeats will no longer get stuck on stalled disks, instead returning an error once the operation times out. Additionally, disk probes now run in parallel on nodes with multiple stores. 81134: kvserver: don't return lease when failing to acquire r=nvanbenschoten,tbg a=erikgrinaker **kvserver: improve lease acquisition in `Store.ManuallyEnqueue()`** Manually enqueuing a range through a queue may sometimes require acquiring the lease first, since it can only be run on the leaseholder. Previously this was done through `getLeaseForGossip`, which was meant for acquiring the gossip lease and would hide certain errors from the queue. This patch instead uses `redirectOnOrAcquireLease` to acquire the lease, and returns an error if the local replica failed to acquire the lease (even if there is a valid leaseholder elsewhere). Release note: None **kvserver: don't return lease when failing to acquire** This patch does not return the previous, expired lease when a replica fails to acquire it for itself. In particular, this prevents the calling DistSender from continually retrying the old leaseholder if it fails to reacquire the lease (e.g. due to a stalled disk). Touches #81100. Release note: None 81416: roachtest: do not fail costfuzz test on perturbed error r=msirek a=michae2 The costfuzz test repeatedly executes a randomly generated query twice, once with a normal plan and once with a perturbed-costs plan. We were considering a successful first execution followed by an errored second execution a test failure. But there are certain queries that legitimately return errors nondeterministically depending on plan. For example: ```sql CREATE TABLE t (a INT PRIMARY KEY); CREATE TABLE u (b INT PRIMARY KEY); INSERT INTO t VALUES (0); SELECT * FROM t INNER HASH JOIN u ON a = b + 1 WHERE 1 / a = 1; SELECT * FROM u INNER HASH JOIN t ON a = b + 1 WHERE 1 / a = 1; ``` The first plan will successfully return 0 rows because the hash join can short-circuit without evaluating 1 / a. The second plan will return a divide-by-zero error because 1 / a is evaluated while building the hash table. This is not a bug. Internal errors are a little more interesting than normal SQL-level errors, so we still consider this a test failure on internal errors. Fixes: #81032 Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com> Co-authored-by: Michael Erickson <michae2@cockroachlabs.com>

erikgrinaker · 2022-05-19T08:19:44Z

Short-term fixes have been submitted in:

These improve the situation somewhat, in that a single non-system range will typically recover within ~20 seconds. However, if a query touches multiple affected ranges then the ~20s wait is sequential by range, and can easily add up to several minutes. Additionally, system ranges may not recover at all because the RPC caller may time out before the NLHE is returned, preventing anyone else from acquiring the lease (#80713). Also, in-flight requests will block indefinitely until we implement replica circuit breaker support for this scenario (#77366).

I think a full solution here may require rethinking our lease acquisition/transfer protocol such that it isn't driven by client requests and dependant on active cooperation of the previous (stalled) leaseholder. We'll need to allocate some resources for this, so I'm going to leave this here for now.

81136: kvserver: add timeout for lease acquisitions r=tbg a=erikgrinaker **kvserver: deflake `TestLeasePreferencesDuringOutage`** This patch deflakes `TestLeasePreferencesDuringOutage` by allowing nodes to acquire a lease even when they are not the Raft leader. Release note: None **kvserver: add timeout for lease acquisitions** This patch adds a timeout for lease acquisitions. It is set to twice the Raft election timeout (6 seconds total), since it may need to hold a Raft election and repropose the lease acquisition command, each of which may take up to one election timeout. Without this timeout, it's possible for a lease acquisition to stall indefinitely (e.g. in the case of a stalled disk). This prevents a `NotLeaseHolderError` from being returned to the client DistSender, which in turn prevents it from trying other replicas that could acquire the lease instead. This can cause a lease to remain invalid forever. Release note (bug fix): Fixed a bug where an unresponsive node (e.g. with a stalled disk) could prevent other nodes from acquiring its leases, effectively stalling these ranges until the node was shut down or recovered. Touches #81100. Release note (bug fix): Fixed a bug where an unresponsive node (e.g. with a stalled disk) could prevent other nodes from acquiring its leases, effectively stalling these ranges until the node was shut down or recovered. Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

94240: roachtest: add `disk-stall` variant of `failover` tests r=erikgrinaker a=erikgrinaker **roachtest: reorganize failer interface** This patch reorganizes the `failer` interface with `Setup`, `Ready`, and `Cleanup` hooks, providing it with access to the monitor. It also passes the test and cluster references via the constructor. Epic: none Release note: None **roachtest: add `disk-stall` variant of `failover` tests** This patch adds `disk-stall` variants of the `failover` tests, which benchmarks the pMax unavailability when a leaseholder experiences a persistent disk stall. Resolves #94231. Touches #81100. Epic: none Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

github-actions · 2024-02-26T11:04:38Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

erikgrinaker · 2024-03-04T15:45:17Z

This will be fixed by #118943.

erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication KV Replication Team labels May 6, 2022

erikgrinaker added this to Incoming in Replication via automation May 6, 2022

erikgrinaker self-assigned this May 6, 2022

erikgrinaker mentioned this issue May 6, 2022

kvserver: persistent outage when liveness leaseholder deadlocks #80713

Closed

erikgrinaker moved this from Incoming to Current Milestone / In Progress in Replication May 7, 2022

This was referenced May 18, 2022

release-22.1: liveness: improve disk probes during node liveness updates #81476

Merged

release-21.2: kvserver: don't return lease when failing to acquire #81477

Merged

release-22.1: kvserver: don't return lease when failing to acquire #81478

Merged

erikgrinaker removed their assignment May 19, 2022

erikgrinaker removed this from Current Milestone / In Progress in Replication May 19, 2022

erikgrinaker added this to Incoming in KV via automation May 19, 2022

blathers-crl bot added the T-kv KV Team label May 19, 2022

erikgrinaker added A-kv Anything in KV that doesn't belong in a more specific category. and removed T-kv-replication KV Replication Team A-kv-replication Relating to Raft, consensus, and coordination. labels May 19, 2022

erikgrinaker mentioned this issue May 19, 2022

release-21.2: liveness: improve disk probes during node liveness updates #81514

Merged

jlinder added sync-me and removed sync-me labels May 20, 2022

This was referenced May 25, 2022

release-22.1: kvserver: add timeout for lease acquisitions #81815

Merged

release-21.2: kvserver: add timeout for lease acquisitions #81816

Merged

mwang1026 moved this from Incoming to Prioritized in KV Jun 1, 2022

erikgrinaker mentioned this issue Jul 15, 2022

draft node failure KB issue; clean up cluster troubleshooting h2 nav cockroachdb/docs#14528

Merged

erikgrinaker mentioned this issue Oct 6, 2022

kvserver: use shorter lease expiration for liveness range #88443

Open

This was referenced Dec 23, 2022

roachtest: add failover benchmark variant with disk stall #94231

Closed

roachtest: add disk-stall variant of failover tests #94240

Merged

erikgrinaker mentioned this issue Mar 11, 2023

kvserver: lease maintenance scheduler #98433

Closed

tbg mentioned this issue Jun 20, 2023

kvserver: replica circuit breakers counterproductive under partial partitions #105167

Closed

github-actions bot added no-issue-activity and removed no-issue-activity labels Feb 26, 2024

erikgrinaker mentioned this issue Mar 4, 2024

kvcoord: add DistSender circuit breakers #118943

Merged

craig bot closed this as completed in bf013ea Mar 4, 2024

KV automation moved this from Prioritized to Closed Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: disk stall prevents lease transfer #81100

kvserver: disk stall prevents lease transfer #81100

erikgrinaker commented May 6, 2022 •

edited by jlinder

erikgrinaker commented May 19, 2022 •

edited

github-actions bot commented Feb 26, 2024

erikgrinaker commented Mar 4, 2024

kvserver: disk stall prevents lease transfer #81100

kvserver: disk stall prevents lease transfer #81100

Comments

erikgrinaker commented May 6, 2022 • edited by jlinder

erikgrinaker commented May 19, 2022 • edited

github-actions bot commented Feb 26, 2024

erikgrinaker commented Mar 4, 2024

erikgrinaker commented May 6, 2022 •

edited by jlinder

erikgrinaker commented May 19, 2022 •

edited