Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: disk stall prevents lease transfer #81100

Closed
erikgrinaker opened this issue May 6, 2022 · 3 comments · Fixed by #118943
Closed

kvserver: disk stall prevents lease transfer #81100

erikgrinaker opened this issue May 6, 2022 · 3 comments · Fixed by #118943
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented May 6, 2022

A node experiencing a persistent disk stall will fail to maintain liveness, since we do a sync disk write before each heartbeat:

// We synchronously write to all disks before updating liveness because we
// don't want any excessively slow disks to prevent leases from being
// shifted to other nodes. A slow/stalled disk would block here and cause
// the node to lose its leases.
if err := storage.WriteSyncNoop(ctx, eng); err != nil {
return Record{}, errors.Wrapf(err, "couldn't update node liveness because disk write failed")
}

This will in turn cause it to lose all leases. However, it will also prevent anyone else from acquiring the lease, causing all relevant ranges to get stuck until the disk recovers or the node is shut down (e.g. due to the Pebble disk stall detector which panics).

Notes from elsewhere below.


Reproduction Details

I'm intentionally trying to reduce the failure modes to the simplest case here -- we can look into more complicated ones when we've picked the low-hanging fruit.

  • 6 nodes: n6 only used for workload, n5 is the faulty node.
  • All system ranges are pinned to n1-4, to avoid interactions with them.
  • Logs are stored on a separate disk, so logging is not stalled.
  • All client traffic is directed at n1-4, to avoid interactions with client processing.

Instructions:

  • Create a 6-node cluster and stage it with necessary binaries:

    roachprod create -n 6 grinaker-diskstall
    roachprod stage grinaker-diskstall release v21.2.9
    roachprod put grinaker-diskstall bin.docker_amd64/workload
    
  • Wrap the SSDs in a delayable device, with 0 delay initially:

    roachprod run grinaker-diskstall 'sudo umount /mnt/data1'
    roachprod run grinaker-diskstall 'echo "0 `sudo blockdev --getsz /dev/nvme0n1` delay /dev/nvme0n1 0 0" | sudo dmsetup create delayed'
    roachprod run grinaker-diskstall 'sudo mount /dev/mapper/delayed /mnt/data1'
    
  • Start the cluster and configure it:

    roachprod start --racks 5 grinaker-diskstall:1-5
    
    ALTER RANGE meta CONFIGURE ZONE USING num_replicas = 3, constraints = '[-rack=4]';
    ALTER RANGE liveness CONFIGURE ZONE USING num_replicas = 3, constraints = '[-rack=4]';
    ALTER RANGE system CONFIGURE ZONE USING num_replicas = 3, constraints = '[-rack=4]';
    ALTER RANGE timeseries CONFIGURE ZONE USING num_replicas = 3, constraints = '[-rack=4]';
    ALTER DATABASE system CONFIGURE ZONE USING num_replicas = 3, constraints = '[-rack=4]';
    ALTER RANGE default CONFIGURE ZONE USING gc.ttlseconds = 600;
    
  • Start a KV workload from n6 against n1-4:

    ./workload run kv --init --splits 100 --read-percent 95 --batch 10 --min-block-bytes 4000 --max-block-bytes 4000 --tolerate-errors <pgurls>
    
  • Introduce an IO delay of 100s per operation on n5 (immediately takes effect, but command hangs for a while):

    echo "0 `sudo blockdev --getsz /dev/nvme0n1` delay /dev/nvme0n1 0 100000" | sudo dmsetup reload delayed && sudo dmsetup resume delayed
    
    # Inspect delay
    sudo dmsetup table delayed
    
    # Disable delay
    echo "0 `sudo blockdev --getsz /dev/nvme0n1` delay /dev/nvme0n1 0 0" | sudo dmsetup reload delayed && sudo dmsetup resume delayed
    

The workload throughput will now have dropped to ~0.

When n5 stalls, all other nodes have it cached as the leaseholder, and will send RPCs to it first. These nodes rely on n5 to return a NotLeaseHolderError to inform them that the lease is invalid, at which point they will try a different node. When this other node receives the RPC, it will detect the invalid lease, acquire a new lease, then return a successful response to the caller who updates its lease cache.

What happens here is that n5 never returns a NotLeaseHolderError, which prevents the rest from happening -- including another node claiming the lease. The lease thus remains invalid, and the workload stalls.

So why doesn't n5 return a NotLeaseHolderError? It notices that its lease is invalid and tries to obtain it for itself first. However, while doing so it first sends a heartbeat to make sure it's live before acquiring the lease. Recall that sending a heartbeat will perform a sync write (which is what made it lose the lease in the first place), and because the disk stalls this operation therefore also stalls. Forever.

// If this replica is previous & next lease holder, manually heartbeat to become live.
if status.OwnedBy(nextLeaseHolder.StoreID) &&
p.repl.store.StoreID() == nextLeaseHolder.StoreID {
if err = p.repl.store.cfg.NodeLiveness.Heartbeat(ctx, status.Liveness); err != nil {
log.Errorf(ctx, "failed to heartbeat own liveness record: %s", err)

If I add a simple 5-second timeout for lease acquisitions, this operations fails and returns a NotLeaseHolderError, allowing a different node to claim the lease. You may think "how does context cancellation work when we're blocked on a sync disk write?". Because we're not actually blocked on a sync write, we're blocked on a semaphore waiting for a previous heartbeat to complete, and that's blocked on a sync write:

// Allow only one heartbeat at a time.
nodeID := nl.gossip.NodeID.Get()
sem := nl.sem(nodeID)
select {
case sem <- struct{}{}:
case <-ctx.Done():
return ctx.Err()
}

With this timeout in place, things are much better:

  • A read-only workload recovers within 15 seconds of the disk stall.
  • A write-only workload hangs forever, because all clients are blocked on Raft application. However, new reads and writes succeed after 15 seconds.
  • A write-only workload with 10-second statement timeouts recovers within 30 seconds.
  • A read/write workload with 10-second statement timeouts recovers within 15 seconds.

To completely fix this, such that clients don't get stuck on Raft application (or waiting on their latches), we'll need circuit breaker support (#74712). However, even without that, this is a much better state of affairs where clients with timeouts will recover within a reasonable time.

I'm not sure a timeout is necessarily the best approach here, so I'm going to look into a few options, but this is definitely tractable in the short term and likely backportable in some form.

Jira issue: CRDB-15145

gz#13737

Epic CRDB-19227

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication KV Replication Team labels May 6, 2022
@erikgrinaker erikgrinaker added this to Incoming in Replication via automation May 6, 2022
@erikgrinaker erikgrinaker self-assigned this May 6, 2022
@erikgrinaker erikgrinaker moved this from Incoming to Current Milestone / In Progress in Replication May 7, 2022
craig bot pushed a commit that referenced this issue May 18, 2022
81133: liveness: improve disk probes during node liveness updates r=tbg a=erikgrinaker

When `NodeLiveness` updates the liveness record (e.g. during
heartbeats), it first does a noop sync write to all disks. This ensures
that a node with a stalled disk will fail to maintain liveness and lose
its leases.

However, this sync write could block indefinitely, and would not respect
the caller's context, which could cause the caller to stall rather than
time out. This in turn could lead to stalls higher up in the stack,
in particular with lease acquisitions that do a synchronous heartbeat.

This patch does the sync write in a separate goroutine in order to
respect the caller's context. The write operation itself will not
(can not) respect the context, and may thus leak a goroutine. However,
concurrent sync writes will coalesce onto an in-flight write.

Additionally, this runs the sync writes in parallel across all disks,
since we can now trivially do so. This may be advantageous on nodes with
many stores, to avoid spurious heartbeat failures under load.

Touches #81100.

Release note (bug fix): Disk write probes during node liveness
heartbeats will no longer get stuck on stalled disks, instead returning
an error once the operation times out. Additionally, disk probes now run
in parallel on nodes with multiple stores.

81134: kvserver: don't return lease when failing to acquire r=nvanbenschoten,tbg a=erikgrinaker

**kvserver: improve lease acquisition in `Store.ManuallyEnqueue()`**

Manually enqueuing a range through a queue may sometimes require
acquiring the lease first, since it can only be run on the leaseholder.
Previously this was done through `getLeaseForGossip`, which was meant
for acquiring the gossip lease and would hide certain errors from the
queue.

This patch instead uses `redirectOnOrAcquireLease` to acquire the lease,
and returns an error if the local replica failed to acquire the lease
(even if there is a valid leaseholder elsewhere).

Release note: None

**kvserver: don't return lease when failing to acquire**

This patch does not return the previous, expired lease when a replica
fails to acquire it for itself. In particular, this prevents the calling
DistSender from continually retrying the old leaseholder if it fails to
reacquire the lease (e.g. due to a stalled disk).

Touches #81100.

Release note: None

81416: roachtest: do not fail costfuzz test on perturbed error r=msirek a=michae2

The costfuzz test repeatedly executes a randomly generated query twice,
once with a normal plan and once with a perturbed-costs plan. We were
considering a successful first execution followed by an errored second
execution a test failure. But there are certain queries that
legitimately return errors nondeterministically depending on plan. For
example:

```sql
CREATE TABLE t (a INT PRIMARY KEY);
CREATE TABLE u (b INT PRIMARY KEY);
INSERT INTO t VALUES (0);
SELECT * FROM t INNER HASH JOIN u ON a = b + 1 WHERE 1 / a = 1;
SELECT * FROM u INNER HASH JOIN t ON a = b + 1 WHERE 1 / a = 1;
```

The first plan will successfully return 0 rows because the hash join can
short-circuit without evaluating 1 / a. The second plan will return a
divide-by-zero error because 1 / a is evaluated while building the hash
table. This is not a bug.

Internal errors are a little more interesting than normal SQL-level
errors, so we still consider this a test failure on internal errors.

Fixes: #81032

Release note: None

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Co-authored-by: Michael Erickson <michae2@cockroachlabs.com>
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented May 19, 2022

Short-term fixes have been submitted in:

These improve the situation somewhat, in that a single non-system range will typically recover within ~20 seconds. However, if a query touches multiple affected ranges then the ~20s wait is sequential by range, and can easily add up to several minutes. Additionally, system ranges may not recover at all because the RPC caller may time out before the NLHE is returned, preventing anyone else from acquiring the lease (#80713). Also, in-flight requests will block indefinitely until we implement replica circuit breaker support for this scenario (#77366).

I think a full solution here may require rethinking our lease acquisition/transfer protocol such that it isn't driven by client requests and dependant on active cooperation of the previous (stalled) leaseholder. We'll need to allocate some resources for this, so I'm going to leave this here for now.

@erikgrinaker erikgrinaker removed their assignment May 19, 2022
@erikgrinaker erikgrinaker removed this from Current Milestone / In Progress in Replication May 19, 2022
@erikgrinaker erikgrinaker added this to Incoming in KV via automation May 19, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label May 19, 2022
@erikgrinaker erikgrinaker added A-kv Anything in KV that doesn't belong in a more specific category. and removed T-kv-replication KV Replication Team A-kv-replication Relating to Raft, consensus, and coordination. labels May 19, 2022
@jlinder jlinder added sync-me and removed sync-me labels May 20, 2022
craig bot pushed a commit that referenced this issue May 25, 2022
81136: kvserver: add timeout for lease acquisitions  r=tbg a=erikgrinaker

**kvserver: deflake `TestLeasePreferencesDuringOutage`**

This patch deflakes `TestLeasePreferencesDuringOutage` by allowing nodes
to acquire a lease even when they are not the Raft leader.

Release note: None

**kvserver: add timeout for lease acquisitions**

This patch adds a timeout for lease acquisitions. It is set to
twice the Raft election timeout (6 seconds total), since it may need to
hold a Raft election and repropose the lease acquisition command, each
of which may take up to one election timeout.

Without this timeout, it's possible for a lease acquisition to stall
indefinitely (e.g. in the case of a stalled disk). This prevents a
`NotLeaseHolderError` from being returned to the client DistSender,
which in turn prevents it from trying other replicas that could acquire
the lease instead. This can cause a lease to remain invalid forever.

Release note (bug fix): Fixed a bug where an unresponsive node (e.g.
with a stalled disk) could prevent other nodes from acquiring its
leases, effectively stalling these ranges until the node was shut down
or recovered.

Touches #81100.

Release note (bug fix): Fixed a bug where an unresponsive node (e.g.
with a stalled disk) could prevent other nodes from acquiring its
leases, effectively stalling these ranges until the node was shut down
or recovered.

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
@mwang1026 mwang1026 moved this from Incoming to Prioritized in KV Jun 1, 2022
craig bot pushed a commit that referenced this issue Dec 28, 2022
94240: roachtest: add `disk-stall` variant of `failover` tests r=erikgrinaker a=erikgrinaker

**roachtest: reorganize failer interface**

This patch reorganizes the `failer` interface with `Setup`, `Ready`, and `Cleanup` hooks, providing it with access to the monitor. It also passes the test and cluster references via the constructor.

Epic: none
Release note: None
  
**roachtest: add `disk-stall` variant of `failover` tests**

This patch adds `disk-stall` variants of the `failover` tests, which benchmarks the pMax unavailability when a leaseholder experiences a persistent disk stall.

Resolves #94231.
Touches #81100.

Epic: none
Release note: None

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@erikgrinaker
Copy link
Contributor Author

This will be fixed by #118943.

@craig craig bot closed this as completed in bf013ea Mar 4, 2024
KV automation moved this from Prioritized to Closed Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
KV
Closed
Development

Successfully merging a pull request may close this issue.

2 participants