Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: persistent outage when liveness leaseholder deadlocks #80713

Closed
tbg opened this issue Apr 28, 2022 · 8 comments · Fixed by #118943
Closed

kvserver: persistent outage when liveness leaseholder deadlocks #80713

tbg opened this issue Apr 28, 2022 · 8 comments · Fixed by #118943
Labels
A-kv-client Relating to the KV client and the KV interface. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. T-kv KV Team
Projects

Comments

@tbg
Copy link
Member

tbg commented Apr 28, 2022

Describe the problem

See #79648.

Quoting an internal support issue:

RCA: n1 experienced a disk stall. It held the lease for the liveness range r2, and this information was cached by all nodes. When n1 locked up, all liveness requests continued to get redirected to n1. Since all of these requests had a timeout on them, they returned from DistSender.Send. This did not invalidate the cached leaseholder entry, thus rendering this outage permanent. It is likely that the liveness range wasn't technically unavailable, if only nodes had figured out to talk to the new leaseholder.

To Reproduce

See above PR (note that second commit has a hacky fix so to get a repro, remove that)

Expected behavior

Failover as usual

Additional data / screenshots
Environment:

  • master at time of writing and all past versions presumably

Additional context

Persistent total cluster outage since all nodes failed heartbeats.
Resolved only when deadlocked node was brought down.

Jira issue: CRDB-15539

gz#13737

Epic CRDB-19227

gz#19526

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. A-kv-client Relating to the KV client and the KV interface. labels Apr 28, 2022
@tbg tbg added this to Incoming in KV via automation Apr 28, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Apr 28, 2022
@mwang1026 mwang1026 moved this from Incoming to Prioritized in KV May 2, 2022
@tbg
Copy link
Member Author

tbg commented May 4, 2022

What's interesting is that we do have a disk stall roachtest but it didn't catch it:

// Copyright 2018 The Cockroach Authors.
//
// Use of this software is governed by the Business Source License
// included in the file licenses/BSL.txt.
//
// As of the Change Date specified in that file, in accordance with
// the Business Source License, use of this software will be governed
// by the Apache License, Version 2.0, included in the file
// licenses/APL.txt.
package tests
import (
"context"
"fmt"
"math/rand"
"runtime"
"strings"
"time"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/registry"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test"
"github.com/cockroachdb/cockroach/pkg/roachprod/install"
"github.com/cockroachdb/cockroach/pkg/util/timeutil"
)
// RegisterDiskStalledDetection registers the disk stall test.
func RegisterDiskStalledDetection(r registry.Registry) {
for _, affectsLogDir := range []bool{false, true} {
for _, affectsDataDir := range []bool{false, true} {
// Grab copies of the args because we'll pass them into a closure.
// Everyone's favorite bug to write in Go.
affectsLogDir := affectsLogDir
affectsDataDir := affectsDataDir
r.Add(registry.TestSpec{
Name: fmt.Sprintf(
"disk-stalled/log=%t,data=%t",
affectsLogDir, affectsDataDir,
),
Owner: registry.OwnerStorage,
Cluster: r.MakeClusterSpec(1),
Run: func(ctx context.Context, t test.Test, c cluster.Cluster) {
runDiskStalledDetection(ctx, t, c, affectsLogDir, affectsDataDir)
},
})
}
}
}
func runDiskStalledDetection(
ctx context.Context, t test.Test, c cluster.Cluster, affectsLogDir bool, affectsDataDir bool,
) {
if c.IsLocal() && runtime.GOOS != "linux" {
t.Fatalf("must run on linux os, found %s", runtime.GOOS)
}
n := c.Node(1)
c.Put(ctx, t.Cockroach(), "./cockroach")
c.Run(ctx, n, "sudo umount -f {store-dir}/faulty || true")
c.Run(ctx, n, "mkdir -p {store-dir}/{real,faulty} || true")
// Make sure the actual logs are downloaded as artifacts.
c.Run(ctx, n, "rm -f logs && ln -s {store-dir}/real/logs logs || true")
t.Status("setting up charybdefs")
if err := c.Install(ctx, t.L(), n, "charybdefs"); err != nil {
t.Fatal(err)
}
c.Run(ctx, n, "sudo charybdefs {store-dir}/faulty -oallow_other,modules=subdir,subdir={store-dir}/real")
c.Run(ctx, n, "sudo mkdir -p {store-dir}/real/logs")
c.Run(ctx, n, "sudo chmod -R 777 {store-dir}/{real,faulty}")
errCh := make(chan install.RunResultDetails)
// NB: charybdefs' delay nemesis introduces 50ms per syscall. It would
// be nicer to introduce a longer delay, but this works.
tooShortSync := 40 * time.Millisecond
maxLogSync := time.Hour
logDir := "real/logs"
if affectsLogDir {
logDir = "faulty/logs"
maxLogSync = tooShortSync
}
maxDataSync := time.Hour
dataDir := "real"
if affectsDataDir {
maxDataSync = tooShortSync
dataDir = "faulty"
}
tStarted := timeutil.Now()
dur := 10 * time.Minute
if !affectsDataDir && !affectsLogDir {
dur = 30 * time.Second
}
go func() {
t.WorkerStatus("running server")
result, err := c.RunWithDetailsSingleNode(ctx, t.L(), n,
fmt.Sprintf("timeout --signal 9 %ds env "+
"COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT=%s "+
"COCKROACH_LOG_MAX_SYNC_DURATION=%s "+
"COCKROACH_AUTO_BALLAST=false "+
"./cockroach start-single-node --insecure --store {store-dir}/%s --log '{sinks: {stderr: {filter: INFO}}, file-defaults: {dir: \"{store-dir}/%s\"}}'",
int(dur.Seconds()), maxDataSync, maxLogSync, dataDir, logDir,
),
)
if err != nil {
result.Err = err
}
errCh <- result
}()
time.Sleep(time.Duration(rand.Intn(5)) * time.Second)
t.Status("blocking storage")
c.Run(ctx, n, "charybdefs-nemesis --delay")
result := <-errCh
if result.Err == nil {
t.Fatalf("expected an error: %s", result.Stdout)
}
// This test can also run in sanity check mode to make sure it doesn't fail
// due to the aggressive env vars above.
expectMsg := affectsDataDir || affectsLogDir
if expectMsg != strings.Contains(result.Stderr, "disk stall detected") {
t.Fatalf("unexpected output: %v", result.Err)
} else if elapsed := timeutil.Since(tStarted); !expectMsg && elapsed < dur {
t.Fatalf("no disk stall injected, but process terminated too early after %s (expected >= %s)", elapsed, dur)
}
c.Run(ctx, n, "charybdefs-nemesis --clear")
c.Run(ctx, n, "sudo umount {store-dir}/faulty")
}

@erikgrinaker
Copy link
Contributor

This seems closely related to #81100, so I'll pick this one up while I'm at it.

@erikgrinaker
Copy link
Contributor

Took a first stab at this over in #81137, where DistSender will prioritize available nodes over unavailable ones (since the liveness leaseholder would become unavailable when the disk stalls). However, this has a bunch of issues -- not least that when the liveness leaseholder goes down then all nodes will be considered unavailable, and so it can still get stuck on the old leaseholder anyway if it's sorted first (e.g. by latency).

I think this issue points towards a more fundamental flaw in the whole model we use around lease acquisition, in that we rely on KV clients to detect failed leaseholders and contact new candidates to trigger lease acquisitions. If the clients fail to do either of these two tasks then leases won't move.

Some other random ideas:

  • Tag the contexts of internal operations like heartbeats, lease acquisition, and range descriptor lookup. If these contexts time out, invalidate the lease cache. However, it's easy to fail to tag relevant operations, and we'll have to avoid hitting the old leaseholder again on the next attempt.

  • Have the server-side RPC stack reject requests when the node fails liveness. However, this requires active cooperation from the server, and won't handle e.g. deadlocks or network outages (black holes).

  • Have the distsender track error statistics for replicas and prioritize more functional ones. However, in this case we'd have to include context cancellation errors, so if a bunch of clients disconnect it can severely disrupt performance.

  • Pick random replicas when the cached lease is considered invalid. However, this could cause poor tail latencies, since in many cases it would be more efficient to contact the previous leaseholder.

Thoughts or ideas?

@joshimhoff
Copy link
Collaborator

joshimhoff commented Jun 10, 2022

Looking for something to play with over breather week. Maybe it's this one? Perhaps can limit the scope by focusing on system ranges if needed.

In #81100 (comment), @erikgrinaker says:

Additionally, system ranges may not recover at all because the RPC caller may time out before the NLHE is returned, preventing anyone else from acquiring the lease (#80713).,

To me, this suggests a slightly different understanding of the outage than what @tbg said in this ticket:

It is likely that the liveness range wasn't technically unavailable, if only nodes had figured out to talk to the new leaseholder.

Which is right? Was there a new available leaseholder or was there not a new available leaseholder, since no kvclient poked a functional kvserver triggering it to acquire the lease, since the NLHE was never returned? Erik's theory makes concrete sense to me FWIW.

Some other random ideas:

I like all these ideas.

I have a weird related idea:

  • Make batch a streaming RPC.
  • Server returns "I'm leaseholder" then actually evaluates the batch and returns any results
  • If no "I'm leaseholder" in CRDB-eng-controlled timeout seconds (N=2s), then client invalidate cache + tries a different replica

Then no need for NLHE.

At some level, is the problem that we require the kvserver to send a NLHE (as the kvserver may be non-functional), not that we require a kvclient to send a message to kvserver in order to kick off a lease acquisition attempt?

I sort of wonder if the cost of sending back a "I'm a leaseholder" message on every RPC to kvserver is too high to stomach? But IDK; perf stuff like this is hard for me to judge. I thought I'd share the idea anyway.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 10, 2022

Which is right? Was there a new available leaseholder or was there not a new available leaseholder, since no kvclient poked a functional kvserver triggering it to acquire the lease, since the NLHE was never returned? Erik's theory makes concrete sense to me FWIW.

Most likely there was no other leaseholder, since we've found that nodes will keep retrying the previous leaseholder until they get a response. Unless someone else happened to send a request to a different replica, for some reason. Can probably find out by digging through the debug.zip, don't remember the details of the original incident. The no-leaseholder scenario will need a solution in any case, which I suspect will indirectly solve the other-leaseholder scenario.

Some other random ideas:

I like all these ideas.

Meh. I don't think any of them are particularly good, but given the current state this is what I could come up with for short/medium-term fixes. What I'd ideally want here is for replicas to sort out the lease even in the absence of client requests, similarly to how vanilla Raft will transfer leadership even if there are no proposals.

I have a weird related idea:

  • Make batch a streaming RPC.
  • Server returns "I'm leaseholder" then actually evaluates the batch and returns any results
  • If no "I'm leaseholder" in CRDB-eng-controlled timeout seconds (N=2s), then client invalidate cache + tries a different replica

Then no need for NLHE.

We sort of have this channel already, in the RPC heartbeats. But it doesn't carry lease information currently.

At some level, is the problem that we require the kvserver to send a NLHE (as the kvserver may be non-functional), not that we require a kvclient to send a message to kvserver in order to kick off a lease acquisition attempt?

I think both. Ideally, we'd want replicas to sort out the lease between themselves, and for clients to efficiently discover lease changes. These concerns are currently conflated, but they're really separate concerns.

I sort of wonder if the cost of sending back a "I'm a leaseholder" message on every RPC to kvserver is too high to stomach? But IDK; perf stuff like this is hard for me to judge. I thought I'd share the idea anyway.

Been a while since I looked at RPC heartbeats, but piggybacking on them might be viable. I think doing it per-batch would possibly be too expensive, and it's unclear whether it'd be better than an aggregate signal. We could e.g. send a sparse bitmap of ranges that we're actively asserting leases for, or something. This would have to integrate with quiescence somehow.

@joshimhoff
Copy link
Collaborator

All makes sense! Good point re: aggregate signal maybe being just as good.

@tbg
Copy link
Member Author

tbg commented Sep 23, 2022

Most likely there was no other leaseholder, since we've found that nodes will keep retrying the previous leaseholder until they get a response. Unless someone else happened to send a request to a different replica, for some reason. Can probably find out by digging through the debug.zip, don't remember the details of the original incident. The no-leaseholder scenario will need a solution in any case, which I suspect will indirectly solve the other-leaseholder scenario.

There may not have been here, but it looks like https://github.com/cockroachlabs/support/issues/1808 is a possible other instance of the same problem. Here, it looks as though only n5 didn't manage to update its cache; other nodes seem to be reaching liveness OK.

But as Erik pointed out, solving one solves the other - the issue is that we can't rely on "proactive" invalidation of a cached leaseholder.

@daniel-crlabs
Copy link
Contributor

A customer ran into this again recently: https://cockroachdb.zendesk.com/agent/tickets/19526

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. T-kv KV Team
Projects
KV
Closed
Development

Successfully merging a pull request may close this issue.

5 participants