Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: range failed to catch up due to invalid term #37056

Closed
ajwerner opened this issue Apr 23, 2019 · 1 comment · Fixed by #37055
Closed

storage: range failed to catch up due to invalid term #37056

ajwerner opened this issue Apr 23, 2019 · 1 comment · Fixed by #37055
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@ajwerner
Copy link
Contributor

Describe the problem

@mberhault observed a CREATE DATABASE command hang. Upon further investigation it was discovered that range 3 was hung in redirectOnOrAcquireLease.
range3

From logspy we observed that the replica was refusing to apply raft entries it was sent due to a term mismatch.

I190423 18:17:41.330605 176 storage/raft.go:142  [n3,s3,r3/2:/System/{NodeLive…-tsd}] raft ready (must-sync=false)
  Outgoing Message[0]: 2->1 MsgAppResp Term:9 Log:0/396611 Rejected (Hint: 396611)
I190423 18:17:41.330576 176 vendor/go.etcd.io/etcd/raft/raft.go:1317  [n3,s3,r3/2:/System/{NodeLive…-tsd}] 2 [logterm: 7, index: 396611] rejected msgApp [logterm: 8, index: 396611] from 1

The replica in question believed that it was in term 7. We also observed that the range had no log entries due to a snapshot.

us-west1-node-0000> I190420 23:13:50.610694 7971602 storage/replica_raftstorage.go:814  [n3,s3,r3/2:/System/{NodeLive…
-tsd}] applying Raft snapshot at index 396611 (id=dc7d62ba, encoded size=76009, 1 rocksdb batches, 0 log entries)
us-west1-node-0000> I190420 23:13:50.616660 7971602 storage/replica_raftstorage.go:820  [n3,s3,r3/2:/System/{NodeLive…-tsd}] applied Raft snapshot in 6ms [clear=1ms batch=0ms entries=0ms commit=4ms]

Interestingly we were able to confirm from other replicas that the hung replica should have had a term of at least 8 for the log position in question. The replica in question must have determined its term here:

if r.mu.lastIndex == i && r.mu.lastTerm != invalidLastTerm {
return r.mu.lastTerm, nil
}
// Try to retrieve the term for the desired entry from the entry cache.
if e, ok := r.store.raftEntryCache.Get(r.RangeID, i); ok {
return e.Term, nil
}

We also notice that raft snapshots which contain no raft entries set the term to invalid term.

var lastTerm uint64

This implies that the incorrect term was derived from an entry in the raftentry.Cache. This observation lead us to realize that there was no code to clear the cache when receiving a snapshot.

@ajwerner ajwerner self-assigned this Apr 23, 2019
craig bot pushed a commit that referenced this issue Apr 24, 2019
37055: storage: drop raftentry.Cache data in applySnapshot r=nvanbenschoten a=ajwerner

This PR adds a new `.Drop` method to the `raftentry.Cache` which will clear all data associated with a range more efficiently than calling `.Clear` with a large index. The second commit then uses this call when applying a snapshot to ensure that stale cached raft entries are never used.

Fixes #37056.

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
@craig craig bot closed this as completed in #37055 Apr 24, 2019
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 24, 2019
This PR adds a regression test for cockroachdb#37056. In doing so, it also confirms
the theory that cockroachdb#37055 is the proper fix for that bug.

Before cockroachdb#37055, the test would get stuck with the following logs repeatedly
firing:
```
I190424 00:15:52.338808 12 storage/client_test.go:1242  SucceedsSoon: expected [21 21 21], got [12 21 21]
I190424 00:15:52.378060 78 vendor/go.etcd.io/etcd/raft/raft.go:1317  [s1,r1/1:/M{in-ax}] 1 [logterm: 6, index: 31] rejected msgApp [logterm: 8, index: 31] from 2
I190424 00:15:52.378248 184 vendor/go.etcd.io/etcd/raft/raft.go:1065  [s2,r1/2:/M{in-ax}] 2 received msgApp rejection(lastindex: 31) from 1 for index 31
I190424 00:15:52.378275 184 vendor/go.etcd.io/etcd/raft/raft.go:1068  [s2,r1/2:/M{in-ax}] 2 decreased progress of 1 to [next = 31, match = 31, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
```

After cockroachdb#37055, the test passes.

Release note: None
@nvanbenschoten nvanbenschoten added A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Apr 24, 2019
@bdarnell
Copy link
Contributor

Some of the flakes in the overloaded scrub test (#35985) didn't make any progress at all (leading to OOM). I guessed that this might be some kind of deadlock, but maybe it's the same issue. I wonder if the overloaded tpcc scenario is still flaky with this fixed.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 24, 2019
This PR adds a regression test for cockroachdb#37056. In doing so, it also confirms
the theory that cockroachdb#37055 is the proper fix for that bug.

Before cockroachdb#37055, the test would get stuck with the following logs repeatedly
firing:
```
I190424 00:15:52.338808 12 storage/client_test.go:1242  SucceedsSoon: expected [21 21 21], got [12 21 21]
I190424 00:15:52.378060 78 vendor/go.etcd.io/etcd/raft/raft.go:1317  [s1,r1/1:/M{in-ax}] 1 [logterm: 6, index: 31] rejected msgApp [logterm: 8, index: 31] from 2
I190424 00:15:52.378248 184 vendor/go.etcd.io/etcd/raft/raft.go:1065  [s2,r1/2:/M{in-ax}] 2 received msgApp rejection(lastindex: 31) from 1 for index 31
I190424 00:15:52.378275 184 vendor/go.etcd.io/etcd/raft/raft.go:1068  [s2,r1/2:/M{in-ax}] 2 decreased progress of 1 to [next = 31, match = 31, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
```

After cockroachdb#37055, the test passes.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 24, 2019
This PR adds a regression test for cockroachdb#37056. In doing so, it also confirms
the theory that cockroachdb#37055 is the proper fix for that bug.

Before cockroachdb#37055, the test would get stuck with the following logs repeatedly
firing:
```
I190424 00:15:52.338808 12 storage/client_test.go:1242  SucceedsSoon: expected [21 21 21], got [12 21 21]
I190424 00:15:52.378060 78 vendor/go.etcd.io/etcd/raft/raft.go:1317  [s1,r1/1:/M{in-ax}] 1 [logterm: 6, index: 31] rejected msgApp [logterm: 8, index: 31] from 2
I190424 00:15:52.378248 184 vendor/go.etcd.io/etcd/raft/raft.go:1065  [s2,r1/2:/M{in-ax}] 2 received msgApp rejection(lastindex: 31) from 1 for index 31
I190424 00:15:52.378275 184 vendor/go.etcd.io/etcd/raft/raft.go:1068  [s2,r1/2:/M{in-ax}] 2 decreased progress of 1 to [next = 31, match = 31, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
```

After cockroachdb#37055, the test passes.

Release note: None
craig bot pushed a commit that referenced this issue Apr 24, 2019
37058: storage: create new TestSnapshotAfterTruncationWithUncommittedTail test r=nvanbenschoten a=nvanbenschoten

This PR adds a regression test for #37056. In doing so, it also confirms
the theory that #37055 is the proper fix for that bug.

Before #37055, the test would get stuck with the following logs repeatedly
firing:
```
I190424 00:15:52.338808 12 storage/client_test.go:1242  SucceedsSoon: expected [21 21 21], got [12 21 21]
I190424 00:15:52.378060 78 vendor/go.etcd.io/etcd/raft/raft.go:1317  [s1,r1/1:/M{in-ax}] 1 [logterm: 6, index: 31] rejected msgApp [logterm: 8, index: 31] from 2
I190424 00:15:52.378248 184 vendor/go.etcd.io/etcd/raft/raft.go:1065  [s2,r1/2:/M{in-ax}] 2 received msgApp rejection(lastindex: 31) from 1 for index 31
I190424 00:15:52.378275 184 vendor/go.etcd.io/etcd/raft/raft.go:1068  [s2,r1/2:/M{in-ax}] 2 decreased progress of 1 to [next = 31, match = 31, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
```

After #37055, the test passes.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 24, 2019
This PR adds a regression test for cockroachdb#37056. In doing so, it also confirms
the theory that cockroachdb#37055 is the proper fix for that bug.

Before cockroachdb#37055, the test would get stuck with the following logs repeatedly
firing:
```
I190424 00:15:52.338808 12 storage/client_test.go:1242  SucceedsSoon: expected [21 21 21], got [12 21 21]
I190424 00:15:52.378060 78 vendor/go.etcd.io/etcd/raft/raft.go:1317  [s1,r1/1:/M{in-ax}] 1 [logterm: 6, index: 31] rejected msgApp [logterm: 8, index: 31] from 2
I190424 00:15:52.378248 184 vendor/go.etcd.io/etcd/raft/raft.go:1065  [s2,r1/2:/M{in-ax}] 2 received msgApp rejection(lastindex: 31) from 1 for index 31
I190424 00:15:52.378275 184 vendor/go.etcd.io/etcd/raft/raft.go:1068  [s2,r1/2:/M{in-ax}] 2 decreased progress of 1 to [next = 31, match = 31, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
```

After cockroachdb#37055, the test passes.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants