storage: range failed to catch up due to invalid term #37056

ajwerner · 2019-04-23T23:04:42Z

Describe the problem

@mberhault observed a CREATE DATABASE command hang. Upon further investigation it was discovered that range 3 was hung in redirectOnOrAcquireLease.

From logspy we observed that the replica was refusing to apply raft entries it was sent due to a term mismatch.

I190423 18:17:41.330605 176 storage/raft.go:142  [n3,s3,r3/2:/System/{NodeLive…-tsd}] raft ready (must-sync=false)
  Outgoing Message[0]: 2->1 MsgAppResp Term:9 Log:0/396611 Rejected (Hint: 396611)

I190423 18:17:41.330576 176 vendor/go.etcd.io/etcd/raft/raft.go:1317  [n3,s3,r3/2:/System/{NodeLive…-tsd}] 2 [logterm: 7, index: 396611] rejected msgApp [logterm: 8, index: 396611] from 1

The replica in question believed that it was in term 7. We also observed that the range had no log entries due to a snapshot.

us-west1-node-0000> I190420 23:13:50.610694 7971602 storage/replica_raftstorage.go:814  [n3,s3,r3/2:/System/{NodeLive…
-tsd}] applying Raft snapshot at index 396611 (id=dc7d62ba, encoded size=76009, 1 rocksdb batches, 0 log entries)
us-west1-node-0000> I190420 23:13:50.616660 7971602 storage/replica_raftstorage.go:820  [n3,s3,r3/2:/System/{NodeLive…-tsd}] applied Raft snapshot in 6ms [clear=1ms batch=0ms entries=0ms commit=4ms]

Interestingly we were able to confirm from other replicas that the hung replica should have had a term of at least 8 for the log position in question. The replica in question must have determined its term here:

cockroach/pkg/storage/replica_raftstorage.go

Lines 254 to 260 in 91abab1

    
           if r.mu.lastIndex == i && r.mu.lastTerm != invalidLastTerm { 
        
           	return r.mu.lastTerm, nil 
        
           } 
        
           // Try to retrieve the term for the desired entry from the entry cache. 
        
           if e, ok := r.store.raftEntryCache.Get(r.RangeID, i); ok { 
        
           	return e.Term, nil 
        
           }

We also notice that raft snapshots which contain no raft entries set the term to invalid term.

cockroach/pkg/storage/replica_raftstorage.go

Line 908 in 43c3cf3

var lastTerm uint64

This implies that the incorrect term was derived from an entry in the raftentry.Cache. This observation lead us to realize that there was no code to clear the cache when receiving a snapshot.

The text was updated successfully, but these errors were encountered:

37055: storage: drop raftentry.Cache data in applySnapshot r=nvanbenschoten a=ajwerner This PR adds a new `.Drop` method to the `raftentry.Cache` which will clear all data associated with a range more efficiently than calling `.Clear` with a large index. The second commit then uses this call when applying a snapshot to ensure that stale cached raft entries are never used. Fixes #37056. Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>

This PR adds a regression test for cockroachdb#37056. In doing so, it also confirms the theory that cockroachdb#37055 is the proper fix for that bug. Before cockroachdb#37055, the test would get stuck with the following logs repeatedly firing: ``` I190424 00:15:52.338808 12 storage/client_test.go:1242 SucceedsSoon: expected [21 21 21], got [12 21 21] I190424 00:15:52.378060 78 vendor/go.etcd.io/etcd/raft/raft.go:1317 [s1,r1/1:/M{in-ax}] 1 [logterm: 6, index: 31] rejected msgApp [logterm: 8, index: 31] from 2 I190424 00:15:52.378248 184 vendor/go.etcd.io/etcd/raft/raft.go:1065 [s2,r1/2:/M{in-ax}] 2 received msgApp rejection(lastindex: 31) from 1 for index 31 I190424 00:15:52.378275 184 vendor/go.etcd.io/etcd/raft/raft.go:1068 [s2,r1/2:/M{in-ax}] 2 decreased progress of 1 to [next = 31, match = 31, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0] ``` After cockroachdb#37055, the test passes. Release note: None

bdarnell · 2019-04-24T01:17:50Z

Some of the flakes in the overloaded scrub test (#35985) didn't make any progress at all (leading to OOM). I guessed that this might be some kind of deadlock, but maybe it's the same issue. I wonder if the overloaded tpcc scenario is still flaky with this fixed.

This PR adds a regression test for cockroachdb#37056. In doing so, it also confirms the theory that cockroachdb#37055 is the proper fix for that bug. Before cockroachdb#37055, the test would get stuck with the following logs repeatedly firing: ``` I190424 00:15:52.338808 12 storage/client_test.go:1242 SucceedsSoon: expected [21 21 21], got [12 21 21] I190424 00:15:52.378060 78 vendor/go.etcd.io/etcd/raft/raft.go:1317 [s1,r1/1:/M{in-ax}] 1 [logterm: 6, index: 31] rejected msgApp [logterm: 8, index: 31] from 2 I190424 00:15:52.378248 184 vendor/go.etcd.io/etcd/raft/raft.go:1065 [s2,r1/2:/M{in-ax}] 2 received msgApp rejection(lastindex: 31) from 1 for index 31 I190424 00:15:52.378275 184 vendor/go.etcd.io/etcd/raft/raft.go:1068 [s2,r1/2:/M{in-ax}] 2 decreased progress of 1 to [next = 31, match = 31, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0] ``` After cockroachdb#37055, the test passes. Release note: None

37058: storage: create new TestSnapshotAfterTruncationWithUncommittedTail test r=nvanbenschoten a=nvanbenschoten This PR adds a regression test for #37056. In doing so, it also confirms the theory that #37055 is the proper fix for that bug. Before #37055, the test would get stuck with the following logs repeatedly firing: ``` I190424 00:15:52.338808 12 storage/client_test.go:1242 SucceedsSoon: expected [21 21 21], got [12 21 21] I190424 00:15:52.378060 78 vendor/go.etcd.io/etcd/raft/raft.go:1317 [s1,r1/1:/M{in-ax}] 1 [logterm: 6, index: 31] rejected msgApp [logterm: 8, index: 31] from 2 I190424 00:15:52.378248 184 vendor/go.etcd.io/etcd/raft/raft.go:1065 [s2,r1/2:/M{in-ax}] 2 received msgApp rejection(lastindex: 31) from 1 for index 31 I190424 00:15:52.378275 184 vendor/go.etcd.io/etcd/raft/raft.go:1068 [s2,r1/2:/M{in-ax}] 2 decreased progress of 1 to [next = 31, match = 31, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0] ``` After #37055, the test passes. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

This PR adds a regression test for cockroachdb#37056. In doing so, it also confirms the theory that cockroachdb#37055 is the proper fix for that bug. Before cockroachdb#37055, the test would get stuck with the following logs repeatedly firing: ``` I190424 00:15:52.338808 12 storage/client_test.go:1242 SucceedsSoon: expected [21 21 21], got [12 21 21] I190424 00:15:52.378060 78 vendor/go.etcd.io/etcd/raft/raft.go:1317 [s1,r1/1:/M{in-ax}] 1 [logterm: 6, index: 31] rejected msgApp [logterm: 8, index: 31] from 2 I190424 00:15:52.378248 184 vendor/go.etcd.io/etcd/raft/raft.go:1065 [s2,r1/2:/M{in-ax}] 2 received msgApp rejection(lastindex: 31) from 1 for index 31 I190424 00:15:52.378275 184 vendor/go.etcd.io/etcd/raft/raft.go:1068 [s2,r1/2:/M{in-ax}] 2 decreased progress of 1 to [next = 31, match = 31, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0] ``` After cockroachdb#37055, the test passes. Release note: None

ajwerner mentioned this issue Apr 23, 2019

storage: drop raftentry.Cache data in applySnapshot #37055

Merged

andreimatei mentioned this issue Apr 23, 2019

19.1 release blockers list #35554

Closed

23 tasks

ajwerner self-assigned this Apr 23, 2019

craig bot closed this as completed in #37055 Apr 24, 2019

nvanbenschoten mentioned this issue Apr 24, 2019

storage: create new TestSnapshotAfterTruncationWithUncommittedTail test #37058

Merged

nvanbenschoten added A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Apr 24, 2019

ajwerner mentioned this issue Apr 24, 2019

release-19.1: storage: drop raftentry.Cache data in applySnapshot #37064

Merged

joshimhoff mentioned this issue May 10, 2019

ui: Make clearer functional purpose of range in the admin UI range report page #37445

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: range failed to catch up due to invalid term #37056

storage: range failed to catch up due to invalid term #37056

ajwerner commented Apr 23, 2019

bdarnell commented Apr 24, 2019

storage: range failed to catch up due to invalid term #37056

storage: range failed to catch up due to invalid term #37056

Comments

ajwerner commented Apr 23, 2019

bdarnell commented Apr 24, 2019