Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: drop raftentry.Cache data in applySnapshot #37055

Merged
merged 2 commits into from
Apr 24, 2019

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Apr 23, 2019

This PR adds a new .Drop method to the raftentry.Cache which will clear all data associated with a range more efficiently than calling .Clear with a large index. The second commit then uses this call when applying a snapshot to ensure that stale cached raft entries are never used.

Fixes #37056.

@ajwerner ajwerner requested review from nvanbenschoten and a team April 23, 2019 21:33
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner
Copy link
Contributor Author

This PR could benefit from a storage test

@ajwerner ajwerner force-pushed the ajwerner/raft-entry-cache-drop branch from a6fa12e to e917cc3 Compare April 23, 2019 22:25
@ajwerner
Copy link
Contributor Author

storage unit test coming in a separate PR courtesy of @nvanbenschoten.

@ajwerner ajwerner force-pushed the ajwerner/raft-entry-cache-drop branch 3 times, most recently from 083666b to 7c7d0a9 Compare April 23, 2019 23:17
@ajwerner ajwerner requested review from a team April 23, 2019 23:17
@ajwerner ajwerner requested a review from a team as a code owner April 23, 2019 23:17
@ajwerner ajwerner requested review from a team April 23, 2019 23:17
@ajwerner ajwerner force-pushed the ajwerner/raft-entry-cache-drop branch 2 times, most recently from 5546e70 to 9a6b4fa Compare April 23, 2019 23:21
@ajwerner ajwerner removed request for a team April 23, 2019 23:22
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: mod two nits. Thanks for getting this up so quickly.

Reviewed 2 of 2 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/storage/replica_raftstorage.go, line 861 at r2 (raw file):

	// Clear the cached raft log entries to ensure that old or uncommitted
	// entries don't impact the in-memory state.
	r.store.raftEntryCache.Drop(r.RangeID)

You might consider adding a TODO to the method to use it in more places.


pkg/storage/raftentry/cache_test.go, line 456 at r1 (raw file):

	var wg sync.WaitGroup
	// Test using both Clear and Drop to remove the added entries.
	testutils.RunTrueAndFalse(t, "dropOrClear", func(t *testing.T, useClear bool) {

nit: this isn't a great use of RunTrueAndFalse because it will be printed as:

TestConcurrentUpdates/dropOrClear=false
TestConcurrentUpdates/dropOrClear=true

It's not really clear what either sub-test does. Maybe consider:

for _, t := range struct{
    name string
    clear func()
}{
    {"drop", func() { c.Drop(1) }},
    {"clear", func() { c.Clear(1, 22) }},
}{
    ...

Add logic to clear cached raft log entries when applying a snapshot.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/raft-entry-cache-drop branch from 9a6b4fa to f8de052 Compare April 23, 2019 23:55
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick review and for writing the test!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/replica_raftstorage.go, line 861 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

You might consider adding a TODO to the method to use it in more places.

Anywhere in particular come to mind? A quick glance at the uses of Clear didn't reveal much to me in terms of replacement outside of some tests. One place we could use it is in replica_destroy. Where would the TODO go? Here or on the method? Either one feels like pollution without something specific in mind. If the thought was just about replica gc, I can just type a PR instead of a TODO.


pkg/storage/raftentry/cache_test.go, line 456 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: this isn't a great use of RunTrueAndFalse because it will be printed as:

TestConcurrentUpdates/dropOrClear=false
TestConcurrentUpdates/dropOrClear=true

It's not really clear what either sub-test does. Maybe consider:

for _, t := range struct{
    name string
    clear func()
}{
    {"drop", func() { c.Drop(1) }},
    {"clear", func() { c.Clear(1, 22) }},
}{
    ...

Done

@ajwerner
Copy link
Contributor Author

@nvanbenschoten
Is this where you had in mind for the additional Drop call?

err := clearRangeData(ctx, desc, reader, batch, destroyData)
if err != nil {
return err
}

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this where you had in mind for the additional Drop call?

I was thinking one level above that, in Replica.destroyRaftMuLocked. After the batch is actually committed.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to make that change now though. We'll want this to be small enough to backport.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)

@ajwerner
Copy link
Contributor Author

bors r=nvanbenschoten

craig bot pushed a commit that referenced this pull request Apr 24, 2019
37055: storage: drop raftentry.Cache data in applySnapshot r=nvanbenschoten a=ajwerner

This PR adds a new `.Drop` method to the `raftentry.Cache` which will clear all data associated with a range more efficiently than calling `.Clear` with a large index. The second commit then uses this call when applying a snapshot to ensure that stale cached raft entries are never used.

Fixes #37056.

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Apr 24, 2019

Build succeeded

@craig craig bot merged commit f8de052 into cockroachdb:master Apr 24, 2019
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Apr 24, 2019
This PR adds a regression test for cockroachdb#37056. In doing so, it also confirms
the theory that cockroachdb#37055 is the proper fix for that bug.

Before cockroachdb#37055, the test would get stuck with the following logs repeatedly
firing:
```
I190424 00:15:52.338808 12 storage/client_test.go:1242  SucceedsSoon: expected [21 21 21], got [12 21 21]
I190424 00:15:52.378060 78 vendor/go.etcd.io/etcd/raft/raft.go:1317  [s1,r1/1:/M{in-ax}] 1 [logterm: 6, index: 31] rejected msgApp [logterm: 8, index: 31] from 2
I190424 00:15:52.378248 184 vendor/go.etcd.io/etcd/raft/raft.go:1065  [s2,r1/2:/M{in-ax}] 2 received msgApp rejection(lastindex: 31) from 1 for index 31
I190424 00:15:52.378275 184 vendor/go.etcd.io/etcd/raft/raft.go:1068  [s2,r1/2:/M{in-ax}] 2 decreased progress of 1 to [next = 31, match = 31, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
```

After cockroachdb#37055, the test passes.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Apr 24, 2019
This PR adds a regression test for cockroachdb#37056. In doing so, it also confirms
the theory that cockroachdb#37055 is the proper fix for that bug.

Before cockroachdb#37055, the test would get stuck with the following logs repeatedly
firing:
```
I190424 00:15:52.338808 12 storage/client_test.go:1242  SucceedsSoon: expected [21 21 21], got [12 21 21]
I190424 00:15:52.378060 78 vendor/go.etcd.io/etcd/raft/raft.go:1317  [s1,r1/1:/M{in-ax}] 1 [logterm: 6, index: 31] rejected msgApp [logterm: 8, index: 31] from 2
I190424 00:15:52.378248 184 vendor/go.etcd.io/etcd/raft/raft.go:1065  [s2,r1/2:/M{in-ax}] 2 received msgApp rejection(lastindex: 31) from 1 for index 31
I190424 00:15:52.378275 184 vendor/go.etcd.io/etcd/raft/raft.go:1068  [s2,r1/2:/M{in-ax}] 2 decreased progress of 1 to [next = 31, match = 31, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
```

After cockroachdb#37055, the test passes.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Apr 24, 2019
This PR adds a regression test for cockroachdb#37056. In doing so, it also confirms
the theory that cockroachdb#37055 is the proper fix for that bug.

Before cockroachdb#37055, the test would get stuck with the following logs repeatedly
firing:
```
I190424 00:15:52.338808 12 storage/client_test.go:1242  SucceedsSoon: expected [21 21 21], got [12 21 21]
I190424 00:15:52.378060 78 vendor/go.etcd.io/etcd/raft/raft.go:1317  [s1,r1/1:/M{in-ax}] 1 [logterm: 6, index: 31] rejected msgApp [logterm: 8, index: 31] from 2
I190424 00:15:52.378248 184 vendor/go.etcd.io/etcd/raft/raft.go:1065  [s2,r1/2:/M{in-ax}] 2 received msgApp rejection(lastindex: 31) from 1 for index 31
I190424 00:15:52.378275 184 vendor/go.etcd.io/etcd/raft/raft.go:1068  [s2,r1/2:/M{in-ax}] 2 decreased progress of 1 to [next = 31, match = 31, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
```

After cockroachdb#37055, the test passes.

Release note: None
craig bot pushed a commit that referenced this pull request Apr 24, 2019
37058: storage: create new TestSnapshotAfterTruncationWithUncommittedTail test r=nvanbenschoten a=nvanbenschoten

This PR adds a regression test for #37056. In doing so, it also confirms
the theory that #37055 is the proper fix for that bug.

Before #37055, the test would get stuck with the following logs repeatedly
firing:
```
I190424 00:15:52.338808 12 storage/client_test.go:1242  SucceedsSoon: expected [21 21 21], got [12 21 21]
I190424 00:15:52.378060 78 vendor/go.etcd.io/etcd/raft/raft.go:1317  [s1,r1/1:/M{in-ax}] 1 [logterm: 6, index: 31] rejected msgApp [logterm: 8, index: 31] from 2
I190424 00:15:52.378248 184 vendor/go.etcd.io/etcd/raft/raft.go:1065  [s2,r1/2:/M{in-ax}] 2 received msgApp rejection(lastindex: 31) from 1 for index 31
I190424 00:15:52.378275 184 vendor/go.etcd.io/etcd/raft/raft.go:1068  [s2,r1/2:/M{in-ax}] 2 decreased progress of 1 to [next = 31, match = 31, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
```

After #37055, the test passes.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Apr 24, 2019
This PR adds a regression test for cockroachdb#37056. In doing so, it also confirms
the theory that cockroachdb#37055 is the proper fix for that bug.

Before cockroachdb#37055, the test would get stuck with the following logs repeatedly
firing:
```
I190424 00:15:52.338808 12 storage/client_test.go:1242  SucceedsSoon: expected [21 21 21], got [12 21 21]
I190424 00:15:52.378060 78 vendor/go.etcd.io/etcd/raft/raft.go:1317  [s1,r1/1:/M{in-ax}] 1 [logterm: 6, index: 31] rejected msgApp [logterm: 8, index: 31] from 2
I190424 00:15:52.378248 184 vendor/go.etcd.io/etcd/raft/raft.go:1065  [s2,r1/2:/M{in-ax}] 2 received msgApp rejection(lastindex: 31) from 1 for index 31
I190424 00:15:52.378275 184 vendor/go.etcd.io/etcd/raft/raft.go:1068  [s2,r1/2:/M{in-ax}] 2 decreased progress of 1 to [next = 31, match = 31, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
```

After cockroachdb#37055, the test passes.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage: range failed to catch up due to invalid term
3 participants