Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: fix asserts around log truncations #43314

Merged
merged 1 commit into from Dec 19, 2019

Conversation

@irfansharif
Copy link
Member

irfansharif commented Dec 18, 2019

Stop relying on getQuorumIndex, as it has subtle behaviour around the addition/removal of replicas to the raft group. Assert instead that we never truncate the log past the committed index.

This change was prompted by a faulty assertion using QuorumIndex, where we previously asserted that NewFirstIndex <= QuorumIndex. But this was incorrect, to see why consider a Raft group with first-last indexes [(100-112), (116-124), (116-124)]. When a new replica is added, it starts off with a match index of 0, i.e. [(0,0), (100-112), (116-124), (116-124)]. Naively, we'd expect the quorum index to be 116, but it's 100. The FirstIndex at the leaseholder will be 116, and thus we have QuorumIndex < FirstIndex <= NewFirstIndex. How should we think about the invariants in this situation? We don't want to truncate past the "quorum index", but new replicas muddle our computation of this "quorum index".

What about raftstatus.Commit? It's more appropriate here: raftstatus.Commit tells us the index that a quorum of replicas have safely committed, thus making it safe to truncate. We now assert that we're not truncating past this index.

Release note (bug fix): Faulty assertion could cause panics when a log truncation took place concurrently with a replica being added to a raft group.

@cockroach-teamcity

This comment has been minimized.

Copy link
Member

cockroach-teamcity commented Dec 18, 2019

This change is Reviewable

@irfansharif irfansharif requested a review from nvanbenschoten Dec 18, 2019
@irfansharif

This comment has been minimized.

Copy link
Member Author

irfansharif commented Dec 18, 2019

I'll be the first to admit this area makes me uneasy. I'm also ok removing the faulty assertion altogether.

Copy link
Member

nvanbenschoten left a comment

I'll be the first to admit this area makes me uneasy. I'm also ok removing the faulty assertion altogether.

Agreed. At this point, I'd feel more comfortable just removing it altogether. We'll need to backport this to other release branches as well, right?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

Stop relying on `getQuorumIndex` for assertions, as it has subtle
behaviour around the addition/removal of replicas to the raft group.

This change was prompted by a faulty assertion using `QuorumIndex`,
where we previously asserted that `NewFirstIndex <= QuorumIndex`. But
this was incorrect, to see why consider a Raft group with first-last
indexes `[(100-112), (116-124), (116-124)]`. When a new replica is
added, it starts off with a match index of 0, i.e. `[(0,0), (100-112),
(116-124), (116-124)]`.  Naively, we'd expect the quorum index to be
116, but it's 100. The `FirstIndex` at the leaseholder will be 116, and
thus we have `QuorumIndex < FirstIndex <= NewFirstIndex`.
We don't want to truncate past the "quorum index", but new replicas
muddle our computation of this "quorum index".

Release note (bug fix): Faulty assertion could cause panics when a log
truncation took place concurrently with a replica being added to a raft
group.
@irfansharif irfansharif force-pushed the 191218-2.1.trunc-assert branch from 99b310f to fe48fcb Dec 19, 2019
Copy link
Member Author

irfansharif left a comment

At this point, I'd feel more comfortable just removing it altogether.

Done.

We'll need to backport this to other release branches as well, right?

No, this assertion was added to catch a malformed snapshot error that was partly caused due to raft log entries being sent as part of the snapshot. That's not the case for subsequent releases, so was not backported. The assertions on master weren't backported either.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Member

nvanbenschoten left a comment

:lgtm:

Reviewed 1 of 1 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@irfansharif irfansharif merged commit b02795e into release-2.1 Dec 19, 2019
2 checks passed
2 checks passed
GitHub CI (Cockroach) TeamCity build finished
Details
license/cla Contributor License Agreement is signed.
Details
@irfansharif irfansharif deleted the 191218-2.1.trunc-assert branch Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.