Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: synchronize replica removal with read-write requests #64471

Merged
merged 1 commit into from May 3, 2021

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Apr 30, 2021

Replica removal did not synchronize with in-flight read-write requests,
which could cause them to be evaluated on a removed (empty) replica. The
request would not be able to persist any writes, since it's unable to
submit Raft proposals. However, it can affect conditional writes, for
example causing a ConditionalPutRequest to error because it finds a
missing value instead of the expected one.

This patch fixes the problem by taking out Replica.readOnlyCmdMu
during pre-Raft evaluation, to synchronize with replica removal. This
can cause such requests to return AmbiguousResultError as the write is
evaluated.

Resolves #46329, follow-up from #64324.

Release note (bug fix): Fixed a race condition where read-write requests
during replica removal (e.g. during range merges or rebalancing) could
be evaluated on the removed replica. These will not have been able to
write any data to persistent storage, but could behave unexpectedly,
e.g. returning errors that they should not have returned.

@erikgrinaker erikgrinaker self-assigned this Apr 30, 2021
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg mentioned this pull request Apr 30, 2021
32 tasks
@erikgrinaker
Copy link
Contributor Author

Seems like the AmbiguousResultError might be the intended result in this case, as described here:

// disconnectReplicationRaftMuLocked is called when a Replica is being removed.
// It cancels all outstanding proposals, closes the proposalQuota if there
// is one, and removes the in-memory raft state.
func (r *Replica) disconnectReplicationRaftMuLocked(ctx context.Context) {
r.raftMu.AssertHeld()
r.mu.Lock()
defer r.mu.Unlock()
// NB: In the very rare scenario that we're being removed but currently
// believe we are the leaseholder and there are more requests waiting for
// quota than total quota then failure to close the proposal quota here could
// leave those requests stuck forever.
if pq := r.mu.proposalQuota; pq != nil {
pq.Close("destroyed")
}
r.mu.proposalBuf.FlushLockedWithoutProposing(ctx)
for _, p := range r.mu.proposals {
r.cleanupFailedProposalLocked(p)
// NB: each proposal needs its own version of the error (i.e. don't try to
// share the error across proposals).
p.finishApplication(ctx, proposalResult{
Err: roachpb.NewError(
roachpb.NewAmbiguousResultError(
apply.ErrRemoved.Error())),
})
}
r.mu.internalRaftGroup = nil
}

So holding readOnlyCmdMu during Replica.evalAndPropose may not be terrible, the only problem I've found thus far is Replica.maybeSetCorrupt which is called here:

if pErr != nil {
pErr = r.maybeSetCorrupt(ctx, pErr)

This takes out raftMu, which violates the locking order since we're already holding readOnlyCmdMu. maybeSetCorrupt is going to shut down the node with a fatal error, so I'm thinking we might just return the error up to executeWriteBatch and call maybeSetCorrupt after we release readOnlyCmdMu.

Haven't found any other obvious deadlocks yet, and CI passes, so I'll start going down this path. Feel free to stop me if you know any reason why I shouldn't @tbg @nvanbenschoten.

@erikgrinaker erikgrinaker force-pushed the remove-replica-writes branch 3 times, most recently from 1464657 to e82f4ae Compare May 1, 2021 15:03
@erikgrinaker
Copy link
Contributor Author

This should be ready for review now. I ran the kvserver tests with deadlock detection via -tags deadlock, which didn't show any relevant failures, and I've audited the evaluation code without finding any lock conflicts. The Raft proposal and execution code paths appear to have sufficient locks and checks that we're not at risk of evaluating Raft commands on a removed replica.

@erikgrinaker erikgrinaker marked this pull request as ready for review May 1, 2021 15:10
@erikgrinaker erikgrinaker force-pushed the remove-replica-writes branch 2 times, most recently from 0eb92f9 to a20cc3d Compare May 2, 2021 10:46
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)


pkg/kv/kvserver/client_relocate_range_test.go, line 376 at r1 (raw file):

// which makes sure an in-flight read operation during replica removal won't
// return empty results.
func TestReplicaRemovalDuringGet(t *testing.T) {

Does this test also still repro the bug if you taint the fix?


pkg/kv/kvserver/replica_write.go, line 82 at r1 (raw file):

	st, err := r.checkExecutionCanProceed(ctx, ba, g)
	if err != nil {
		r.readOnlyCmdMu.RUnlock()

How about

needRUnlock := true
defer func() {
  if needRUnlock {
    r.readOnyCmdMu.RUnlock()
  }
}()

// later..
needRUnlock = false

or does that escape (and even then can use singleton pointers to true and false)?

The need to unlock before each return invites errors.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/kv/kvserver/client_relocate_range_test.go, line 376 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Does this test also still repro the bug if you taint the fix?

Yep


pkg/kv/kvserver/replica_write.go, line 82 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

How about

needRUnlock := true
defer func() {
  if needRUnlock {
    r.readOnyCmdMu.RUnlock()
  }
}()

// later..
needRUnlock = false

or does that escape (and even then can use singleton pointers to true and false)?

The need to unlock before each return invites errors.

I agree, but we already do this all over the place. Personally I think e.g. utility functions that cover the mutex lifetime (so that we could use defer) would be cleaner. Anyway, I've opened #64459 to review and clean up this locking, I suggest we revisit this then since we may want to make broader changes (e.g. readOnlyCmdMu is itself a contradiction here).

@erikgrinaker
Copy link
Contributor Author

bors r=tbg

@nvanbenschoten
Copy link
Member

bors r-

Sorry to do this, but I'm midway through a review and there's at least one serious issue here.

@craig
Copy link
Contributor

craig bot commented May 3, 2021

Canceled.

@erikgrinaker
Copy link
Contributor Author

Sorry to do this, but I'm midway through a review and there's at least one serious issue here.

No, appreciate it -- sorry for being trigger-happy here.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 7 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)


pkg/kv/kvserver/client_relocate_range_test.go, line 417 at r1 (raw file):

	// Perform delayed conditional put during replica removal. This will cause
	// an ambiguous result error, as outstanding proposals in the leaseholder
	// replica's proposal queue will be aborted when the replica is removed.

Mind mentioning what this would return if the bug was not fixed here as well?


pkg/kv/kvserver/client_test.go, line 66 at r1 (raw file):

		var e roachpb.Value
		e.SetBytes(expValue)
		expBytes = e.TagAndDataBytes()

nit: expBytes = roachpb.MakeValueFromBytes(expValue).TagAndDataBytes() if you want


pkg/kv/kvserver/replica_raft.go, line 205 at r1 (raw file):

	}

	maxLeaseIndex, pErr := r.propose(ctx, proposal, tok.Move(ctx))

Sorry for the false alarm. I was thinking that the propBuf's flushRLocked path could acquire the raftMu in cases where the propBuf hit a size limit, but I see now that it's not grabbing the raftMu, it's just upgrading a read lock on the Replica.mu to a write lock. So I think we're ok. Do you mind just adding a reminder in a comment somewhere in this method that nothing at or below this level should acquire the raftMu?

Replica removal did not synchronize with in-flight read-write requests,
which could cause them to be evaluated on a removed (empty) replica. The
request would not be able to persist any writes, since it's unable to
submit Raft proposals. However, it can affect conditional writes, for
example causing a `ConditionalPutRequest` to error because it finds a
missing value instead of the expected one.

This patch fixes the problem by taking out `Replica.readOnlyCmdMu`
during pre-Raft evaluation, to synchronize with replica removal. This
can cause such requests to return `AmbiguousResultError` as the write is
evaluated.

Release note (bug fix): Fixed a race condition where read-write requests
during replica removal (e.g. during range merges or rebalancing) could
be evaluated on the removed replica. These will not have been able to
write any data to persistent storage, but could behave unexpectedly,
e.g. returning errors that they should not have returned.
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/kv/kvserver/client_test.go, line 66 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: expBytes = roachpb.MakeValueFromBytes(expValue).TagAndDataBytes() if you want

Thanks!


pkg/kv/kvserver/replica_raft.go, line 205 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Sorry for the false alarm. I was thinking that the propBuf's flushRLocked path could acquire the raftMu in cases where the propBuf hit a size limit, but I see now that it's not grabbing the raftMu, it's just upgrading a read lock on the Replica.mu to a write lock. So I think we're ok. Do you mind just adding a reminder in a comment somewhere in this method that nothing at or below this level should acquire the raftMu?

Not at all, thanks for checking! Added a note, good idea.

@erikgrinaker
Copy link
Contributor Author

bors r=tbg,nvanbenschoten

The CI failure (TestLogic/5node-metadata/distsql_datetime) appears to be flake, stressed a few thousand runs without any failures.

@craig
Copy link
Contributor

craig bot commented May 3, 2021

Build succeeded:

@erikgrinaker erikgrinaker deleted the remove-replica-writes branch May 12, 2021 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: readOnlyCmdMu not acquired by executeWriteBatch
4 participants