Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: ignore discovered intents from prior lease sequences #51615

Merged

Conversation

nvanbenschoten
Copy link
Member

Fixes #50996.

This commit fixes what I suspect to be the underlying cause of the crash in #50996 (whose only occurrence was seen on a known customer's cluster).

The fix addresses an assertion failure caused by a scan that discovers an intent but does not inform the lock-table about the discovery until after the corresponding range's lease has been transferred away and back. If, during the time that the lease was remote, the intent that the scan discovered is removed and replaced, the scan can end up placing the wrong intent into the lock-table once the lease is returned. If another request had already discovered the new intent and placed that in the lock-table by the time that the original scan informs the lockTable, then the "discovered lock by different transaction than existing lock" assertion would be hit.

This PR demonstrates that this sequence of events could trigger the assertion in two ways. First, it demonstrates it in the new discover_lock_after_lease_race ConcurrencyManager data-driven test. This one is fairly synthetic, as it doesn't actually prove that it is possible for a scan to inform the lockTable about a stale intent after a lease is transferred away and back, only that if it did, it would hit the assertion. The PR then demonstrates using a TestCluster in TestDiscoverIntentAcrossLeaseTransferAwayAndBack that because of the lack of latching during lease transfers and requests, such an ordering is possible and, without proper protection in the lockTable, could trigger the assertion. Without the fix, both tests panic every time they are run.

The PR then fixes the issue. The fix is to not only track an "enabled" boolean in the lockTable, but also the lease sequence that the lockTable is operating under when the lockTable is enabled. We then also provide the lease sequence that a scan ran under when discovering intents (AddDiscoveredLock), and ignore discovered intents from prior lease sequences. If a request provides a stale discovered intent, it is safe to try its scan again – the intent is either incorrect and shouldn't be added to the lockTable or it is correct and will be added once the scan retries under the new lease sequence.

Release note (bug fix): It is no longer possible for rapid Range lease movement to trigger a rare assertion failure under contended workloads.

The corresponding issue cockroachdb#10493 has since been closed because
we seem to have made peace with the filter.
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Avoids a heap allocation.

Curious - which allocation was that? The buf.String() call at the end?

Release note (bug fix): It is no longer possible for rapid Range lease
movement to trigger a rare assertion failure under contended workloads.

Put the text of the assertion in the release note pls.

Reviewed 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)


pkg/kv/kvserver/concurrency/concurrency_control.go, line 195 at r3 (raw file):

	// concurrency manager about the replicated write intent that was missing
	// from its lock table which was found during request evaluation (while
	// holding latches under the provided lease sequence). After doing so, it

well, one doesn't "hold latches under a lease sequence", does it? I think you want to say that the intent was found while evaluating


pkg/kv/kvserver/concurrency/concurrency_control.go, line 470 at r3 (raw file):

	// AddDiscoveredLock informs the lockTable of a lock that was discovered
	// during evaluation under the provided lease sequence which the lockTable

this sentence now became confusing cause it's no longer clear whether the lock table wasn't previously tracking an intent or a lease sequence.


pkg/kv/kvserver/concurrency/concurrency_control.go, line 470 at r3 (raw file):

	// AddDiscoveredLock informs the lockTable of a lock that was discovered
	// during evaluation under the provided lease sequence which the lockTable

Say that the LeaseSeq is used to detect whether a lease changed occurred since... and say why.


pkg/kv/kvserver/concurrency/lock_table.go, line 1863 at r3 (raw file):

		return true, nil
	} else if seq > t.enabledSeq {
		return false, errors.AssertionFailedf("unexpected lease sequence: %d > %d", seq, t.enabledSeq)

maybe put the intent in this error


pkg/kv/kvserver/concurrency/lock_table.go, line 1863 at r3 (raw file):

		return true, nil
	} else if seq > t.enabledSeq {
		return false, errors.AssertionFailedf("unexpected lease sequence: %d > %d", seq, t.enabledSeq)

I assume this assertion is valid because the table is supposed to have been disabled the moment a new lease applied, and the lease application must have preceded the writing of any intent for that sequence. Right? Pls put a comment.

Fixes cockroachdb#50996.

This commit fixes what I suspect to be the underlying cause of the crash
in cockroachdb#50996 (whose only occurrence was seen on a known customer's cluster).

The fix addresses an assertion failure caused by a scan that discovers
an intent but does not inform the lock-table about the discovery until
after the corresponding range's lease has been transferred away and
back. If, during the time that the lease was remote, the intent that the
scan discovered is removed and replaced, the scan can end up placing the
wrong intent into the lock-table once the lease is returned. If another
request had already discovered the new intent and placed that in the
lock-table by the time that the original scan informs the lockTable,
then the "discovered lock by different transaction than existing lock"
assertion would be hit.

This PR demonstrates that this sequence of events could trigger the
assertion in two ways. First, it demonstrates it in the new
`discover_lock_after_lease_race` ConcurrencyManager data driven test.
This one is fairly synthetic, as it doesn't actually prove that it is
possible for a scan to inform the lockTable about a stale intent after a
lease is transferred away and back, only that if it did, it would hit
the assertion. The PR then demonstrates using a TestCluster in
`TestDiscoverIntentAcrossLeaseTransferAwayAndBack` that because of the
lack of latching during lease transfers and requests, such an ordering
is possible and, without proper protection in the lockTable, could
trigger the assertion. Without the fix, both tests panic every time they
are run.

The PR then fixes the issue. The fix is to not only track an "enabled"
boolean in the lockTable, but also the lease sequence that the lockTable
is operating under when the lockTable is enabled. We then also provide
the lease sequence that a scan ran under when discovering intents
(`AddDiscoveredLock`), and ignore discovered intents from prior lease
sequences. If a request provides a stale discovered intent, it is safe
to try its scan again – the intent is either incorrect and shouldn't be
added to the lockTable or it is correct and will be added once the scan
retries under the new lease sequence.

Release note (bug fix): It is no longer possible for rapid Range lease
movement to trigger a rare assertion failure under contended workloads.
The assertion contained the text: "discovered lock by different transaction
than existing lock".
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Curious - which allocation was that? The buf.String() call at the end?

Yes, see https://medium.com/@thuc/8-notes-about-strings-builder-in-golang-65260daae6e9 or https://medium.com/@felipedutratine/string-concatenation-in-golang-since-1-10-bytes-buffer-vs-strings-builder-2b3081848c45.

Put the text of the assertion in the release note pls.

Done.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/kv/kvserver/concurrency/concurrency_control.go, line 195 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

well, one doesn't "hold latches under a lease sequence", does it? I think you want to say that the intent was found while evaluating

Done.


pkg/kv/kvserver/concurrency/concurrency_control.go, line 470 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

this sentence now became confusing cause it's no longer clear whether the lock table wasn't previously tracking an intent or a lease sequence.

Done.


pkg/kv/kvserver/concurrency/concurrency_control.go, line 470 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Say that the LeaseSeq is used to detect whether a lease changed occurred since... and say why.

Done.


pkg/kv/kvserver/concurrency/lock_table.go, line 1863 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I assume this assertion is valid because the table is supposed to have been disabled the moment a new lease applied, and the lease application must have preceded the writing of any intent for that sequence. Right? Pls put a comment.

Done.

@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jul 23, 2020

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jul 24, 2020

Build succeeded:

@craig craig bot merged commit bc288d4 into cockroachdb:master Jul 24, 2020
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/lockTableAssert branch July 24, 2020 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: v20.1.1: assertion failure caused by concurrency_manager.go:261
3 participants