Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-20.1: kv/kvserver: ignore discovered intents from prior lease sequences #51869

Merged

Conversation

nvanbenschoten
Copy link
Member

Backport 3/3 commits from #51615.

/cc @cockroachdb/release


Fixes #50996.

This commit fixes what I suspect to be the underlying cause of the crash in #50996 (whose only occurrence was seen on a known customer's cluster).

The fix addresses an assertion failure caused by a scan that discovers an intent but does not inform the lock-table about the discovery until after the corresponding range's lease has been transferred away and back. If, during the time that the lease was remote, the intent that the scan discovered is removed and replaced, the scan can end up placing the wrong intent into the lock-table once the lease is returned. If another request had already discovered the new intent and placed that in the lock-table by the time that the original scan informs the lockTable, then the "discovered lock by different transaction than existing lock" assertion would be hit.

This PR demonstrates that this sequence of events could trigger the assertion in two ways. First, it demonstrates it in the new discover_lock_after_lease_race ConcurrencyManager data-driven test. This one is fairly synthetic, as it doesn't actually prove that it is possible for a scan to inform the lockTable about a stale intent after a lease is transferred away and back, only that if it did, it would hit the assertion. The PR then demonstrates using a TestCluster in TestDiscoverIntentAcrossLeaseTransferAwayAndBack that because of the lack of latching during lease transfers and requests, such an ordering is possible and, without proper protection in the lockTable, could trigger the assertion. Without the fix, both tests panic every time they are run.

The PR then fixes the issue. The fix is to not only track an "enabled" boolean in the lockTable, but also the lease sequence that the lockTable is operating under when the lockTable is enabled. We then also provide the lease sequence that a scan ran under when discovering intents (AddDiscoveredLock), and ignore discovered intents from prior lease sequences. If a request provides a stale discovered intent, it is safe to try its scan again – the intent is either incorrect and shouldn't be added to the lockTable or it is correct and will be added once the scan retries under the new lease sequence.

Release note (bug fix): It is no longer possible for rapid Range lease movement to trigger a rare assertion failure under contended workloads.

The corresponding issue cockroachdb#10493 has since been closed because
we seem to have made peace with the filter.
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei
Copy link
Contributor

andreimatei commented Jul 24, 2020 via email

@nvanbenschoten nvanbenschoten force-pushed the backport20.1-51615 branch 2 times, most recently from cef59c9 to 963f343 Compare July 24, 2020 23:47
Fixes cockroachdb#50996.

This commit fixes what I suspect to be the underlying cause of the crash
in cockroachdb#50996 (whose only occurrence was seen on a known customer's cluster).

The fix addresses an assertion failure caused by a scan that discovers
an intent but does not inform the lock-table about the discovery until
after the corresponding range's lease has been transferred away and
back. If, during the time that the lease was remote, the intent that the
scan discovered is removed and replaced, the scan can end up placing the
wrong intent into the lock-table once the lease is returned. If another
request had already discovered the new intent and placed that in the
lock-table by the time that the original scan informs the lockTable,
then the "discovered lock by different transaction than existing lock"
assertion would be hit.

This PR demonstrates that this sequence of events could trigger the
assertion in two ways. First, it demonstrates it in the new
`discover_lock_after_lease_race` ConcurrencyManager data driven test.
This one is fairly synthetic, as it doesn't actually prove that it is
possible for a scan to inform the lockTable about a stale intent after a
lease is transferred away and back, only that if it did, it would hit
the assertion. The PR then demonstrates using a TestCluster in
`TestDiscoverIntentAcrossLeaseTransferAwayAndBack` that because of the
lack of latching during lease transfers and requests, such an ordering
is possible and, without proper protection in the lockTable, could
trigger the assertion. Without the fix, both tests panic every time they
are run.

The PR then fixes the issue. The fix is to not only track an "enabled"
boolean in the lockTable, but also the lease sequence that the lockTable
is operating under when the lockTable is enabled. We then also provide
the lease sequence that a scan ran under when discovering intents
(`AddDiscoveredLock`), and ignore discovered intents from prior lease
sequences. If a request provides a stale discovered intent, it is safe
to try its scan again – the intent is either incorrect and shouldn't be
added to the lockTable or it is correct and will be added once the scan
retries under the new lease sequence.

Release note (bug fix): It is no longer possible for rapid Range lease
movement to trigger a rare assertion failure under contended workloads.
The assertion contained the text: "discovered lock by different transaction
than existing lock".
@nvanbenschoten nvanbenschoten merged commit 76bb7cd into cockroachdb:release-20.1 Jul 27, 2020
@nvanbenschoten nvanbenschoten deleted the backport20.1-51615 branch July 27, 2020 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants