release-20.1: kv/kvserver: ignore discovered intents from prior lease sequences #51869
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport 3/3 commits from #51615.
/cc @cockroachdb/release
Fixes #50996.
This commit fixes what I suspect to be the underlying cause of the crash in #50996 (whose only occurrence was seen on a known customer's cluster).
The fix addresses an assertion failure caused by a scan that discovers an intent but does not inform the lock-table about the discovery until after the corresponding range's lease has been transferred away and back. If, during the time that the lease was remote, the intent that the scan discovered is removed and replaced, the scan can end up placing the wrong intent into the lock-table once the lease is returned. If another request had already discovered the new intent and placed that in the lock-table by the time that the original scan informs the lockTable, then the "discovered lock by different transaction than existing lock" assertion would be hit.
This PR demonstrates that this sequence of events could trigger the assertion in two ways. First, it demonstrates it in the new
discover_lock_after_lease_race
ConcurrencyManager data-driven test. This one is fairly synthetic, as it doesn't actually prove that it is possible for a scan to inform the lockTable about a stale intent after a lease is transferred away and back, only that if it did, it would hit the assertion. The PR then demonstrates using a TestCluster inTestDiscoverIntentAcrossLeaseTransferAwayAndBack
that because of the lack of latching during lease transfers and requests, such an ordering is possible and, without proper protection in the lockTable, could trigger the assertion. Without the fix, both tests panic every time they are run.The PR then fixes the issue. The fix is to not only track an "enabled" boolean in the lockTable, but also the lease sequence that the lockTable is operating under when the lockTable is enabled. We then also provide the lease sequence that a scan ran under when discovering intents (
AddDiscoveredLock
), and ignore discovered intents from prior lease sequences. If a request provides a stale discovered intent, it is safe to try its scan again – the intent is either incorrect and shouldn't be added to the lockTable or it is correct and will be added once the scan retries under the new lease sequence.Release note (bug fix): It is no longer possible for rapid Range lease movement to trigger a rare assertion failure under contended workloads.