release-20.1: kv/kvserver: ignore discovered intents from prior lease sequences #51869

nvanbenschoten · 2020-07-24T05:06:28Z

Backport 3/3 commits from #51615.

/cc @cockroachdb/release

This commit fixes what I suspect to be the underlying cause of the crash in #50996 (whose only occurrence was seen on a known customer's cluster).

The fix addresses an assertion failure caused by a scan that discovers an intent but does not inform the lock-table about the discovery until after the corresponding range's lease has been transferred away and back. If, during the time that the lease was remote, the intent that the scan discovered is removed and replaced, the scan can end up placing the wrong intent into the lock-table once the lease is returned. If another request had already discovered the new intent and placed that in the lock-table by the time that the original scan informs the lockTable, then the "discovered lock by different transaction than existing lock" assertion would be hit.

This PR demonstrates that this sequence of events could trigger the assertion in two ways. First, it demonstrates it in the new discover_lock_after_lease_race ConcurrencyManager data-driven test. This one is fairly synthetic, as it doesn't actually prove that it is possible for a scan to inform the lockTable about a stale intent after a lease is transferred away and back, only that if it did, it would hit the assertion. The PR then demonstrates using a TestCluster in TestDiscoverIntentAcrossLeaseTransferAwayAndBack that because of the lack of latching during lease transfers and requests, such an ordering is possible and, without proper protection in the lockTable, could trigger the assertion. Without the fix, both tests panic every time they are run.

The PR then fixes the issue. The fix is to not only track an "enabled" boolean in the lockTable, but also the lease sequence that the lockTable is operating under when the lockTable is enabled. We then also provide the lease sequence that a scan ran under when discovering intents (AddDiscoveredLock), and ignore discovered intents from prior lease sequences. If a request provides a stale discovered intent, it is safe to try its scan again – the intent is either incorrect and shouldn't be added to the lockTable or it is correct and will be added once the scan retries under the new lease sequence.

Release note (bug fix): It is no longer possible for rapid Range lease movement to trigger a rare assertion failure under contended workloads.

Avoids a heap allocation.

The corresponding issue cockroachdb#10493 has since been closed because we seem to have made peace with the filter.

cockroach-teamcity · 2020-07-24T05:06:37Z

This change is

andreimatei · 2020-07-24T13:31:05Z

LGTM

…

On Fri, Jul 24, 2020, 1:06 AM cockroach-teamcity ***@***.***> wrote: This change is [image: Reviewable] <https://reviewable.io/reviews/cockroachdb/cockroach/51869> — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#51869 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAC4C4JXQIAM647X3BMTN5LR5EJGXANCNFSM4PGL6V3Q> .

Fixes cockroachdb#50996. This commit fixes what I suspect to be the underlying cause of the crash in cockroachdb#50996 (whose only occurrence was seen on a known customer's cluster). The fix addresses an assertion failure caused by a scan that discovers an intent but does not inform the lock-table about the discovery until after the corresponding range's lease has been transferred away and back. If, during the time that the lease was remote, the intent that the scan discovered is removed and replaced, the scan can end up placing the wrong intent into the lock-table once the lease is returned. If another request had already discovered the new intent and placed that in the lock-table by the time that the original scan informs the lockTable, then the "discovered lock by different transaction than existing lock" assertion would be hit. This PR demonstrates that this sequence of events could trigger the assertion in two ways. First, it demonstrates it in the new `discover_lock_after_lease_race` ConcurrencyManager data driven test. This one is fairly synthetic, as it doesn't actually prove that it is possible for a scan to inform the lockTable about a stale intent after a lease is transferred away and back, only that if it did, it would hit the assertion. The PR then demonstrates using a TestCluster in `TestDiscoverIntentAcrossLeaseTransferAwayAndBack` that because of the lack of latching during lease transfers and requests, such an ordering is possible and, without proper protection in the lockTable, could trigger the assertion. Without the fix, both tests panic every time they are run. The PR then fixes the issue. The fix is to not only track an "enabled" boolean in the lockTable, but also the lease sequence that the lockTable is operating under when the lockTable is enabled. We then also provide the lease sequence that a scan ran under when discovering intents (`AddDiscoveredLock`), and ignore discovered intents from prior lease sequences. If a request provides a stale discovered intent, it is safe to try its scan again – the intent is either incorrect and shouldn't be added to the lockTable or it is correct and will be added once the scan retries under the new lease sequence. Release note (bug fix): It is no longer possible for rapid Range lease movement to trigger a rare assertion failure under contended workloads. The assertion contained the text: "discovered lock by different transaction than existing lock".

nvanbenschoten added 2 commits July 24, 2020 01:03

roachpb: use strings.Builder in WriteIntentError.Error

4462e01

Avoids a heap allocation.

kv/kvserver: remove deprecation warning on TestingEvalFilter

1282d59

The corresponding issue cockroachdb#10493 has since been closed because we seem to have made peace with the filter.

nvanbenschoten requested a review from andreimatei July 24, 2020 05:06

nvanbenschoten force-pushed the backport20.1-51615 branch 2 times, most recently from cef59c9 to 963f343 Compare July 24, 2020 23:47

nvanbenschoten force-pushed the backport20.1-51615 branch from 963f343 to 5a41ea1 Compare July 27, 2020 14:33

nvanbenschoten merged commit 76bb7cd into cockroachdb:release-20.1 Jul 27, 2020

nvanbenschoten deleted the backport20.1-51615 branch July 27, 2020 15:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-20.1: kv/kvserver: ignore discovered intents from prior lease sequences #51869

release-20.1: kv/kvserver: ignore discovered intents from prior lease sequences #51869

nvanbenschoten commented Jul 24, 2020

cockroach-teamcity commented Jul 24, 2020

andreimatei commented Jul 24, 2020 via email

release-20.1: kv/kvserver: ignore discovered intents from prior lease sequences #51869

release-20.1: kv/kvserver: ignore discovered intents from prior lease sequences #51869

Conversation

nvanbenschoten commented Jul 24, 2020

cockroach-teamcity commented Jul 24, 2020

andreimatei commented Jul 24, 2020 via email