-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/kvserver: ignore discovered intents from prior lease sequences #51615
kv/kvserver: ignore discovered intents from prior lease sequences #51615
Conversation
Avoids a heap allocation.
The corresponding issue cockroachdb#10493 has since been closed because we seem to have made peace with the filter.
b816a5a
to
e4e1463
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Avoids a heap allocation.
Curious - which allocation was that? The buf.String()
call at the end?
Release note (bug fix): It is no longer possible for rapid Range lease
movement to trigger a rare assertion failure under contended workloads.
Put the text of the assertion in the release note pls.
Reviewed 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)
pkg/kv/kvserver/concurrency/concurrency_control.go, line 195 at r3 (raw file):
// concurrency manager about the replicated write intent that was missing // from its lock table which was found during request evaluation (while // holding latches under the provided lease sequence). After doing so, it
well, one doesn't "hold latches under a lease sequence", does it? I think you want to say that the intent was found while evaluating
pkg/kv/kvserver/concurrency/concurrency_control.go, line 470 at r3 (raw file):
// AddDiscoveredLock informs the lockTable of a lock that was discovered // during evaluation under the provided lease sequence which the lockTable
this sentence now became confusing cause it's no longer clear whether the lock table wasn't previously tracking an intent or a lease sequence.
pkg/kv/kvserver/concurrency/concurrency_control.go, line 470 at r3 (raw file):
// AddDiscoveredLock informs the lockTable of a lock that was discovered // during evaluation under the provided lease sequence which the lockTable
Say that the LeaseSeq
is used to detect whether a lease changed occurred since... and say why.
pkg/kv/kvserver/concurrency/lock_table.go, line 1863 at r3 (raw file):
return true, nil } else if seq > t.enabledSeq { return false, errors.AssertionFailedf("unexpected lease sequence: %d > %d", seq, t.enabledSeq)
maybe put the intent in this error
pkg/kv/kvserver/concurrency/lock_table.go, line 1863 at r3 (raw file):
return true, nil } else if seq > t.enabledSeq { return false, errors.AssertionFailedf("unexpected lease sequence: %d > %d", seq, t.enabledSeq)
I assume this assertion is valid because the table is supposed to have been disabled the moment a new lease applied, and the lease application must have preceded the writing of any intent for that sequence. Right? Pls put a comment.
Fixes cockroachdb#50996. This commit fixes what I suspect to be the underlying cause of the crash in cockroachdb#50996 (whose only occurrence was seen on a known customer's cluster). The fix addresses an assertion failure caused by a scan that discovers an intent but does not inform the lock-table about the discovery until after the corresponding range's lease has been transferred away and back. If, during the time that the lease was remote, the intent that the scan discovered is removed and replaced, the scan can end up placing the wrong intent into the lock-table once the lease is returned. If another request had already discovered the new intent and placed that in the lock-table by the time that the original scan informs the lockTable, then the "discovered lock by different transaction than existing lock" assertion would be hit. This PR demonstrates that this sequence of events could trigger the assertion in two ways. First, it demonstrates it in the new `discover_lock_after_lease_race` ConcurrencyManager data driven test. This one is fairly synthetic, as it doesn't actually prove that it is possible for a scan to inform the lockTable about a stale intent after a lease is transferred away and back, only that if it did, it would hit the assertion. The PR then demonstrates using a TestCluster in `TestDiscoverIntentAcrossLeaseTransferAwayAndBack` that because of the lack of latching during lease transfers and requests, such an ordering is possible and, without proper protection in the lockTable, could trigger the assertion. Without the fix, both tests panic every time they are run. The PR then fixes the issue. The fix is to not only track an "enabled" boolean in the lockTable, but also the lease sequence that the lockTable is operating under when the lockTable is enabled. We then also provide the lease sequence that a scan ran under when discovering intents (`AddDiscoveredLock`), and ignore discovered intents from prior lease sequences. If a request provides a stale discovered intent, it is safe to try its scan again – the intent is either incorrect and shouldn't be added to the lockTable or it is correct and will be added once the scan retries under the new lease sequence. Release note (bug fix): It is no longer possible for rapid Range lease movement to trigger a rare assertion failure under contended workloads. The assertion contained the text: "discovered lock by different transaction than existing lock".
e4e1463
to
4ab98a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Curious - which allocation was that? The buf.String() call at the end?
Yes, see https://medium.com/@thuc/8-notes-about-strings-builder-in-golang-65260daae6e9 or https://medium.com/@felipedutratine/string-concatenation-in-golang-since-1-10-bytes-buffer-vs-strings-builder-2b3081848c45.
Put the text of the assertion in the release note pls.
Done.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)
pkg/kv/kvserver/concurrency/concurrency_control.go, line 195 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
well, one doesn't "hold latches under a lease sequence", does it? I think you want to say that the intent was found while evaluating
Done.
pkg/kv/kvserver/concurrency/concurrency_control.go, line 470 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
this sentence now became confusing cause it's no longer clear whether the lock table wasn't previously tracking an intent or a lease sequence.
Done.
pkg/kv/kvserver/concurrency/concurrency_control.go, line 470 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Say that the
LeaseSeq
is used to detect whether a lease changed occurred since... and say why.
Done.
pkg/kv/kvserver/concurrency/lock_table.go, line 1863 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I assume this assertion is valid because the table is supposed to have been disabled the moment a new lease applied, and the lease application must have preceded the writing of any intent for that sequence. Right? Pls put a comment.
Done.
bors r+ |
Build failed (retrying...): |
Build succeeded: |
Fixes #50996.
This commit fixes what I suspect to be the underlying cause of the crash in #50996 (whose only occurrence was seen on a known customer's cluster).
The fix addresses an assertion failure caused by a scan that discovers an intent but does not inform the lock-table about the discovery until after the corresponding range's lease has been transferred away and back. If, during the time that the lease was remote, the intent that the scan discovered is removed and replaced, the scan can end up placing the wrong intent into the lock-table once the lease is returned. If another request had already discovered the new intent and placed that in the lock-table by the time that the original scan informs the lockTable, then the "discovered lock by different transaction than existing lock" assertion would be hit.
This PR demonstrates that this sequence of events could trigger the assertion in two ways. First, it demonstrates it in the new
discover_lock_after_lease_race
ConcurrencyManager data-driven test. This one is fairly synthetic, as it doesn't actually prove that it is possible for a scan to inform the lockTable about a stale intent after a lease is transferred away and back, only that if it did, it would hit the assertion. The PR then demonstrates using a TestCluster inTestDiscoverIntentAcrossLeaseTransferAwayAndBack
that because of the lack of latching during lease transfers and requests, such an ordering is possible and, without proper protection in the lockTable, could trigger the assertion. Without the fix, both tests panic every time they are run.The PR then fixes the issue. The fix is to not only track an "enabled" boolean in the lockTable, but also the lease sequence that the lockTable is operating under when the lockTable is enabled. We then also provide the lease sequence that a scan ran under when discovering intents (
AddDiscoveredLock
), and ignore discovered intents from prior lease sequences. If a request provides a stale discovered intent, it is safe to try its scan again – the intent is either incorrect and shouldn't be added to the lockTable or it is correct and will be added once the scan retries under the new lease sequence.Release note (bug fix): It is no longer possible for rapid Range lease movement to trigger a rare assertion failure under contended workloads.