New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-20.1: kv: don't mix prefix and non-prefix iters when collecting intents #47301
release-20.1: kv: don't mix prefix and non-prefix iters when collecting intents #47301
Conversation
Fixes cockroachdb#47219. This commit addresses the bug diagnosed and explained in cockroachdb#47219. In that issue, we saw an assertion failure all the way up in the concurrency manager because a READ_UNCOMMITTED scan was hitting a WriteIntentError, which should not be possible. The root cause of this issue was that READ_UNCOMMITTED scans were mixing prefix and non-prefix iterators pulled from a read-only engine between the time that they were collecting intent keys and they were returning to fetch the provisional values for those keys. This mixing of iterators did not guarantee that the two stages of the operation would observe a consistent snapshot of the underlying engine, and because the READ_UNCOMMITTED scans also did not acquire latches, writes were able to slip in and change the intent while the scan wasn't looking. This caused the scan to throw a WriteIntentError for the new intent transaction, which badly confused other parts of the system (rightfully so). This commit fixes this issue in a few different ways: 1. it ensures that we always use the same iterator type (prefix or non-prefix) when retrieving the provisional values for a collection of intents retrieved by an earlier scan during READ_UNCOMMITTED operations. 2. it adds an assertion inside of batcheval.CollectIntentRows that the function never returns a WriteIntentError. This would have caught the bug much more easily, especially back before we had the concurrency manager assertion and this bug could have materialized as stuck range lookups and potentially even deadlocked splits due to the dependency cycle between those two operations. 3. it documents the limited guarantees that read-only engines provide with respect to consistent engine snapshots across iterator instances. We'll want to backport this fix as far back as possible. It won't crash earlier releases of Cockroach, but as stated above, it might cause even more disastrous results. REMINDER: when backporting, remember to change the release note. Release notes (bug fix): a bug that could cause Cockroach processes to crash due to an assertion failure with the text "expected latches held, found none" has been fixed. Release justification: fixes a high-priority bug in existing functionality. The bug became louder (now crashes servers) due to recent changes that added new assertions into the code.
Updates documentation on: - GetResponse.IntentValue - ScanResponse.IntentRows - ReverseScanResponse.IntentRows Release justification: comment-only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @irfansharif)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @ajwerner and @irfansharif)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 3 of 0 LGTMs obtained (waiting on @irfansharif)
Backport 2/2 commits from #47247.
/cc @cockroachdb/release
Fixes #47219.
This commit addresses the bug diagnosed and explained in #47219. In that issue, we saw an assertion failure all the way up in the concurrency manager because a
READ_UNCOMMITTED
scan was hitting aWriteIntentError
, which should not be possible. The root cause of this issue was thatREAD_UNCOMMITTED
scans were mixing prefix and non-prefix iterators pulled from a read-only engine between the time that they were collecting intent keys and they were returning to fetch the provisional values for those keys. This mixing of iterators did not guarantee that the two stages of the operation would observe a consistent snapshot of the underlying engine, and because theREAD_UNCOMMITTED
scans also did not acquire latches, writes were able to slip in and change the intent while the scan wasn't looking. This caused the scan to throw aWriteIntentError
for the new intent transaction, which badly confused other parts of the system (rightfully so).This commit fixes this issue in a few different ways:
READ_UNCOMMITTED
operations.batcheval.CollectIntentRows
that the function never returns aWriteIntentError
. This would have caught the bug much more easily, especially back before we had the concurrency manager assertion and this bug could have materialized as stuck range lookups and potentially even deadlocked splits due to the dependency cycle between those two operations.We'll want to backport this fix as far back as possible. It won't crash earlier releases of Cockroach, but as stated above, it might cause even more disastrous results. REMINDER: when backporting, remember to change the release note.
Release notes (bug fix): a bug that could cause Cockroach processes to crash due to an assertion failure with the text "expected latches held, found none" has been fixed.
Release justification: fixes a high-priority bug in existing functionality. The bug became louder (now crashes servers) due to recent changes that added new assertions into the code.