Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-20.1: kv: don't mix prefix and non-prefix iters when collecting intents #47301

Merged

Conversation

nvanbenschoten
Copy link
Member

Backport 2/2 commits from #47247.

/cc @cockroachdb/release


Fixes #47219.

This commit addresses the bug diagnosed and explained in #47219. In that issue, we saw an assertion failure all the way up in the concurrency manager because a READ_UNCOMMITTED scan was hitting a WriteIntentError, which should not be possible. The root cause of this issue was that READ_UNCOMMITTED scans were mixing prefix and non-prefix iterators pulled from a read-only engine between the time that they were collecting intent keys and they were returning to fetch the provisional values for those keys. This mixing of iterators did not guarantee that the two stages of the operation would observe a consistent snapshot of the underlying engine, and because the READ_UNCOMMITTED scans also did not acquire latches, writes were able to slip in and change the intent while the scan wasn't looking. This caused the scan to throw a WriteIntentError for the new intent transaction, which badly confused other parts of the system (rightfully so).

This commit fixes this issue in a few different ways:

  1. it ensures that we always use the same iterator type (prefix or non-prefix) when retrieving the provisional values for a collection of intents retrieved by an earlier scan during READ_UNCOMMITTED operations.
  2. it adds an assertion inside of batcheval.CollectIntentRows that the function never returns a WriteIntentError. This would have caught the bug much more easily, especially back before we had the concurrency manager assertion and this bug could have materialized as stuck range lookups and potentially even deadlocked splits due to the dependency cycle between those two operations.
  3. it documents the limited guarantees that read-only engines provide with respect to consistent engine snapshots across iterator instances.

We'll want to backport this fix as far back as possible. It won't crash earlier releases of Cockroach, but as stated above, it might cause even more disastrous results. REMINDER: when backporting, remember to change the release note.

Release notes (bug fix): a bug that could cause Cockroach processes to crash due to an assertion failure with the text "expected latches held, found none" has been fixed.

Release justification: fixes a high-priority bug in existing functionality. The bug became louder (now crashes servers) due to recent changes that added new assertions into the code.

Fixes cockroachdb#47219.

This commit addresses the bug diagnosed and explained in cockroachdb#47219. In that
issue, we saw an assertion failure all the way up in the concurrency
manager because a READ_UNCOMMITTED scan was hitting a WriteIntentError,
which should not be possible. The root cause of this issue was that
READ_UNCOMMITTED scans were mixing prefix and non-prefix iterators
pulled from a read-only engine between the time that they were
collecting intent keys and they were returning to fetch the provisional
values for those keys. This mixing of iterators did not guarantee that
the two stages of the operation would observe a consistent snapshot of
the underlying engine, and because the READ_UNCOMMITTED scans also did
not acquire latches, writes were able to slip in and change the intent
while the scan wasn't looking. This caused the scan to throw a
WriteIntentError for the new intent transaction, which badly confused
other parts of the system (rightfully so).

This commit fixes this issue in a few different ways:
1. it ensures that we always use the same iterator type (prefix or non-prefix)
   when retrieving the provisional values for a collection of intents retrieved
   by an earlier scan during READ_UNCOMMITTED operations.
2. it adds an assertion inside of batcheval.CollectIntentRows that the
   function never returns a WriteIntentError. This would have caught the bug
   much more easily, especially back before we had the concurrency manager
   assertion and this bug could have materialized as stuck range lookups and
   potentially even deadlocked splits due to the dependency cycle between
   those two operations.
3. it documents the limited guarantees that read-only engines provide with
   respect to consistent engine snapshots across iterator instances.

We'll want to backport this fix as far back as possible. It won't crash
earlier releases of Cockroach, but as stated above, it might cause even
more disastrous results. REMINDER: when backporting, remember to change
the release note.

Release notes (bug fix): a bug that could cause Cockroach processes to
crash due to an assertion failure with the text "expected latches held,
found none" has been fixed.

Release justification: fixes a high-priority bug in existing
functionality. The bug became louder (now crashes servers) due to recent
changes that added new assertions into the code.
Updates documentation on:
- GetResponse.IntentValue
- ScanResponse.IntentRows
- ReverseScanResponse.IntentRows

Release justification: comment-only.
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:, clean backport

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @irfansharif)

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @ajwerner and @irfansharif)

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewable status: :shipit: complete! 3 of 0 LGTMs obtained (waiting on @irfansharif)

@nvanbenschoten nvanbenschoten merged commit e593160 into cockroachdb:release-20.1 Apr 9, 2020
@nvanbenschoten nvanbenschoten deleted the backport20.1-47247 branch April 10, 2020 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants