Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: don't disable the merge queue needlessly in more tests #46431

Merged
merged 1 commit into from Mar 24, 2020

Conversation

nvanbenschoten
Copy link
Member

Follow up to #46383.

These tests were disabling the queue to not interfere with its
AdminSplits, but since the tests were written, AdminSplit got
a TTL.

Release note: None
Release justification: test only

Follow up to cockroachdb#46383.

These tests were disabling the queue to not interfere with its
AdminSplits, but since the tests were written, AdminSplit got
a TTL.

Release note: None
Release justification: test only
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)

@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 23, 2020

Build failed (retrying...)

@craig
Copy link
Contributor

craig bot commented Mar 24, 2020

Build succeeded

@craig craig bot merged commit 21e6fa6 into cockroachdb:master Mar 24, 2020
andreimatei added a commit to andreimatei/cockroach that referenced this pull request Mar 25, 2020
The scenario that this patch addresses is the following (from cockroachdb#46431):
1. txn1 sends Put(a) + Put(b) + EndTxn
2. DistSender splits the Put(a) from the rest.
3. Put(a) succeeds, but the rest catches some retriable error.
4. TxnCoordSender gets the retriable error. The fact that a sub-batch
  succeeded is lost. We used to care about that fact, but we've
  successively gotten rid of that tracking across cockroachdb#35140 and cockroachdb#44661.
5. we refresh everything that came before this batch. The refresh
  succeeds.
6. we re-send the batch. It gets split again. The part with the EndTxn
  executes first. The transaction is now STAGING. More than that, the txn
  is in fact implicitly committed - the intent on a is already there since
  the previous attempt and, because it's at a lower timestamp than the txn
  record, it counts as golden for the purposes of verifying the implicit
  commit condition.
7. some other transaction wonders in, sees that txn1 is in its way, and
  transitions it to explicitly committed.
8. the Put(a) now tries to evaluate. It gets really confused. I guess
  that different things can happen; none of them good. One thing that I
  believe we've observed in cockroachdb#46299 is that, if there's another txn's
  intent there already, the Put will try to push it, enter the
  txnWaitQueue, eventually observe that its own txn is committed and
  return an error. The client thus gets an error (and a non-ambiguous one
  to boot) although the txn is committed. Even worse perhaps, I think it's
  possible for a request to return wrong results instead of an error.

This patch fixes it by inhibiting the parallel commit when the EndTxn
batch is retried. This way, there's never a STAGING record.

Release note (bug fix): A rare bug causing errors to be returned for
successfully committed transactions was fixed. The most common error
message was "TransactionStatusError: already committed".

Release justification: serious bug fix

Fixes cockroachdb#46341
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/mergeQueue branch March 30, 2020 19:22
andreimatei added a commit to andreimatei/cockroach that referenced this pull request Mar 31, 2020
The scenario that this patch addresses is the following (from cockroachdb#46431):
1. txn1 sends Put(a) + Put(b) + EndTxn
2. DistSender splits the Put(a) from the rest.
3. Put(a) succeeds, but the rest catches some retriable error.
4. TxnCoordSender gets the retriable error. The fact that a sub-batch
  succeeded is lost. We used to care about that fact, but we've
  successively gotten rid of that tracking across cockroachdb#35140 and cockroachdb#44661.
5. we refresh everything that came before this batch. The refresh
  succeeds.
6. we re-send the batch. It gets split again. The part with the EndTxn
  executes first. The transaction is now STAGING. More than that, the txn
  is in fact implicitly committed - the intent on a is already there since
  the previous attempt and, because it's at a lower timestamp than the txn
  record, it counts as golden for the purposes of verifying the implicit
  commit condition.
7. some other transaction wonders in, sees that txn1 is in its way, and
  transitions it to explicitly committed.
8. the Put(a) now tries to evaluate. It gets really confused. I guess
  that different things can happen; none of them good. One thing that I
  believe we've observed in cockroachdb#46299 is that, if there's another txn's
  intent there already, the Put will try to push it, enter the
  txnWaitQueue, eventually observe that its own txn is committed and
  return an error. The client thus gets an error (and a non-ambiguous one
  to boot) although the txn is committed. Even worse perhaps, I think it's
  possible for a request to return wrong results instead of an error.

This patch fixes it by inhibiting the parallel commit when the EndTxn
batch is retried. This way, there's never a STAGING record.

Release note (bug fix): A rare bug causing errors to be returned for
successfully committed transactions was fixed. The most common error
message was "TransactionStatusError: already committed".

Release justification: serious bug fix

Fixes cockroachdb#46341
andreimatei added a commit to andreimatei/cockroach that referenced this pull request Mar 31, 2020
The scenario that this patch addresses is the following (from cockroachdb#46431):
1. txn1 sends Put(a) + Put(b) + EndTxn
2. DistSender splits the Put(a) from the rest.
3. Put(a) succeeds, but the rest catches some retriable error.
4. TxnCoordSender gets the retriable error. The fact that a sub-batch
  succeeded is lost. We used to care about that fact, but we've
  successively gotten rid of that tracking across cockroachdb#35140 and cockroachdb#44661.
5. we refresh everything that came before this batch. The refresh
  succeeds.
6. we re-send the batch. It gets split again. The part with the EndTxn
  executes first. The transaction is now STAGING. More than that, the txn
  is in fact implicitly committed - the intent on a is already there since
  the previous attempt and, because it's at a lower timestamp than the txn
  record, it counts as golden for the purposes of verifying the implicit
  commit condition.
7. some other transaction wonders in, sees that txn1 is in its way, and
  transitions it to explicitly committed.
8. the Put(a) now tries to evaluate. It gets really confused. I guess
  that different things can happen; none of them good. One thing that I
  believe we've observed in cockroachdb#46299 is that, if there's another txn's
  intent there already, the Put will try to push it, enter the
  txnWaitQueue, eventually observe that its own txn is committed and
  return an error. The client thus gets an error (and a non-ambiguous one
  to boot) although the txn is committed. Even worse perhaps, I think it's
  possible for a request to return wrong results instead of an error.

This patch fixes it by inhibiting the parallel commit when the EndTxn
batch is retried. This way, there's never a STAGING record.

Release note (bug fix): A rare bug causing errors to be returned for
successfully committed transactions was fixed. The most common error
message was "TransactionStatusError: already committed".

Release justification: serious bug fix

Fixes cockroachdb#46341
andreimatei added a commit to andreimatei/cockroach that referenced this pull request Apr 1, 2020
The scenario that this patch addresses is the following (from cockroachdb#46431):
1. txn1 sends Put(a) + Put(b) + EndTxn
2. DistSender splits the Put(a) from the rest.
3. Put(a) succeeds, but the rest catches some retriable error.
4. TxnCoordSender gets the retriable error. The fact that a sub-batch
  succeeded is lost. We used to care about that fact, but we've
  successively gotten rid of that tracking across cockroachdb#35140 and cockroachdb#44661.
5. we refresh everything that came before this batch. The refresh
  succeeds.
6. we re-send the batch. It gets split again. The part with the EndTxn
  executes first. The transaction is now STAGING. More than that, the txn
  is in fact implicitly committed - the intent on a is already there since
  the previous attempt and, because it's at a lower timestamp than the txn
  record, it counts as golden for the purposes of verifying the implicit
  commit condition.
7. some other transaction wonders in, sees that txn1 is in its way, and
  transitions it to explicitly committed.
8. the Put(a) now tries to evaluate. It gets really confused. I guess
  that different things can happen; none of them good. One thing that I
  believe we've observed in cockroachdb#46299 is that, if there's another txn's
  intent there already, the Put will try to push it, enter the
  txnWaitQueue, eventually observe that its own txn is committed and
  return an error. The client thus gets an error (and a non-ambiguous one
  to boot) although the txn is committed. Even worse perhaps, I think it's
  possible for a request to return wrong results instead of an error.

This patch fixes it by inhibiting the parallel commit when the EndTxn
batch is retried. This way, there's never a STAGING record.

Release note (bug fix): A rare bug causing errors to be returned for
successfully committed transactions was fixed. The most common error
message was "TransactionStatusError: already committed".

Release justification: serious bug fix

Fixes cockroachdb#46341
craig bot pushed a commit that referenced this pull request Apr 1, 2020
46596: kvclient/kvcoord: inhibit parallel commit when retrying EndTxn request r=andreimatei a=andreimatei

The scenario that this patch addresses is the following (from #46431):
1. txn1 sends Put(a) + Put(b) + EndTxn
2. DistSender splits the Put(a) from the rest.
3. Put(a) succeeds, but the rest catches some retriable error.
4. TxnCoordSender gets the retriable error. The fact that a sub-batch
  succeeded is lost. We used to care about that fact, but we've
  successively gotten rid of that tracking across #35140 and #44661.
5. we refresh everything that came before this batch. The refresh
  succeeds.
6. we re-send the batch. It gets split again. The part with the EndTxn
  executes first. The transaction is now STAGING. More than that, the txn
  is in fact implicitly committed - the intent on a is already there since
  the previous attempt and, because it's at a lower timestamp than the txn
  record, it counts as golden for the purposes of verifying the implicit
  commit condition.
7. some other transaction wonders in, sees that txn1 is in its way, and
  transitions it to explicitly committed.
8. the Put(a) now tries to evaluate. It gets really confused. I guess
  that different things can happen; none of them good. One thing that I
  believe we've observed in #46299 is that, if there's another txn's
  intent there already, the Put will try to push it, enter the
  txnWaitQueue, eventually observe that its own txn is committed and
  return an error. The client thus gets an error (and a non-ambiguous one
  to boot) although the txn is committed. Even worse perhaps, I think it's
  possible for a request to return wrong results instead of an error.

This patch fixes it by inhibiting the parallel commit when the EndTxn
batch is retried. This way, there's never a STAGING record.

Release note (bug fix): A rare bug causing errors to be returned for
successfully committed transactions was fixed. The most common error
message was "TransactionStatusError: already committed".

Release justification: serious bug fix

Fixes #46341

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants