-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: don't clear lastTxnMeta on WriteIntentError to different key #32773
storage: don't clear lastTxnMeta on WriteIntentError to different key #32773
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. You're going to be adding better unit testing and such, right?
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/intent_resolver.go, line 349 at r1 (raw file):
// contended key (i.e. newWIErr != nil). contended.setLastTxnMeta(nil) }
Doesn't there need to be a case where we set it to nil or does it not matter?
I'm also confused generally by the semantics of setLastTxnMeta
. For example, there's no synchronization between the callers of add
, so the following can happen:
- read1 runs into intent of txn1, almost calls add but gets preempted
- txn1 resolves, txn2 writes another intent
- read2 runs into intent of txn2, calls add and thus setLastMeta(txn2)
- read1 continues and clobbers by calling
setLastMeta(txn1)
I would likely have more questions if I really dug into the code. While you're here, I'd appreciate a round of comments on how it all works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/intent_resolver.go, line 349 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Doesn't there need to be a case where we set it to nil or does it not matter?
I'm also confused generally by the semantics of
setLastTxnMeta
. For example, there's no synchronization between the callers ofadd
, so the following can happen:
- read1 runs into intent of txn1, almost calls add but gets preempted
- txn1 resolves, txn2 writes another intent
- read2 runs into intent of txn2, calls add and thus setLastMeta(txn2)
- read1 continues and clobbers by calling
setLastMeta(txn1)
I would likely have more questions if I really dug into the code. While you're here, I'd appreciate a round of comments on how it all works.
Nil is a dangerous value here, since it doesn't participate in cycle detection. I don't think it's ever required to set it to nil. Instead, we leave it set to the last-known transaction, so that if that transaction is pending we'll set up the cycle-detection loop, and if it's not pending we'll break out of the contention queue and retry.
I don't think that race would be a problem as described: read1 would try to push txn1, see that it's no longer pending, and be able to retry. But I'm not sure if all instances of that race would be so benign. I wouldn't be surprised if there were cases in which this could break a necessary link in the waiting graph.
f89676b
to
8311e97
Compare
This took way longer than expected, but I finally got a reliable reproduction of the failure in a test. It doesn't reproduce quite as reliably now that I ripped out the randomized sleeps that helped guide it in the right direction, but it still fails under stress without the fix. I wasn't able to create a sceneraio that reproduced the issue with fewer than 5 unique txns. Here are the steps:
There are probably a few simplifications that we could make here. The tricky part is ensuring that nothing corrected the
I agree. The change rips out all code paths that allow it.
We never need to set it to nil. It's always safe to allow the push and go on from there.
I don't think pushing will ever break links in the waiting graph. As long as we don't stop pushing and we continue handling cycles correctly in the txnWaitQueue then we'll eventually converge on the correct depedency graph. |
8311e97
to
3793bfa
Compare
I opened #32814 to create a test that regularly generates these kinds of scenarios. |
Fixes cockroachdb#32582. This change removes a faulty optimization in the `contentionQueue`. The optimization removed the txnMeta associated with a contended key in the queue when it found a `WriteIntentError` from a different key. It didn't take into account that this error could be from an earlier request within the same batch, meaning that we can't make any assumptions about the state of the previously contended intent simply because we see a different `WriteIntentError`. Release note (bug fix): Fix a bug where metadata about contended keys was inadvertently ignored, allowing for a failure in txn cycle detection and transaction deadlocks in rare cases.
3793bfa
to
1252ac7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained
TFTR! bors r+ |
32773: storage: don't clear lastTxnMeta on WriteIntentError to different key r=nvanbenschoten a=nvanbenschoten Fixes #32582. This change removes a faulty optimization in the `contentionQueue`. The optimization removed the txnMeta associated with a contended key in the queue when it found a `WriteIntentError` from a different key. It didn't take into account that this error could be from an earlier request within the same batch, meaning that we can't make any assumptions about the state of the previously contended intent simply because we see a different `WriteIntentError`. Release note (bug fix): Fix a bug where metadata about contended keys was inadvertently ignored, allowing for a failure in txn cycle detection and transaction deadlocks in rare cases. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Build succeeded |
Fixes #32582.
This change removes a faulty optimization in the
contentionQueue
.The optimization removed the txnMeta associated with a contended key
in the queue when it found a
WriteIntentError
from a different key.It didn't take into account that this error could be from an earlier
request within the same batch, meaning that we can't make any assumptions
about the state of the previously contended intent simply because we
see a different
WriteIntentError
.Release note (bug fix): Fix a bug where metadata about contended keys
was inadvertently ignored, allowing for a failure in txn cycle detection
and transaction deadlocks in rare cases.