New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tla-plus: QueryTxn on ambiguous QueryIntent failure during ParallelCommit #37900
tla-plus: QueryTxn on ambiguous QueryIntent failure during ParallelCommit #37900
Conversation
This commit adds write pipelining and concurrent intent resolution by conflicting transactions to the parallel commits spec. This causes an assertion to be triggered by the hazard discussed in cockroachdb#37866. The assertion fires when the pre-commit QueryIntent for a pipelined write gets confused about an already resolved intent. This triggers a transaction retry, at which point the transaction record is unexpectedly already committed. Running the model right now hits this issue. The next commit will propose a solution to fixing it. Release note: None
…mmit This commit proposes a medium-term solution to cockroachdb#37866. In doing so, it resolved the model failure from the previous commit. The medium-term solution is to catch `IntentMissingErrors` in `DistSender`'s `divideAndSendParallelCommit` method coming from the pre-commit QueryIntent batch. When we see one of these errors, we immediately send a `QueryTxn` request to the transaction record. This will result in one of the four statuses: 1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore. 2. STAGING: Unambiguously not the issue from cockroachdb#37866. Ignore. 3. COMMITTED: Unambiguously the issue from cockroachdb#37866. Strip the error and return the updated proto. 4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return. This solution isolates the ambiguity caused by the loss of information during intent resolution to just the case where the result of the QueryTxn is ABORTED. This is because an ABORTED record can mean either 1) the transaction was ABORTED and the missing intent was removed or 2) the transaction was COMMITTED, all intents were resolved, and the transaction record was GCed. This is a significant reduction in the cases where an AmbiguousCommitError will be needed and I suspect it will be able to tide us over until we're able to eliminate the loss of information caused by intent resolution almost entirely (e.g. by storing transaction IDs in committed values). There will still be some loss of information if we're not careful about MVCC GC, and it's still not completely clear to me how we'll need to handle that in every case. Release note: None
Reviewed 2 of 2 files at r1, 1 of 1 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)
docs/tla-plus/ParallelCommits/ParallelCommits.tla, line 296 at r1 (raw file):
while to_write /= {} do with key \in to_write do if ~intent_writes[key].resolved then
Just curious what forced you to add this. You're only mutating intent_writes once in this label, so...?
This commit adds the possibility of async consensus failures when pipelining writes in the parallel commits spec. This doesn't cause any new issues, but expands the model to be more accurate. The commit also documents the complications caused by information loss during intent resolution. Release note: None
TFTR!
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)
docs/tla-plus/ParallelCommits/ParallelCommits.tla, line 296 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Just curious what forced you to add this. You're only mutating intent_writes once in this label, so...?
At some point I'm going to add multiple resolved states, so checking that the intent is not resolved before resolving it to be more explicit seemed like an appropriate change. You're correct that this wasn't necessary.
37900: tla-plus: QueryTxn on ambiguous QueryIntent failure during ParallelCommit r=nvanbenschoten a=nvanbenschoten This PR starts by adding write pipelining and concurrent intent resolution by conflicting transactions to the parallel commits TLA+ spec. This causes an assertion to be triggered by the hazard discussed in #37866. The assertion fires when the pre-commit QueryIntent for a pipelined write gets confused about an already resolved intent. This triggers a transaction retry, at which point the transaction record is unexpectedly already committed. This PR then proposes a medium-term solution to #37866. In doing so, it resolved the model failure from the previous commit. The medium-term solution is to catch `IntentMissingErrors` in `DistSender`'s `divideAndSendParallelCommit` method coming from the pre-commit QueryIntent batch (right around [here](https://github.com/cockroachdb/cockroach/blob/2428567cba5e0838615521cbc9d0e1310f0ee6ad/pkg/kv/dist_sender.go#L916)). When we see one of these errors, we immediately send a `QueryTxn` request to the transaction record. This will result in one of the four statuses: 1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore. 2. STAGING: Unambiguously not the issue from #37866. Ignore. 3. COMMITTED: Unambiguously the issue from #37866. Strip the error and return the updated proto. 4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return. This solution isolates the ambiguity caused by the loss of information during intent resolution to just the case where the result of the QueryTxn is ABORTED. This is because an ABORTED record can mean either 1) the transaction was ABORTED and the missing intent was removed or 2) the transaction was COMMITTED, all intents were resolved, and the transaction record was GCed. This is a significant reduction in the cases where an AmbiguousCommitError will be needed and I suspect it will be able to tide us over until we're able to eliminate the loss of information caused by intent resolution almost entirely (e.g. by storing transaction IDs in committed values). There will still be some loss of information if we're not careful about MVCC GC, and it's still not completely clear to me how we'll need to handle that in every case. That's a discussion for a different time. 37918: roachtest: Move jepsen-initialized marker out of defer r=tbg a=bdarnell This was running whether the test succeeded or failed, which is silly. Fixes #37831 Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Ben Darnell <ben@bendarnell.com>
Build succeeded |
…lel commit Fixes cockroachdb#37866. This commit implements the medium-term solution to cockroachdb#37866 proposed and modeled in cockroachdb#37900. The solution is to catch `IntentMissingErrors` in `DistSender`'s `divideAndSendParallelCommit` method coming from a parallel commit's pre-commit QueryIntent batch. When we see one of these errors, we immediately send a `QueryTxn` request to the transaction record. This will result in one of the four statuses: 1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore. 2. STAGING: Unambiguously not the issue from cockroachdb#37866. Ignore. 3. COMMITTED: Unambiguously the issue from cockroachdb#37866. Strip the error and return the updated proto. 4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return. This solution isolates the ambiguity caused by the loss of information during intent resolution to just the case where the result of the QueryTxn is ABORTED. This is because an ABORTED record can mean either 1) the transaction was ABORTED and the missing intent was removed or 2) the transaction was COMMITTED, all intents were resolved, and the transaction record was GCed. Release note: None
…lel commit Fixes cockroachdb#37866. This commit implements the medium-term solution to cockroachdb#37866 proposed and modeled in cockroachdb#37900. The solution is to catch `IntentMissingErrors` in `DistSender`'s `divideAndSendParallelCommit` method coming from a parallel commit's pre-commit QueryIntent batch. When we see one of these errors, we immediately send a `QueryTxn` request to the transaction record. This will result in one of the four statuses: 1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore. 2. STAGING: Unambiguously not the issue from cockroachdb#37866. Ignore. 3. COMMITTED: Unambiguously the issue from cockroachdb#37866. Strip the error and return the updated proto. 4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return. This solution isolates the ambiguity caused by the loss of information during intent resolution to just the case where the result of the QueryTxn is ABORTED. This is because an ABORTED record can mean either 1) the transaction was ABORTED and the missing intent was removed or 2) the transaction was COMMITTED, all intents were resolved, and the transaction record was GCed. Release note: None
37987: kv: detect if missing intent is due to intent resolution during parallel commit r=nvanbenschoten a=nvanbenschoten Fixes #37866. This commit implements the medium-term solution to #37866 proposed and modeled in #37900. The solution is to catch `IntentMissingErrors` in `DistSender`'s `divideAndSendParallelCommit` method coming from a parallel commit's pre-commit QueryIntent batch. When we see one of these errors, we immediately send a `QueryTxn` request to the transaction record. This will result in one of the four statuses: 1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore. 2. STAGING: Unambiguously not the issue from #37866. Ignore. 3. COMMITTED: Unambiguously the issue from #37866. Strip the error and return the updated proto. 4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return. This solution isolates the ambiguity caused by the loss of information during intent resolution to just the case where the result of the QueryTxn is ABORTED. This is because an ABORTED record can mean either 1) the transaction was ABORTED and the missing intent was removed or 2) the transaction was COMMITTED, all intents were resolved, and the transaction record was GCed. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
This PR starts by adding write pipelining and concurrent intent resolution by conflicting transactions to the parallel commits TLA+ spec. This causes an assertion to be triggered by the hazard discussed in #37866.
The assertion fires when the pre-commit QueryIntent for a pipelined write gets confused about an already resolved intent. This triggers a transaction retry, at which point the transaction record is unexpectedly already committed.
This PR then proposes a medium-term solution to #37866. In doing so, it resolved the model failure from the previous commit.
The medium-term solution is to catch
IntentMissingErrors
inDistSender
'sdivideAndSendParallelCommit
method coming from the pre-commit QueryIntent batch (right around here). When we see one of these errors, we immediately send aQueryTxn
request to the transaction record. This will result in one of the four statuses:EndTransactionRequest
succeeded. Ignore.This solution isolates the ambiguity caused by the loss of information during intent resolution to just the case where the result of the QueryTxn is ABORTED. This is because an ABORTED record can mean either 1) the transaction was ABORTED and the missing intent was removed or 2) the transaction was COMMITTED, all intents were resolved, and the transaction record was GCed.
This is a significant reduction in the cases where an AmbiguousCommitError will be needed and I suspect it will be able to tide us over until we're able to eliminate the loss of information caused by intent resolution almost entirely (e.g. by storing transaction IDs in committed values). There will still be some loss of information if we're not careful about MVCC GC, and it's still not completely clear to me how we'll need to handle that in every case. That's a discussion for a different time.