Merged
Merged

# tla-plus: QueryTxn on ambiguous QueryIntent failure during ParallelCommit#37900

+478 −220

## Conversation

Member

### nvanbenschoten commented May 29, 2019

 This PR starts by adding write pipelining and concurrent intent resolution by conflicting transactions to the parallel commits TLA+ spec. This causes an assertion to be triggered by the hazard discussed in #37866. The assertion fires when the pre-commit QueryIntent for a pipelined write gets confused about an already resolved intent. This triggers a transaction retry, at which point the transaction record is unexpectedly already committed. This PR then proposes a medium-term solution to #37866. In doing so, it resolved the model failure from the previous commit. The medium-term solution is to catch IntentMissingErrors in DistSender's divideAndSendParallelCommit method coming from the pre-commit QueryIntent batch (right around here). When we see one of these errors, we immediately send a QueryTxn request to the transaction record. This will result in one of the four statuses: PENDING: Unexpected because the parallel commit EndTransactionRequest succeeded. Ignore. STAGING: Unambiguously not the issue from #37866. Ignore. COMMITTED: Unambiguously the issue from #37866. Strip the error and return the updated proto. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return. This solution isolates the ambiguity caused by the loss of information during intent resolution to just the case where the result of the QueryTxn is ABORTED. This is because an ABORTED record can mean either 1) the transaction was ABORTED and the missing intent was removed or 2) the transaction was COMMITTED, all intents were resolved, and the transaction record was GCed. This is a significant reduction in the cases where an AmbiguousCommitError will be needed and I suspect it will be able to tide us over until we're able to eliminate the loss of information caused by intent resolution almost entirely (e.g. by storing transaction IDs in committed values). There will still be some loss of information if we're not careful about MVCC GC, and it's still not completely clear to me how we'll need to handle that in every case. That's a discussion for a different time.
 tla-plus: model pipelined writes and concurrent intent resolution 
 74c36e5 
This commit adds write pipelining and concurrent intent resolution
by conflicting transactions to the parallel commits spec. This causes
an assertion to be triggered by the hazard discussed in #37866.

The assertion fires when the pre-commit QueryIntent for a pipelined
a transaction retry, at which point the transaction record is

Running the model right now hits this issue. The next commit will
propose a solution to fixing it.

Release note: None
 tla-plus: QueryTxn on ambiguous QueryIntent failure during ParallelCo… 
 8dc06b6 
…mmit

This commit proposes a medium-term solution to #37866. In doing so, it
resolved the model failure from the previous commit.

The medium-term solution is to catch IntentMissingErrors in DistSender's
divideAndSendParallelCommit method coming from the pre-commit QueryIntent
batch. When we see one of these errors, we immediately send a QueryTxn
request to the transaction record. This will result in one of the four
statuses:
1. PENDING: Unexpected because the parallel commit EndTransactionRequest
succeeded. Ignore.
2. STAGING: Unambiguously not the issue from #37866. Ignore.
3. COMMITTED: Unambiguously the issue from #37866. Strip the error and
return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError
and return.

This solution isolates the ambiguity caused by the loss of information during
intent resolution to just the case where the result of the QueryTxn is ABORTED.
This is because an ABORTED record can mean either 1) the transaction was ABORTED
and the missing intent was removed or 2) the transaction was COMMITTED, all intents
were resolved, and the transaction record was GCed.

This is a significant reduction in the cases where an AmbiguousCommitError will
be needed and I suspect it will be able to tide us over until we're able to eliminate
the loss of information caused by intent resolution almost entirely (e.g. by storing
transaction IDs in committed values). There will still be some loss of information if
we're not careful about MVCC GC, and it's still not completely clear to me how we'll
need to handle that in every case.

Release note: None
requested review from andreimatei and tbg May 29, 2019
Member

### cockroach-teamcity commented May 29, 2019

 This change is
mentioned this pull request May 29, 2019
approved these changes
Member

### tbg left a comment

 Reviewed 2 of 2 files at r1, 1 of 1 files at r2. Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten) docs/tla-plus/ParallelCommits/ParallelCommits.tla, line 296 at r1 (raw file):  while to_write /= {} do with key \in to_write do if ~intent_writes[key].resolved then Just curious what forced you to add this. You're only mutating intent_writes once in this label, so...?
 tla-plus: model pipelined write failures during async consensus 
 c6b641c 
This commit adds the possibility of async consensus failures when
pipelining writes in the parallel commits spec. This doesn't cause
any new issues, but expands the model to be more accurate.

The commit also documents the complications caused by information
loss during intent resolution.

Release note: None
reviewed
Member Author

### nvanbenschoten left a comment

 TFTR! bors r+ Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @tbg) docs/tla-plus/ParallelCommits/ParallelCommits.tla, line 296 at r1 (raw file): Previously, tbg (Tobias Grieger) wrote… Just curious what forced you to add this. You're only mutating intent_writes once in this label, so...? At some point I'm going to add multiple resolved states, so checking that the intent is not resolved before resolving it to be more explicit seemed like an appropriate change. You're correct that this wasn't necessary.
bot pushed a commit that referenced this pull request Jun 3, 2019
 Merge #37900 #37918 
 22da7da 
37900: tla-plus: QueryTxn on ambiguous QueryIntent failure during ParallelCommit r=nvanbenschoten a=nvanbenschoten

This PR starts by adding write pipelining and concurrent intent resolution by conflicting transactions to the parallel commits TLA+ spec. This causes an assertion to be triggered by the hazard discussed in #37866.

The assertion fires when the pre-commit QueryIntent for a pipelined write gets confused about an already resolved intent. This triggers a transaction retry, at which point the transaction record is unexpectedly already committed.

This PR then proposes a medium-term solution to #37866. In doing so, it resolved the model failure from the previous commit.

The medium-term solution is to catch IntentMissingErrors in DistSender's divideAndSendParallelCommit method coming from the pre-commit QueryIntent batch (right around [here](https://github.com/cockroachdb/cockroach/blob/2428567cba5e0838615521cbc9d0e1310f0ee6ad/pkg/kv/dist_sender.go#L916)). When we see one of these errors, we immediately send a QueryTxn request to the transaction record. This will result in one of the four statuses:
1. PENDING: Unexpected because the parallel commit EndTransactionRequest succeeded. Ignore.
2. STAGING: Unambiguously not the issue from #37866. Ignore.
3. COMMITTED: Unambiguously the issue from #37866. Strip the error and return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return.

This solution isolates the ambiguity caused by the loss of information during intent resolution to just the case where the result of the QueryTxn is ABORTED. This is because an ABORTED record can mean either 1) the transaction was ABORTED and the missing intent was removed or 2) the transaction was COMMITTED, all intents were resolved, and the transaction record was GCed.

This is a significant reduction in the cases where an AmbiguousCommitError will be needed and I suspect it will be able to tide us over until we're able to eliminate the loss of information caused by intent resolution almost entirely (e.g. by storing transaction IDs in committed values). There will still be some loss of information if we're not careful about MVCC GC, and it's still not completely clear to me how we'll need to handle that in every case. That's a discussion for a different time.

37918: roachtest: Move jepsen-initialized marker out of defer r=tbg a=bdarnell

This was running whether the test succeeded or failed, which is silly.

Fixes #37831

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Ben Darnell <ben@bendarnell.com>

# Build succeeded

bot merged commit c6b641c into cockroachdb:master Jun 3, 2019
3 checks passed
3 checks passed
GitHub CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
Details
deleted the nvanbenschoten:nvanbenschoten/parCommitTLA branch Jun 3, 2019
added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 4, 2019
 kv: detect if missing intent is due to intent resolution during paral… 
 dc8c2b8 
…lel commit

Fixes cockroachdb#37866.

This commit implements the medium-term solution to cockroachdb#37866 proposed and modeled
in cockroachdb#37900. The solution is to catch IntentMissingErrors in DistSender's
divideAndSendParallelCommit method coming from a parallel commit's pre-commit
QueryIntent batch. When we see one of these errors, we immediately send a
QueryTxn request to the transaction record. This will result in one of the
four statuses:
1. PENDING: Unexpected because the parallel commit EndTransactionRequest succeeded. Ignore.
2. STAGING: Unambiguously not the issue from cockroachdb#37866. Ignore.
3. COMMITTED: Unambiguously the issue from cockroachdb#37866. Strip the error and return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return.

This solution isolates the ambiguity caused by the loss of information during
intent resolution to just the case where the result of the QueryTxn is ABORTED.
This is because an ABORTED record can mean either 1) the transaction was ABORTED
and the missing intent was removed or 2) the transaction was COMMITTED, all
intents were resolved, and the transaction record was GCed.

Release note: None
added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 4, 2019
 kv: detect if missing intent is due to intent resolution during paral… 
 0d9a5b8 
…lel commit

Fixes cockroachdb#37866.

This commit implements the medium-term solution to cockroachdb#37866 proposed and modeled
in cockroachdb#37900. The solution is to catch IntentMissingErrors in DistSender's
divideAndSendParallelCommit method coming from a parallel commit's pre-commit
QueryIntent batch. When we see one of these errors, we immediately send a
QueryTxn request to the transaction record. This will result in one of the
four statuses:
1. PENDING: Unexpected because the parallel commit EndTransactionRequest succeeded. Ignore.
2. STAGING: Unambiguously not the issue from cockroachdb#37866. Ignore.
3. COMMITTED: Unambiguously the issue from cockroachdb#37866. Strip the error and return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return.

This solution isolates the ambiguity caused by the loss of information during
intent resolution to just the case where the result of the QueryTxn is ABORTED.
This is because an ABORTED record can mean either 1) the transaction was ABORTED
and the missing intent was removed or 2) the transaction was COMMITTED, all
intents were resolved, and the transaction record was GCed.

Release note: None
mentioned this pull request Jun 4, 2019
added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 4, 2019
 kv: detect if missing intent is due to intent resolution during paral… 
 a83d408 
…lel commit

Fixes cockroachdb#37866.

This commit implements the medium-term solution to cockroachdb#37866 proposed and modeled
in cockroachdb#37900. The solution is to catch IntentMissingErrors in DistSender's
divideAndSendParallelCommit method coming from a parallel commit's pre-commit
QueryIntent batch. When we see one of these errors, we immediately send a
QueryTxn request to the transaction record. This will result in one of the
four statuses:
1. PENDING: Unexpected because the parallel commit EndTransactionRequest succeeded. Ignore.
2. STAGING: Unambiguously not the issue from cockroachdb#37866. Ignore.
3. COMMITTED: Unambiguously the issue from cockroachdb#37866. Strip the error and return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return.

This solution isolates the ambiguity caused by the loss of information during
intent resolution to just the case where the result of the QueryTxn is ABORTED.
This is because an ABORTED record can mean either 1) the transaction was ABORTED
and the missing intent was removed or 2) the transaction was COMMITTED, all
intents were resolved, and the transaction record was GCed.

Release note: None
bot pushed a commit that referenced this pull request Jun 4, 2019
 Merge #37987 
 fe1a38c 
37987: kv: detect if missing intent is due to intent resolution during parallel commit r=nvanbenschoten a=nvanbenschoten

Fixes #37866.

This commit implements the medium-term solution to #37866 proposed and modeled in #37900. The solution is to catch IntentMissingErrors in DistSender's divideAndSendParallelCommit method coming from a parallel commit's pre-commit QueryIntent batch. When we see one of these errors, we immediately send a QueryTxn request to the transaction record. This will result in one of the four statuses:
1. PENDING: Unexpected because the parallel commit EndTransactionRequest succeeded. Ignore.
2. STAGING: Unambiguously not the issue from #37866. Ignore.
3. COMMITTED: Unambiguously the issue from #37866. Strip the error and return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return.

This solution isolates the ambiguity caused by the loss of information during intent resolution to just the case where the result of the QueryTxn is ABORTED. This is because an ABORTED record can mean either 1) the transaction was ABORTED and the missing intent was removed or 2) the transaction was COMMITTED, all intents were resolved, and the transaction record was GCed.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
added a commit to tyleroberts/cockroach that referenced this pull request Jul 31, 2019
 kv: detect if missing intent is due to intent resolution during paral… 
 7fc2370 
…lel commit

Fixes cockroachdb#37866.

This commit implements the medium-term solution to cockroachdb#37866 proposed and modeled
in cockroachdb#37900. The solution is to catch IntentMissingErrors in DistSender's
divideAndSendParallelCommit method coming from a parallel commit's pre-commit
QueryIntent batch. When we see one of these errors, we immediately send a
QueryTxn request to the transaction record. This will result in one of the
four statuses:
1. PENDING: Unexpected because the parallel commit EndTransactionRequest succeeded. Ignore.
2. STAGING: Unambiguously not the issue from cockroachdb#37866. Ignore.
3. COMMITTED: Unambiguously the issue from cockroachdb#37866. Strip the error and return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return.

This solution isolates the ambiguity caused by the loss of information during
intent resolution to just the case where the result of the QueryTxn is ABORTED.
This is because an ABORTED record can mean either 1) the transaction was ABORTED
and the missing intent was removed or 2) the transaction was COMMITTED, all
intents were resolved, and the transaction record was GCed.

Release note: None
added a commit to tyleroberts/cockroach that referenced this pull request Aug 6, 2019
 kv: detect if missing intent is due to intent resolution during paral… 
 c3e978c 
…lel commit

Fixes cockroachdb#37866.

This commit implements the medium-term solution to cockroachdb#37866 proposed and modeled
in cockroachdb#37900. The solution is to catch IntentMissingErrors in DistSender's
divideAndSendParallelCommit method coming from a parallel commit's pre-commit
QueryIntent batch. When we see one of these errors, we immediately send a
QueryTxn request to the transaction record. This will result in one of the
four statuses:
1. PENDING: Unexpected because the parallel commit EndTransactionRequest succeeded. Ignore.
2. STAGING: Unambiguously not the issue from cockroachdb#37866. Ignore.
3. COMMITTED: Unambiguously the issue from cockroachdb#37866. Strip the error and return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return.

This solution isolates the ambiguity caused by the loss of information during
intent resolution to just the case where the result of the QueryTxn is ABORTED.
This is because an ABORTED record can mean either 1) the transaction was ABORTED
and the missing intent was removed or 2) the transaction was COMMITTED, all
intents were resolved, and the transaction record was GCed.

Release note: None