Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tla-plus: QueryTxn on ambiguous QueryIntent failure during ParallelCommit #37900

Merged

Conversation

@nvanbenschoten
Copy link
Member

nvanbenschoten commented May 29, 2019

This PR starts by adding write pipelining and concurrent intent resolution by conflicting transactions to the parallel commits TLA+ spec. This causes an assertion to be triggered by the hazard discussed in #37866.

The assertion fires when the pre-commit QueryIntent for a pipelined write gets confused about an already resolved intent. This triggers a transaction retry, at which point the transaction record is unexpectedly already committed.

This PR then proposes a medium-term solution to #37866. In doing so, it resolved the model failure from the previous commit.

The medium-term solution is to catch IntentMissingErrors in DistSender's divideAndSendParallelCommit method coming from the pre-commit QueryIntent batch (right around here). When we see one of these errors, we immediately send a QueryTxn request to the transaction record. This will result in one of the four statuses:

  1. PENDING: Unexpected because the parallel commit EndTransactionRequest succeeded. Ignore.
  2. STAGING: Unambiguously not the issue from #37866. Ignore.
  3. COMMITTED: Unambiguously the issue from #37866. Strip the error and return the updated proto.
  4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return.

This solution isolates the ambiguity caused by the loss of information during intent resolution to just the case where the result of the QueryTxn is ABORTED. This is because an ABORTED record can mean either 1) the transaction was ABORTED and the missing intent was removed or 2) the transaction was COMMITTED, all intents were resolved, and the transaction record was GCed.

This is a significant reduction in the cases where an AmbiguousCommitError will be needed and I suspect it will be able to tide us over until we're able to eliminate the loss of information caused by intent resolution almost entirely (e.g. by storing transaction IDs in committed values). There will still be some loss of information if we're not careful about MVCC GC, and it's still not completely clear to me how we'll need to handle that in every case. That's a discussion for a different time.

This commit adds write pipelining and concurrent intent resolution
by conflicting transactions to the parallel commits spec. This causes
an assertion to be triggered by the hazard discussed in #37866.

The assertion fires when the pre-commit QueryIntent for a pipelined
write gets confused about an already resolved intent. This triggers
a transaction retry, at which point the transaction record is
unexpectedly already committed.

Running the model right now hits this issue. The next commit will
propose a solution to fixing it.

Release note: None
…mmit

This commit proposes a medium-term solution to #37866. In doing so, it
resolved the model failure from the previous commit.

The medium-term solution is to catch `IntentMissingErrors` in `DistSender`'s
`divideAndSendParallelCommit` method coming from the pre-commit QueryIntent
batch. When we see one of these errors, we immediately send a `QueryTxn`
request to the transaction record. This will result in one of the four
statuses:
1. PENDING: Unexpected because the parallel commit `EndTransactionRequest`
    succeeded. Ignore.
2. STAGING: Unambiguously not the issue from #37866. Ignore.
3. COMMITTED: Unambiguously the issue from #37866. Strip the error and
    return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError
    and return.

This solution isolates the ambiguity caused by the loss of information during
intent resolution to just the case where the result of the QueryTxn is ABORTED.
This is because an ABORTED record can mean either 1) the transaction was ABORTED
and the missing intent was removed or 2) the transaction was COMMITTED, all intents
were resolved, and the transaction record was GCed.

This is a significant reduction in the cases where an AmbiguousCommitError will
be needed and I suspect it will be able to tide us over until we're able to eliminate
the loss of information caused by intent resolution almost entirely (e.g. by storing
transaction IDs in committed values). There will still be some loss of information if
we're not careful about MVCC GC, and it's still not completely clear to me how we'll
need to handle that in every case.

Release note: None
@nvanbenschoten nvanbenschoten requested review from andreimatei and tbg May 29, 2019
@cockroach-teamcity

This comment has been minimized.

Copy link
Member

cockroach-teamcity commented May 29, 2019

This change is Reviewable

@tbg
tbg approved these changes May 29, 2019
Copy link
Member

tbg left a comment

Reviewed 2 of 2 files at r1, 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)


docs/tla-plus/ParallelCommits/ParallelCommits.tla, line 296 at r1 (raw file):

    while to_write /= {} do
      with key \in to_write do
        if ~intent_writes[key].resolved then

Just curious what forced you to add this. You're only mutating intent_writes once in this label, so...?

This commit adds the possibility of async consensus failures when
pipelining writes in the parallel commits spec. This doesn't cause
any new issues, but expands the model to be more accurate.

The commit also documents the complications caused by information
loss during intent resolution.

Release note: None
Copy link
Member Author

nvanbenschoten left a comment

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)


docs/tla-plus/ParallelCommits/ParallelCommits.tla, line 296 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Just curious what forced you to add this. You're only mutating intent_writes once in this label, so...?

At some point I'm going to add multiple resolved states, so checking that the intent is not resolved before resolving it to be more explicit seemed like an appropriate change. You're correct that this wasn't necessary.

craig bot pushed a commit that referenced this pull request Jun 3, 2019
37900: tla-plus: QueryTxn on ambiguous QueryIntent failure during ParallelCommit r=nvanbenschoten a=nvanbenschoten

This PR starts by adding write pipelining and concurrent intent resolution by conflicting transactions to the parallel commits TLA+ spec. This causes an assertion to be triggered by the hazard discussed in #37866.

The assertion fires when the pre-commit QueryIntent for a pipelined write gets confused about an already resolved intent. This triggers a transaction retry, at which point the transaction record is unexpectedly already committed.

This PR then proposes a medium-term solution to #37866. In doing so, it resolved the model failure from the previous commit.

The medium-term solution is to catch `IntentMissingErrors` in `DistSender`'s `divideAndSendParallelCommit` method coming from the pre-commit QueryIntent batch (right around [here](https://github.com/cockroachdb/cockroach/blob/2428567cba5e0838615521cbc9d0e1310f0ee6ad/pkg/kv/dist_sender.go#L916)). When we see one of these errors, we immediately send a `QueryTxn` request to the transaction record. This will result in one of the four statuses:
1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore.
2. STAGING: Unambiguously not the issue from #37866. Ignore.
3. COMMITTED: Unambiguously the issue from #37866. Strip the error and return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return.

This solution isolates the ambiguity caused by the loss of information during intent resolution to just the case where the result of the QueryTxn is ABORTED. This is because an ABORTED record can mean either 1) the transaction was ABORTED and the missing intent was removed or 2) the transaction was COMMITTED, all intents were resolved, and the transaction record was GCed.

This is a significant reduction in the cases where an AmbiguousCommitError will be needed and I suspect it will be able to tide us over until we're able to eliminate the loss of information caused by intent resolution almost entirely (e.g. by storing transaction IDs in committed values). There will still be some loss of information if we're not careful about MVCC GC, and it's still not completely clear to me how we'll need to handle that in every case. That's a discussion for a different time.

37918: roachtest: Move jepsen-initialized marker out of defer r=tbg a=bdarnell

This was running whether the test succeeded or failed, which is silly.

Fixes #37831

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Ben Darnell <ben@bendarnell.com>
@craig

This comment has been minimized.

Copy link

craig bot commented Jun 3, 2019

Build succeeded

@craig craig bot merged commit c6b641c into cockroachdb:master Jun 3, 2019
3 checks passed
3 checks passed
GitHub CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
license/cla Contributor License Agreement is signed.
Details
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten:nvanbenschoten/parCommitTLA branch Jun 3, 2019
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 4, 2019
…lel commit

Fixes cockroachdb#37866.

This commit implements the medium-term solution to cockroachdb#37866 proposed and modeled
in cockroachdb#37900. The solution is to catch `IntentMissingErrors` in `DistSender`'s
`divideAndSendParallelCommit` method coming from a parallel commit's pre-commit
QueryIntent batch. When we see one of these errors, we immediately send a
`QueryTxn` request to the transaction record. This will result in one of the
four statuses:
1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore.
2. STAGING: Unambiguously not the issue from cockroachdb#37866. Ignore.
3. COMMITTED: Unambiguously the issue from cockroachdb#37866. Strip the error and return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return.

This solution isolates the ambiguity caused by the loss of information during
intent resolution to just the case where the result of the QueryTxn is ABORTED.
This is because an ABORTED record can mean either 1) the transaction was ABORTED
and the missing intent was removed or 2) the transaction was COMMITTED, all
intents were resolved, and the transaction record was GCed.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 4, 2019
…lel commit

Fixes cockroachdb#37866.

This commit implements the medium-term solution to cockroachdb#37866 proposed and modeled
in cockroachdb#37900. The solution is to catch `IntentMissingErrors` in `DistSender`'s
`divideAndSendParallelCommit` method coming from a parallel commit's pre-commit
QueryIntent batch. When we see one of these errors, we immediately send a
`QueryTxn` request to the transaction record. This will result in one of the
four statuses:
1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore.
2. STAGING: Unambiguously not the issue from cockroachdb#37866. Ignore.
3. COMMITTED: Unambiguously the issue from cockroachdb#37866. Strip the error and return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return.

This solution isolates the ambiguity caused by the loss of information during
intent resolution to just the case where the result of the QueryTxn is ABORTED.
This is because an ABORTED record can mean either 1) the transaction was ABORTED
and the missing intent was removed or 2) the transaction was COMMITTED, all
intents were resolved, and the transaction record was GCed.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 4, 2019
…lel commit

Fixes cockroachdb#37866.

This commit implements the medium-term solution to cockroachdb#37866 proposed and modeled
in cockroachdb#37900. The solution is to catch `IntentMissingErrors` in `DistSender`'s
`divideAndSendParallelCommit` method coming from a parallel commit's pre-commit
QueryIntent batch. When we see one of these errors, we immediately send a
`QueryTxn` request to the transaction record. This will result in one of the
four statuses:
1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore.
2. STAGING: Unambiguously not the issue from cockroachdb#37866. Ignore.
3. COMMITTED: Unambiguously the issue from cockroachdb#37866. Strip the error and return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return.

This solution isolates the ambiguity caused by the loss of information during
intent resolution to just the case where the result of the QueryTxn is ABORTED.
This is because an ABORTED record can mean either 1) the transaction was ABORTED
and the missing intent was removed or 2) the transaction was COMMITTED, all
intents were resolved, and the transaction record was GCed.

Release note: None
craig bot pushed a commit that referenced this pull request Jun 4, 2019
37987: kv: detect if missing intent is due to intent resolution during parallel commit r=nvanbenschoten a=nvanbenschoten

Fixes #37866.

This commit implements the medium-term solution to #37866 proposed and modeled in #37900. The solution is to catch `IntentMissingErrors` in `DistSender`'s `divideAndSendParallelCommit` method coming from a parallel commit's pre-commit QueryIntent batch. When we see one of these errors, we immediately send a `QueryTxn` request to the transaction record. This will result in one of the four statuses:
1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore.
2. STAGING: Unambiguously not the issue from #37866. Ignore.
3. COMMITTED: Unambiguously the issue from #37866. Strip the error and return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return.

This solution isolates the ambiguity caused by the loss of information during intent resolution to just the case where the result of the QueryTxn is ABORTED. This is because an ABORTED record can mean either 1) the transaction was ABORTED and the missing intent was removed or 2) the transaction was COMMITTED, all intents were resolved, and the transaction record was GCed.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
tyleroberts added a commit to tyleroberts/cockroach that referenced this pull request Jul 31, 2019
…lel commit

Fixes cockroachdb#37866.

This commit implements the medium-term solution to cockroachdb#37866 proposed and modeled
in cockroachdb#37900. The solution is to catch `IntentMissingErrors` in `DistSender`'s
`divideAndSendParallelCommit` method coming from a parallel commit's pre-commit
QueryIntent batch. When we see one of these errors, we immediately send a
`QueryTxn` request to the transaction record. This will result in one of the
four statuses:
1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore.
2. STAGING: Unambiguously not the issue from cockroachdb#37866. Ignore.
3. COMMITTED: Unambiguously the issue from cockroachdb#37866. Strip the error and return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return.

This solution isolates the ambiguity caused by the loss of information during
intent resolution to just the case where the result of the QueryTxn is ABORTED.
This is because an ABORTED record can mean either 1) the transaction was ABORTED
and the missing intent was removed or 2) the transaction was COMMITTED, all
intents were resolved, and the transaction record was GCed.

Release note: None
tyleroberts added a commit to tyleroberts/cockroach that referenced this pull request Aug 6, 2019
…lel commit

Fixes cockroachdb#37866.

This commit implements the medium-term solution to cockroachdb#37866 proposed and modeled
in cockroachdb#37900. The solution is to catch `IntentMissingErrors` in `DistSender`'s
`divideAndSendParallelCommit` method coming from a parallel commit's pre-commit
QueryIntent batch. When we see one of these errors, we immediately send a
`QueryTxn` request to the transaction record. This will result in one of the
four statuses:
1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore.
2. STAGING: Unambiguously not the issue from cockroachdb#37866. Ignore.
3. COMMITTED: Unambiguously the issue from cockroachdb#37866. Strip the error and return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return.

This solution isolates the ambiguity caused by the loss of information during
intent resolution to just the case where the result of the QueryTxn is ABORTED.
This is because an ABORTED record can mean either 1) the transaction was ABORTED
and the missing intent was removed or 2) the transaction was COMMITTED, all
intents were resolved, and the transaction record was GCed.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.