Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: tpcc returns retriable errors to the user #28898

Closed
jordanlewis opened this issue Aug 21, 2018 · 10 comments · Fixed by #29455
Closed

sql: tpcc returns retriable errors to the user #28898

jordanlewis opened this issue Aug 21, 2018 · 10 comments · Fixed by #29455
Assignees
Labels
A-kv-client Relating to the KV client and the KV interface. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Milestone

Comments

@jordanlewis
Copy link
Member

Running TPCC with latest master sometimes produces retriable errors that get propagated all the way up to the user, like the "replay txn" error.

@asubiotto theorizes that this could be because of #24798.

@andreimatei, what do you think?

@jordanlewis jordanlewis added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. labels Aug 21, 2018
@jordanlewis jordanlewis added this to the 2.1 milestone Aug 21, 2018
@jordanlewis jordanlewis added this to Triage in (DEPRECATED) SQL execution via automation Aug 21, 2018
@andreimatei
Copy link
Contributor

I don't think "replay txn" errors can be caused by #24798; this sounds like #28270.

@jordanlewis jordanlewis moved this from Triage to Bugfix milestone in (DEPRECATED) SQL execution Aug 21, 2018
@jordanlewis jordanlewis self-assigned this Aug 21, 2018
@jordanlewis
Copy link
Member Author

I've added more logging to the client - soon we'll know what pgerror code these errors are returning. Plenty of errors are being returned properly as 40001 - it's just a matter of tracking down which ones aren't.

@jordanlewis
Copy link
Member Author

As far as I can tell, the only spurious error that gets returned is replay txn, which is marked as XX000 but is caused by a separate issue: see #28389.

I could have sworn I saw a non-replay-txn error pop up, but now I can't reproduce it. I'll keep trying though.

@jordanlewis
Copy link
Member Author

I've figured out the issue. Errors emitted from remote flows don't get properly unwrapped and turned into pgwire errors. I've made a test that shows this:

root@localhost:26257/kv> select *, crdb_internal.force_retry('100ms') from kv limit 1;
pq: HandledRetryableTxnError: forced by crdb_internal.force_retry()

This should automatically retry, but it doesn't, since I've forced the data on kv to a remote node.

What I don't understand is why this hasn't always been a problem. Nothing about this was changed during the DistSQL merge. Perhaps another error reporting mechanism was changed recently?

@jordanlewis
Copy link
Member Author

Looks like this has always been a problem. I confirmed that it reproduces on release-2.0. The next question is why it didn't come up in TPCC back then. Perhaps these sorts of errors were inadvertently made more likely.

@jordanlewis
Copy link
Member Author

The root cause of this issue seems to be the fact that the TransactionCoordSender is currently misconfigured to return a HandledRetryableTxnError even in the "leaf txn" configuration, which confuses downstream components into not returning a retryable error.

We found this by adding the diff:

diff --git a/pkg/kv/txn_coord_sender.go b/pkg/kv/txn_coord_sender.go
index 5b34d5935f..299894fbab 100644
--- a/pkg/kv/txn_coord_sender.go
+++ b/pkg/kv/txn_coord_sender.go
@@ -21,6 +21,8 @@ import (
 
        "github.com/pkg/errors"
 
+       "runtime/debug"
+
        "github.com/cockroachdb/cockroach/pkg/base"
        "github.com/cockroachdb/cockroach/pkg/internal/client"
        "github.com/cockroachdb/cockroach/pkg/keys"
@@ -711,6 +713,9 @@ func (tc *TxnCoordSender) maybeRejectClientLocked(
                        // priority is not used for aborted errors
                        roachpb.NormalUserPriority,
                        tc.clock)
+               if tc.typ == client.LeafTxn {
+                       log.Infof(ctx, "ALERT %s", debug.Stack())
+               }
                return roachpb.NewError(roachpb.NewHandledRetryableTxnError(
                        abortedErr.Message, tc.mu.txn.ID, newTxn))
        }

which produced a stack trace on an in-the-wild cluster:

I180822 19:47:31.791747 634332 kv/txn_coord_sender.go:717  [n1] ALERT goroutine 634332 [running]:
runtime/debug.Stack(0x2f4e2c0, 0xc4517f0570, 0xc42e47ca00)
       /usr/local/go/src/runtime/debug/stack.go:24 +0xa7
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).maybeRejectClientLocked(0xc45d3ff800, 0x2f4e2c0, 0xc4517f0570, 0xc421643318, 0x0)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:717 +0x2c5
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).Send(0xc45d3ff800, 0x2f4e2c0, 0xc4517f0570, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:550 +0xc2
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).sendUsingSender(0xc42076e380, 0x2f4e2c0, 0xc4517f0570, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:622 +0x135
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).Send(0xc45ccae6e0, 0x2f4e2c0, 0xc4517f0570, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:783 +0x161
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*txnKVFetcher).fetch(0xc463bf9030, 0x2f4e2c0, 0xc4517f0570, 0xc433704180, 0x26b9da0)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/kvfetcher.go:312 +0x597
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*txnKVFetcher).nextBatch(0xc463bf9030, 0x2f4e2c0, 0xc4517f0570, 0x26b9da0, 0x46e8118, 0x29aef40, 0xc4408e8d00, 0x2f4e2c0, 0xc464fcb500, 0xc428ddddb8, ...)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/kvfetcher.go:400 +0x70
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).nextKV(0xc42e232c90, 0x2f4e2c0, 0xc4517f0570, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:486 +0x3ab
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).NextKey(0xc42e232c90, 0x2f4e2c0, 0xc4517f0570, 0x0, 0x0, 0x0)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:524 +0x8a
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).StartScanFrom(0xc42e232c90, 0x2f4e2c0, 0xc4517f0570, 0x2f2d6c0, 0xc463bf9030, 0x0, 0x28ffc01)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:447 +0xd9
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).StartScan(0xc42e232c90, 0x2f4e2c0, 0xc4517f0570, 0xc45ccae6e0, 0xc4338853b0, 0x1, 0x1, 0xc435d86901, 0x0, 0xc44fad6600, ...)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:435 +0x1e8
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*tableReader).Start(0xc42e232800, 0x2f4e2c0, 0xc433884e40, 0xc4425b7f70, 0xbe897f)

/home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:226 +0x173
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*aggregatorBase).start(0xc44d532a80, 0x2f4e2c0, 0xc433884e40, 0x2a2681c, 0x12, 0x2f4e2c0, 0xc4425b7f88)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/aggregator.go:400 +0x52
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*orderedAggregator).Start(0xc44d532a80, 0x2f4e2c0, 0xc433884e40, 0xc4340a4b00, 0x0)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/aggregator.go:396 +0x56
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc44d532a80, 0x2f4e2c0, 0xc433884e40, 0xc44b031f38)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:738 +0x58
created by github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).StartAsync

This should never happen.

@andreimatei
Copy link
Contributor

Btw, I've verified that the cause of the "replay txn" error is the scenario described in #28270 (comment)
A big-ass batch with the BeginTxn is sent out, split into sub-batches for different ranges, the sub-batch with the Begin is real slow so an intent from another range manages to be pushed, and then the heartbeat loop notices that it was aborted and sends a rollback, and then eventually the original BeginTxn gets this error.
On a 4-node roachprod cluster with 2 tpcc workers (no -nowait), I see the Begin batch take 20s :S
The Begin corresponds to a transactions doing:

UPDATE "order" SET o_carrier_id = 2 WHERE (o_w_id = 375) AND ((o_d_id, o_id) IN ((1, 2104), (2, 2104), (10, 2104), (7, 2104), (8, 2104), (9, 2104), (3, 2104), (4, 2104), (5, 2104), (6, 2104))) RETURNING o_d_id, o_c_id

@tbg tbg added the A-kv-client Relating to the KV client and the KV interface. label Aug 28, 2018
@tbg tbg assigned andreimatei and unassigned jordanlewis Aug 28, 2018
@andreimatei
Copy link
Contributor

So I'm looking at the badly wrapped HandledRetryableTxn issues. I can't quite tell what's going on. What appears to be happening is that a TxnCoordSenderFactory.TransactionalSender() is called by providing a TxnCoordMeta with an aborted txn in it. But that seems cray cray... Are we starting a flow in an aborted txn? More digging tomorrow.

@andreimatei
Copy link
Contributor

I verified that we are indeed trying to start a flow in an aborted txn. We should add protection against that, as it obviously it will result in a bad time.
But the hole only gets deeper - why exactly are we trying to do that? Presumably there's some sort of a race, but what? There's no parallel statements in TPCC, so how does a client get to know that its txn is aborted async wrt to processing queries? It has to be the heartbeat loop, but how is a transaction whose BeginTxn presumably succeeded getting aborted? The only thing I can think of is the hb loop failed to heartbeat for more than 2s (but then succeeding suddenly to notice that someone else aborted the txn). Verifying...

@andreimatei
Copy link
Contributor

Well in the cluster I'm running (roachprod 4 node, 3 tpcc workloads running concurrently) I'm seeing heartbeats taking up to 5s. And those heartbeats find their transactions to be aborted.
I'm now satisfied with the theory that these loser heartbeats race with creating a new flow which captures an already-aborted txn.
Will have a patch with protections at different levels soon.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Aug 31, 2018
Before this patch, we had a race between a heartbeat find out (async)
that a txn has been aborted and a client using the txn to create DistSQL
flows. We could end up creating flows in aborted txns, which are going
to have a bad time: at the moment, leaf TxnCoordSenders don't react well
to what appears to be an async abort: tcs.maybeRejectClientLocked()
returns a HandledRetryableError, but the DistSQL infrastructure only
expects raw retriable errors - so the HandledRetryableError was
ironically losing its "retryable" character and was making it to a
sql client with the wrong pgwire code.
This patch resolves the situation by atomically checking that the txn is
still good (i.e. status==PENDING) before extracting its metadata that is
sent to leaves.

Fixes cockroachdb#28898

Release note: Fix an issue where, under severe load, clients were
sometimes receiving retryable errors with a non-retryable error code (a
client would get an error with the message "HandledRetryableError: ..."
but an internal error code instead of the expected retryable error
code).
@tbg tbg added this to Backlog in KV Sep 4, 2018
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 5, 2018
Before this patch, we had a race between a heartbeat find out (async)
that a txn has been aborted and a client using the txn to create DistSQL
flows. We could end up creating flows in aborted txns, which are going
to have a bad time: at the moment, leaf TxnCoordSenders don't react well
to what appears to be an async abort: tcs.maybeRejectClientLocked()
returns a HandledRetryableError, but the DistSQL infrastructure only
expects raw retriable errors - so the HandledRetryableError was
ironically losing its "retryable" character and was making it to a
sql client with the wrong pgwire code.
This patch resolves the situation by atomically checking that the txn is
still good (i.e. status==PENDING) before extracting its metadata that is
sent to leaves.

Fixes cockroachdb#28898
Fixes cockroachdb#29271

Release note (bug fix): Fix an issue where, under severe load, clients
were sometimes receiving retryable errors with a non-retryable error
code (a client would get an error with the message
"HandledRetryableError: ..." but an internal error code instead of the
expected retryable error code).
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 5, 2018
Before this patch, we had a race between a heartbeat find out (async)
that a txn has been aborted and a client using the txn to create DistSQL
flows. We could end up creating flows in aborted txns, which are going
to have a bad time: at the moment, leaf TxnCoordSenders don't react well
to what appears to be an async abort: tcs.maybeRejectClientLocked()
returns a HandledRetryableError, but the DistSQL infrastructure only
expects raw retriable errors - so the HandledRetryableError was
ironically losing its "retryable" character and was making it to a
sql client with the wrong pgwire code.
This patch resolves the situation by atomically checking that the txn is
still good (i.e. status==PENDING) before extracting its metadata that is
sent to leaves.

Fixes cockroachdb#28898
Fixes cockroachdb#29271

Release note (bug fix): Fix an issue where, under severe load, clients
were sometimes receiving retryable errors with a non-retryable error
code (a client would get an error with the message
"HandledRetryableError: ..." but an internal error code instead of the
expected retryable error code).
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 5, 2018
Before this patch, we had a race between a heartbeat find out (async)
that a txn has been aborted and a client using the txn to create DistSQL
flows. We could end up creating flows in aborted txns, which are going
to have a bad time: at the moment, leaf TxnCoordSenders don't react well
to what appears to be an async abort: tcs.maybeRejectClientLocked()
returns a HandledRetryableError, but the DistSQL infrastructure only
expects raw retriable errors - so the HandledRetryableError was
ironically losing its "retryable" character and was making it to a
sql client with the wrong pgwire code.
This patch resolves the situation by atomically checking that the txn is
still good (i.e. status==PENDING) before extracting its metadata that is
sent to leaves.

Fixes cockroachdb#28898
Fixes cockroachdb#29271

Release note (bug fix): Fix an issue where, under severe load, clients
were sometimes receiving retryable errors with a non-retryable error
code (a client would get an error with the message
"HandledRetryableError: ..." but an internal error code instead of the
expected retryable error code).
craig bot pushed a commit that referenced this issue Sep 5, 2018
29455: distsqlrun,kv,client: don't create flows in aborted txns r=andreimatei a=andreimatei

Before this patch, we had a race between a heartbeat find out (async)
that a txn has been aborted and a client using the txn to create DistSQL
flows. We could end up creating flows in aborted txns, which are going
to have a bad time: at the moment, leaf TxnCoordSenders don't react well
to what appears to be an async abort: tcs.maybeRejectClientLocked()
returns a HandledRetryableError, but the DistSQL infrastructure only
expects raw retriable errors - so the HandledRetryableError was
ironically losing its "retryable" character and was making it to a
sql client with the wrong pgwire code.
This patch resolves the situation by atomically checking that the txn is
still good (i.e. status==PENDING) before extracting its metadata that is
sent to leaves.

Fixes #28898
Fixes #29271

Release note: Fix an issue where, under severe load, clients were
sometimes receiving retryable errors with a non-retryable error code (a
client would get an error with the message "HandledRetryableError: ..."
but an internal error code instead of the expected retryable error
code).

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@craig craig bot closed this as completed in #29455 Sep 5, 2018
@jordanlewis jordanlewis moved this from Bugfix milestone to Finished (milestone 4) in (DEPRECATED) SQL execution Sep 6, 2018
craig bot pushed a commit that referenced this issue Sep 10, 2018
29036: kv: prevent a leaf TCS from returning HandledRetryableError r=andreimatei a=andreimatei

A root TxnCoordSender takes retryable errors, prepares the transaction
for a retry and the returns a HandledRetryableError to the client.
Leaf TCS does not prepare transactions for retries, and so has no
business creating HandledRetryableErrors. Instead, it returns raw error
which have to make their way (through DistSQL streams) to the root,
which will "handle" them.
Before this patch, there was a codepath where a leaf TCS would
erroneously return a HandledRetryableError. This was confusing DistSQL
which was not plumbing it properly and it was making its way to the
client with a non-retriable code.

Fixes #28898

Release note: None

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Sep 10, 2018
29998: release-2.1: kv: prevent a leaf TCS from returning HandledRetryableError r=andreimatei a=andreimatei

Backport 1/1 commits from #29036.

/cc @cockroachdb/release

---

A root TxnCoordSender takes retryable errors, prepares the transaction
for a retry and the returns a HandledRetryableError to the client.
Leaf TCS does not prepare transactions for retries, and so has no
business creating HandledRetryableErrors. Instead, it returns raw error
which have to make their way (through DistSQL streams) to the root,
which will "handle" them.
Before this patch, there was a codepath where a leaf TCS would
erroneously return a HandledRetryableError. This was confusing DistSQL
which was not plumbing it properly and it was making its way to the
client with a non-retriable code.

Fixes #28898

Release note: None


Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@tbg tbg moved this from Backlog to Closed in KV Oct 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Projects
No open projects
(DEPRECATED) SQL execution
  
Finished (milestone 4)
Development

Successfully merging a pull request may close this issue.

3 participants