sql: tpcc returns retriable errors to the user #28898

jordanlewis · 2018-08-21T15:41:11Z

Running TPCC with latest master sometimes produces retriable errors that get propagated all the way up to the user, like the "replay txn" error.

@asubiotto theorizes that this could be because of #24798.

@andreimatei, what do you think?

andreimatei · 2018-08-21T16:08:12Z

I don't think "replay txn" errors can be caused by #24798; this sounds like #28270.

jordanlewis · 2018-08-21T18:12:19Z

I've added more logging to the client - soon we'll know what pgerror code these errors are returning. Plenty of errors are being returned properly as 40001 - it's just a matter of tracking down which ones aren't.

jordanlewis · 2018-08-21T19:04:31Z

As far as I can tell, the only spurious error that gets returned is replay txn, which is marked as XX000 but is caused by a separate issue: see #28389.

I could have sworn I saw a non-replay-txn error pop up, but now I can't reproduce it. I'll keep trying though.

jordanlewis · 2018-08-22T03:20:30Z

I've figured out the issue. Errors emitted from remote flows don't get properly unwrapped and turned into pgwire errors. I've made a test that shows this:

root@localhost:26257/kv> select *, crdb_internal.force_retry('100ms') from kv limit 1;
pq: HandledRetryableTxnError: forced by crdb_internal.force_retry()

This should automatically retry, but it doesn't, since I've forced the data on kv to a remote node.

What I don't understand is why this hasn't always been a problem. Nothing about this was changed during the DistSQL merge. Perhaps another error reporting mechanism was changed recently?

jordanlewis · 2018-08-22T03:57:51Z

Looks like this has always been a problem. I confirmed that it reproduces on release-2.0. The next question is why it didn't come up in TPCC back then. Perhaps these sorts of errors were inadvertently made more likely.

jordanlewis · 2018-08-22T20:00:05Z

The root cause of this issue seems to be the fact that the TransactionCoordSender is currently misconfigured to return a HandledRetryableTxnError even in the "leaf txn" configuration, which confuses downstream components into not returning a retryable error.

We found this by adding the diff:

diff --git a/pkg/kv/txn_coord_sender.go b/pkg/kv/txn_coord_sender.go
index 5b34d5935f..299894fbab 100644
--- a/pkg/kv/txn_coord_sender.go
+++ b/pkg/kv/txn_coord_sender.go
@@ -21,6 +21,8 @@ import (
 
        "github.com/pkg/errors"
 
+       "runtime/debug"
+
        "github.com/cockroachdb/cockroach/pkg/base"
        "github.com/cockroachdb/cockroach/pkg/internal/client"
        "github.com/cockroachdb/cockroach/pkg/keys"
@@ -711,6 +713,9 @@ func (tc *TxnCoordSender) maybeRejectClientLocked(
                        // priority is not used for aborted errors
                        roachpb.NormalUserPriority,
                        tc.clock)
+               if tc.typ == client.LeafTxn {
+                       log.Infof(ctx, "ALERT %s", debug.Stack())
+               }
                return roachpb.NewError(roachpb.NewHandledRetryableTxnError(
                        abortedErr.Message, tc.mu.txn.ID, newTxn))
        }

which produced a stack trace on an in-the-wild cluster:

I180822 19:47:31.791747 634332 kv/txn_coord_sender.go:717  [n1] ALERT goroutine 634332 [running]:
runtime/debug.Stack(0x2f4e2c0, 0xc4517f0570, 0xc42e47ca00)
       /usr/local/go/src/runtime/debug/stack.go:24 +0xa7
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).maybeRejectClientLocked(0xc45d3ff800, 0x2f4e2c0, 0xc4517f0570, 0xc421643318, 0x0)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:717 +0x2c5
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).Send(0xc45d3ff800, 0x2f4e2c0, 0xc4517f0570, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:550 +0xc2
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).sendUsingSender(0xc42076e380, 0x2f4e2c0, 0xc4517f0570, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:622 +0x135
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).Send(0xc45ccae6e0, 0x2f4e2c0, 0xc4517f0570, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:783 +0x161
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*txnKVFetcher).fetch(0xc463bf9030, 0x2f4e2c0, 0xc4517f0570, 0xc433704180, 0x26b9da0)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/kvfetcher.go:312 +0x597
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*txnKVFetcher).nextBatch(0xc463bf9030, 0x2f4e2c0, 0xc4517f0570, 0x26b9da0, 0x46e8118, 0x29aef40, 0xc4408e8d00, 0x2f4e2c0, 0xc464fcb500, 0xc428ddddb8, ...)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/kvfetcher.go:400 +0x70
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).nextKV(0xc42e232c90, 0x2f4e2c0, 0xc4517f0570, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:486 +0x3ab
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).NextKey(0xc42e232c90, 0x2f4e2c0, 0xc4517f0570, 0x0, 0x0, 0x0)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:524 +0x8a
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).StartScanFrom(0xc42e232c90, 0x2f4e2c0, 0xc4517f0570, 0x2f2d6c0, 0xc463bf9030, 0x0, 0x28ffc01)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:447 +0xd9
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).StartScan(0xc42e232c90, 0x2f4e2c0, 0xc4517f0570, 0xc45ccae6e0, 0xc4338853b0, 0x1, 0x1, 0xc435d86901, 0x0, 0xc44fad6600, ...)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:435 +0x1e8
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*tableReader).Start(0xc42e232800, 0x2f4e2c0, 0xc433884e40, 0xc4425b7f70, 0xbe897f)

/home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:226 +0x173
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*aggregatorBase).start(0xc44d532a80, 0x2f4e2c0, 0xc433884e40, 0x2a2681c, 0x12, 0x2f4e2c0, 0xc4425b7f88)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/aggregator.go:400 +0x52
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*orderedAggregator).Start(0xc44d532a80, 0x2f4e2c0, 0xc433884e40, 0xc4340a4b00, 0x0)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/aggregator.go:396 +0x56
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc44d532a80, 0x2f4e2c0, 0xc433884e40, 0xc44b031f38)
       /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:738 +0x58
created by github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).StartAsync

This should never happen.

andreimatei · 2018-08-27T22:36:24Z

Btw, I've verified that the cause of the "replay txn" error is the scenario described in #28270 (comment)
A big-ass batch with the BeginTxn is sent out, split into sub-batches for different ranges, the sub-batch with the Begin is real slow so an intent from another range manages to be pushed, and then the heartbeat loop notices that it was aborted and sends a rollback, and then eventually the original BeginTxn gets this error.
On a 4-node roachprod cluster with 2 tpcc workers (no -nowait), I see the Begin batch take 20s :S
The Begin corresponds to a transactions doing:

UPDATE "order" SET o_carrier_id = 2 WHERE (o_w_id = 375) AND ((o_d_id, o_id) IN ((1, 2104), (2, 2104), (10, 2104), (7, 2104), (8, 2104), (9, 2104), (3, 2104), (4, 2104), (5, 2104), (6, 2104))) RETURNING o_d_id, o_c_id

andreimatei · 2018-08-30T00:13:25Z

So I'm looking at the badly wrapped HandledRetryableTxn issues. I can't quite tell what's going on. What appears to be happening is that a TxnCoordSenderFactory.TransactionalSender() is called by providing a TxnCoordMeta with an aborted txn in it. But that seems cray cray... Are we starting a flow in an aborted txn? More digging tomorrow.

andreimatei · 2018-08-30T18:37:22Z

I verified that we are indeed trying to start a flow in an aborted txn. We should add protection against that, as it obviously it will result in a bad time.
But the hole only gets deeper - why exactly are we trying to do that? Presumably there's some sort of a race, but what? There's no parallel statements in TPCC, so how does a client get to know that its txn is aborted async wrt to processing queries? It has to be the heartbeat loop, but how is a transaction whose BeginTxn presumably succeeded getting aborted? The only thing I can think of is the hb loop failed to heartbeat for more than 2s (but then succeeding suddenly to notice that someone else aborted the txn). Verifying...

andreimatei · 2018-08-30T22:18:26Z

Well in the cluster I'm running (roachprod 4 node, 3 tpcc workloads running concurrently) I'm seeing heartbeats taking up to 5s. And those heartbeats find their transactions to be aborted.
I'm now satisfied with the theory that these loser heartbeats race with creating a new flow which captures an already-aborted txn.
Will have a patch with protections at different levels soon.

Before this patch, we had a race between a heartbeat find out (async) that a txn has been aborted and a client using the txn to create DistSQL flows. We could end up creating flows in aborted txns, which are going to have a bad time: at the moment, leaf TxnCoordSenders don't react well to what appears to be an async abort: tcs.maybeRejectClientLocked() returns a HandledRetryableError, but the DistSQL infrastructure only expects raw retriable errors - so the HandledRetryableError was ironically losing its "retryable" character and was making it to a sql client with the wrong pgwire code. This patch resolves the situation by atomically checking that the txn is still good (i.e. status==PENDING) before extracting its metadata that is sent to leaves. Fixes cockroachdb#28898 Release note: Fix an issue where, under severe load, clients were sometimes receiving retryable errors with a non-retryable error code (a client would get an error with the message "HandledRetryableError: ..." but an internal error code instead of the expected retryable error code).

Before this patch, we had a race between a heartbeat find out (async) that a txn has been aborted and a client using the txn to create DistSQL flows. We could end up creating flows in aborted txns, which are going to have a bad time: at the moment, leaf TxnCoordSenders don't react well to what appears to be an async abort: tcs.maybeRejectClientLocked() returns a HandledRetryableError, but the DistSQL infrastructure only expects raw retriable errors - so the HandledRetryableError was ironically losing its "retryable" character and was making it to a sql client with the wrong pgwire code. This patch resolves the situation by atomically checking that the txn is still good (i.e. status==PENDING) before extracting its metadata that is sent to leaves. Fixes cockroachdb#28898 Fixes cockroachdb#29271 Release note (bug fix): Fix an issue where, under severe load, clients were sometimes receiving retryable errors with a non-retryable error code (a client would get an error with the message "HandledRetryableError: ..." but an internal error code instead of the expected retryable error code).

29455: distsqlrun,kv,client: don't create flows in aborted txns r=andreimatei a=andreimatei Before this patch, we had a race between a heartbeat find out (async) that a txn has been aborted and a client using the txn to create DistSQL flows. We could end up creating flows in aborted txns, which are going to have a bad time: at the moment, leaf TxnCoordSenders don't react well to what appears to be an async abort: tcs.maybeRejectClientLocked() returns a HandledRetryableError, but the DistSQL infrastructure only expects raw retriable errors - so the HandledRetryableError was ironically losing its "retryable" character and was making it to a sql client with the wrong pgwire code. This patch resolves the situation by atomically checking that the txn is still good (i.e. status==PENDING) before extracting its metadata that is sent to leaves. Fixes #28898 Fixes #29271 Release note: Fix an issue where, under severe load, clients were sometimes receiving retryable errors with a non-retryable error code (a client would get an error with the message "HandledRetryableError: ..." but an internal error code instead of the expected retryable error code). Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

29036: kv: prevent a leaf TCS from returning HandledRetryableError r=andreimatei a=andreimatei A root TxnCoordSender takes retryable errors, prepares the transaction for a retry and the returns a HandledRetryableError to the client. Leaf TCS does not prepare transactions for retries, and so has no business creating HandledRetryableErrors. Instead, it returns raw error which have to make their way (through DistSQL streams) to the root, which will "handle" them. Before this patch, there was a codepath where a leaf TCS would erroneously return a HandledRetryableError. This was confusing DistSQL which was not plumbing it properly and it was making its way to the client with a non-retriable code. Fixes #28898 Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

29998: release-2.1: kv: prevent a leaf TCS from returning HandledRetryableError r=andreimatei a=andreimatei Backport 1/1 commits from #29036. /cc @cockroachdb/release --- A root TxnCoordSender takes retryable errors, prepares the transaction for a retry and the returns a HandledRetryableError to the client. Leaf TCS does not prepare transactions for retries, and so has no business creating HandledRetryableErrors. Instead, it returns raw error which have to make their way (through DistSQL streams) to the root, which will "handle" them. Before this patch, there was a codepath where a leaf TCS would erroneously return a HandledRetryableError. This was confusing DistSQL which was not plumbing it properly and it was making its way to the client with a non-retriable code. Fixes #28898 Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

jordanlewis added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. labels Aug 21, 2018

jordanlewis added this to the 2.1 milestone Aug 21, 2018

jordanlewis added this to Triage in (DEPRECATED) SQL execution via automation Aug 21, 2018

jordanlewis moved this from Triage to Bugfix milestone in (DEPRECATED) SQL execution Aug 21, 2018

jordanlewis self-assigned this Aug 21, 2018

andreimatei mentioned this issue Aug 24, 2018

kv: prevent a leaf TCS from returning HandledRetryableError #29036

Merged

tbg added the A-kv-client Relating to the KV client and the KV interface. label Aug 28, 2018

tbg assigned andreimatei and unassigned jordanlewis Aug 28, 2018

andreimatei mentioned this issue Aug 31, 2018

distsqlrun,kv,client: don't create flows in aborted txns #29455

Merged

tbg added this to Backlog in KV Sep 4, 2018

andreimatei mentioned this issue Sep 5, 2018

release-2.1: distsqlrun,kv,client: don't create flows in aborted txns #29614

Merged

craig bot closed this as completed in #29455 Sep 5, 2018

jordanlewis moved this from Bugfix milestone to Finished (milestone 4) in (DEPRECATED) SQL execution Sep 6, 2018

andreimatei mentioned this issue Sep 10, 2018

release-2.1: kv: prevent a leaf TCS from returning HandledRetryableError #29998

Merged

tbg moved this from Backlog to Closed in KV Oct 8, 2018

jordanlewis added this to Triage in [DEPRECATED] Old SQLExec board. Don't move stuff here May 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: tpcc returns retriable errors to the user #28898

sql: tpcc returns retriable errors to the user #28898

jordanlewis commented Aug 21, 2018

andreimatei commented Aug 21, 2018

jordanlewis commented Aug 21, 2018

jordanlewis commented Aug 21, 2018

jordanlewis commented Aug 22, 2018

jordanlewis commented Aug 22, 2018

jordanlewis commented Aug 22, 2018

andreimatei commented Aug 27, 2018

andreimatei commented Aug 30, 2018

andreimatei commented Aug 30, 2018

andreimatei commented Aug 30, 2018

sql: tpcc returns retriable errors to the user #28898

sql: tpcc returns retriable errors to the user #28898

Comments

jordanlewis commented Aug 21, 2018

andreimatei commented Aug 21, 2018

jordanlewis commented Aug 21, 2018

jordanlewis commented Aug 21, 2018

jordanlewis commented Aug 22, 2018

jordanlewis commented Aug 22, 2018

jordanlewis commented Aug 22, 2018

andreimatei commented Aug 27, 2018

andreimatei commented Aug 30, 2018

andreimatei commented Aug 30, 2018

andreimatei commented Aug 30, 2018