sql: clarify DB error messages that seep through the client connection #4421

knz · 2016-02-16T22:17:08Z

As mentioned in today's meeting we should document the messages that SQL users may see through their connection. Right now the messages convey a lot of debugging information about the DB internals, although in many cases the proper course of action is to simply say "retry the transaction". So really what we should do is:

simplify the text of the error reported through the SQL connections
document the error messages, to explain what they mean
put the detailed message in a per-client trace, if we ever support that type of logging

For point (1) it was suggested during the meeting to start by making an inventory. Here is what I directly experienced during my work on #4036:

query error: pq: there is no transaction in progress

This happens when the server already has reported an error and the client tries to send more queries instead of closing the failed transaction.

pq: storage/store.go:xxx: Rejecting command with timestamp in the future: NNN (NNN.NNNms ahead)

This happens when the cluster is not "on time" (with a big offset).

pq: read at time NNN.NNN,NNN encountered previous write with future timestamp NNN.NNN,NNN within uncertainty interval

This happens when the cluster is not "on time" (with a small offset).

pq: retry txn "sql/executor.go:xx sql" id=NN key=/Table/NNN/NNN/"a"/NNN/NNN rw=true pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN

pq: txn aborted "sql/executor.go:xx sql" id=NN key=/Table/NNN/NNN/NNN/NNN/NNN rw=true pri=NNN.NNN iso=SERIALIZABLE stat=ABORTED epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN

pq: txn aborted "sql/executor.go:xx sql" id=NN key=/Table/NNN/NNN/"a"/NNN/NNN rw=true pri=NNN.NNN iso=SNAPSHOT stat=ABORTED epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN

pq: txn "sql/lease.go:xx LeaseStore.Acquire" id=NN key=/Table/NNN/NNN/NNN/NNN/Thu Feb  NNN NNN:3NNN:5NNN UTC NNN/NNN/NNN rw=false pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN: non-aborted transaction exists already

pq: txn "sql/executor.go:xx sql" id=NN key=/Table/NNN/NNN/"a"/NNN/NNN rw=false pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN failed to push "sql/executor.go:xx (*Executor).execStmt" id=NN key=/Table/NNN/NNN/"a"/NNN/NNN rw=false pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN

The text was updated successfully, but these errors were encountered:

knz · 2016-02-16T22:20:20Z

I think we should first spend some time figuring out when these messages occur exactly before we determine a better / more informative phrasing for them.

Also in some cases it might be possible to hide them entirely by automatically retrying the transaction. @andreimatei, care to comment?

tbg · 2016-02-16T22:59:59Z

I think we've already decided (?) to expose all transaction retry/abort
errors in a way that lets SQL clients parse them (that includes clock
uncertainty restarts). That exhausts your list already. All other errors
from KV should probably be exposed through some kind of "internal error"
which we can then document (we'll want people to be able to give us error
messages that we can make sense of).

On Tue, Feb 16, 2016 at 5:20 PM kena notifications@github.com wrote:

I think we should first spend some time figuring out when these messages
occur exactly before we determine a better / more informative phrasing for
them.

Also in some cases it might be possible to hide them entirely by
automatically retrying the transaction. @andreimatei
https://github.com/andreimatei, care to comment?

—
Reply to this email directly or view it on GitHub
#4421 (comment)
.

knz · 2016-02-16T23:37:09Z

@tschottdorf I'm not sure I fully understand your answer. Regardless of which syntax we use for error messages they still need to be simplified and documented. Right now (correct me if I am wrong) we do not have a proper user-level explanation for each of these messages; nor a summary of what the user-programmer should do in response to each of them. This description work should precede IMHO any discussion about which format to report the errors in.

tbg · 2016-02-17T02:45:34Z

@knz yes, just a heads up that the errors you collected will all melt into one (which then needs documentation).
For any other errors bubbling up for KV (and except for some "system down/giving up" errors), those are basically "unexpected errors", so we can't really enumerate them all. Maybe there are some that are worth special-casing, but when SQL clients get most these, we're in bug tracking territory and want the client to be able to report the error without obfuscation.

For example, I just scanned your list again and these two are "bug territory":

pq: txn "sql/lease.go:xx LeaseStore.Acquire" id=NN key=/Table/NNN/NNN/NNN/NNN/Thu Feb NNN NNN:3NNN:5NNN UTC NNN/NNN/NNN rw=false pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN: non-aborted transaction exists already

^- looks like an honest bug. Can you repro that?

pq: storage/store.go:xxx: Rejecting command with timestamp in the future: NNN (NNN.NNNms ahead)

^- this is #3087 I haven't been able to track down (failure to repro).

knz · 2016-02-17T16:56:30Z

pq: txn "sql/lease.go:xx LeaseStore.Acquire" id=NN key=/Table/NNN/NNN/NNN/NNN/Thu Feb NNN NNN:3NNN:5NNN UTC NNN/NNN/NNN rw=false pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN: non-aborted transaction exists already

^- looks like an honest bug. Can you repro that?

After upgrading to a more recent version I can't find this specific
instance, but instead I see a couple instances of the following:

W0217 16:52:19.808122 15926 kv/txn_coord_sender.go:637  heartbeat to
"sql/lease.go:149 LeaseStore.Acquire" id=8448a7d2
key=/Table/11/1/52/1/Wed Feb 17 16:56:1
3 UTC 2016/3/0 rw=true pri=0.02687466 iso=SERIALIZABLE stat=PENDING
epo=1 ts=1455727903.217180779,1 orig=1455727903.217180779,1
max=1455727903.244803122,0 fai
led: retry txn "sql/lease.go:149 LeaseStore.Acquire" id=8448a7d2
key=/Table/11/1/52/1/Wed Feb 17 16:56:13 UTC 2016/3/0 rw=true
pri=0.02687466 iso=SERIALIZABLE
 stat=PENDING epo=1 ts=1455727904.246115227,1
 orig=1455727903.217180779,1 max=1455727903.244803122,0

is that similar/equivalent?

pq: storage/store.go:xxx: Rejecting command with timestamp in the future: NNN (NNN.NNNms ahead)

^- this is #3087 I haven't been able to track down (failure to repro).

I get this one when I forget to reset my clock drifting before starting
a new test -- it seems to happen when I let the clocks drift slowly but
surely over time while the cluster is idle, then send a request after
there was enough drift.

It hasn't been happening as much lately (my code has gotten better at
resetting the test env) but if I catch it again I'll let you know.

andreimatei · 2016-02-17T18:19:15Z

What I'm planning on doing is expose all the retry-able errors (mostly printed with "pq: retry" in the CLI) and some non-retry-able errors (from the KV perspective) (I think the ones printed with "pq: txn aborted") as a single error message, with a single error code, to the user. And then the user can issue something like BEGIN TRANSACTION RETRY and try everything again.
All the other errors I believe are bugs.

So let's go through your list again:

query error: pq: there is no transaction in progress
This happens when the server already has reported an error and the client tries to send more queries instead of closing the failed transaction.

ummm, I don't think this is what you're supposed to see when you're sending queries in an aborted txn (you're supposed to see something like "txn already aborted").
So it sounds like a bug to me. If you can repro, that'd be very helpful. Unless you're seeing this with the CLI due to txn timeouts (sql: ./cockroach sql should heartbeat during open transaction #3031).

pq: storage/store.go:xxx: Rejecting command with timestamp in the future: NNN (NNN.NNNms ahead)
This happens when the cluster is not "on time" (with a big offset).

don't know. Toby says it's a bug? Sounds like a reasonable error to me, in which case we should document it.

pq: read at time NNN.NNN,NNN encountered previous write with future timestamp NNN.NNN,NNN within uncertainty interval
This happens when the cluster is not "on time" (with a small offset).

this is ReadWithinUncertaintyIntervalError, a retryable one.

pq: retry txn "sql/executor.go:xx sql" id=NN key=/Table/NNN/NNN/"a"/NNN/NNN rw=true 
pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN 
max=NNN.NNN,NNN

cool, retryable

pq: txn aborted "sql/executor.go:xx sql" id=NN key=/Table/NNN/NNN/NNN/NNN/NNN rw=true pri=NNN.NNN iso=SERIALIZABLE stat=ABORTED epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN

cool, aborted. We'll surface as retryable.

pq: txn aborted "sql/executor.go:xx sql" id=NN key=/Table/NNN/NNN/"a"/NNN/NNN rw=true pri=NNN.NNN iso=SNAPSHOT stat=ABORTED epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN

cool, aborted. We'll surface as retryable.

pq: txn "sql/lease.go:xx LeaseStore.Acquire" id=NN key=/Table/NNN/NNN/NNN/NNN/Thu Feb  NNN NNN:3NNN:5NNN UTC NNN/NNN/NNN rw=false pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN: non-aborted transaction exists already

sounds like a bug

pq: txn "sql/executor.go:xx sql" id=NN key=/Table/NNN/NNN/"a"/NNN/NNN rw=false pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN failed to push "sql/executor.go:xx (*Executor).execStmt" id=NN key=/Table/NNN/NNN/"a"/NNN/NNN rw=false pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN

this is TransactionPushError, a retry-able one.

tbg · 2016-02-17T18:49:35Z

I get this one when I forget to reset my clock drifting before starting
a new test

Ah, ok. If you're letting your clocks drift close to or above the offset, then that's an expected error. In that issue, it happens during Docker acceptance tests, where all nodes share the same clock - no bueno.

After upgrading to a more recent version I can't find this specific
instance, but instead I see a couple instances of the following:

This is what happens if the (central) transaction record indicates that you need to restart. It's a bit unusual to see in practice unless the transaction is taking a lot of time and isn't doing much - for example, BEGIN; INSERT(....); WAIT, WAIT, WAIT.... could catch that error if someone pushes the txn when running into the intent in the meantime. Were you doing something where that would be expected?

pq: txn "sql/lease.go:xx LeaseStore.Acquire" id=NN key=/Table/NNN/NNN/NNN/NNN/Thu Feb NNN NNN:3NNN:5NNN UTC NNN/NNN/NNN rw=false pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN: non-aborted transaction exists already

^- looks like an honest bug. Can you repro that?

I actually know how this can happen, but I would still classify as bug: If the first write in a txn is multi-range and fails retryably on the second range, the command will fail with that error on the first range (since the previous attempt was actually successful there, so we have a transaction table entry from BeginTransaction already). I believe this should not reoccur when #4443 is fixed.

knz · 2016-02-25T21:59:50Z

( While looking at #4036 )

knz · 2016-03-23T20:25:54Z

New error messages cropping up recently:

ERROR: failed to send RPC: too many errors encountered (1 of 1 total): rpc error: code = 13 desc = "transport is closing"

ERROR: failed to send RPC: too many errors encountered (1 of 1 total): rpc error: code = 4 desc = "context deadline exceeded"

(happen when nodes become unreachable on the network during transactions - see also #5452 )

spencerkimball · 2017-03-29T14:35:50Z

@knz, @andreimatei I moved this off of 1.0 milestone. If you disagree, please comment on how this work will be scheduled and reset the 1.0 milestone.

jordanlewis · 2019-04-11T18:41:21Z

We've made huge strides with this over the last 3 years, closing.

knz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) docs-todo labels Feb 16, 2016

petermattis changed the title ~~Clarify DB error messages that seep through the client connection~~ sql: clarify DB error messages that seep through the client connection Feb 17, 2016

petermattis added this to the Beta milestone Feb 17, 2016

tbg mentioned this issue Feb 17, 2016

kv: sanitize returned transaction on errors #4442

Closed

petermattis assigned andreimatei Feb 29, 2016

petermattis modified the milestones: Beta, Q2 Apr 7, 2016

petermattis mentioned this issue Jun 23, 2016

storage: rejecting command with timestamp in the future #7423

Closed

petermattis modified the milestones: Q2, Q3 Jul 11, 2016

knz mentioned this issue Sep 4, 2016

sql: RPC error leaks through the client connection #4709

Closed

knz modified the milestones: 1.0, Q3 Mar 7, 2017

spencerkimball removed this from the 1.0 milestone Mar 29, 2017

dianasaur323 added this to the Later milestone Apr 20, 2017

knz added this to Triage in (DEPRECATED) SQL Front-end, Lang & Semantics via automation Apr 28, 2018

knz moved this from Triage to Backlog in (DEPRECATED) SQL Front-end, Lang & Semantics May 3, 2018

knz added A-sql-pgwire pgwire protocol issues. A-sql-executor SQL txn logic labels May 15, 2018

petermattis removed this from the Later milestone Oct 5, 2018

jordanlewis closed this as completed Apr 11, 2019

jordanlewis added this to tmp in [DEPRECATED] Old SQLExec board. Don't move stuff here May 1, 2019

ITSecOps-404 mentioned this issue May 13, 2024

[Snyk] Security upgrade @typescript-eslint/eslint-plugin from 4.29.1 to 5.10.0 ITSecOps-404/cockroach#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: clarify DB error messages that seep through the client connection #4421

sql: clarify DB error messages that seep through the client connection #4421

knz commented Feb 16, 2016

knz commented Feb 16, 2016

tbg commented Feb 16, 2016

knz commented Feb 16, 2016

tbg commented Feb 17, 2016

knz commented Feb 17, 2016

andreimatei commented Feb 17, 2016

tbg commented Feb 17, 2016

knz commented Feb 25, 2016

knz commented Mar 23, 2016

spencerkimball commented Mar 29, 2017

jordanlewis commented Apr 11, 2019

sql: clarify DB error messages that seep through the client connection #4421

sql: clarify DB error messages that seep through the client connection #4421

Comments

knz commented Feb 16, 2016

knz commented Feb 16, 2016

tbg commented Feb 16, 2016

knz commented Feb 16, 2016

tbg commented Feb 17, 2016

knz commented Feb 17, 2016

andreimatei commented Feb 17, 2016

tbg commented Feb 17, 2016

knz commented Feb 25, 2016

knz commented Mar 23, 2016

spencerkimball commented Mar 29, 2017

jordanlewis commented Apr 11, 2019