Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: clarify DB error messages that seep through the client connection #4421

Closed
knz opened this issue Feb 16, 2016 · 11 comments
Closed

sql: clarify DB error messages that seep through the client connection #4421

knz opened this issue Feb 16, 2016 · 11 comments
Assignees
Labels
A-sql-executor SQL txn logic A-sql-pgwire pgwire protocol issues. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) docs-todo

Comments

@knz
Copy link
Contributor

knz commented Feb 16, 2016

As mentioned in today's meeting we should document the messages that SQL users may see through their connection. Right now the messages convey a lot of debugging information about the DB internals, although in many cases the proper course of action is to simply say "retry the transaction". So really what we should do is:

  1. simplify the text of the error reported through the SQL connections
  2. document the error messages, to explain what they mean
  3. put the detailed message in a per-client trace, if we ever support that type of logging

For point (1) it was suggested during the meeting to start by making an inventory. Here is what I directly experienced during my work on #4036:

query error: pq: there is no transaction in progress

This happens when the server already has reported an error and the client tries to send more queries instead of closing the failed transaction.

pq: storage/store.go:xxx: Rejecting command with timestamp in the future: NNN (NNN.NNNms ahead)

This happens when the cluster is not "on time" (with a big offset).

pq: read at time NNN.NNN,NNN encountered previous write with future timestamp NNN.NNN,NNN within uncertainty interval

This happens when the cluster is not "on time" (with a small offset).

pq: retry txn "sql/executor.go:xx sql" id=NN key=/Table/NNN/NNN/"a"/NNN/NNN rw=true pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN
pq: txn aborted "sql/executor.go:xx sql" id=NN key=/Table/NNN/NNN/NNN/NNN/NNN rw=true pri=NNN.NNN iso=SERIALIZABLE stat=ABORTED epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN
pq: txn aborted "sql/executor.go:xx sql" id=NN key=/Table/NNN/NNN/"a"/NNN/NNN rw=true pri=NNN.NNN iso=SNAPSHOT stat=ABORTED epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN
pq: txn "sql/lease.go:xx LeaseStore.Acquire" id=NN key=/Table/NNN/NNN/NNN/NNN/Thu Feb  NNN NNN:3NNN:5NNN UTC NNN/NNN/NNN rw=false pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN: non-aborted transaction exists already
pq: txn "sql/executor.go:xx sql" id=NN key=/Table/NNN/NNN/"a"/NNN/NNN rw=false pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN failed to push "sql/executor.go:xx (*Executor).execStmt" id=NN key=/Table/NNN/NNN/"a"/NNN/NNN rw=false pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN
@knz
Copy link
Contributor Author

knz commented Feb 16, 2016

I think we should first spend some time figuring out when these messages occur exactly before we determine a better / more informative phrasing for them.

Also in some cases it might be possible to hide them entirely by automatically retrying the transaction. @andreimatei, care to comment?

@knz knz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) docs-todo labels Feb 16, 2016
@tbg
Copy link
Member

tbg commented Feb 16, 2016

I think we've already decided (?) to expose all transaction retry/abort
errors in a way that lets SQL clients parse them (that includes clock
uncertainty restarts). That exhausts your list already. All other errors
from KV should probably be exposed through some kind of "internal error"
which we can then document (we'll want people to be able to give us error
messages that we can make sense of).

On Tue, Feb 16, 2016 at 5:20 PM kena notifications@github.com wrote:

I think we should first spend some time figuring out when these messages
occur exactly before we determine a better / more informative phrasing for
them.

Also in some cases it might be possible to hide them entirely by
automatically retrying the transaction. @andreimatei
https://github.com/andreimatei, care to comment?


Reply to this email directly or view it on GitHub
#4421 (comment)
.

@knz
Copy link
Contributor Author

knz commented Feb 16, 2016

@tschottdorf I'm not sure I fully understand your answer. Regardless of which syntax we use for error messages they still need to be simplified and documented. Right now (correct me if I am wrong) we do not have a proper user-level explanation for each of these messages; nor a summary of what the user-programmer should do in response to each of them. This description work should precede IMHO any discussion about which format to report the errors in.

@petermattis petermattis changed the title Clarify DB error messages that seep through the client connection sql: clarify DB error messages that seep through the client connection Feb 17, 2016
@petermattis petermattis added this to the Beta milestone Feb 17, 2016
@tbg
Copy link
Member

tbg commented Feb 17, 2016

@knz yes, just a heads up that the errors you collected will all melt into one (which then needs documentation).
For any other errors bubbling up for KV (and except for some "system down/giving up" errors), those are basically "unexpected errors", so we can't really enumerate them all. Maybe there are some that are worth special-casing, but when SQL clients get most these, we're in bug tracking territory and want the client to be able to report the error without obfuscation.

For example, I just scanned your list again and these two are "bug territory":

pq: txn "sql/lease.go:xx LeaseStore.Acquire" id=NN key=/Table/NNN/NNN/NNN/NNN/Thu Feb NNN NNN:3NNN:5NNN UTC NNN/NNN/NNN rw=false pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN: non-aborted transaction exists already

^- looks like an honest bug. Can you repro that?

pq: storage/store.go:xxx: Rejecting command with timestamp in the future: NNN (NNN.NNNms ahead)

^- this is #3087 I haven't been able to track down (failure to repro).

@knz
Copy link
Contributor Author

knz commented Feb 17, 2016

pq: txn "sql/lease.go:xx LeaseStore.Acquire" id=NN key=/Table/NNN/NNN/NNN/NNN/Thu Feb NNN NNN:3NNN:5NNN UTC NNN/NNN/NNN rw=false pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN: non-aborted transaction exists already

^- looks like an honest bug. Can you repro that?

After upgrading to a more recent version I can't find this specific
instance, but instead I see a couple instances of the following:

W0217 16:52:19.808122 15926 kv/txn_coord_sender.go:637  heartbeat to
"sql/lease.go:149 LeaseStore.Acquire" id=8448a7d2
key=/Table/11/1/52/1/Wed Feb 17 16:56:1
3 UTC 2016/3/0 rw=true pri=0.02687466 iso=SERIALIZABLE stat=PENDING
epo=1 ts=1455727903.217180779,1 orig=1455727903.217180779,1
max=1455727903.244803122,0 fai
led: retry txn "sql/lease.go:149 LeaseStore.Acquire" id=8448a7d2
key=/Table/11/1/52/1/Wed Feb 17 16:56:13 UTC 2016/3/0 rw=true
pri=0.02687466 iso=SERIALIZABLE
 stat=PENDING epo=1 ts=1455727904.246115227,1
 orig=1455727903.217180779,1 max=1455727903.244803122,0

is that similar/equivalent?

pq: storage/store.go:xxx: Rejecting command with timestamp in the future: NNN (NNN.NNNms ahead)

^- this is #3087 I haven't been able to track down (failure to repro).

I get this one when I forget to reset my clock drifting before starting
a new test -- it seems to happen when I let the clocks drift slowly but
surely over time while the cluster is idle, then send a request after
there was enough drift.

It hasn't been happening as much lately (my code has gotten better at
resetting the test env) but if I catch it again I'll let you know.

@andreimatei
Copy link
Contributor

What I'm planning on doing is expose all the retry-able errors (mostly printed with "pq: retry" in the CLI) and some non-retry-able errors (from the KV perspective) (I think the ones printed with "pq: txn aborted") as a single error message, with a single error code, to the user. And then the user can issue something like BEGIN TRANSACTION RETRY and try everything again.
All the other errors I believe are bugs.

So let's go through your list again:

query error: pq: there is no transaction in progress
This happens when the server already has reported an error and the client tries to send more queries instead of closing the failed transaction.
  • ummm, I don't think this is what you're supposed to see when you're sending queries in an aborted txn (you're supposed to see something like "txn already aborted").
    So it sounds like a bug to me. If you can repro, that'd be very helpful. Unless you're seeing this with the CLI due to txn timeouts (sql: ./cockroach sql should heartbeat during open transaction #3031).
pq: storage/store.go:xxx: Rejecting command with timestamp in the future: NNN (NNN.NNNms ahead)
This happens when the cluster is not "on time" (with a big offset).
  • don't know. Toby says it's a bug? Sounds like a reasonable error to me, in which case we should document it.
pq: read at time NNN.NNN,NNN encountered previous write with future timestamp NNN.NNN,NNN within uncertainty interval
This happens when the cluster is not "on time" (with a small offset).
  • this is ReadWithinUncertaintyIntervalError, a retryable one.
pq: retry txn "sql/executor.go:xx sql" id=NN key=/Table/NNN/NNN/"a"/NNN/NNN rw=true 
pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN 
max=NNN.NNN,NNN
  • cool, retryable
pq: txn aborted "sql/executor.go:xx sql" id=NN key=/Table/NNN/NNN/NNN/NNN/NNN rw=true pri=NNN.NNN iso=SERIALIZABLE stat=ABORTED epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN
  • cool, aborted. We'll surface as retryable.
pq: txn aborted "sql/executor.go:xx sql" id=NN key=/Table/NNN/NNN/"a"/NNN/NNN rw=true pri=NNN.NNN iso=SNAPSHOT stat=ABORTED epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN
  • cool, aborted. We'll surface as retryable.
pq: txn "sql/lease.go:xx LeaseStore.Acquire" id=NN key=/Table/NNN/NNN/NNN/NNN/Thu Feb  NNN NNN:3NNN:5NNN UTC NNN/NNN/NNN rw=false pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN: non-aborted transaction exists already
  • sounds like a bug
pq: txn "sql/executor.go:xx sql" id=NN key=/Table/NNN/NNN/"a"/NNN/NNN rw=false pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN failed to push "sql/executor.go:xx (*Executor).execStmt" id=NN key=/Table/NNN/NNN/"a"/NNN/NNN rw=false pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN
  • this is TransactionPushError, a retry-able one.

@tbg
Copy link
Member

tbg commented Feb 17, 2016

I get this one when I forget to reset my clock drifting before starting
a new test

Ah, ok. If you're letting your clocks drift close to or above the offset, then that's an expected error. In that issue, it happens during Docker acceptance tests, where all nodes share the same clock - no bueno.

After upgrading to a more recent version I can't find this specific
instance, but instead I see a couple instances of the following:

This is what happens if the (central) transaction record indicates that you need to restart. It's a bit unusual to see in practice unless the transaction is taking a lot of time and isn't doing much - for example, BEGIN; INSERT(....); WAIT, WAIT, WAIT.... could catch that error if someone pushes the txn when running into the intent in the meantime. Were you doing something where that would be expected?

pq: txn "sql/lease.go:xx LeaseStore.Acquire" id=NN key=/Table/NNN/NNN/NNN/NNN/Thu Feb NNN NNN:3NNN:5NNN UTC NNN/NNN/NNN rw=false pri=NNN.NNN iso=SERIALIZABLE stat=PENDING epo=NNN ts=NNN.NNN,NNN orig=NNN.NNN,NNN max=NNN.NNN,NNN: non-aborted transaction exists already

^- looks like an honest bug. Can you repro that?

I actually know how this can happen, but I would still classify as bug: If the first write in a txn is multi-range and fails retryably on the second range, the command will fail with that error on the first range (since the previous attempt was actually successful there, so we have a transaction table entry from BeginTransaction already). I believe this should not reoccur when #4443 is fixed.

@knz
Copy link
Contributor Author

knz commented Feb 25, 2016

( While looking at #4036 )

@knz
Copy link
Contributor Author

knz commented Mar 23, 2016

New error messages cropping up recently:

ERROR: failed to send RPC: too many errors encountered (1 of 1 total): rpc error: code = 13 desc = "transport is closing"

ERROR: failed to send RPC: too many errors encountered (1 of 1 total): rpc error: code = 4 desc = "context deadline exceeded"

(happen when nodes become unreachable on the network during transactions - see also #5452 )

@petermattis petermattis modified the milestones: Beta, Q2 Apr 7, 2016
@petermattis petermattis modified the milestones: Q2, Q3 Jul 11, 2016
@knz knz modified the milestones: 1.0, Q3 Mar 7, 2017
@spencerkimball spencerkimball removed this from the 1.0 milestone Mar 29, 2017
@spencerkimball
Copy link
Member

@knz, @andreimatei I moved this off of 1.0 milestone. If you disagree, please comment on how this work will be scheduled and reset the 1.0 milestone.

@dianasaur323 dianasaur323 added this to the Later milestone Apr 20, 2017
@knz knz added this to Triage in (DEPRECATED) SQL Front-end, Lang & Semantics via automation Apr 28, 2018
@knz knz moved this from Triage to Backlog in (DEPRECATED) SQL Front-end, Lang & Semantics May 3, 2018
@knz knz added A-sql-pgwire pgwire protocol issues. A-sql-executor SQL txn logic labels May 15, 2018
@petermattis petermattis removed this from the Later milestone Oct 5, 2018
@jordanlewis
Copy link
Member

We've made huge strides with this over the last 3 years, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-executor SQL txn logic A-sql-pgwire pgwire protocol issues. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) docs-todo
Projects
No open projects
Development

No branches or pull requests

7 participants