Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected SQLITE_INTERRUPT #18

Closed
fasterthanlime opened this issue Jun 21, 2018 · 8 comments
Closed

unexpected SQLITE_INTERRUPT #18

fasterthanlime opened this issue Jun 21, 2018 · 8 comments

Comments

@fasterthanlime
Copy link
Contributor

Lines 267 and 268 here: stmt.Reset() and stmt.ClearBindings() can both return errors (notably, SQLITE_INTERRUPTED), but they're never returned from conn.Prepare()

sqlite/sqlite.go

Lines 265 to 281 in 4e3e3e8

func (conn *Conn) Prepare(query string) (*Stmt, error) {
if stmt := conn.stmts[query]; stmt != nil {
stmt.Reset()
stmt.ClearBindings()
return stmt, nil
}
stmt, trailingBytes, err := conn.prepare(query, C.SQLITE_PREPARE_PERSISTENT)
if err != nil {
return nil, err
}
if trailingBytes != 0 {
stmt.Finalize()
return nil, reserr("Conn.Prepare", query, "statement has trailing bytes", C.SQLITE_ERROR)
}
conn.stmts[query] = stmt
return stmt, nil
}

Is this on purpose? If so, can we add a comment that says it's on purpose and where the error will actually be handled?

(I'm currently trying to figure out why some of my connections always end up returning SQLITE_INTERRUPTED, this might or might not be related)

@crawshaw
Copy link
Owner

The errors from those methods should be repeated by anything that is called on the returned Stmt after prepare. As those errors are properly annotated with source, it should be safe to return them. I'll make a change to do that.

If you're seeing SQLITE_INTERRUPT, that almost certainly means the context associated with your connection has been cancelled.

crawshaw added a commit that referenced this issue Jun 29, 2018
Should not affect program operation but may be helpful in debugging.
For #18.
@fasterthanlime
Copy link
Contributor Author

If you're seeing SQLITE_INTERRUPT, that almost certainly means the context associated with your connection has been cancelled.

I made sure it wasn't that! see https://github.com/fasterthanlime/chao and this long-ass twitter thread

tl;dr I was seeing a lot of SQLITE_INTERRUPT before the context was cancelled, I'm now pretty sure what I'm seeing is a duplicate of #17

@crawshaw crawshaw reopened this Jun 29, 2018
@crawshaw crawshaw changed the title Swallowed errors in Conn.Prepare unexpected SQLITE_INTERRUPT Jun 29, 2018
@crawshaw
Copy link
Owner

Could you add a printf here and see if you're hitting this code path?

if ErrCode(err) == SQLITE_INTERRUPT {

From what you're describing I don't think that's the problem, but after staring at it for a while I think that code path is problematic.

@crawshaw
Copy link
Owner

Ah I see from your twitter thread that the SQLITE_INTERRUPT is coming from the sqlite3_prepare call. Hmm.

@crawshaw
Copy link
Owner

One thought is these sqlite docs:

Any new SQL statements that are started after the sqlite3_interrupt() call and before the running statements reaches zero are interrupted as if they had been running prior to the sqlite3_interrupt() call.

This makes it sound like the problem is your connection has an in-progress Stmt somewhere when you that is keeping the underlying connection interrupted when you go to prep another statement.

One way to test this: when you go to reset the context on the connection, do it by calling dbpool.Put and then dbpool.Get again. The new code I just added will catch the problematic in-progress statement when you call Put and panic, which will tell you where it is.

(I should do the same thing when someone calls SetInterrupt.)

@fasterthanlime
Copy link
Contributor Author

Ah I see from your twitter thread that the SQLITE_INTERRUPT is coming from the sqlite3_prepare call. > Hmm.

At least it did at some point! I've also seen SQLITE_INTERRUPT returned from _step(), from _reset(), from _clearbinding(), etc.

Any new SQL statements that are started after the sqlite3_interrupt() call and before the running statements reaches zero are interrupted as if they had been running prior to the sqlite3_interrupt() call.

Yep, that + the sqlite_stmt_busy docs made me think that that's what was happening - cycling through all cached statement and resetting them on dbpool.Get seems to have fixed that problem.

..but then I ran into other issues, like SetInterrupt hanging forever on a cancelCh receive (in the goroutine, not in cancelInterrupt), and other times I was just getting consistent SQLITE_LOCKED out of both connections of the pool and I don't think I can invest any more time in exploring this for now, sorry 😞

@fasterthanlime
Copy link
Contributor Author

Update:

I've updated 🔥 https://github.com/fasterthanlime/chao so it:


What chao does

chao attempts to flirt with the amount of INSERT ON CONFLICT DO UPDATE SET (upserts) it can do in memory in a certain deadline.

The goal is to have part of the queries succeed and part of the queries be interrupted, hopefully at various stages:

  • It'll never interrupt in Prepare() because all the statements are cached
  • It sometimes gets interrupted in pool.Get() - but there's retry logic built-in
  • It sometimes gets interrupted in Stmt.Reset()
  • It sometimes gets interrupted in Stmt.Step()
  • etc.

It tries to use a small (but > 2) connection pool (so it's easy to investigate in dlv) and a small amount of workers (so the logs aren't too cluttered), and varies the workload:

  • If the query succeeds too early, it increases the number of updates
  • If the query is interrupted, it decreases the number of updates

It runs several "rounds" of heavy querying, each separated by a few seconds of "recess".

After all rounds are done, it runs a very simple query on each connection of the pool, with a 2-second timeout. If this query succeeds, the connection is deemed "healthy".

Without the workaround

Without the workaround, a lot of queries get interrupted, because we intentionally try to push the workload until we can't complete it within the deadline anymore:

(-) numUpdates = 4605
  0 ...- [0xc4200a2780] succeeded 97.696724ms early
(+) numUpdates = 5526
  0 .... [0xc4200a2320] interrupted 172.563µs after deadline (173.172563ms duration)
(-) numUpdates = 4973
  0 ...- [0xc4200a2780] succeeded 140.473827ms early
(+) numUpdates = 5967
  0 .... [0xc4200a2640] interrupted 263.001µs after deadline (111.263001ms duration)
(-) numUpdates = 5370
  0 .... [0xc4200a25a0] interrupted 316.853µs after deadline (136.316853ms duration)
(-) numUpdates = 4833
  0 .... [0xc4200a2460] interrupted 387.727µs after deadline (199.387727ms duration)
(-) numUpdates = 4349
  0 .... [0xc4200a2500] interrupted 264.914µs after deadline (180.264914ms duration)
(-) numUpdates = 3914
  0 .... [0xc4200a26e0] interrupted 400.294µs after deadline (125.400294ms duration)
  0 ...- [0xc4200a2780] succeeded 133.105419ms early
(+) numUpdates = 4696
  0 .... [0xc4200a2190] interrupted 368.89µs after deadline (147.36889ms duration)
(-) numUpdates = 4226
  0 .... [0xc4200a23c0] interrupted 364.055µs after deadline (125.364055ms duration)
(-) numUpdates = 3803
  0 .... [0xc4200a2320] interrupted 438.773µs after deadline (129.438773ms duration)
  0 .... [0xc4200a2280] interrupted 299.835µs after deadline (168.299835ms duration)

At the final connection health check, I'm usually seeing "1 / 10 healthy conns":

Now testing connections...
Testing connection 0...
[0] sqlite.Exec: Stmt.Step: SQLITE_INTERRUPT (SELECT * FROM thieves LIMIT 1)
Testing connection 1...
[1] sqlite.Exec: Stmt.Step: SQLITE_INTERRUPT (SELECT * FROM thieves LIMIT 1)
Testing connection 2...
[2] sqlite.Exec: Stmt.Step: SQLITE_INTERRUPT (SELECT * FROM thieves LIMIT 1)
Testing connection 3...
[3] sqlite.Exec: Stmt.Step: SQLITE_INTERRUPT (SELECT * FROM thieves LIMIT 1)
Testing connection 4...
[4] sqlite.Exec: Stmt.Step: SQLITE_INTERRUPT (SELECT * FROM thieves LIMIT 1)
Testing connection 5...
Testing connection 6...
[6] sqlite.Exec: Stmt.Step: SQLITE_INTERRUPT (SELECT * FROM thieves LIMIT 1)
Testing connection 7...
[7] sqlite.Exec: Stmt.Step: SQLITE_INTERRUPT (SELECT * FROM thieves LIMIT 1)
Testing connection 8...
[8] sqlite.Exec: Stmt.Step: SQLITE_INTERRUPT (SELECT * FROM thieves LIMIT 1)
Testing connection 9...
[9] sqlite.Exec: Stmt.Step: SQLITE_INTERRUPT (SELECT * FROM thieves LIMIT 1)
Found 1 / 10 healthy conns
 :(

With the workaround

The workaround is to just... not set a deadline, ie. never cancel the connection's context.

So it's not doing the same thing, it effectively robs us of the ability to cancel requests early (for my use case, that's ok).

As a result, all the queries succeed (the only failure condition is us not waiting on the pool long enough):

  5 .... [0x0] pool timeout (105.102637ms / 105ms)
  5 .... [0x0] pool timeout (105.156695ms / 105ms)
  3 .... [0x0] pool timeout (129.141502ms / 129ms)
  5 .... [0x0] pool timeout (161.115876ms / 161ms)
  5 ...+ [0xc4200a2460] succeeded 223.406454ms late
  0 ...+ [0xc4200a2280] succeeded 64.341857ms late
  1 ...+ [0xc4200a2280] succeeded 120.187759ms late
  7 ...+ [0xc4200a2460] succeeded 145.737485ms late
  5 .... [0x0] pool timeout (122.159863ms / 122ms)
  3 .... [0x0] pool timeout (143.103395ms / 143ms)
  8 .... [0x0] pool timeout (176.092786ms / 176ms)
  1 ...+ [0xc4200a25a0] succeeded 462.855671ms late
 10 ...+ [0xc4200a2640] succeeded 206.411653ms late
  1 ...+ [0xc4200a2640] succeeded 162.279826ms late
  6 ...+ [0xc4200a2780] succeeded 213.400416ms late
  0 ...+ [0xc4200a2780] succeeded 101.938924ms late
 10 ...+ [0xc4200a2320] succeeded 228.328918ms late
  7 ...+ [0xc4200a2500] succeeded 238.080748ms late
  5 ...+ [0xc4200a2190] succeeded 242.975096ms late
  7 ...+ [0xc4200a2500] succeeded 241.771349ms late
 15 ...+ [0xc4200a2780] succeeded 116.676525ms late
  9 ...+ [0xc4200a2500] succeeded 239.820994ms late
 18 ...+ [0xc4200a2460] succeeded 201.173352ms late
  1 ...+ [0xc4200a26e0] succeeded 187.169667ms late
  1 ...+ [0xc4200a2320] succeeded 232.396621ms late
 18 ...+ [0xc4200a2640] succeeded 226.160701ms late
  8 ...+ [0xc4200a2190] succeeded 210.415338ms late
 24 ...+ [0xc4200a2320] succeeded 136.373109ms late
  9 ...+ [0xc4200a23c0] succeeded 153.115875ms late
  1 ...+ [0xc4200a2280] succeeded 224.001045ms late
  0 ...+ [0xc4200a2190] succeeded 101.808727ms late
 10 ...+ [0xc4200a26e0] succeeded 273.496389ms late
  1 ...+ [0xc4200a2320] succeeded 176.150518ms late
All done in 7.47433119s
Sleeping 2 seconds to let SQLite settle...

At the end, all connections are healthy:

Now testing connections...
Testing connection 0...
Testing connection 1...
Testing connection 2...
Testing connection 3...
Testing connection 4...
Testing connection 5...
Testing connection 6...
Testing connection 7...
Testing connection 8...
Testing connection 9...
Found 10 / 10 healthy conns
 :)

What could cause SQLite to return SQLITE_INTERRUPT?

The main suspect (as far as I'm concerned) is this:

Any new SQL statements that are started after the sqlite3_interrupt() call and before the running statements reaches zero are interrupted as if they had been running prior to the sqlite3_interrupt() call

🔗 https://www.sqlite.org/c3ref/interrupt.html

What does SQLite consider a "running statement" ?

The sqlite3_stmt_busy(S) interface returns true (non-zero) if the prepared statement S has been stepped at least once using sqlite3_step(S) but has neither run to completion (returned SQLITE_DONE from sqlite3_step(S)) nor been reset using sqlite3_reset(S). The sqlite3_stmt_busy(S) interface returns false if S is a NULL pointer. If S is not a NULL pointer and is not a pointer to a valid prepared statement object, then the behavior is undefined and probably undesirable.

This interface can be used in combination sqlite3_next_stmt() to locate all prepared statements associated with a database connection that are in need of being reset. This can be used, for example, in diagnostic routines to search for prepared statements that are holding a transaction open.

🔗 https://www.sqlite.org/c3ref/stmt_busy.html

What about the pool.Put check?

I don't think the check is strong enough. Consider this:

sqlite/sqlite.go

Lines 501 to 534 in 3f55bce

func (stmt *Stmt) Step() (rowReturned bool, err error) {
defer func() {
stmt.lastHasRow = rowReturned
}()
if stmt.bindErr != nil {
err = stmt.bindErr
stmt.bindErr = nil
return false, err
}
for {
stmt.conn.count++
if err := stmt.interrupted("Stmt.Step"); err != nil {
return false, err
}
switch res := C.sqlite3_step(stmt.stmt); uint8(res) { // reduce to non-extended error code
case C.SQLITE_LOCKED:
if res := C.wait_for_unlock_notify(stmt.conn.conn, stmt.conn.unlockNote); res != C.SQLITE_OK {
return false, stmt.conn.reserr("Stmt.Step(Wait)", stmt.query, res)
}
C.sqlite3_reset(stmt.stmt)
// loop
case C.SQLITE_ROW:
return true, nil
case C.SQLITE_DONE:
return false, nil
case C.SQLITE_BUSY, C.SQLITE_INTERRUPT, C.SQLITE_CONSTRAINT:
// TODO: embed some of these errors into the stmt for zero-alloc errors?
return false, stmt.conn.reserr("Stmt.Step", stmt.query, res)
default:
return false, stmt.conn.extreserr("Stmt.Step", stmt.query, res)
}
}
}

In the codepath where we get SQLITE_INTERRUPT from wait_for_unlock_notify:

  • we return false for rowReturned
  • lastHasRow is set to false
  • ...even if a previous Step() call did return a row

The same thing happens (I think) when stmt.interrupted returns a non-nil error.

What doesn't fix this?

Things I've tried:

  • Resetting all cached statements (conn.stmts) on Pool.Put
  • Resetting all cached statements (conn.stmts) on Pool.Get
  • defer stmt.Reset() in exec
    • although I suspect this doesn't work because stmt.Reset() never calls sqlite3_reset if the conn is interrupted by doneCh

What might fix this?

As I'm writing this (I haven't tried it yet), I suspect something along the lines of:

defer func() {
	if rowReturned {
		stmt.lastHasRow = true
	}
}()

...might fix it. I'll try it now, but I did want to leave my findings somewhere.

How do I run chao myself?

  • Make sure you have the latest crawshaw/sqlite in your $GOPATH
  • Review the code to make sure I didn't sneak in any harmful stuff
    • Heck, run it in a VM
  • Review the Makefile
    • It runs go get -v -x with gcflags to disable inlining and optimization
  • Run make
  • Run chao

You might want to run rr chao to capture an execution trace instead. mozilla's rr is Linux-only afaict, and your distro probably has outdated packages for it:

  • https://github.com/mozilla/rr
  • Note that I've seen rr chao get stuck for seemingly no reason
  • You won't be able to dlv attach to an instance of chao already being recorded by rr

Then:

  • Study vars.go
  • Enable the workaround
  • Run make and chao again
  • All connections should be healthy at the end

If you want to debug:

  • Run dlv replay ~/.local/share/rr/chao-XXX if you captured an execution trace (recommended)

or:

  • Enable facilitateDebugging in vars.go
  • Rebuild (make)
  • chao will print a dlv command you can copy-paste, with the right PID
  • Run the command (something like dlv attach 999)
  • Weep whenever delve says it can't read part of the memory

I was able to debug with delve on both Windows 10 64-bit & Linux 64-bit, but I wasn't able to gather any surprising information.

@fasterthanlime
Copy link
Contributor Author

I found what I believe to be the two root causes so I opened #20 and #21 which are more palatable.

crawshaw added a commit that referenced this issue Jun 30, 2018
According to the SQLite docs:

	Any new SQL statements that are started after the
	sqlite3_interrupt() call and before the running statements
	reaches zero are interrupted as if they had been running
	prior to the sqlite3_interrupt() call.

When we detect a closed interrupt and return false from Step without
resetting a statemnt, we leave open a statement. This means future
calls to SetInterrupt don't succeed, because of an outstanding sqlite
statement.

To fix that conditon, reset stmts when we interrupt Step without
calling into cgo. To fix the more general condition, reset any
open statements on call to SetInterrupt.

Finally, make sure Reset always resets, as discovered by Amos Wenger.

For #18
For #20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants