unexpected SQLITE_INTERRUPT #18

fasterthanlime · 2018-06-21T15:56:09Z

Lines 267 and 268 here: stmt.Reset() and stmt.ClearBindings() can both return errors (notably, SQLITE_INTERRUPTED), but they're never returned from conn.Prepare()

sqlite/sqlite.go

Lines 265 to 281 in 4e3e3e8

    
           func (conn *Conn) Prepare(query string) (*Stmt, error) { 
        
           	if stmt := conn.stmts[query]; stmt != nil { 
        
           		stmt.Reset() 
        
           		stmt.ClearBindings() 
        
           		return stmt, nil 
        
           	} 
        
           	stmt, trailingBytes, err := conn.prepare(query, C.SQLITE_PREPARE_PERSISTENT) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	if trailingBytes != 0 { 
        
           		stmt.Finalize() 
        
           		return nil, reserr("Conn.Prepare", query, "statement has trailing bytes", C.SQLITE_ERROR) 
        
           	} 
        
           	conn.stmts[query] = stmt 
        
           	return stmt, nil 
        
           }

Is this on purpose? If so, can we add a comment that says it's on purpose and where the error will actually be handled?

(I'm currently trying to figure out why some of my connections always end up returning SQLITE_INTERRUPTED, this might or might not be related)

The text was updated successfully, but these errors were encountered:

crawshaw · 2018-06-29T13:10:03Z

The errors from those methods should be repeated by anything that is called on the returned Stmt after prepare. As those errors are properly annotated with source, it should be safe to return them. I'll make a change to do that.

If you're seeing SQLITE_INTERRUPT, that almost certainly means the context associated with your connection has been cancelled.

Should not affect program operation but may be helpful in debugging. For #18.

fasterthanlime · 2018-06-29T14:36:27Z

If you're seeing SQLITE_INTERRUPT, that almost certainly means the context associated with your connection has been cancelled.

I made sure it wasn't that! see https://github.com/fasterthanlime/chao and this long-ass twitter thread

tl;dr I was seeing a lot of SQLITE_INTERRUPT before the context was cancelled, I'm now pretty sure what I'm seeing is a duplicate of #17

crawshaw · 2018-06-29T17:09:09Z

Could you add a printf here and see if you're hitting this code path?

sqlite/sqlite.go

Line 242 in 3f55bce

if ErrCode(err) == SQLITE_INTERRUPT {

From what you're describing I don't think that's the problem, but after staring at it for a while I think that code path is problematic.

crawshaw · 2018-06-29T17:10:27Z

Ah I see from your twitter thread that the SQLITE_INTERRUPT is coming from the sqlite3_prepare call. Hmm.

crawshaw · 2018-06-29T17:15:52Z

One thought is these sqlite docs:

Any new SQL statements that are started after the sqlite3_interrupt() call and before the running statements reaches zero are interrupted as if they had been running prior to the sqlite3_interrupt() call.

This makes it sound like the problem is your connection has an in-progress Stmt somewhere when you that is keeping the underlying connection interrupted when you go to prep another statement.

One way to test this: when you go to reset the context on the connection, do it by calling dbpool.Put and then dbpool.Get again. The new code I just added will catch the problematic in-progress statement when you call Put and panic, which will tell you where it is.

(I should do the same thing when someone calls SetInterrupt.)

fasterthanlime · 2018-06-29T19:39:09Z

Ah I see from your twitter thread that the SQLITE_INTERRUPT is coming from the sqlite3_prepare call. > Hmm.

At least it did at some point! I've also seen SQLITE_INTERRUPT returned from _step(), from _reset(), from _clearbinding(), etc.

Any new SQL statements that are started after the sqlite3_interrupt() call and before the running statements reaches zero are interrupted as if they had been running prior to the sqlite3_interrupt() call.

Yep, that + the sqlite_stmt_busy docs made me think that that's what was happening - cycling through all cached statement and resetting them on dbpool.Get seems to have fixed that problem.

..but then I ran into other issues, like SetInterrupt hanging forever on a cancelCh receive (in the goroutine, not in cancelInterrupt), and other times I was just getting consistent SQLITE_LOCKED out of both connections of the pool and I don't think I can invest any more time in exploring this for now, sorry 😞

fasterthanlime · 2018-06-30T12:50:43Z

Update:

I've updated 🔥 https://github.com/fasterthanlime/chao so it:

Has a Makefile that builds it without optimizations & inlining
- This makes it easy to attach https://github.com/derekparker/delve to it
- or record it with https://github.com/mozilla/rr and replay it with delve
Does not depend on any of my libraries (just go-xorm/builder)

What chao does

chao attempts to flirt with the amount of INSERT ON CONFLICT DO UPDATE SET (upserts) it can do in memory in a certain deadline.

The goal is to have part of the queries succeed and part of the queries be interrupted, hopefully at various stages:

It'll never interrupt in Prepare() because all the statements are cached
It sometimes gets interrupted in pool.Get() - but there's retry logic built-in
It sometimes gets interrupted in Stmt.Reset()
It sometimes gets interrupted in Stmt.Step()
etc.

It tries to use a small (but > 2) connection pool (so it's easy to investigate in dlv) and a small amount of workers (so the logs aren't too cluttered), and varies the workload:

If the query succeeds too early, it increases the number of updates
If the query is interrupted, it decreases the number of updates

It runs several "rounds" of heavy querying, each separated by a few seconds of "recess".

After all rounds are done, it runs a very simple query on each connection of the pool, with a 2-second timeout. If this query succeeds, the connection is deemed "healthy".

Without the workaround

Without the workaround, a lot of queries get interrupted, because we intentionally try to push the workload until we can't complete it within the deadline anymore:

(-) numUpdates = 4605
  0 ...- [0xc4200a2780] succeeded 97.696724ms early
(+) numUpdates = 5526
  0 .... [0xc4200a2320] interrupted 172.563µs after deadline (173.172563ms duration)
(-) numUpdates = 4973
  0 ...- [0xc4200a2780] succeeded 140.473827ms early
(+) numUpdates = 5967
  0 .... [0xc4200a2640] interrupted 263.001µs after deadline (111.263001ms duration)
(-) numUpdates = 5370
  0 .... [0xc4200a25a0] interrupted 316.853µs after deadline (136.316853ms duration)
(-) numUpdates = 4833
  0 .... [0xc4200a2460] interrupted 387.727µs after deadline (199.387727ms duration)
(-) numUpdates = 4349
  0 .... [0xc4200a2500] interrupted 264.914µs after deadline (180.264914ms duration)
(-) numUpdates = 3914
  0 .... [0xc4200a26e0] interrupted 400.294µs after deadline (125.400294ms duration)
  0 ...- [0xc4200a2780] succeeded 133.105419ms early
(+) numUpdates = 4696
  0 .... [0xc4200a2190] interrupted 368.89µs after deadline (147.36889ms duration)
(-) numUpdates = 4226
  0 .... [0xc4200a23c0] interrupted 364.055µs after deadline (125.364055ms duration)
(-) numUpdates = 3803
  0 .... [0xc4200a2320] interrupted 438.773µs after deadline (129.438773ms duration)
  0 .... [0xc4200a2280] interrupted 299.835µs after deadline (168.299835ms duration)

At the final connection health check, I'm usually seeing "1 / 10 healthy conns":

Now testing connections...
Testing connection 0...
[0] sqlite.Exec: Stmt.Step: SQLITE_INTERRUPT (SELECT * FROM thieves LIMIT 1)
Testing connection 1...
[1] sqlite.Exec: Stmt.Step: SQLITE_INTERRUPT (SELECT * FROM thieves LIMIT 1)
Testing connection 2...
[2] sqlite.Exec: Stmt.Step: SQLITE_INTERRUPT (SELECT * FROM thieves LIMIT 1)
Testing connection 3...
[3] sqlite.Exec: Stmt.Step: SQLITE_INTERRUPT (SELECT * FROM thieves LIMIT 1)
Testing connection 4...
[4] sqlite.Exec: Stmt.Step: SQLITE_INTERRUPT (SELECT * FROM thieves LIMIT 1)
Testing connection 5...
Testing connection 6...
[6] sqlite.Exec: Stmt.Step: SQLITE_INTERRUPT (SELECT * FROM thieves LIMIT 1)
Testing connection 7...
[7] sqlite.Exec: Stmt.Step: SQLITE_INTERRUPT (SELECT * FROM thieves LIMIT 1)
Testing connection 8...
[8] sqlite.Exec: Stmt.Step: SQLITE_INTERRUPT (SELECT * FROM thieves LIMIT 1)
Testing connection 9...
[9] sqlite.Exec: Stmt.Step: SQLITE_INTERRUPT (SELECT * FROM thieves LIMIT 1)
Found 1 / 10 healthy conns
 :(

With the workaround

The workaround is to just... not set a deadline, ie. never cancel the connection's context.

So it's not doing the same thing, it effectively robs us of the ability to cancel requests early (for my use case, that's ok).

As a result, all the queries succeed (the only failure condition is us not waiting on the pool long enough):

  5 .... [0x0] pool timeout (105.102637ms / 105ms)
  5 .... [0x0] pool timeout (105.156695ms / 105ms)
  3 .... [0x0] pool timeout (129.141502ms / 129ms)
  5 .... [0x0] pool timeout (161.115876ms / 161ms)
  5 ...+ [0xc4200a2460] succeeded 223.406454ms late
  0 ...+ [0xc4200a2280] succeeded 64.341857ms late
  1 ...+ [0xc4200a2280] succeeded 120.187759ms late
  7 ...+ [0xc4200a2460] succeeded 145.737485ms late
  5 .... [0x0] pool timeout (122.159863ms / 122ms)
  3 .... [0x0] pool timeout (143.103395ms / 143ms)
  8 .... [0x0] pool timeout (176.092786ms / 176ms)
  1 ...+ [0xc4200a25a0] succeeded 462.855671ms late
 10 ...+ [0xc4200a2640] succeeded 206.411653ms late
  1 ...+ [0xc4200a2640] succeeded 162.279826ms late
  6 ...+ [0xc4200a2780] succeeded 213.400416ms late
  0 ...+ [0xc4200a2780] succeeded 101.938924ms late
 10 ...+ [0xc4200a2320] succeeded 228.328918ms late
  7 ...+ [0xc4200a2500] succeeded 238.080748ms late
  5 ...+ [0xc4200a2190] succeeded 242.975096ms late
  7 ...+ [0xc4200a2500] succeeded 241.771349ms late
 15 ...+ [0xc4200a2780] succeeded 116.676525ms late
  9 ...+ [0xc4200a2500] succeeded 239.820994ms late
 18 ...+ [0xc4200a2460] succeeded 201.173352ms late
  1 ...+ [0xc4200a26e0] succeeded 187.169667ms late
  1 ...+ [0xc4200a2320] succeeded 232.396621ms late
 18 ...+ [0xc4200a2640] succeeded 226.160701ms late
  8 ...+ [0xc4200a2190] succeeded 210.415338ms late
 24 ...+ [0xc4200a2320] succeeded 136.373109ms late
  9 ...+ [0xc4200a23c0] succeeded 153.115875ms late
  1 ...+ [0xc4200a2280] succeeded 224.001045ms late
  0 ...+ [0xc4200a2190] succeeded 101.808727ms late
 10 ...+ [0xc4200a26e0] succeeded 273.496389ms late
  1 ...+ [0xc4200a2320] succeeded 176.150518ms late
All done in 7.47433119s
Sleeping 2 seconds to let SQLite settle...

At the end, all connections are healthy:

Now testing connections...
Testing connection 0...
Testing connection 1...
Testing connection 2...
Testing connection 3...
Testing connection 4...
Testing connection 5...
Testing connection 6...
Testing connection 7...
Testing connection 8...
Testing connection 9...
Found 10 / 10 healthy conns
 :)

What could cause SQLite to return SQLITE_INTERRUPT?

The main suspect (as far as I'm concerned) is this:

Any new SQL statements that are started after the sqlite3_interrupt() call and before the running statements reaches zero are interrupted as if they had been running prior to the sqlite3_interrupt() call

🔗 https://www.sqlite.org/c3ref/interrupt.html

What does SQLite consider a "running statement" ?

The sqlite3_stmt_busy(S) interface returns true (non-zero) if the prepared statement S has been stepped at least once using sqlite3_step(S) but has neither run to completion (returned SQLITE_DONE from sqlite3_step(S)) nor been reset using sqlite3_reset(S). The sqlite3_stmt_busy(S) interface returns false if S is a NULL pointer. If S is not a NULL pointer and is not a pointer to a valid prepared statement object, then the behavior is undefined and probably undesirable.

This interface can be used in combination sqlite3_next_stmt() to locate all prepared statements associated with a database connection that are in need of being reset. This can be used, for example, in diagnostic routines to search for prepared statements that are holding a transaction open.

🔗 https://www.sqlite.org/c3ref/stmt_busy.html

What about the `pool.Put` check?

I don't think the check is strong enough. Consider this:

sqlite/sqlite.go

Lines 501 to 534 in 3f55bce

    
           func (stmt *Stmt) Step() (rowReturned bool, err error) { 
        
           	defer func() { 
        
           		stmt.lastHasRow = rowReturned 
        
           	}() 
        
           	if stmt.bindErr != nil { 
        
           		err = stmt.bindErr 
        
           		stmt.bindErr = nil 
        
           		return false, err 
        
           	} 
        
           	for { 
        
           		stmt.conn.count++ 
        
           		if err := stmt.interrupted("Stmt.Step"); err != nil { 
        
           			return false, err 
        
           		} 
        
           		switch res := C.sqlite3_step(stmt.stmt); uint8(res) { // reduce to non-extended error code 
        
           		case C.SQLITE_LOCKED: 
        
           			if res := C.wait_for_unlock_notify(stmt.conn.conn, stmt.conn.unlockNote); res != C.SQLITE_OK { 
        
           				return false, stmt.conn.reserr("Stmt.Step(Wait)", stmt.query, res) 
        
           			} 
        
           			C.sqlite3_reset(stmt.stmt) 
        
           			// loop 
        
           		case C.SQLITE_ROW: 
        
           			return true, nil 
        
           		case C.SQLITE_DONE: 
        
           			return false, nil 
        
           		case C.SQLITE_BUSY, C.SQLITE_INTERRUPT, C.SQLITE_CONSTRAINT: 
        
           			// TODO: embed some of these errors into the stmt for zero-alloc errors? 
        
           			return false, stmt.conn.reserr("Stmt.Step", stmt.query, res) 
        
           		default: 
        
           			return false, stmt.conn.extreserr("Stmt.Step", stmt.query, res) 
        
           		} 
        
           	} 
        
           }

In the codepath where we get SQLITE_INTERRUPT from wait_for_unlock_notify:

we return false for rowReturned
lastHasRow is set to false
...even if a previous Step() call did return a row

The same thing happens (I think) when stmt.interrupted returns a non-nil error.

What doesn't fix this?

Things I've tried:

Resetting all cached statements (conn.stmts) on Pool.Put
Resetting all cached statements (conn.stmts) on Pool.Get
defer stmt.Reset() in exec
- although I suspect this doesn't work because stmt.Reset() never calls sqlite3_reset if the conn is interrupted by doneCh

What might fix this?

As I'm writing this (I haven't tried it yet), I suspect something along the lines of:

defer func() {
	if rowReturned {
		stmt.lastHasRow = true
	}
}()

...might fix it. I'll try it now, but I did want to leave my findings somewhere.

How do I run chao myself?

Make sure you have the latest crawshaw/sqlite in your $GOPATH
Review the code to make sure I didn't sneak in any harmful stuff
- Heck, run it in a VM
Review the Makefile
- It runs go get -v -x with gcflags to disable inlining and optimization
Run make
Run chao

You might want to run rr chao to capture an execution trace instead. mozilla's rr is Linux-only afaict, and your distro probably has outdated packages for it:

https://github.com/mozilla/rr
Note that I've seen rr chao get stuck for seemingly no reason
You won't be able to dlv attach to an instance of chao already being recorded by rr

Then:

Study vars.go
Enable the workaround
Run make and chao again
All connections should be healthy at the end

If you want to debug:

Run dlv replay ~/.local/share/rr/chao-XXX if you captured an execution trace (recommended)

or:

Enable facilitateDebugging in vars.go
Rebuild (make)
chao will print a dlv command you can copy-paste, with the right PID
Run the command (something like dlv attach 999)
Weep whenever delve says it can't read part of the memory

I was able to debug with delve on both Windows 10 64-bit & Linux 64-bit, but I wasn't able to gather any surprising information.

fasterthanlime · 2018-06-30T13:27:32Z

I found what I believe to be the two root causes so I opened #20 and #21 which are more palatable.

According to the SQLite docs: Any new SQL statements that are started after the sqlite3_interrupt() call and before the running statements reaches zero are interrupted as if they had been running prior to the sqlite3_interrupt() call. When we detect a closed interrupt and return false from Step without resetting a statemnt, we leave open a statement. This means future calls to SetInterrupt don't succeed, because of an outstanding sqlite statement. To fix that conditon, reset stmts when we interrupt Step without calling into cgo. To fix the more general condition, reset any open statements on call to SetInterrupt. Finally, make sure Reset always resets, as discovered by Amos Wenger. For #18 For #20

crawshaw added a commit that referenced this issue Jun 29, 2018

sqlite: export some possible errors from Prepare

b6bcaa0

Should not affect program operation but may be helpful in debugging. For #18.

crawshaw closed this as completed Jun 29, 2018

crawshaw reopened this Jun 29, 2018

crawshaw changed the title ~~Swallowed errors in Conn.Prepare~~ unexpected SQLITE_INTERRUPT Jun 29, 2018

fasterthanlime mentioned this issue Jun 30, 2018

sqliteutil.Exec doesn't release prepare statements / stmt.Reset() after INTERRUPT #20

Closed

2 tasks

fasterthanlime closed this as completed Jun 30, 2018

fasterthanlime mentioned this issue Jun 30, 2018

Check for extended error code SQLITE_LOCKED_SHAREDCACHE, closes #6 (hopefully) #25

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unexpected SQLITE_INTERRUPT #18

unexpected SQLITE_INTERRUPT #18

fasterthanlime commented Jun 21, 2018

crawshaw commented Jun 29, 2018

fasterthanlime commented Jun 29, 2018

crawshaw commented Jun 29, 2018

crawshaw commented Jun 29, 2018

crawshaw commented Jun 29, 2018

fasterthanlime commented Jun 29, 2018

fasterthanlime commented Jun 30, 2018

fasterthanlime commented Jun 30, 2018

unexpected SQLITE_INTERRUPT #18

unexpected SQLITE_INTERRUPT #18

Comments

fasterthanlime commented Jun 21, 2018

crawshaw commented Jun 29, 2018

fasterthanlime commented Jun 29, 2018

crawshaw commented Jun 29, 2018

crawshaw commented Jun 29, 2018

crawshaw commented Jun 29, 2018

fasterthanlime commented Jun 29, 2018

fasterthanlime commented Jun 30, 2018

What chao does

Without the workaround

With the workaround

What could cause SQLite to return SQLITE_INTERRUPT?

What does SQLite consider a "running statement" ?

What about the pool.Put check?

What doesn't fix this?

What might fix this?

How do I run chao myself?

fasterthanlime commented Jun 30, 2018

What about the `pool.Put` check?