Prevent segfault after mmap failure #58

lukechampine · 2017-09-29T16:27:58Z

Filling the disk with a bolt database causes a segfault on Windows. This script reproduces the bug (tested on Windows 10).

stack trace:

unexpected fault address 0x7fff1040
fatal error: fault
[signal 0xc0000005 code=0x0 addr=0x7fff1040 pc=0x45b9e8]

goroutine 1 [running]:
runtime.throw(0x4ecf03, 0x5)
       C:/Go/src/runtime/panic.go:566 +0x9c fp=0xc0420d5c38 sp=0xc0420d5c18
runtime.sigpanic()
       C:/Go/src/runtime/signal_windows.go:164 +0x10b fp=0xc0420d5c68 sp=0xc0420d5c38
github.com/boltdb/bolt.(*DB).meta(0xc04207e000, 0x1ec)
       C:/Users/nebul/go/src/github.com/boltdb/bolt/db.go:811 +0x38 fp=0xc0420d5cc0 sp=0xc0420d5c68
github.com/boltdb/bolt.(*Tx).rollback(0xc0420841c0)
       C:/Users/nebul/go/src/github.com/boltdb/bolt/tx.go:255 +0x79 fp=0xc0420d5ce8 sp=0xc0420d5cc0
github.com/boltdb/bolt.(*Tx).Commit(0xc0420841c0, 0x0, 0x0)
       C:/Users/nebul/go/src/github.com/boltdb/bolt/tx.go:164 +0x8b2 fp=0xc0420d5e38 sp=0xc0420d5ce8
github.com/boltdb/bolt.(*DB).Update(0xc04207e000, 0xc0420d5ec0, 0x0, 0x0)
       C:/Users/nebul/go/src/github.com/boltdb/bolt/db.go:605 +0x114 fp=0xc0420d5e88 sp=0xc0420d5e38

I admit that I haven't tested whether bbolt suffers from the same bug, but it seems likely. I don't have immediate access to a Windows machine, so I'd be grateful to anyone willing to run the script linked above (after switching the import to bbolt, of course).

This PR implements the fix that I came up with for the issue. I submitted it to the original bolt repo (boltdb/bolt#707) but got no response, so I went ahead and merged it into my own fork. We've been running it in production for a while now and it seems to work as intended.

codecov-io · 2017-09-29T16:44:13Z

Codecov Report

❗ No coverage uploaded for pull request base (master@3eac9d3). Click here to learn what that means.
The diff coverage is 0%.

@@            Coverage Diff            @@
##             master      #58   +/-   ##
=========================================
  Coverage          ?   85.43%           
=========================================
  Files             ?       10           
  Lines             ?     1860           
  Branches          ?        0           
=========================================
  Hits              ?     1589           
  Misses            ?      162           
  Partials          ?      109

Impacted Files	Coverage Δ
db.go	`82.44% <0%> (ø)`
errors.go	`0% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3eac9d3...bcc3736. Read the comment docs.

xiang90 · 2017-09-29T18:54:17Z

db.go

@@ -332,7 +332,12 @@ func (db *DB) mmap(minsz int) error {

 	// Memory-map the data file as a byte slice.
 	if err := mmap(db, size); err != nil {
-		return err


the previous returned error will cause the db library to panic? if that is the case, shouldnt we fix that behavior instead?

like if db fails on mmap, then all following calls into it should return mmap failure error instead of causing pancis?

As far as I can tell, the reason is panics after returning the error is that calling unmap invalidates db.meta0 and/or db.meta1. So we could also fix this by modifying the logic to never access those variables if mmap fails, and put the db in an "error" state, as you suggested.

That's probably a better solution than attempting to remap the old db. Just recently we received a bug report where it seems the remap failed as well, causing a panic. The one advantage of remapping is that you can still read from the db, you just can't write any more. But seeing as it's a failure state anyway, I don't think being able to read the db adds a ton of value.

@lukechampine

You are right. I am fine with your current solution as I mentioned. But, I want to make sure this does fix the original issue. Also we need to have sort of a test for this ideally.

thank you!

xiang90 · 2017-09-29T18:58:53Z

@lukechampine

The proposed the solution here does not seem ideal, but it is probably OK to accept since it is better than before.

But can you please confirm it does fix the problem on windows?

lukechampine · 2017-10-01T18:27:29Z

Hmm, I ran the test script on Windows 10 and got a different panic:

fatal error: all goroutines are asleep - deadlock!

goroutine 1 [semacquire]:
sync.runtime_SemacquireMutex(0xc04205672c, 0x429300)
        C:/Go/src/runtime/sema.go:71 +0x44
sync.(*Mutex).Lock(0xc042056728)
        C:/Go/src/sync/mutex.go:134 +0xf5
github.com/coreos/bbolt.(*DB).Close(0xc0420565a0, 0x0, 0x0)
        C:/Users/lukec/go/src/github.com/coreos/bbolt/db.go:446 +0x56
panic(0x4d6180, 0x512570)
        C:/Go/src/runtime/panic.go:491 +0x291
github.com/coreos/bbolt.(*DB).meta(0xc0420565a0, 0x6)
        C:/Users/lukec/go/src/github.com/coreos/bbolt/db.go:890 +0xc9
github.com/coreos/bbolt.(*Tx).rollback(0xc04206c1c0)
        C:/Users/lukec/go/src/github.com/coreos/bbolt/tx.go:267 +0x88
github.com/coreos/bbolt.(*DB).Update.func1(0xc04206c1c0)
        C:/Users/lukec/go/src/github.com/coreos/bbolt/db.go:659 +0x45
panic(0x4d6180, 0x512570)
        C:/Go/src/runtime/panic.go:491 +0x291
github.com/coreos/bbolt.(*DB).meta(0xc0420565a0, 0x6)
        C:/Users/lukec/go/src/github.com/coreos/bbolt/db.go:890 +0xc9
github.com/coreos/bbolt.(*Tx).rollback(0xc04206c1c0)
        C:/Users/lukec/go/src/github.com/coreos/bbolt/tx.go:267 +0x88
github.com/coreos/bbolt.(*Tx).Commit(0xc04206c1c0, 0x0, 0x0)
        C:/Users/lukec/go/src/github.com/coreos/bbolt/tx.go:161 +0x5fe
github.com/coreos/bbolt.(*DB).Update(0xc0420565a0, 0x505ef0, 0x0, 0x0)
        C:/Users/lukec/go/src/github.com/coreos/bbolt/db.go:674 +0xf9
main.main()
        C:/Users/lukec/go/src/github.com/coreos/bbolt/cmd/fill.go:24 +0xed

Seems like a similar underlying problem -- poor handling of an out-of-disk-space scenario.

However, the error changed when I used a disk with a different size. For a 100MB disk, I got the error above, even after applying the fix in this PR. With a 500MB disk, I got the same segfault as in the original issue, but after applying the fix, I got a clean error instead:

mmap allocate error: truncate: truncate test.db: There is not enough space on the disk.

So this PR fixes the problem in some scenarios, but not all. I think it's safe to say that we need a more robust solution to this problem.

re: testing, I'm not sure if this would work, but perhaps we could use build tags to supply a faulty mmap function that returns an error? Obviously it's infeasible for the tests to actually fill up your disks.

btw, on Windows you can create a small disk for testing purposes like so:
Computer Management > Disk Management > Action Menu > Create VHD > Right click disk > Initialize > Right click Free Space > New Simple Volume

xiang90 · 2017-10-01T18:32:09Z

@lukechampine would you like to investigate the deadlock problem (I feel it could be a simple fix)? I hope we could fix the original windows issue in one PR. I do not have access to any window machines for now. So it is hard for me to reproduce.

lukechampine · 2017-10-02T00:02:23Z

It looks like the root cause is indeed the same. Briefly:

In tx.Commit(), the call to tx.root.spill() fails (likely due to an mmap failure)
tx.Commit() attempts to clean up by calling tx.rollback()
tx.rollback() calls db.meta()
both db.meta0 and db.meta1 fail to validate, resulting in a "should never happen" panic
the panic unwinds into the db.Update call, triggering another deferred tx.rollback()
tx.rollback() panics again, this time unwinding into the main function
main calls a deferred db.Close()

The deadlock occurs because the stack unwinds "uncleanly." Normally, tx.rollback() calls db.rwlock.Unlock(), but in this case it gets interrupted by the panic triggered by db.meta(). And since db.Close() calls db.rwlock.Lock(), we deadlock.

So the root of the problem is invalid meta pages. In the 500MB case, it seems that the meta pages wind up in inaccessible memory (hence the segfault), whereas in the 100MB case, they remain accessible, but are corrupted somehow, causing the validate method to fail. When I get a chance, I'll add some debug statements to my Windows tests to see if I can establish exactly why validation fails. But I suspect it doesn't really matter. We should proceed with the plan to put the db in a "failure state" as soon as the mmap call fails. I'll make some progress on this later this week. Open to suggestions too.

xiang90 · 2017-10-02T06:48:01Z

We should proceed with the plan to put the db in a "failure state" as soon as the mmap call fails. I'll make some progress on this later this week. Open to suggestions too.

the plan sounds good to me.

lukechampine force-pushed the mmap branch from d38f16c to 72634fb Compare September 29, 2017 16:28

prevent segfault on Windows after mmap failure

bcc3736

lukechampine force-pushed the mmap branch from 72634fb to bcc3736 Compare September 29, 2017 16:29

xiang90 reviewed Sep 29, 2017

View reviewed changes

lukechampine mentioned this pull request Oct 7, 2017

bolt.DB.meta(): invalid meta pages NebulousLabs/Sia#2318

Open

lukechampine closed this Jan 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent segfault after mmap failure #58

Prevent segfault after mmap failure #58

lukechampine commented Sep 29, 2017

codecov-io commented Sep 29, 2017

xiang90 Sep 29, 2017

xiang90 Sep 29, 2017

lukechampine Sep 29, 2017

xiang90 Sep 29, 2017

xiang90 commented Sep 29, 2017

lukechampine commented Oct 1, 2017

xiang90 commented Oct 1, 2017

lukechampine commented Oct 2, 2017

xiang90 commented Oct 2, 2017

Prevent segfault after mmap failure #58

Prevent segfault after mmap failure #58

Conversation

lukechampine commented Sep 29, 2017

codecov-io commented Sep 29, 2017

Codecov Report

xiang90 Sep 29, 2017

Choose a reason for hiding this comment

xiang90 Sep 29, 2017

Choose a reason for hiding this comment

lukechampine Sep 29, 2017

Choose a reason for hiding this comment

xiang90 Sep 29, 2017

Choose a reason for hiding this comment

xiang90 commented Sep 29, 2017

lukechampine commented Oct 1, 2017

xiang90 commented Oct 1, 2017

lukechampine commented Oct 2, 2017

xiang90 commented Oct 2, 2017