Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent segfault after mmap failure #58

Closed
wants to merge 1 commit into from

Conversation

lukechampine
Copy link
Contributor

Copied from boltdb/bolt#706:

Filling the disk with a bolt database causes a segfault on Windows. This script reproduces the bug (tested on Windows 10).

stack trace:

unexpected fault address 0x7fff1040
fatal error: fault
[signal 0xc0000005 code=0x0 addr=0x7fff1040 pc=0x45b9e8]

goroutine 1 [running]:
runtime.throw(0x4ecf03, 0x5)
       C:/Go/src/runtime/panic.go:566 +0x9c fp=0xc0420d5c38 sp=0xc0420d5c18
runtime.sigpanic()
       C:/Go/src/runtime/signal_windows.go:164 +0x10b fp=0xc0420d5c68 sp=0xc0420d5c38
github.com/boltdb/bolt.(*DB).meta(0xc04207e000, 0x1ec)
       C:/Users/nebul/go/src/github.com/boltdb/bolt/db.go:811 +0x38 fp=0xc0420d5cc0 sp=0xc0420d5c68
github.com/boltdb/bolt.(*Tx).rollback(0xc0420841c0)
       C:/Users/nebul/go/src/github.com/boltdb/bolt/tx.go:255 +0x79 fp=0xc0420d5ce8 sp=0xc0420d5cc0
github.com/boltdb/bolt.(*Tx).Commit(0xc0420841c0, 0x0, 0x0)
       C:/Users/nebul/go/src/github.com/boltdb/bolt/tx.go:164 +0x8b2 fp=0xc0420d5e38 sp=0xc0420d5ce8
github.com/boltdb/bolt.(*DB).Update(0xc04207e000, 0xc0420d5ec0, 0x0, 0x0)
       C:/Users/nebul/go/src/github.com/boltdb/bolt/db.go:605 +0x114 fp=0xc0420d5e88 sp=0xc0420d5e38

I admit that I haven't tested whether bbolt suffers from the same bug, but it seems likely. I don't have immediate access to a Windows machine, so I'd be grateful to anyone willing to run the script linked above (after switching the import to bbolt, of course).

This PR implements the fix that I came up with for the issue. I submitted it to the original bolt repo (boltdb/bolt#707) but got no response, so I went ahead and merged it into my own fork. We've been running it in production for a while now and it seems to work as intended.

@codecov-io
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (master@3eac9d3). Click here to learn what that means.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master      #58   +/-   ##
=========================================
  Coverage          ?   85.43%           
=========================================
  Files             ?       10           
  Lines             ?     1860           
  Branches          ?        0           
=========================================
  Hits              ?     1589           
  Misses            ?      162           
  Partials          ?      109
Impacted Files Coverage Δ
db.go 82.44% <0%> (ø)
errors.go 0% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3eac9d3...bcc3736. Read the comment docs.

@@ -332,7 +332,12 @@ func (db *DB) mmap(minsz int) error {

// Memory-map the data file as a byte slice.
if err := mmap(db, size); err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous returned error will cause the db library to panic? if that is the case, shouldnt we fix that behavior instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like if db fails on mmap, then all following calls into it should return mmap failure error instead of causing pancis?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, the reason is panics after returning the error is that calling unmap invalidates db.meta0 and/or db.meta1. So we could also fix this by modifying the logic to never access those variables if mmap fails, and put the db in an "error" state, as you suggested.

That's probably a better solution than attempting to remap the old db. Just recently we received a bug report where it seems the remap failed as well, causing a panic. The one advantage of remapping is that you can still read from the db, you just can't write any more. But seeing as it's a failure state anyway, I don't think being able to read the db adds a ton of value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukechampine

You are right. I am fine with your current solution as I mentioned. But, I want to make sure this does fix the original issue. Also we need to have sort of a test for this ideally.

thank you!

@xiang90
Copy link
Contributor

xiang90 commented Sep 29, 2017

@lukechampine

The proposed the solution here does not seem ideal, but it is probably OK to accept since it is better than before.

But can you please confirm it does fix the problem on windows?

@lukechampine
Copy link
Contributor Author

Hmm, I ran the test script on Windows 10 and got a different panic:

fatal error: all goroutines are asleep - deadlock!

goroutine 1 [semacquire]:
sync.runtime_SemacquireMutex(0xc04205672c, 0x429300)
        C:/Go/src/runtime/sema.go:71 +0x44
sync.(*Mutex).Lock(0xc042056728)
        C:/Go/src/sync/mutex.go:134 +0xf5
github.com/coreos/bbolt.(*DB).Close(0xc0420565a0, 0x0, 0x0)
        C:/Users/lukec/go/src/github.com/coreos/bbolt/db.go:446 +0x56
panic(0x4d6180, 0x512570)
        C:/Go/src/runtime/panic.go:491 +0x291
github.com/coreos/bbolt.(*DB).meta(0xc0420565a0, 0x6)
        C:/Users/lukec/go/src/github.com/coreos/bbolt/db.go:890 +0xc9
github.com/coreos/bbolt.(*Tx).rollback(0xc04206c1c0)
        C:/Users/lukec/go/src/github.com/coreos/bbolt/tx.go:267 +0x88
github.com/coreos/bbolt.(*DB).Update.func1(0xc04206c1c0)
        C:/Users/lukec/go/src/github.com/coreos/bbolt/db.go:659 +0x45
panic(0x4d6180, 0x512570)
        C:/Go/src/runtime/panic.go:491 +0x291
github.com/coreos/bbolt.(*DB).meta(0xc0420565a0, 0x6)
        C:/Users/lukec/go/src/github.com/coreos/bbolt/db.go:890 +0xc9
github.com/coreos/bbolt.(*Tx).rollback(0xc04206c1c0)
        C:/Users/lukec/go/src/github.com/coreos/bbolt/tx.go:267 +0x88
github.com/coreos/bbolt.(*Tx).Commit(0xc04206c1c0, 0x0, 0x0)
        C:/Users/lukec/go/src/github.com/coreos/bbolt/tx.go:161 +0x5fe
github.com/coreos/bbolt.(*DB).Update(0xc0420565a0, 0x505ef0, 0x0, 0x0)
        C:/Users/lukec/go/src/github.com/coreos/bbolt/db.go:674 +0xf9
main.main()
        C:/Users/lukec/go/src/github.com/coreos/bbolt/cmd/fill.go:24 +0xed

Seems like a similar underlying problem -- poor handling of an out-of-disk-space scenario.

However, the error changed when I used a disk with a different size. For a 100MB disk, I got the error above, even after applying the fix in this PR. With a 500MB disk, I got the same segfault as in the original issue, but after applying the fix, I got a clean error instead:

mmap allocate error: truncate: truncate test.db: There is not enough space on the disk.

So this PR fixes the problem in some scenarios, but not all. I think it's safe to say that we need a more robust solution to this problem.

re: testing, I'm not sure if this would work, but perhaps we could use build tags to supply a faulty mmap function that returns an error? Obviously it's infeasible for the tests to actually fill up your disks.

btw, on Windows you can create a small disk for testing purposes like so:
Computer Management > Disk Management > Action Menu > Create VHD > Right click disk > Initialize > Right click Free Space > New Simple Volume

@xiang90
Copy link
Contributor

xiang90 commented Oct 1, 2017

@lukechampine would you like to investigate the deadlock problem (I feel it could be a simple fix)? I hope we could fix the original windows issue in one PR. I do not have access to any window machines for now. So it is hard for me to reproduce.

@lukechampine
Copy link
Contributor Author

It looks like the root cause is indeed the same. Briefly:

  • In tx.Commit(), the call to tx.root.spill() fails (likely due to an mmap failure)
  • tx.Commit() attempts to clean up by calling tx.rollback()
  • tx.rollback() calls db.meta()
  • both db.meta0 and db.meta1 fail to validate, resulting in a "should never happen" panic
  • the panic unwinds into the db.Update call, triggering another deferred tx.rollback()
  • tx.rollback() panics again, this time unwinding into the main function
  • main calls a deferred db.Close()

The deadlock occurs because the stack unwinds "uncleanly." Normally, tx.rollback() calls db.rwlock.Unlock(), but in this case it gets interrupted by the panic triggered by db.meta(). And since db.Close() calls db.rwlock.Lock(), we deadlock.

So the root of the problem is invalid meta pages. In the 500MB case, it seems that the meta pages wind up in inaccessible memory (hence the segfault), whereas in the 100MB case, they remain accessible, but are corrupted somehow, causing the validate method to fail. When I get a chance, I'll add some debug statements to my Windows tests to see if I can establish exactly why validation fails. But I suspect it doesn't really matter. We should proceed with the plan to put the db in a "failure state" as soon as the mmap call fails. I'll make some progress on this later this week. Open to suggestions too.

@xiang90
Copy link
Contributor

xiang90 commented Oct 2, 2017

We should proceed with the plan to put the db in a "failure state" as soon as the mmap call fails. I'll make some progress on this later this week. Open to suggestions too.

the plan sounds good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants