-
Notifications
You must be signed in to change notification settings - Fork 643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent segfault after mmap failure #58
Conversation
Codecov Report
@@ Coverage Diff @@
## master #58 +/- ##
=========================================
Coverage ? 85.43%
=========================================
Files ? 10
Lines ? 1860
Branches ? 0
=========================================
Hits ? 1589
Misses ? 162
Partials ? 109
Continue to review full report at Codecov.
|
@@ -332,7 +332,12 @@ func (db *DB) mmap(minsz int) error { | |||
|
|||
// Memory-map the data file as a byte slice. | |||
if err := mmap(db, size); err != nil { | |||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the previous returned error will cause the db library to panic? if that is the case, shouldnt we fix that behavior instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like if db fails on mmap, then all following calls into it should return mmap failure error instead of causing pancis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, the reason is panics after returning the error is that calling unmap
invalidates db.meta0
and/or db.meta1
. So we could also fix this by modifying the logic to never access those variables if mmap
fails, and put the db in an "error" state, as you suggested.
That's probably a better solution than attempting to remap the old db. Just recently we received a bug report where it seems the remap failed as well, causing a panic. The one advantage of remapping is that you can still read from the db, you just can't write any more. But seeing as it's a failure state anyway, I don't think being able to read the db adds a ton of value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I am fine with your current solution as I mentioned. But, I want to make sure this does fix the original issue. Also we need to have sort of a test for this ideally.
thank you!
The proposed the solution here does not seem ideal, but it is probably OK to accept since it is better than before. But can you please confirm it does fix the problem on windows? |
Hmm, I ran the test script on Windows 10 and got a different panic:
Seems like a similar underlying problem -- poor handling of an out-of-disk-space scenario. However, the error changed when I used a disk with a different size. For a 100MB disk, I got the error above, even after applying the fix in this PR. With a 500MB disk, I got the same segfault as in the original issue, but after applying the fix, I got a clean error instead:
So this PR fixes the problem in some scenarios, but not all. I think it's safe to say that we need a more robust solution to this problem. re: testing, I'm not sure if this would work, but perhaps we could use build tags to supply a faulty btw, on Windows you can create a small disk for testing purposes like so: |
@lukechampine would you like to investigate the deadlock problem (I feel it could be a simple fix)? I hope we could fix the original windows issue in one PR. I do not have access to any window machines for now. So it is hard for me to reproduce. |
It looks like the root cause is indeed the same. Briefly:
The deadlock occurs because the stack unwinds "uncleanly." Normally, So the root of the problem is invalid meta pages. In the 500MB case, it seems that the meta pages wind up in inaccessible memory (hence the segfault), whereas in the 100MB case, they remain accessible, but are corrupted somehow, causing the |
the plan sounds good to me. |
Copied from boltdb/bolt#706:
I admit that I haven't tested whether bbolt suffers from the same bug, but it seems likely. I don't have immediate access to a Windows machine, so I'd be grateful to anyone willing to run the script linked above (after switching the import to bbolt, of course).
This PR implements the fix that I came up with for the issue. I submitted it to the original bolt repo (boltdb/bolt#707) but got no response, so I went ahead and merged it into my own fork. We've been running it in production for a while now and it seems to work as intended.