-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the possibility for merge and open to race. #150
Conversation
-1 as-is, sorry. I'm seeing nondeterministic failures of the bitcask_qc_fsm:prop_bitcask() test: the failing cases appear to have zero failures while shrinking, which means that the original case == shrunk case, and attempts to get it to fail again don't work, e.g. One counterexample:
Another, with a more common postcondition failure:
|
Hrm. The 2nd non-deterministic case that I describe above doesn't appear to happen on |
There are two ways to go with this, both seem to fix the issue. Still doing a basic PULSE run on 1, but it passed 450 seconds of the fsm test without throwing either counter.
Opinions? My preference, due to it being simpler, is to do 2, but I can see an argument for 1 being that it's a smaller change and we're very close to release (also it's easier to backport). |
https://github.com/basho/bitcask/tree/pevm-avoid-merge-open-race-opt1 |
Hi, Evan. pevm-avoid-merge-open-race-opt1 hasn't fixed the race, sorry. Every failure that I see has a common pattern:
|
For example:
|
Sorry about the multiple tries here, my machine hasn't done a great job of hitting these. |
Spent a lot of time today trying to get this to happen reliably, but haven't seen it once. I pushed an exploratory commit to the opt1 branch tweaking the locking behavior. Scott, if you have the time, could you run it and see if it turns the counter from a model check failure to a badmatch on the lock? Also, any repro tips would be appreciated. I tried running it on a ramdisk, running several testers in parallel to slow the disk down, but nothing seems to help, even on the original code. |
FWIW, bitcask_qc_fsm's To avoid tearing my SSD to shreds, I've started using a RAM disk for QuickCheck testing. My OS X steps are borrowed from http://bogner.sh/2012/12/os-x-create-a-ram-disk-the-easy-way/ Looking at commits going forward in time:
If things are working correctly as of 6a549bf, is it worth declaring victory there? (Assuming that yet-to-be-finished long-term QuickCheck & PULSE testing at that commit doesn't find another problem.) |
commit 6a549bf fixes #95, but leaves #149 hanging. I am OK with |
FWIW, I see commit 6a549bf passing the |
PULSE testing doesn't close and reopen fast enough to hit this issue, I don't think. I've found the issue and will likely rebase this branch Real Soon Now. The issue is that I'd really like to get my remove-merge-delete branch in (opt 2 above), but there are some additional issues with it, and I am out of time. I am going to clean up the opt one branch and make that the review candidate today, and I'll revisit the cleanup later. |
Other than the PULSE comment, ignore what I said in the last thing. There's strong evidence (I shouldn't have been so definite last time, but I am pretty sure here), that this is a test-design problem. The issue seems to be that the delete worker isn't getting reset at the start of the test, so it can still be carrying delete orders from several test ago when it finally triggers. Here's some debugging prints showing the issue:
As you can see, there are a number of deletes spread throughout the various opens and merges. A delete queued in this execution would be prefixed with FF, and there are none. You can see that we're seeing deletes of file 7 and 10 even though we've only seen a few puts. Seeing Anyway, I am feeling non-confident about anything, but I think that I might actually be on to it this time. I am going to let things churn for a couple of hours on this branch, then rebase if everything is still looking good. The branch that I am currently testing is: poll-race-try2 |
opens always work.
- avoid fate-sharing by not caring about the delete queue. - clean up an unused variable warning in the maybe_new nif.
rebased to current develop, fixed a few more test errors, should be good to go now. pre-rebase it did 3 hours of qc_fsm and 4 hours of pulse. Restarting PULSE now. |
My testing over the weekend found another race. Quickly, this exact pattern match needs to be more flexible: https://github.com/basho/bitcask/blob/pevm-avoid-merge-open-race/src/bitcask.erl#L1349 ... the failure case isn't 100% deterministic, so I'm going to try to hunt for where the nondeterminism is creeping in.
|
Hrm, no that pattern match doesn't need flexibility. But that case writes a key, deletes it, merges it away, close & reopen ... and the reopen discovers that all data files are setuid. So the keydir has never seen a call to increment_file_id() and thus the keydir's view of the world is wrong. I have a patch testing right now..... |
…minism in PULSE test
Commit f085660 has passed 8 CPU days without failing, yay. |
ping? has there been code review or just testing? do we need someone else |
@evanmcc I'm waiting for review of my 5 commits from last week. |
They look good to me. |
+1 f085660 |
Remove the possibility for merge and open to race. Reviewed-by: slfritchie
+1 f085660 |
Hrm, I hope I haven't confused borshop with an extra approval. |
@borshop merge |
As described in #95, if a vnode dies and takes down your bitcask instance, it's possible that a merge will get there first, causing repeated open failures that can take the whole node down. This change alters how open and merge work, disallowing merges if the bitcask instances is not currently opened by a pid in the same VM.
Additionally, as described in #149, initialization can hang on startup if the deletion queue is being blocked. Since that polling was intended to fix an issue on the merge init path (which no longer exists), we can remove it. The removal requires making some other pathways more robust.
This PR also contains some test cleanup.