Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Fix race with deleting stale input files from merge, h/t @evanmcc #86

Merged
merged 1 commit into from

3 participants

@slfritchie
Owner

Evan found a failing QuickCheck + PULSE test case that causes a
fold to fail when racing with a merge. The race appears to be
very difficult to trigger via QC + PULSE, alas, even when
tweaking the operation weights to get more puts + folds + merge
operations to happen on average.

My proposed fix is to move the call to purge_setuid_files/1
to the init_keydir() function. init_keydir() knows if
we're opening the keydir for the first time (not_ready or
not), so purge the stale setuid files only in the not_ready
case.

@slfritchie slfritchie Fix race with deleting stale input files from merge, h/t @evanmcc
Evan found a failing QuickCheck + PULSE test case that causes a
fold to fail when racing with a merge.  The race appears to be
very difficult to trigger via QC + PULSE, alas, even when
tweaking the operation weights to get more puts + folds + merge
operations to happen on average.

My proposed fix is to move the call to `purge_setuid_files/1`
to the `init_keydir()` function.  `init_keydir()` knows if
we're opening the keydir for the first time (`not_ready` or
not), so purge the stale setuid files only in the `not_ready`
case.
6dc6cad
@evanmcc

This looks good to me, but best to have another reviewer in the loop. PULSE checking ongoing here as well.

More detail on the race:
This can occur when there's an ongoing merge on a keydir that was frozen prior to the beginning of the merge. If the merge has scheduled deferred deletions, a new thread opening the cask read_write can come in and force those deletion to happen immediately. This is too early in some cases; an ongoing fold may still be iterating over these files, and if one is missing, all of the keys that a deleted file owns at that moment will be omitted from the fold.

@jonmeredith
Owner

+1 merge. All unit tests and quickcheck tests pass. Ran it through it's paces on pulse for 30 mins

OK, passed 10323 tests

54.8238% incr_clock
9.0563% puts
8.9450% put
4.5027% get
3.6507% fork
3.3076% {needs_merge,false}
3.2887% {fork_merge,not_needed}
2.7448% bc_open
2.6975% delete
2.1199% {needs_merge,true}
1.8473% {fork_merge,ok}
0.8787% sync
0.4668% fold_keys
0.4580% bc_close
0.4422% fold
0.3107% {kill,merger}
0.3033% {fork_merge,already_queued}
0.1459% {kill,reader}
0.0101% {fork_merge,'EXIT'}
schedule:    Count: 10323   Min: 57   Max: 7097   Avg: 189.45   Total: 1955733
true
2> q().
@slfritchie slfritchie merged commit 469cd7d into master

1 check passed

Details default The Travis build passed
@engelsanchez engelsanchez deleted the slf-purge-stale-file-race branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Mar 28, 2013
  1. @slfritchie

    Fix race with deleting stale input files from merge, h/t @evanmcc

    slfritchie authored
    Evan found a failing QuickCheck + PULSE test case that causes a
    fold to fail when racing with a merge.  The race appears to be
    very difficult to trigger via QC + PULSE, alas, even when
    tweaking the operation weights to get more puts + folds + merge
    operations to happen on average.
    
    My proposed fix is to move the call to `purge_setuid_files/1`
    to the `init_keydir()` function.  `init_keydir()` knows if
    we're opening the keydir for the first time (`not_ready` or
    not), so purge the stale setuid files only in the `not_ready`
    case.
This page is out of date. Refresh to see the latest.
Showing with 10 additions and 5 deletions.
  1. +10 −5 src/bitcask.erl
View
15 src/bitcask.erl
@@ -111,8 +111,6 @@ open(Dirname, Opts) ->
%% and loading anyway: if later someone tries to write
%% something, that someone will get a write_locked exception.
_ = bitcask_lockops:delete_stale_lock(write, Dirname),
- %% This purge will acquire the write lock prior to doing anything.
- purge_setuid_files(Dirname),
fresh;
false -> undefined
end,
@@ -124,7 +122,7 @@ open(Dirname, Opts) ->
WaitTime = timer:seconds(get_opt(open_timeout, Opts)),
%% Loop and wait for the keydir to come available.
- case init_keydir(Dirname, WaitTime) of
+ case init_keydir(Dirname, WaitTime, WritingFile /= undefined) of
{ok, KeyDir, ReadFiles} ->
%% Ensure that expiry_secs is in Opts and not just application env
ExpOpts = [{expiry_secs,get_opt(expiry_secs,Opts)}|Opts],
@@ -877,7 +875,7 @@ scan_key_files([Filename | Rest], KeyDir, Acc, CloseFile, EnoentOK) ->
%%
%% Initialize a keydir for a given directory.
%%
-init_keydir(Dirname, WaitTime) ->
+init_keydir(Dirname, WaitTime, ReadWriteModeP) ->
%% Get the named keydir for this directory. If we get it and it's already
%% marked as ready, that indicates another caller has already loaded
%% all the data from disk and we can short-circuit scanning all the files.
@@ -903,6 +901,13 @@ init_keydir(Dirname, WaitTime) ->
Lock = poll_for_merge_lock(Dirname),
try
poll_deferred_delete_queue_empty(),
+ if ReadWriteModeP ->
+ %% This purge will acquire the write lock
+ %% prior to doing anything.
+ purge_setuid_files(Dirname);
+ true ->
+ ok
+ end,
init_keydir_scan_key_files(Dirname, KeyDir)
after
ok = bitcask_lockops:release(Lock)
@@ -919,7 +924,7 @@ init_keydir(Dirname, WaitTime) ->
Value when is_integer(Value), Value =< 0 -> %% avoids 'infinity'!
{error, timeout};
_ ->
- init_keydir(Dirname, WaitTime - 100)
+ init_keydir(Dirname, WaitTime - 100, ReadWriteModeP)
end
end.
Something went wrong with that request. Please try again.