Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Change behavior of merge when merging for data expiration #87

Merged
merged 13 commits into from

4 participants

@bsparrow435
Collaborator

If and only if a data files reason for merge is data_expiry, remove entries from keydir and delete the file instead of doing the normal merge process. In conjunction with this change, instead of expiry merge threshold being calculated off of oldest timestamp in file, use newest. Therefore. files will only be expired if all entries in the file are expired. This eliminates the need to fold over the data file when merging for expiry and saves the need to fold over the same values.

Currently L#694 will add the file to the list of expired files if the only reason for merge is data_expiry. However, in the future this could be changed to add to list if any one of the reasons for merge is data_expiry as all the entries in the file have been expired.

@bsparrow435 bsparrow435 commented on the diff
src/bitcask.erl
((9 lines not shown))
[catch set_setuid_bit(F) || F <- FileNames],
bitcask_merge_delete:defer_delete(Dirname, IterGeneration, FileNames),
+ if InFiles == [] ->
@bsparrow435 Collaborator

This is necessary if all files in the merge are being deleted, we need to delete the output file for the merge. Otherwise, we create zero length files on each merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/bitcask.erl
((5 lines not shown))
FileNames = [Filename || {Filename, _Reasons} <- MergableFiles],
- {true, FileNames};
+ ExpiredFiles = [Filename || {Filename, [{data_expired,_,_}]} <- MergableFiles],
@bsparrow435 Collaborator

Possibly could be changed to lists:members check for data_expired.

@evanmcc
evanmcc added a note

fwiw, my understanding is that list comprehensions are faster than the stdlib lists stuff.

@bsparrow435 Collaborator

So this will only match if data_expired is the only reason for the merge. But, if data_expired is a reason(no matter what the other reasons) we are going to remove all entries anyway so it would be more efficient to call the expiry code path. Suggestions on the best way to implement this change are welcome.

@evanmcc
evanmcc added a note

Does this work? Or am I misunderstanding the question?

    FileNames0 = [Filename || {Filename, _Reasons} <- MergableFiles],
    ExpiredFiles = [Filename || {Filename, [{data_expired,_,_}]} <- MergableFiles],
    FileNames = FileNames0 -- ExpiredFiles, 
@bsparrow435 Collaborator

MergableFiles contains a list of lists {Reason, threshold, value} and this will only match if the list has one element and that element is data_expired. This works but we could have some files who pass merge thresholds for fragmentation AND data_expired which will be put through the old merge process. It would be a nice improvement if those instances also went through the merge_expiry function as all the data is expired. It works in the current implementation but there are some cases(such as high update AND using data_expiry use case) that will not see the improvement implemented by this change.

@evanmcc
evanmcc added a note

Will the file name be duplicated in the list?

@bsparrow435 Collaborator

Yes, sorry. Each element is {Filename, [ListofTuples]}

@bsparrow435 Collaborator

The order of tuples in ListofTuples is not guaranteed so you cant do a match on the first element or something like that.

@evanmcc
evanmcc added a note
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsparrow435 bsparrow435 commented on the diff
src/bitcask_merge_worker.erl
@@ -132,13 +132,14 @@ do_merge(Args) ->
Start = os:timestamp(),
Result = (catch apply(bitcask, merge, Args)),
ElapsedSecs = timer:now_diff(os:timestamp(), Start) / 1000000,
+ [_,_,{Pargs,_}] = Args,
@bsparrow435 Collaborator

Only matching on the filenames that are merged. No reason to note all the args, you can read them from the app.config or get_all_env

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@evanmcc

Two things:

  • Does this need to be configurable?
  • Have you run the pulse checks?
@bsparrow435
Collaborator

I did not see a reason for it to be configurable. There is a slight tradeoff in that you will now use disk space until all entries in a data file are expired instead of doing the disk io to fold over the same data potentially multiple times before it's all expired. Unless you have a very long expiry_sec and are adding data slowly I dont see a reason why this behavior is not desirable.

I have not run the PULSE tests, pinged you internally about it.

@evanmcc

I think at this point this looks decent. I've run ~20 hours of PULSE testing on it.

It'd be awesome if @slfritchie or @gburd could give it a once-over before it's merged.

@bsparrow435
Collaborator

Added a lists:any check so now we'll use the new expiry merge path for any MergableFile with Reason {data_expired,,}.

src/bitcask.erl
((37 lines not shown))
%% Inner merge function, assumes that bitcask is running and all files exist.
-merge1(_Dirname, _Opts, []) ->
+merge1(_Dirname, _Opts, [], _ExpiredFiles) ->
ok;
@slfritchie Owner

This seems odd: if FilesToMerge = [], then doing the ExpiredFiles list seems odd. After all, if ExpiredFiles is not empty, isn't there stuff to do?

What if this entire function clause were deleted?

@bsparrow435 Collaborator

ExpiredFiles is always a subset of FilesToMerge so this really should be merge1(_Dirname, _Opts, [], []).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsparrow435 bsparrow435 merged commit 31126cc into master

1 check failed

Details default The Travis CI build could not complete due to an error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Nov 19, 2012
  1. @bsparrow435

    initial try

    bsparrow435 authored
  2. @bsparrow435

    More changes

    bsparrow435 authored
Commits on Nov 20, 2012
  1. @bsparrow435

    More fixes

    bsparrow435 authored
  2. @bsparrow435

    More fixes

    bsparrow435 authored
  3. @bsparrow435

    more fixes

    bsparrow435 authored
Commits on Nov 21, 2012
  1. @bsparrow435

    more fixes

    bsparrow435 authored
  2. @bsparrow435
Commits on Apr 15, 2013
  1. @joecaswell
Commits on Apr 24, 2013
  1. @bsparrow435
Commits on Apr 25, 2013
  1. @bsparrow435
  2. @bsparrow435
Commits on May 10, 2013
  1. @bsparrow435
  2. @bsparrow435

    Adressed review comments

    bsparrow435 authored
This page is out of date. Refresh to see the latest.
Showing with 68 additions and 27 deletions.
  1. +65 −25 src/bitcask.erl
  2. +3 −2 src/bitcask_merge_worker.erl
View
90 src/bitcask.erl
@@ -140,9 +140,6 @@ open(Dirname, Opts) ->
{error, Reason}
end.
-
-
-
%% @doc Close a bitcask data store and flush any pending writes to disk.
-spec close(reference()) -> ok.
close(Ref) ->
@@ -446,34 +443,39 @@ iterator_release(Ref) ->
%% into a more compact form.
-spec merge(Dirname::string()) -> ok.
merge(Dirname) ->
- merge(Dirname, [], readable_files(Dirname)).
+ merge(Dirname, [], {readable_files(Dirname), []}).
%% @doc Merge several data files within a bitcask datastore
%% into a more compact form.
-spec merge(Dirname::string(), Opts::[_]) -> ok.
merge(Dirname, Opts) ->
- merge(Dirname, Opts, readable_files(Dirname)).
+ merge(Dirname, Opts, {readable_files(Dirname), []}).
%% @doc Merge several data files within a bitcask datastore
%% into a more compact form.
-spec merge(Dirname::string(), Opts::[_], FilesToMerge::[string()]) -> ok.
merge(_Dirname, _Opts, []) ->
ok;
-merge(Dirname, Opts, FilesToMerge0) ->
+merge(Dirname,Opts,FilesToMerge) when is_list(FilesToMerge) ->
+ merge(Dirname,Opts,{FilesToMerge,[]});
+merge(_Dirname, _Opts, {[],_}) ->
+ ok;
+merge(Dirname, Opts, {FilesToMerge0, ExpiredFiles0}) ->
%% Make sure bitcask app is started so we can pull defaults from env
ok = start_app(),
-
%% Filter the files to merge and ensure that they all exist. It's
%% possible in some circumstances that we'll get an out-of-date
%% list of files.
FilesToMerge = [F || F <- FilesToMerge0,
filelib:is_file(F)],
- merge1(Dirname, Opts, FilesToMerge).
+ ExpiredFiles = [F || F <- ExpiredFiles0,
+ filelib:is_file(F)],
+ merge1(Dirname, Opts, FilesToMerge, ExpiredFiles).
%% Inner merge function, assumes that bitcask is running and all files exist.
-merge1(_Dirname, _Opts, []) ->
+merge1(_Dirname, _Opts, [], []) ->
ok;
-merge1(Dirname, Opts, FilesToMerge) ->
+merge1(Dirname, Opts, FilesToMerge, ExpiredFiles) ->
%% Test to see if this is a complete or partial merge
Partial = not(lists:usort(readable_files(Dirname)) ==
lists:usort(FilesToMerge)),
@@ -544,18 +546,22 @@ merge1(Dirname, Opts, FilesToMerge) ->
TooNew = [F#file_status.filename ||
F <- Summary,
F#file_status.newest_tstamp >= MergeStart],
- InFiles = lists:reverse(
- lists:foldl(fun(F, Acc) ->
+ {InFiles,InExpiredFiles} = lists:foldl(fun(F, {InFilesAcc,InExpiredAcc} = Acc) ->
case lists:member(F#filestate.filename,
TooNew) of
false ->
- [F|Acc];
+ case lists:member(F#filestate.filename,
+ ExpiredFiles) of
+ false ->
+ {[F|InFilesAcc],InExpiredAcc};
+ true ->
+ {InFilesAcc,[F|InExpiredAcc]}
+ end;
true ->
bitcask_fileops:close(F),
Acc
end
- end, [], InFiles1)),
-
+ end, {[],[]}, InFiles1),
%% Setup our first output merge file and update the merge lock accordingly
{ok, Outfile} = bitcask_fileops:create_file(Dirname, Opts),
ok = bitcask_lockops:write_activefile(
@@ -579,6 +585,7 @@ merge1(Dirname, Opts, FilesToMerge) ->
opts = Opts },
%% Finally, start the merge process
+ ExpiredFilesFinished = expiry_merge(InExpiredFiles, LiveKeyDir, []),
State1 = merge_files(State),
%% Make sure to close the final output file
@@ -587,11 +594,16 @@ merge1(Dirname, Opts, FilesToMerge) ->
%% Close the original input files, schedule them for deletion,
%% close keydirs, and release our lock
- [bitcask_fileops:close(F) || F <- State#mstate.input_files],
+ [bitcask_fileops:close(F) || F <- State#mstate.input_files ++ ExpiredFilesFinished],
{_, _, _, {IterGeneration, _, _}} = bitcask_nifs:keydir_info(LiveKeyDir),
- FileNames = [F#filestate.filename || F <- State#mstate.input_files],
+ FileNames = [F#filestate.filename || F <- State#mstate.input_files ++ ExpiredFilesFinished],
[catch set_setuid_bit(F) || F <- FileNames],
bitcask_merge_delete:defer_delete(Dirname, IterGeneration, FileNames),
+ if InFiles == [] ->
@bsparrow435 Collaborator

This is necessary if all files in the merge are being deleted, we need to delete the output file for the merge. Otherwise, we create zero length files on each merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ bitcask_fileops:delete(Outfile);
+ true ->
+ ok
+ end,
%% Explicitly release our keydirs instead of waiting for GC
bitcask_nifs:keydir_release(LiveKeyDir),
@@ -604,8 +616,8 @@ consider_for_merge(FragTrigger, DeadBytesTrigger, ExpirationGraceTime) ->
fun (F) ->
(F#file_status.fragmented >= FragTrigger)
orelse (F#file_status.dead_bytes >= DeadBytesTrigger)
- orelse ( (F#file_status.oldest_tstamp > 0) %% means that the file has data
- andalso (F#file_status.oldest_tstamp < ExpirationGraceTime)
+ orelse ((F#file_status.oldest_tstamp > 0) andalso %% means that the file has data
+ (F#file_status.newest_tstamp < ExpirationGraceTime)
)
end.
@@ -678,9 +690,17 @@ needs_merge(Ref) ->
_ ->
ok
end,
-
FileNames = [Filename || {Filename, _Reasons} <- MergableFiles],
- {true, FileNames};
+ F = fun(X) ->
+ case X of
+ {data_expired,_,_} ->
+ true;
+ _ ->
+ false
+ end
+ end,
+ ExpiredFiles = [Filename || {Filename, Reasons} <- MergableFiles, lists:any(F,Reasons)],
+ {true, {FileNames, ExpiredFiles}};
false ->
false
end.
@@ -726,8 +746,8 @@ small_file_threshold(Opts) ->
expired_threshold(Cutoff) ->
fun(F) ->
- if F#file_status.oldest_tstamp < Cutoff ->
- [{oldest_tstamp, F#file_status.oldest_tstamp, Cutoff}];
+ if F#file_status.newest_tstamp < Cutoff ->
+ [{data_expired, F#file_status.newest_tstamp, Cutoff}];
true ->
[]
end
@@ -1283,6 +1303,26 @@ poll_deferred_delete_queue_empty() ->
_ -> receive after 1100 -> poll_deferred_delete_queue_empty() end
end.
+%% Internal merge function for cache_merge functionality.
+expiry_merge([], _LiveKeyDir, Acc) ->
+ Acc;
+
+expiry_merge([File | Files], LiveKeyDir, Acc0) ->
+ FileId = bitcask_fileops:file_tstamp(File),
+ Fun = fun(K, Tstamp, {Offset, _TotalSz}, Acc) ->
+ bitcask_nifs:keydir_remove(LiveKeyDir, K, Tstamp, FileId, Offset),
+ Acc
+ end,
+ case bitcask_fileops:fold_keys(File, Fun, ok, default) of
+ {error, Reason} ->
+ error_logger:error_msg("Error folding keys for ~p: ~p\n", [File#filestate.filename,Reason]),
+ Acc = Acc0;
+ _ ->
+ error_logger:info_msg("All keys expired in: ~p scheduling file for deletion\n", [File#filestate.filename]),
+ Acc = lists:append(Acc0, [File])
+ end,
+ expiry_merge(Files, LiveKeyDir, Acc).
+
%% ===================================================================
%% EUnit tests
%% ===================================================================
@@ -1692,9 +1732,9 @@ delete_partial_merge_test() ->
%% selective merge, hit all of the files with deletes but not
%% all of the ones with deleted data
timer:sleep(1100),
- ok = merge("/tmp/bc.test.pardel",[],lists:reverse(lists:nthtail(2,
+ ok = merge("/tmp/bc.test.pardel",[],{lists:reverse(lists:nthtail(2,
lists:reverse(readable_files(
- "/tmp/bc.test.pardel"))))),
+ "/tmp/bc.test.pardel")))),[]}),
%% Verify we've now only got one item left
B2 = bitcask:open("/tmp/bc.test.pardel"),
View
5 src/bitcask_merge_worker.erl
@@ -132,13 +132,14 @@ do_merge(Args) ->
Start = os:timestamp(),
Result = (catch apply(bitcask, merge, Args)),
ElapsedSecs = timer:now_diff(os:timestamp(), Start) / 1000000,
+ [_,_,{Pargs,_}] = Args,
@bsparrow435 Collaborator

Only matching on the filenames that are merged. No reason to note all the args, you can read them from the app.config or get_all_env

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
case Result of
ok ->
error_logger:info_msg("Merged ~p in ~p seconds.\n",
- [Args, ElapsedSecs]);
+ [Pargs, ElapsedSecs]);
{Error, Reason} when Error == error; Error == 'EXIT' ->
error_logger:error_msg("Failed to merge ~p: ~p\n",
- [Args, Reason])
+ [Pargs, Reason])
end;
false ->
ok
Something went wrong with that request. Please try again.