Change behavior of merge when merging for data expiration #87

Merged
merged 13 commits into from May 10, 2013

Conversation

Projects
None yet
3 participants
Contributor

bsparrow435 commented Apr 25, 2013

If and only if a data files reason for merge is data_expiry, remove entries from keydir and delete the file instead of doing the normal merge process. In conjunction with this change, instead of expiry merge threshold being calculated off of oldest timestamp in file, use newest. Therefore. files will only be expired if all entries in the file are expired. This eliminates the need to fold over the data file when merging for expiry and saves the need to fold over the same values.

Currently L#694 will add the file to the list of expired files if the only reason for merge is data_expiry. However, in the future this could be changed to add to list if any one of the reasons for merge is data_expiry as all the entries in the file have been expired.

@bsparrow435 bsparrow435 commented on the diff Apr 25, 2013

src/bitcask.erl
[catch set_setuid_bit(F) || F <- FileNames],
bitcask_merge_delete:defer_delete(Dirname, IterGeneration, FileNames),
+ if InFiles == [] ->
@bsparrow435

bsparrow435 Apr 25, 2013

Contributor

This is necessary if all files in the merge are being deleted, we need to delete the output file for the merge. Otherwise, we create zero length files on each merge.

@bsparrow435 bsparrow435 and 1 other commented on an outdated diff Apr 25, 2013

src/bitcask.erl
FileNames = [Filename || {Filename, _Reasons} <- MergableFiles],
- {true, FileNames};
+ ExpiredFiles = [Filename || {Filename, [{data_expired,_,_}]} <- MergableFiles],
@bsparrow435

bsparrow435 Apr 25, 2013

Contributor

Possibly could be changed to lists:members check for data_expired.

@evanmcc

evanmcc May 7, 2013

Contributor

fwiw, my understanding is that list comprehensions are faster than the stdlib lists stuff.

@bsparrow435

bsparrow435 May 8, 2013

Contributor

So this will only match if data_expired is the only reason for the merge. But, if data_expired is a reason(no matter what the other reasons) we are going to remove all entries anyway so it would be more efficient to call the expiry code path. Suggestions on the best way to implement this change are welcome.

@evanmcc

evanmcc May 8, 2013

Contributor

Does this work? Or am I misunderstanding the question?

    FileNames0 = [Filename || {Filename, _Reasons} <- MergableFiles],
    ExpiredFiles = [Filename || {Filename, [{data_expired,_,_}]} <- MergableFiles],
    FileNames = FileNames0 -- ExpiredFiles, 
@bsparrow435

bsparrow435 May 8, 2013

Contributor

MergableFiles contains a list of lists {Reason, threshold, value} and this will only match if the list has one element and that element is data_expired. This works but we could have some files who pass merge thresholds for fragmentation AND data_expired which will be put through the old merge process. It would be a nice improvement if those instances also went through the merge_expiry function as all the data is expired. It works in the current implementation but there are some cases(such as high update AND using data_expiry use case) that will not see the improvement implemented by this change.

@evanmcc

evanmcc May 8, 2013

Contributor

Will the file name be duplicated in the list?

@bsparrow435

bsparrow435 May 8, 2013

Contributor

Yes, sorry. Each element is {Filename, [ListofTuples]}

@bsparrow435

bsparrow435 May 8, 2013

Contributor

The order of tuples in ListofTuples is not guaranteed so you cant do a match on the first element or something like that.

@evanmcc

evanmcc May 8, 2013

Contributor

In that case you should just likely write a function that scans a list for
{data_expired, _, _} and do something like:

ExpiredFiles = [Filename || {Filename, Reasons} <- MergableFiles,
scan_for_expired(Reasons)],

On Wed, May 8, 2013 at 3:03 PM, Brian Sparrow notifications@github.comwrote:

In src/bitcask.erl:

         FileNames = [Filename || {Filename, _Reasons} <- MergableFiles],
  •        {true, FileNames};
    
  •        ExpiredFiles = [Filename || {Filename, [{data_expired,_,_}]} <- MergableFiles],
    

The order of tuples in ListofTuples is not guaranteed so you cant do a
match on the first element or something like that.


Reply to this email directly or view it on GitHubhttps://github.com/basho/bitcask/pull/87/files#r4144733
.

@bsparrow435 bsparrow435 commented on the diff Apr 25, 2013

src/bitcask_merge_worker.erl
@@ -132,13 +132,14 @@ do_merge(Args) ->
Start = os:timestamp(),
Result = (catch apply(bitcask, merge, Args)),
ElapsedSecs = timer:now_diff(os:timestamp(), Start) / 1000000,
+ [_,_,{Pargs,_}] = Args,
@bsparrow435

bsparrow435 Apr 25, 2013

Contributor

Only matching on the filenames that are merged. No reason to note all the args, you can read them from the app.config or get_all_env

Contributor

evanmcc commented Apr 25, 2013

Two things:

  • Does this need to be configurable?
  • Have you run the pulse checks?
Contributor

bsparrow435 commented Apr 25, 2013

I did not see a reason for it to be configurable. There is a slight tradeoff in that you will now use disk space until all entries in a data file are expired instead of doing the disk io to fold over the same data potentially multiple times before it's all expired. Unless you have a very long expiry_sec and are adding data slowly I dont see a reason why this behavior is not desirable.

I have not run the PULSE tests, pinged you internally about it.

Contributor

evanmcc commented May 7, 2013

I think at this point this looks decent. I've run ~20 hours of PULSE testing on it.

It'd be awesome if @slfritchie or @gburd could give it a once-over before it's merged.

Contributor

bsparrow435 commented May 10, 2013

Added a lists:any check so now we'll use the new expiry merge path for any MergableFile with Reason {data_expired,,}.

@slfritchie slfritchie and 1 other commented on an outdated diff May 10, 2013

src/bitcask.erl
%% Inner merge function, assumes that bitcask is running and all files exist.
-merge1(_Dirname, _Opts, []) ->
+merge1(_Dirname, _Opts, [], _ExpiredFiles) ->
ok;
@slfritchie

slfritchie May 10, 2013

Contributor

This seems odd: if FilesToMerge = [], then doing the ExpiredFiles list seems odd. After all, if ExpiredFiles is not empty, isn't there stuff to do?

What if this entire function clause were deleted?

@bsparrow435

bsparrow435 May 10, 2013

Contributor

ExpiredFiles is always a subset of FilesToMerge so this really should be merge1(_Dirname, _Opts, [], []).

@bsparrow435 bsparrow435 added a commit that referenced this pull request May 10, 2013

@bsparrow435 bsparrow435 Merge pull request #87 from basho/bs-merge-expiration-change
Change behavior of merge when merging for data expiration
31126cc

@bsparrow435 bsparrow435 merged commit 31126cc into master May 10, 2013

1 check failed

default The Travis CI build could not complete due to an error
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment