Improve tombstone management #128

Closed
wants to merge 39 commits into
from

Conversation

Projects
None yet
3 participants
Contributor

slfritchie commented Dec 26, 2013

Addresses #82. See also #102.

This PR is the result of two efforts, the second building upon the first:

  1. A fix for all known merge mis-behavior and races with respect to tombstones and merging. (This stopped roughly at the tip of the slf-merge-panopticon branch on 24 December.
  2. Change the format of hint files (in a backward-compatible way) to make it possible to avoid storing tombstone'ed keys in the keydir.

Scott Lystig Fritchie added some commits Dec 15, 2013

Scott Lystig Fritchie Add regression test for GH #82 eb182ba
Scott Lystig Fritchie Fix race with merge & bitcask:delete(): remove from keydir *first* 1a9c994
Scott Lystig Fritchie Change order of scan & fold: now oldest -> youngest d594f90
Scott Lystig Fritchie Fix rough edges on test/bitcask_qc.erl & test/bitcask_qc_expiry.erl a331587
Scott Lystig Fritchie Write tombstone on delete only if there's a keydir entry, plus race b…
…ugfix
239ed71
Scott Lystig Fritchie New regression test (currently broken) for recent scan change of olde…
…st -> youngest
8805790
Scott Lystig Fritchie Fix remaining bug when bitcask scan/fold order was changed to oldest-…
…>youngest

The NIF change fixes a long-standing latent bug: when put'ing a key
that does not exist, if there's a race with a merge, keydir_put_int()
would return 'ok' (error) rather than 'already_exists' (correct).  The
'already_exists' return value is a signal to the read-write owner of
the bitcask that the current append file must be closed and a new one
opened (with a larger fileid than any merge).

The tombstone change adds a new tombstone data format.  Old tombstones
will be handled correctly.  New tombstones for any key K contain
the fileid & offset of the key that it is deleting.  If the fileid
F still exists, then the tombstone will always be merged forward.
If the fileid F does not exist, then merging forward is not
necessary. When F was merged, the on-disk representation of key K
not be merged forward: K does not exist in the keydir (because it
was deleted by this tombstone), or it was replaced by a newer put.
afc5c3f
Scott Lystig Fritchie FML: fix tombstone merging bug when K & K's tombstone are in same fileid
Originally found with bitcask_pulse, I deconstructed the test case to
help understand what was happening: the new EUnit test is
new_20131217_a_test_.

As a result of the puts, key #13 is written 3x to fileid #1 (normal,
tombstone, normal) and 1x to fileid #2 (normal @ the very beginning
of the file).  The merge creates fileid #3 and copies only the
tombstone (the normal entry isn't copied because it is out-of-date).
Before the close, the internal keydir contains the correct info
about key #13, but after the close and re-open, we see key #13's
entries: normal (and most recent) in fileid 32, and tombstone in
fileid #3, oops.

The fix is to remove all of the merge input fileids from the set of fileids
that will survive/exist after the merge is finished.
1712650
Scott Lystig Fritchie WIP: new test new_20131217_c_test_() is broken, need to experiment wi…
…th bitcask:delete & NIF usage before continuing
dd1c8b4
Scott Lystig Fritchie Fix non-determinism in the fold_visits_unfrozen_test test ... now 100…
…% broken
a99916f
Scott Lystig Fritchie Fix iterator freeze bug demonstrated by last commit: remove kh_put_wi…
…ll_resize() predicate test for creating a keydir->pending
2af49f0
Scott Lystig Fritchie Fix bug introduced by almost-bugfix in commit 1a9c99 that was suppose…
…d to be a partial fix for GH #82
662fba8
Scott Lystig Fritchie Fix unexported function errors, thanks Buildbot 990b5ca
Scott Lystig Fritchie Omnibus bugfixing, including:
* Add 'already_exists' return to bitcask_nifs_keydir_remove(): we need
it to signal races with merge, alas.

* Add state to #filestate to be able to 'undo' last update to both a
data file and its hint file.  This probably means that we're going
to have to play some games with merge file naming, TBD, stay tuned.

* For bitcask:delete(), make the keydir delete conditional: if it fails,
redo the entire thing again.

* inner_merge_write() can have a race that, if a partial merge happens
at the proper time after, we see an old value reappearing.  Fix by
checking the return value of the keydir put, and if 'already_exists',
then undo the write.

* When do_put() has a race and gets 'already_exists' from the keydir,
undo the write before retrying.  If this key is deleted sometime later,
and then a partial merge happens after that, we might see this value
reappear after the merge is done.

* Add file_truncate() to bitcask_file.erl.  TODO: do the same for the
NIF style I/O.

* Better robustness (I hope) to EUnit tests in bitcask_merge_delete.erl
4019fd0
Scott Lystig Fritchie Add UNIX epoch time to put_int NIF call, to avoid C use of time(3).
I hope this will eliminate a nasty source of nondeterminism during
PULSE testing.
89711f3
Scott Lystig Fritchie Fix buildbot ?TESTDIR definition location problem dd068bf
Scott Lystig Fritchie Uff da: most PULSE non-determinism problems fixed? 0695ba1
Scott Lystig Fritchie Fix merge race when keydir_put specifies old fileid & offset races wi…
…th delete

Scenario, with 3 writers, 2 & 3 are racing:

* Writer 1: Put K, write @ {file 1, offset 63}
* Writer 2: Delete operation starts ... but there is no write to disk yet
* Writer 3: Merge scans file 1, sees K @ {1,30} -> not out of date ->
  but there is no write to disk yet
* Writer 2: writes a tombstone @ {3,48}
* Writer 2: Keydir conditional delete @ old location @ {1,63} is ok
* Writer 2: keydir delete returns from NIF-land
* Writer 3: merge copies data from {1, 63} -> {4, 42}
* Writer 3: keydir put {4, 42} conditional on {1,63} succeeds due to
  incorrect conditional validation: the record is gone, but bug
  permits put to return 'ok'.
0c3699a
Scott Lystig Fritchie Fix bug in previous commit: offset=0 is valid, do not check it c0f8760
Scott Lystig Fritchie Use larger retry # for open_fold_files() and keydir_wait_ready() when…
… PULSE testing
91a781b
Scott Lystig Fritchie Small bitcask_pulse changes: file size, pretty the keys used, slightl…
…y less verbosity
3c4ffa5
Scott Lystig Fritchie Adjust command frequency weightings for PULSE model f27c640
Scott Lystig Fritchie Add no_tombstones_after_reopen_test() to test desired condition a981d53
Scott Lystig Fritchie Avoid storing tombstones in the keydir when possible
When a Bitcask is opened and scan_key_files() is reading data
from disk and loading the RAM keydir, we now detect if the key
is a tombstone and, if so, do not store it in the keydir.

Normally, only hint files are scanned during startup.  However,
hint files have not stored enough information to confirm that
a key is/is not a tombstone.  I have added such a flag in a
backward-compatible way: the offset size has been reduced from
64 -> 63 bits, and the uppermost bit (which is assumed to be
0 in all cases -- we assume nobody has actually written a file
bit enough to require 64 bits to describe the offset) is used
to signal tombstone status.

An optional argument was given to the increment_file_id() NIF
to communicate to the NIF that a data file exists ... a fact
that would otherwise be lost if a hint/data file contains
only tombstones.

For testing purposes, fold() and fold_keys() are extended with
another argument to expose the presence of keydir tombstones.

Adjust timeouts and concurrency limits in bitcask_pulse.erl
to avoid the worst of false-positive errors when using the
PULSE model: {badrpc,timeout} nonsense.
2690354
Scott Lystig Fritchie Add checks to the PULSE model to try to catch improper tombstones in …
…keydir
cef207b
Scott Lystig Fritchie Dialyzer fixes 72d87aa
Scott Lystig Fritchie Add failing test: tombstones exist in keydir if hint files are destroyed 78a2667
Scott Lystig Fritchie Add {tombstone,Key} wrapper when folding .data files; fixes test fail…
…ure in last commit
7bb2db0

@engelsanchez engelsanchez and 2 others commented on an outdated diff Dec 27, 2013

c_src/bitcask_nifs.c
UNLOCK(keydir);
return ATOM_ALREADY_EXISTS;
}
- // If put would resize and iterating, start pending hash
- if (kh_put_will_resize(entries, keydir->entries) &&
- keydir->keyfolders != 0 &&
+ if (keydir->keyfolders != 0 &&
@engelsanchez

engelsanchez Dec 27, 2013

Contributor

This seems to kill the new multiple folds improvement. Is it intentional?

@evanmcc

evanmcc Dec 27, 2013

Contributor

Yeah, we can't do this, given how reliant we now are on AAE.

@slfritchie

slfritchie Jan 6, 2014

Contributor

I don't understand, sorry, both based on my sketchy knowledge of khash's internals, the multi-fold work, and the fact that if you restore that ky_put_will_resize() check then the PULSE test will start spewing failures very quickly because the keydir isn't properly frozen.

If the put is going to resize the hash, I certainly agree that a freeze is required in order to keep the folders' sorting order stable. However, this function is doing a mutation, and with a mutation of any kind (if there are any keyfolders) we must freeze. My brain can't think of a reason why you'd have keyfolders != 0 and doing a mutation where you would not want to freeze ... but then again, it's -18F here in Minneapolis today, my brain may be slushy.

@engelsanchez

engelsanchez Jan 6, 2014

Contributor

Most of the work for multi-folds was aimed at making this possible: delaying the inevitable freeze until a put changed the number of khash slots. Iterators are linked to a timestamp and tolerate finding entries added after the iteration was started. Entries can now be simple entries or linked lists, which can contain multiple versions of an entry. Puts then do not replace the entry, but add another version to this list (or convert a plain entry to a list). Iterators will choose from this list of timestamped entries the one from the snapshot they belong to. When the pending hash is merged (freeze is over!), all entries are merged back to good ol' plain entries. It's a lot of fun, you should try it.

@slfritchie

slfritchie Jan 6, 2014

Contributor

It's a lot of fun, you should try it.

Ha, then how about a change to the PULSE model to deal with freezes that aren't really freezes? Or change all of the other logic to compare timestamps in the same way so that real frozenness returns?

Is the goal of delaying the freeze is simply a RAM-consumption optimization?

@engelsanchez

engelsanchez Jan 6, 2014

Contributor

I believe it is there to make sure the number of concurrent keyfolders is not limited by when the next write is coming, which is usually "there, it's already here". In fact, memory consumption could be higher with this, as you can end up with many versions of values. It depends on initial value of the khash table and typical growth during freeze I suppose. I have not looked at the implications for the PULSE model, but it does sound like we'll need to work on that soon to merge this work before 2.0.

@slfritchie

slfritchie Jan 7, 2014

Contributor

I apparently do not understand the goals of the multiple folds work. I'll try to review the original PR & comments today, because I'm stuck in a pit of befuddlement.

@slfritchie

slfritchie Jan 8, 2014

Contributor

Alright ... I put the will_resize check back in the code. The bitcask_pulse test finds this problem with a single folder (the fork_merge step in the counterexample):

[[{set,{var,11},{call,bitcask_pulse,bc_open,[true]}},
  {set,{var,13},
       {call,bitcask_pulse,puts,[{var,11},{1,16},<<0>>]}},
  {set,{var,18},{call,bitcask_pulse,bc_close,[{var,11}]}},
  {set,{var,23},{call,bitcask_pulse,incr_clock,[]}},
  {set,{var,24},{call,bitcask_pulse,bc_open,[true]}},
  {set,{var,26},
       {call,bitcask_pulse,puts,[{var,24},{4,13},<<0>>]}},
  {set,{var,31},{call,bitcask_pulse,fork_merge,[{var,24}]}},
  {set,{var,34},{call,bitcask_pulse,fold,[{var,24}]}},
  {set,{var,37},{call,bitcask_pulse,fold,[{var,24}]}}],
 {19747,13791,98974},
 [{events,[]}]]

The problem is that the fold in step # 37 sees the key 14 twice.

@slfritchie

slfritchie Jan 8, 2014

Contributor

So, I've a different counterexample that skips keys: a fold doesn't see keys that have been put and have never been deleted.

7> C9c.
[[{set,{var,32},{call,bitcask_pulse,bc_open,[true]}},
  {set,{var,46},
       {call,bitcask_pulse,puts,[{var,32},{1,1},<<0,0,0>>]}},
  {set,{var,48},{call,bitcask_pulse,bc_close,[{var,32}]}},
  {set,{var,64},{call,bitcask_pulse,bc_open,[true]}},
  {set,{var,65},
       {call,bitcask_pulse,puts,[{var,64},{1,22},<<0>>]}},
  {set,{var,66},{call,bitcask_pulse,fork_merge,[{var,64}]}},
  {set,{var,81},{call,bitcask_pulse,incr_clock,[]}},
  {set,{var,82},{call,bitcask_pulse,bc_close,[{var,64}]}},
  {set,{var,85},{call,bitcask_pulse,bc_open,[true]}},
  {set,{var,87},{call,bitcask_pulse,fold,[{var,85}]}}],
 {86273,69841,50357},
 [{events,[]}]]

Here is an annotated timeline of the race:

folding proc:                      merging proc:
-------------------------------    ------------------------------------------

%% Fold starts.
{list_data_files,[1,2,3]}
{subfold,processing_file_number,1}

                                  %% Merge starts
                                  {merge_files,input_list,[2]}
                                  {inner_merge_write,fresh,new_file_is,4}
                                  {merge_single_entry,<<"kk02">>,
                                   old_location,file,2,offset,19}
                                  {merge_single_entry,<<"kk02">>,
                                   not_out_of_date}
                                  {inner_merge_write,<<"kk02">>,before_write}

{subfold,processing_file_number,2}

                                  {inner_merge_write,<<"kk02">>,
                                   new_location,file,4,offset,19,
                                   old_location,file,2,offset,19}
                                  {inner_merge_write,<<"kk02">>,keydir_put,ok}

%% The fold reaches the place in
%% file #2 where key <<"kk02">> is stored.
{fold,<<"kk02">>,keydir_get,not_exact,
 file_number,2,offset,19,
 keydir_location_is_now,file,4,offset,19
 fold_does_not_see_this_entry}

{subfold,processing_file_number,3}

%% key <<"kk02">> is not found in file #3, nor is it found in
%% any other file that the fold can process (in this case, file 3 is
%% the last file in the fold's list of files).

If the keydir is frozen, then the keydir update by the merge causes the keydir to freeze, and the folder sees <<"kk02">>'s entry in file # 2, the keydir's frozenness will allow the folder to see that the <<"kk02">>'s keydir entry is that same place in file # 2.

But I'm not seeing an easy way to fix this merge race without consuming more memory ... I need to sleep on this.

@slfritchie

slfritchie Jan 9, 2014

Contributor

Hrm, well, my comment about a fix without using more RAM may or may not survive.

For the 2nd problem mentioned above, the one with the annotated timeline. I am not understanding why the mutation made by the "merging proc" isn't visible by the "folding proc" at the place where the fold reaches that key in file # 2.

Contributor

engelsanchez commented Dec 27, 2013

@slfritchie A PR so long in the making, involving so many commits and so many files touched should really have a good high level summary of the changes and a rationale. See any of the LevelDB changes by MvM in the last year. Could you cook up something similar to help with the review?

Contributor

evanmcc commented Dec 27, 2013

Drive-by review comment (will take a closer look on monday): would be great to break this up into several different PRs (or at least rebase, logically grouping the commits into groups of unrelated changes).

@engelsanchez engelsanchez and 1 other commented on an outdated diff Dec 27, 2013

c_src/bitcask_nifs.c
if (enif_get_resource(env, argv[0], bitcask_keydir_RESOURCE, (void**)&handle))
{
+ if (argv[1] != 0) {
@engelsanchez

engelsanchez Dec 27, 2013

Contributor

Is it really valid to test that a ERL_NIF_TERM is non-zero?

@engelsanchez

engelsanchez Dec 27, 2013

Contributor

I imagine that what you really want to do here is check argc > 1

@slfritchie

slfritchie Jan 7, 2014

Contributor

Fixed in a soon-to-be-poooshed commit.

@engelsanchez engelsanchez commented on the diff Dec 27, 2013

src/bitcask.erl
@@ -1167,16 +1273,35 @@ inner_merge_write(K, V, Tstamp, OldFileId, OldOffset, State) ->
{ok, Outfile, Offset, Size} =
bitcask_fileops:write(State1#mstate.out_file,
- K, V, Tstamp),
-
- %% Update live keydir for the current out
- %% file. It's possible that this is a noop, as
- %% someone else may have written a newer value
- %% whilst we were processing.
- _ = bitcask_nifs:keydir_put(State#mstate.live_keydir, K,
- bitcask_fileops:file_tstamp(Outfile),
- Size, Offset, Tstamp, OldFileId, OldOffset),
- State1#mstate { out_file = Outfile }.
+ K, V, Tstamp, is_tombstone(V)),
+ case bitcask_fileops:file_tstamp(Outfile) =< OldFileId of
+ true ->
+ exit({invariant_violation, K, V, OldFileId, OldOffset, "->",
+ bitcask_fileops:file_tstamp(Outfile), Offset});
@engelsanchez

engelsanchez Dec 27, 2013

Contributor

We should think of a way to handle invariant violations. It would be better to drop all data in the backend than have it crap out on problematic data over and over again.

@engelsanchez engelsanchez and 1 other commented on an outdated diff Dec 27, 2013

@@ -0,0 +1,37 @@
+-module(foo).
@engelsanchez

engelsanchez Dec 27, 2013

Contributor

Hello foo. :)

@slfritchie

slfritchie Jan 7, 2014

Contributor

Oops, it's deleted in a soon-to-be-poooshed commit.

@engelsanchez engelsanchez commented on an outdated diff Jan 2, 2014

c_src/bitcask_nifs.c
find_result f;
- find_keydir_entry(keydir, &key, time, handle->iterating, &f);
+ find_keydir_entry(keydir, &key, time, iterating_status, &f);
@engelsanchez

engelsanchez Jan 2, 2014

Contributor

To make names clear, we should probably change some names. In find_keydir_entry, iterating should be called use_snapshot, get_from_snapshot or similar. This is not from your PR. I wrote that one. My bad :). A similar name should be given to iterating_status.

@engelsanchez engelsanchez and 1 other commented on an outdated diff Jan 2, 2014

src/bitcask.erl
case ?MODULE:get(Ref, Key) of
- not_found -> Acc;
+ not_found when not SeeTombstonesP ->
+ Acc;
+ not_found when SeeTombstonesP ->
+ Fun({tombstone, BCEntry}, Acc);
@engelsanchez

engelsanchez Jan 2, 2014

Contributor

I'm not sure that it matters as I still haven't grokked what the see tombstones on key fold flag is for, but: we always skipped values that have already expired (new line 366 above), but if they happen to pass that test (about to expire on near second boundary), have the same size as a tombstone, and then expire by the time we call get(Ref, Key), they will be returned as tombstones whether they really were a tombstone or not.

@slfritchie

slfritchie Jan 7, 2014

Contributor

AFAIK, the tests that care about 100%-never-see-a-tombstone-in-the-keydir aren't using expiry, and the expiry tests don't yet care. So I think that we agree that it doesn't matter.

@slfritchie

slfritchie Jan 7, 2014

Contributor

what the see tombstones on key fold flag is for

It's for testing purposes only: if we want to make certain that tombstones don't linger in the NIF keydir side of the world, then I needed a way to tell the Erlang side not to hide them from test code.

Contributor

slfritchie commented Jan 7, 2014

[...] would be great to break this up into several different PRs (or at least rebase, logically grouping the commits into groups of unrelated changes).

I'd just squish it all down to a single commit. There's just one goal: eliminate tombstones from the keydir.

Scott Lystig Fritchie Add an 'epoch' counter to subdivide Bitcask timestamps
In order for the multi-folder work to operate correctly, it
needs to be able to keep track of exactly when a fold started
relative to any mutations that happen during the same 1-second
time interval.  If a merge process cannot tell exactly if a
mutation happened before or after its fold started, then merge
may do the wrong thing: operate on a key zero times, or
operate on a key multiple times.

An epoch counter now subdivides Bitcask timestamps.  The epoch
counter is incremented whenever an iterator is formed.  A new NIF
was added, keydir_fold_is_starting(), to inform a fold what the
current epoch is.  The fold starting timestamp + epoch are used for
all get operations that the folder performs.

If the keydir contains only a single entry for a key, there's
no need to store the epoch with that key.  The epoch is stored
in the struct bitcask_keydir_entry_sib, when there are multiple
entries per key.

Things are very tricky (alas) when keeping entries in the
siblings 'next' linked list in newest-to-oldest timewise order.
A merge can do something "newer" wall-clock-wise with a mutation
that is "older" by that same wall-clock view.  The 'tstamp'
stored in the keydir is the wall-clock-time when the entry
was written for any reason, including a merge.  However, we
do NOT want a merge to change a key's expiration time, thus a
merge may not change a key's tstamp -- the solution is to have
the keydir also store the key's 'orig_tstamp' to keep a copy
of they key's specified-by-the-client timestamp for expiration
purposes.

To avoid mis-behavior of merging when the OS system clock
moves backward across a 1-second boundary, there is new checking
for mutations where Now < keydir->biggest_timestamp.  Operations
are retried when this condition is detected.

Try to avoid false-positives in the
bitcask_pulse:check_no_tombstones() predicate by calling only
when opened in read-write mode.

Remove the fold_visits_unfrozen_test_() and replace with a
corrected fold_snapshotX_test_()
2518628
Contributor

slfritchie commented Jan 22, 2014

Hi, reviewers. I've got a branch that's based on this one that contains the hackery required to add an epoch counter to distinguish writes within Bitcask's 1-second timestamp granularity. Would you prefer to see that work tacked onto this branch and reviewed together, or as a PR to merge into this one, or ... something else?

https://github.com/basho/bitcask/compare/slf-tombstone-management...slf-tombstone-management%2Bsub-second-epochs?expand=1

Scott Lystig Fritchie added some commits Jan 24, 2014

Scott Lystig Fritchie Add sibling->regular entry conversion sweeper, to GC keydir siblings
The amortization mechanism attempts to limit itself to less than
1 msec of additional latency for any get/put/delete call, but as
shown below, it doesn't always stay strictly under that limit when
you're freeing hundreds of megabytes of data.

Below are histograms showing the NIF function latency as recorded
on OS X on my MBP for a couple of different mutation rates: 90%
and 9%.  The workload is:

* Put 10M keys
* Create an iterator
* Modify some percentage of the keys (e.g. 90%, 9%)
* Close the iterator
* Fetch all 10M keys, measuring the NIF latency time of each call.

---- snip ---- snip ---- snip ---- snip ---- snip ----

By my back-of-the-envelope calculations (1 word = 8 bytes)

      bitcask_keydir_entry = 3 words + key bytes
    vs.
      bitcask_keydir_entry_head = 2 words + key bytes
      plus
      bitcask_keydir_entry_sib = 5 words

So each 2-sibling entry is using 2 + (2*5) = 12 words, not counting
the key bytes.
After the sibling sweep, we're going from 12 words -> 3 words per key.

So for 10M keys, that's a savings of 9 words -> 687MBytes.

(RSS for the 10M keys @ 90% mutation tests peaks at 1.45GB of RAM.  By
 comparison, the 10M key @ 0% mutation test peaks at ~850MByte RSS.
 So, those numbers roughly match, yay.)

No wonder that there are a very small number of
bitcask_nifs_keydir_get_int() calls that are taking > 10msec to
finish: it may be the OS getting involved with side-effects from
free(3) calls??

** 90% mutation

10M keys, ~90% mutated, 0% deleted
via:
    bitcask_nifs:yoo(10*1000000, 9*1000*1000, 0).

*** Tracing on for sequence 1...

    bitcask_nifs_keydir_get_int latency with off-cpu (usec)
             value  ------------- Distribution ------------- count
                 8 |                                         0
                16 |@@@@@@@@@@@@@@@                          40
                32 |@@@@@@                                   17
                64 |@@@                                      8
               128 |@@                                       5
               256 |@@                                       5
               512 |@@@@@@@@@@                               27
              1024 |@@                                       5
              2048 |                                         0

    bitcask_nifs_keydir_get_int latency (usec)
             value  ------------- Distribution ------------- count
                 0 |                                         0
                 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         7954469
                 2 |@@@@@@@@                                 2051656
                 4 |                                         10440
                 8 |                                         25446
                16 |                                         209
                32 |                                         5
                64 |                                         0
               128 |                                         9
               256 |                                         0
               512 |                                         11698
              1024 |                                         723
              2048 |                                         12
              4096 |                                         0
              8192 |                                         0
             16384 |                                         1
             32768 |                                         1
             65536 |                                         0

*** Tracing on for sequence 4...

    bitcask_nifs_keydir_get_int latency with off-cpu (usec)
             value  ------------- Distribution ------------- count
                 8 |                                         0
                16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              56
                32 |@@@@@@                                   13
                64 |@@@@                                     8
               128 |@@                                       5
               256 |                                         0

    bitcask_nifs_keydir_get_int latency (usec)
             value  ------------- Distribution ------------- count
                 0 |                                         0
                 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          7754589
                 2 |@@@@@@@@@                                2211364
                 4 |                                         15906
                 8 |                                         25269
                16 |                                         375
                32 |                                         8
                64 |                                         0

** 9% mutation

10M keys, ~9% mutated, 0% deleted
via:
    bitcask_nifs:yoo(10*1000000, 9 * 100*1000, 0).

*** Tracing on for sequence 1...

    bitcask_nifs_keydir_get_int latency with off-cpu (usec)
             value  ------------- Distribution ------------- count
                 8 |                                         0
                16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@             64
                32 |@@@@@                                    12
                64 |@@@                                      7
               128 |@@                                       4
               256 |@                                        3
               512 |@                                        2
              1024 |                                         0

    bitcask_nifs_keydir_get_int latency (usec)
             value  ------------- Distribution ------------- count
                 0 |                                         0
                 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           7530608
                 2 |@@@@@@@@@@                               2465556
                 4 |                                         12014
                 8 |                                         24721
                16 |                                         234
                32 |                                         6
                64 |                                         0
               128 |                                         0
               256 |                                         1
               512 |                                         1473
              1024 |                                         3
              2048 |                                         0

*** Tracing on for sequence 4...

    bitcask_nifs_keydir_get_int latency with off-cpu (usec)
             value  ------------- Distribution ------------- count
                 4 |                                         0
                 8 |@                                        3
                16 |@@@@@@@@@@@@@@@@@@@@@@@@@                55
                32 |@@@@@@@                                  15
                64 |@@                                       5
               128 |@@@                                      6
               256 |@                                        2
               512 |                                         1
              1024 |                                         0

    bitcask_nifs_keydir_get_int latency (usec)
             value  ------------- Distribution ------------- count
                 0 |                                         0
                 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           7631794
                 2 |@@@@@@@@@                                2345179
                 4 |                                         13769
                 8 |                                         26056
                16 |                                         324
                32 |                                         22
                64 |                                         2
               128 |                                         0
1563feb
Scott Lystig Fritchie Add new EUnit test and fix PULSE model
The EUnit test is freeze_close_reopen_test(), which forces an
actual old-style freeze of the keydir and then checks the sanity
of folds while frozen.

The PULSE model change is something I'm not 100% happy with,
but anytime the PULSE model has false positives, it takes a huge
amount of time to determine that it's a false alarm.  So, this
change should eliminate a rare source of false reports ... but I
hope I haven't introduced something that will also hide a real
error.

The problem comes from having a read-only proc folding a cask
and having it frozen, then the read-write proc closes & reopens
the cask and does a fold.  If the keydir has been frozen the
entire time, the PULSE model doesn't know about the freezing and
thus reports an error in fold/fold_keys results.

This model change will discover if there has ever been a fold
in progress while the 1st/read-write pid opens the cask.  If yes,
then fold/fold_keys mismatches are excused.
9db9248
Scott Lystig Fritchie Fix last two Dialyzer errors by removing dead code 432f3b1
Scott Lystig Fritchie Add error filter for harmless race a50a1c0
Scott Lystig Fritchie Add gets() to PULSE model, analogus to puts() for multiple keys 959ecce
Scott Lystig Fritchie Add new NIF, bitcask_nifs_keydir_global_pending_frozen/0, to assist P…
…ULSE test

Bitcask's fold semantics are difficult enough to try to predict,
but when a keydir is actually frozen, the job is even more difficult.
This NIF is added to reliably inspect if the keydir was frozen during
a PULSE test case, and if we find a fold or fold_keys problem, we
let it pass.

The new NIF is also tested by an existing EUnit test,
keydir_wait_pending_test().
5f4d145
Contributor

slfritchie commented Jan 31, 2014

PR for the above-mentioned branch: #134

Contributor

engelsanchez commented Feb 19, 2014

#134 has now been merged into this. Scott's description of those commits:

Just when I thought I was done last week with this branch-of-a-branch, PULSE found another problem.

The 'epoch' subdivision of Bitcask timestamps: this fixes fold bugs where merge mutations can be seen by folding procs.

The sibling->regular entry conversion sweeper helps reclaim memory by converting siblings back to single entries whenever mutations have happened during a fold and then the keydir is quiescent long enough to allow sweeping during get/put/remove operations. If a new fold starts in the middle of the sweep, the sweep stops and will resume when it's able to.

Model corrections to avoid false positive reports. Until we get a temporal logic hacker skilled enough to build a real model, or fix the silly keydir frozen feature, I'm afraid that this is as good as I can do.

@engelsanchez engelsanchez and 1 other commented on an outdated diff Feb 19, 2014

c_src/bitcask_nifs.c
@@ -127,6 +155,7 @@ struct bitcask_keydir_entry_sib
uint32_t total_sz;
uint64_t offset;
uint32_t tstamp;
+ uint32_t tstamp_epoch;
@engelsanchez

engelsanchez Feb 19, 2014

Contributor

(Deja vu) This should be uint8_t instead of uint32_t like in the other places, right?

@slfritchie

slfritchie Feb 20, 2014

Contributor

Yup, will fix in a review-fixer-upper commit.

@engelsanchez engelsanchez commented on the diff Feb 19, 2014

c_src/bitcask_nifs.c
@@ -37,6 +38,26 @@
#include <stdio.h>
+void DEBUG2(const char *fmt, ...) { }
+/* #include <stdarg.h> */
+/* void DEBUG2(const char *fmt, ...) */
+/* { */
+/* va_list ap; */
+/* va_start(ap, fmt); */
+/* vfprintf(stderr, fmt, ap); */
+/* va_end(ap); */
+/* } */
+#include <stdarg.h>
+void DEBUG3(const char *fmt, ...) { }
@engelsanchez

engelsanchez Feb 19, 2014

Contributor

DEBUG3 is really just DEBUG2 with a tweak. Should we bite the bullet and clean this up so there is just one macro and a flag that can be passed to activate it?

@evanmcc

evanmcc Feb 19, 2014

Contributor

probably overkill, but might be reasonable to move to an actual logging library at some point, if there is one that plays well with erlang.

@evanmcc

evanmcc Feb 19, 2014

Contributor

also some naming caremad: _2, _3, etc. might be better to name the variants by how they vary from the original function so it's easier to remember which one to use.

@slfritchie

slfritchie Feb 20, 2014

Contributor

All of the fprintf()-style tracing that I add ends up being in a dozen different places. Every time I yank it out, I keep a diff so I can put it back in if I need it. Some hours or days later, I need to put it back in ... but "patch" can't apply some hunks because of edits. So I finally decided to leave it all in and call it DEBUG2() and the even newer DEBUG3().

I don't use a single one, because I'm not interested in the stuff that the older DEBUG() macros use. And I'm usually not interested in DEBUG2() output, either. But I keep all 3 because those macros might be useful to other folks.

A maintenance issue/ticket/thingie for cleaner fprintf() wrappers is fine with me. Is there anything that y'all feel must be done DEBUG*()-wise in this PR?

@engelsanchez engelsanchez and 1 other commented on an outdated diff Feb 19, 2014

src/bitcask_merge_worker.erl
@@ -245,8 +253,9 @@ prop_in_window() ->
true
end).
-prop_in_window_test() ->
- ?assert(eqc:quickcheck(prop_in_window())).
+prop_in_window_test_() ->
+ {timeout, 30,
+ [fun() -> ?assert(eqc:quickcheck(prop_in_window())) end]}.
@engelsanchez

engelsanchez Feb 19, 2014

Contributor

Just curious: why not use _assert() here instead?

@slfritchie

slfritchie Feb 20, 2014

Contributor

Didn't know it existed? ... fixer-upper....

@evanmcc evanmcc commented on the diff Feb 19, 2014

c_src/bitcask_nifs.c
@@ -101,7 +126,10 @@ struct bitcask_keydir_entry_sib
uint32_t file_id;
uint32_t total_sz;
uint64_t offset;
- uint32_t tstamp;
+ uint32_t orig_tstamp; // Original tstamp from original put
@evanmcc

evanmcc Feb 19, 2014

Contributor

isn't this just going to be the tstamp of the first sib in the list?

@slfritchie

slfritchie Feb 20, 2014

Contributor

Hrm. I'd added orig_tstamp to solve a problem that I've forgotten now. Perhaps the commit history says why? Nope. ... Ah, right, we need to keep track of the difference of tstamp for comparison for freezing purposes vs. tstamp that's visible to the outside world (because that tstamp can be used for expiration purposes, and merging a key shouldn't change the user's expectation of when that key will expire).

Hrm, do you mean the oldest sibling? That would be the sib at the end of the list, not the first on the list.

If I use this hack patch to check the sib at the head of the list, https://gist.github.com/slfritchie/324d955b53b1a581352f , then I see "Diff @ line 725" typically once per minute. If I change line 725 to look at the tstamp of the sibling at the end of the list instead, then I also see "Diff @ " messages.

So, that's a brute-force approach to suggesting that the answer to your question is probably "no". But it's quite possible there's a flaw in my hack.

@evanmcc evanmcc commented on the diff Feb 19, 2014

c_src/bitcask_nifs.c
@@ -1084,39 +1264,81 @@ ERL_NIF_TERM bitcask_nifs_keydir_put_int(ErlNifEnv* env, int argc, const ERL_NIF
(int)entry.total_sz, (unsigned) entry.tstamp, (int)old_file_id);
DEBUG_KEYDIR(keydir);
+ perhaps_sweep_siblings(handle->keydir);
@evanmcc

evanmcc Feb 19, 2014

Contributor

Won't putting this in the put path impact high percentile latencies?

@slfritchie

slfritchie Feb 20, 2014

Contributor

Yes. That's why there's a periodic gettimeofday() check in perhaps_sweep_siblings() to try to avoid running for too long. See the commit message for 1563feb for an exploration of the added latency. The commit excerpts from a script that runs IIRC 4 times. I show histograms from the 1st run and 4th run in two difference cases, 90% of keys mutated and 9% of keys mutated.

At the time, I was also measuring off-CPU latencies with DTrace. It turned out to be a fruitless investigation, but I did include the histograms from off-cpu stats in the commit message, but I don't believe they show anything useful.

@evanmcc evanmcc commented on the diff Feb 19, 2014

c_src/bitcask_nifs.c
@@ -58,6 +79,8 @@ void DEBUG(const char *fmt, ...)
#include "pulse_c_send.h"
#endif
+static uint32_t keydir_pending_frozen;
@evanmcc

evanmcc Feb 19, 2014

Contributor

shouldn't all of the machinery surrounding this be ifdefed away into pulse mode?

@slfritchie

slfritchie Feb 20, 2014

Contributor

{shrug} It feels like a small amount of code that would add more clutter with #ifdef statements than just keeping it as-is?

#ifdef'ing would be implicit documentation that the item is really only useful for testing, but ... the NIF exposes only read-only behavior, so I don't believe there are safety reasons to keep it out of reach of regular users.

@engelsanchez

engelsanchez Feb 20, 2014

Contributor

I suppose it is very unlikely that this variable will ever be concurrently incremented during the pulse test. There's still an infinitesimal chance that it will contain the wrong value since there is no synchronization around the code that increments it, possibly from different CPUs.

@slfritchie

slfritchie Feb 21, 2014

Contributor

Yes, that's true. PULSE only needs to know if the value is non-zero -- there isn't a case where all updates can be lost, is there? Also, PULSE is only scheduling any single process sequentially, so racing updates can't happen when PULSE is in control.

@engelsanchez engelsanchez commented on the diff Feb 19, 2014

test/bitcask_qc.erl
@@ -325,5 +328,16 @@ prop_fold_test_() ->
get_keydir(Ref) ->
element(9, erlang:get(Ref)).
+delete_bitcask_dir(Dir) ->
+ [file:delete(X) || X <- filelib:wildcard(Dir ++ "/*")],
+ file:del_dir(Dir),
+ case file:read_file_info(Dir) of
+ {error, enoent} -> ok;
+ {ok, _} ->
+ timer:sleep(10),
+ delete_bitcask_dir(Dir)
@engelsanchez

engelsanchez Feb 19, 2014

Contributor

This could become an infinite loop. We should add a parachute

@slfritchie

slfritchie Feb 20, 2014

Contributor

Well, yes, but it's for a QuickCheck test only. IIRC, it exists to guard against races with silly background merge activity creating new files. The background merge would have to be doing enough work to keep creating new files, which that test is highly unlikely to do. And there's the EUnit time wrapper around the whole thing.

@evanmcc evanmcc commented on the diff Feb 19, 2014

test/bitcask_pulse.erl
+ %%
+ %% 1. Proc A opens cask in read-write mode
+ %% 2. Proc B opens cask in read-only mode
+ %% 3. Proc B starts a fold
+ %% 4. Proc A causes enough mutations to freeze the keydir.
+ %% 4b. Proc A deletes key K.
+ %% 5. Proc A closes the cask.
+ %% 6. Proc A re-opens the cask in read-write mode.
+ %% 7. Proc A starts a fold.
+ %% 8. Proc B finishes its fold, releases itr, and closes.
+ %% 9. Proc A visits key K but doesn't see the delete
+ %% from step 4b because the keydir was originally
+ %% frozen way back early in step 4 and has since
+ %% never been unfrozen.
+ %%
+ %% Our model cannot (???) keep enough state about step #4 & #45 to
@evanmcc

evanmcc Feb 19, 2014

Contributor

I don't see why 9 wouldn't see the pending keydir? Recall if frozen we should be blocking at step 7 until after step 8 when the pending and main keydirs should be merged.

Also: Is there a way to model each keydir as a dict, which we then retain a pointer to when a fold happens?

@slfritchie

slfritchie Feb 20, 2014

Contributor

That prose is as direct a translation as I could make of a PULSE test case failure. It was important that proc A closes & reopens the cask while B's fold is running. A reopens in read-write mode and thus isn't subject to blocking-due-to-frozen-ness.

@evanmcc evanmcc added a commit that referenced this pull request Feb 25, 2014

@evanmcc evanmcc - pull in multifold regression test from #134
- pull in more aggressive pulse tunings from #128
- remove all test sleeps related to old 1 second timestamps, so that
  things will break if old code is retained.
a9e9af4

@evanmcc evanmcc added a commit that referenced this pull request Mar 5, 2014

@evanmcc evanmcc - pull in multifold regression test from #134
- pull in more aggressive pulse tunings from #128
- remove all test sleeps related to old 1 second timestamps, so that
  things will break if old code is retained.
cf78204

@evanmcc evanmcc added a commit that referenced this pull request Mar 7, 2014

@evanmcc evanmcc - pull in multifold regression test from #134
- pull in more aggressive pulse tunings from #128
- remove all test sleeps related to old 1 second timestamps, so that
  things will break if old code is retained.
1ef2d9c

evanmcc closed this Mar 20, 2014

seancribbs deleted the slf-tombstone-management branch Apr 1, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment