Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slf tombstone management+sub second epochs #134

Conversation

slfritchie
Copy link
Contributor

Just when I thought I was done last week with this branch-of-a-branch, PULSE found another problem.

  • The 'epoch' subdivision of Bitcask timestamps: this fixes fold bugs where merge mutations can be seen by folding procs.
  • The sibling->regular entry conversion sweeper helps reclaim memory by converting siblings back to single entries whenever mutations have happened during a fold and then the keydir is quiescent long enough to allow sweeping during get/put/remove operations. If a new fold starts in the middle of the sweep, the sweep stops and will resume when it's able to.
  • Model corrections to avoid false positive reports. Until we get a temporal logic hacker skilled enough to build a real model, or fix the silly keydir frozen feature, I'm afraid that this is as good as I can do.

cc: @engelsanchez @evanmcc @jonmeredith @justinsheehy

In order for the multi-folder work to operate correctly, it
needs to be able to keep track of exactly when a fold started
relative to any mutations that happen during the same 1-second
time interval.  If a merge process cannot tell exactly if a
mutation happened before or after its fold started, then merge
may do the wrong thing: operate on a key zero times, or
operate on a key multiple times.

An epoch counter now subdivides Bitcask timestamps.  The epoch
counter is incremented whenever an iterator is formed.  A new NIF
was added, keydir_fold_is_starting(), to inform a fold what the
current epoch is.  The fold starting timestamp + epoch are used for
all get operations that the folder performs.

If the keydir contains only a single entry for a key, there's
no need to store the epoch with that key.  The epoch is stored
in the struct bitcask_keydir_entry_sib, when there are multiple
entries per key.

Things are very tricky (alas) when keeping entries in the
siblings 'next' linked list in newest-to-oldest timewise order.
A merge can do something "newer" wall-clock-wise with a mutation
that is "older" by that same wall-clock view.  The 'tstamp'
stored in the keydir is the wall-clock-time when the entry
was written for any reason, including a merge.  However, we
do NOT want a merge to change a key's expiration time, thus a
merge may not change a key's tstamp -- the solution is to have
the keydir also store the key's 'orig_tstamp' to keep a copy
of they key's specified-by-the-client timestamp for expiration
purposes.

To avoid mis-behavior of merging when the OS system clock
moves backward across a 1-second boundary, there is new checking
for mutations where Now < keydir->biggest_timestamp.  Operations
are retried when this condition is detected.

Try to avoid false-positives in the
bitcask_pulse:check_no_tombstones() predicate by calling only
when opened in read-write mode.

Remove the fold_visits_unfrozen_test_() and replace with a
corrected fold_snapshotX_test_()
The amortization mechanism attempts to limit itself to less than
1 msec of additional latency for any get/put/delete call, but as
shown below, it doesn't always stay strictly under that limit when
you're freeing hundreds of megabytes of data.

Below are histograms showing the NIF function latency as recorded
on OS X on my MBP for a couple of different mutation rates: 90%
and 9%.  The workload is:

* Put 10M keys
* Create an iterator
* Modify some percentage of the keys (e.g. 90%, 9%)
* Close the iterator
* Fetch all 10M keys, measuring the NIF latency time of each call.

---- snip ---- snip ---- snip ---- snip ---- snip ----

By my back-of-the-envelope calculations (1 word = 8 bytes)

      bitcask_keydir_entry = 3 words + key bytes
    vs.
      bitcask_keydir_entry_head = 2 words + key bytes
      plus
      bitcask_keydir_entry_sib = 5 words

So each 2-sibling entry is using 2 + (2*5) = 12 words, not counting
the key bytes.
After the sibling sweep, we're going from 12 words -> 3 words per key.

So for 10M keys, that's a savings of 9 words -> 687MBytes.

(RSS for the 10M keys @ 90% mutation tests peaks at 1.45GB of RAM.  By
 comparison, the 10M key @ 0% mutation test peaks at ~850MByte RSS.
 So, those numbers roughly match, yay.)

No wonder that there are a very small number of
bitcask_nifs_keydir_get_int() calls that are taking > 10msec to
finish: it may be the OS getting involved with side-effects from
free(3) calls??

** 90% mutation

10M keys, ~90% mutated, 0% deleted
via:
    bitcask_nifs:yoo(10*1000000, 9*1000*1000, 0).

*** Tracing on for sequence 1...

    bitcask_nifs_keydir_get_int latency with off-cpu (usec)
             value  ------------- Distribution ------------- count
                 8 |                                         0
                16 |@@@@@@@@@@@@@@@                          40
                32 |@@@@@@                                   17
                64 |@@@                                      8
               128 |@@                                       5
               256 |@@                                       5
               512 |@@@@@@@@@@                               27
              1024 |@@                                       5
              2048 |                                         0

    bitcask_nifs_keydir_get_int latency (usec)
             value  ------------- Distribution ------------- count
                 0 |                                         0
                 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         7954469
                 2 |@@@@@@@@                                 2051656
                 4 |                                         10440
                 8 |                                         25446
                16 |                                         209
                32 |                                         5
                64 |                                         0
               128 |                                         9
               256 |                                         0
               512 |                                         11698
              1024 |                                         723
              2048 |                                         12
              4096 |                                         0
              8192 |                                         0
             16384 |                                         1
             32768 |                                         1
             65536 |                                         0

*** Tracing on for sequence 4...

    bitcask_nifs_keydir_get_int latency with off-cpu (usec)
             value  ------------- Distribution ------------- count
                 8 |                                         0
                16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              56
                32 |@@@@@@                                   13
                64 |@@@@                                     8
               128 |@@                                       5
               256 |                                         0

    bitcask_nifs_keydir_get_int latency (usec)
             value  ------------- Distribution ------------- count
                 0 |                                         0
                 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          7754589
                 2 |@@@@@@@@@                                2211364
                 4 |                                         15906
                 8 |                                         25269
                16 |                                         375
                32 |                                         8
                64 |                                         0

** 9% mutation

10M keys, ~9% mutated, 0% deleted
via:
    bitcask_nifs:yoo(10*1000000, 9 * 100*1000, 0).

*** Tracing on for sequence 1...

    bitcask_nifs_keydir_get_int latency with off-cpu (usec)
             value  ------------- Distribution ------------- count
                 8 |                                         0
                16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@             64
                32 |@@@@@                                    12
                64 |@@@                                      7
               128 |@@                                       4
               256 |@                                        3
               512 |@                                        2
              1024 |                                         0

    bitcask_nifs_keydir_get_int latency (usec)
             value  ------------- Distribution ------------- count
                 0 |                                         0
                 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           7530608
                 2 |@@@@@@@@@@                               2465556
                 4 |                                         12014
                 8 |                                         24721
                16 |                                         234
                32 |                                         6
                64 |                                         0
               128 |                                         0
               256 |                                         1
               512 |                                         1473
              1024 |                                         3
              2048 |                                         0

*** Tracing on for sequence 4...

    bitcask_nifs_keydir_get_int latency with off-cpu (usec)
             value  ------------- Distribution ------------- count
                 4 |                                         0
                 8 |@                                        3
                16 |@@@@@@@@@@@@@@@@@@@@@@@@@                55
                32 |@@@@@@@                                  15
                64 |@@                                       5
               128 |@@@                                      6
               256 |@                                        2
               512 |                                         1
              1024 |                                         0

    bitcask_nifs_keydir_get_int latency (usec)
             value  ------------- Distribution ------------- count
                 0 |                                         0
                 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           7631794
                 2 |@@@@@@@@@                                2345179
                 4 |                                         13769
                 8 |                                         26056
                16 |                                         324
                32 |                                         22
                64 |                                         2
               128 |                                         0
The EUnit test is freeze_close_reopen_test(), which forces an
actual old-style freeze of the keydir and then checks the sanity
of folds while frozen.

The PULSE model change is something I'm not 100% happy with,
but anytime the PULSE model has false positives, it takes a huge
amount of time to determine that it's a false alarm.  So, this
change should eliminate a rare source of false reports ... but I
hope I haven't introduced something that will also hide a real
error.

The problem comes from having a read-only proc folding a cask
and having it frozen, then the read-write proc closes & reopens
the cask and does a fold.  If the keydir has been frozen the
entire time, the PULSE model doesn't know about the freezing and
thus reports an error in fold/fold_keys results.

This model change will discover if there has ever been a fold
in progress while the 1st/read-write pid opens the cask.  If yes,
then fold/fold_keys mismatches are excused.
…ULSE test

Bitcask's fold semantics are difficult enough to try to predict,
but when a keydir is actually frozen, the job is even more difficult.
This NIF is added to reliably inspect if the keydir was frozen during
a PULSE test case, and if we find a fold or fold_keys problem, we
let it pass.

The new NIF is also tested by an existing EUnit test,
keydir_wait_pending_test().
@@ -137,6 +155,7 @@ struct bitcask_keydir_entry_sib
uint32_t total_sz;
uint64_t offset;
uint32_t tstamp;
uint32_t tstamp_epoch;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine you want uint8_t here too, not uint32_t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

@engelsanchez engelsanchez merged commit 5f4d145 into slf-tombstone-management Feb 19, 2014
@engelsanchez
Copy link
Contributor

Merged this onto its parent branch to continue review on all changes at once.

evanmcc added a commit that referenced this pull request Feb 25, 2014
- pull in more aggressive pulse tunings from #128
- remove all test sleeps related to old 1 second timestamps, so that
  things will break if old code is retained.
evanmcc added a commit that referenced this pull request Mar 5, 2014
- pull in more aggressive pulse tunings from #128
- remove all test sleeps related to old 1 second timestamps, so that
  things will break if old code is retained.
evanmcc added a commit that referenced this pull request Mar 7, 2014
- pull in more aggressive pulse tunings from #128
- remove all test sleeps related to old 1 second timestamps, so that
  things will break if old code is retained.
@seancribbs seancribbs deleted the slf-tombstone-management+sub-second-epochs branch April 1, 2015 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants