Slf tombstone management+sub second epochs #134

slfritchie · 2014-01-30T12:13:25Z

Just when I thought I was done last week with this branch-of-a-branch, PULSE found another problem.

The 'epoch' subdivision of Bitcask timestamps: this fixes fold bugs where merge mutations can be seen by folding procs.
The sibling->regular entry conversion sweeper helps reclaim memory by converting siblings back to single entries whenever mutations have happened during a fold and then the keydir is quiescent long enough to allow sweeping during get/put/remove operations. If a new fold starts in the middle of the sweep, the sweep stops and will resume when it's able to.
Model corrections to avoid false positive reports. Until we get a temporal logic hacker skilled enough to build a real model, or fix the silly keydir frozen feature, I'm afraid that this is as good as I can do.

cc: @engelsanchez @evanmcc @jonmeredith @justinsheehy

In order for the multi-folder work to operate correctly, it needs to be able to keep track of exactly when a fold started relative to any mutations that happen during the same 1-second time interval. If a merge process cannot tell exactly if a mutation happened before or after its fold started, then merge may do the wrong thing: operate on a key zero times, or operate on a key multiple times. An epoch counter now subdivides Bitcask timestamps. The epoch counter is incremented whenever an iterator is formed. A new NIF was added, keydir_fold_is_starting(), to inform a fold what the current epoch is. The fold starting timestamp + epoch are used for all get operations that the folder performs. If the keydir contains only a single entry for a key, there's no need to store the epoch with that key. The epoch is stored in the struct bitcask_keydir_entry_sib, when there are multiple entries per key. Things are very tricky (alas) when keeping entries in the siblings 'next' linked list in newest-to-oldest timewise order. A merge can do something "newer" wall-clock-wise with a mutation that is "older" by that same wall-clock view. The 'tstamp' stored in the keydir is the wall-clock-time when the entry was written for any reason, including a merge. However, we do NOT want a merge to change a key's expiration time, thus a merge may not change a key's tstamp -- the solution is to have the keydir also store the key's 'orig_tstamp' to keep a copy of they key's specified-by-the-client timestamp for expiration purposes. To avoid mis-behavior of merging when the OS system clock moves backward across a 1-second boundary, there is new checking for mutations where Now < keydir->biggest_timestamp. Operations are retried when this condition is detected. Try to avoid false-positives in the bitcask_pulse:check_no_tombstones() predicate by calling only when opened in read-write mode. Remove the fold_visits_unfrozen_test_() and replace with a corrected fold_snapshotX_test_()

The amortization mechanism attempts to limit itself to less than 1 msec of additional latency for any get/put/delete call, but as shown below, it doesn't always stay strictly under that limit when you're freeing hundreds of megabytes of data. Below are histograms showing the NIF function latency as recorded on OS X on my MBP for a couple of different mutation rates: 90% and 9%. The workload is: * Put 10M keys * Create an iterator * Modify some percentage of the keys (e.g. 90%, 9%) * Close the iterator * Fetch all 10M keys, measuring the NIF latency time of each call. ---- snip ---- snip ---- snip ---- snip ---- snip ---- By my back-of-the-envelope calculations (1 word = 8 bytes) bitcask_keydir_entry = 3 words + key bytes vs. bitcask_keydir_entry_head = 2 words + key bytes plus bitcask_keydir_entry_sib = 5 words So each 2-sibling entry is using 2 + (2*5) = 12 words, not counting the key bytes. After the sibling sweep, we're going from 12 words -> 3 words per key. So for 10M keys, that's a savings of 9 words -> 687MBytes. (RSS for the 10M keys @ 90% mutation tests peaks at 1.45GB of RAM. By comparison, the 10M key @ 0% mutation test peaks at ~850MByte RSS. So, those numbers roughly match, yay.) No wonder that there are a very small number of bitcask_nifs_keydir_get_int() calls that are taking > 10msec to finish: it may be the OS getting involved with side-effects from free(3) calls?? ** 90% mutation 10M keys, ~90% mutated, 0% deleted via: bitcask_nifs:yoo(10*1000000, 9*1000*1000, 0). *** Tracing on for sequence 1... bitcask_nifs_keydir_get_int latency with off-cpu (usec) value ------------- Distribution ------------- count 8 | 0 16 |@@@@@@@@@@@@@@@ 40 32 |@@@@@@ 17 64 |@@@ 8 128 |@@ 5 256 |@@ 5 512 |@@@@@@@@@@ 27 1024 |@@ 5 2048 | 0 bitcask_nifs_keydir_get_int latency (usec) value ------------- Distribution ------------- count 0 | 0 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7954469 2 |@@@@@@@@ 2051656 4 | 10440 8 | 25446 16 | 209 32 | 5 64 | 0 128 | 9 256 | 0 512 | 11698 1024 | 723 2048 | 12 4096 | 0 8192 | 0 16384 | 1 32768 | 1 65536 | 0 *** Tracing on for sequence 4... bitcask_nifs_keydir_get_int latency with off-cpu (usec) value ------------- Distribution ------------- count 8 | 0 16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 56 32 |@@@@@@ 13 64 |@@@@ 8 128 |@@ 5 256 | 0 bitcask_nifs_keydir_get_int latency (usec) value ------------- Distribution ------------- count 0 | 0 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7754589 2 |@@@@@@@@@ 2211364 4 | 15906 8 | 25269 16 | 375 32 | 8 64 | 0 ** 9% mutation 10M keys, ~9% mutated, 0% deleted via: bitcask_nifs:yoo(10*1000000, 9 * 100*1000, 0). *** Tracing on for sequence 1... bitcask_nifs_keydir_get_int latency with off-cpu (usec) value ------------- Distribution ------------- count 8 | 0 16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 64 32 |@@@@@ 12 64 |@@@ 7 128 |@@ 4 256 |@ 3 512 |@ 2 1024 | 0 bitcask_nifs_keydir_get_int latency (usec) value ------------- Distribution ------------- count 0 | 0 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7530608 2 |@@@@@@@@@@ 2465556 4 | 12014 8 | 24721 16 | 234 32 | 6 64 | 0 128 | 0 256 | 1 512 | 1473 1024 | 3 2048 | 0 *** Tracing on for sequence 4... bitcask_nifs_keydir_get_int latency with off-cpu (usec) value ------------- Distribution ------------- count 4 | 0 8 |@ 3 16 |@@@@@@@@@@@@@@@@@@@@@@@@@ 55 32 |@@@@@@@ 15 64 |@@ 5 128 |@@@ 6 256 |@ 2 512 | 1 1024 | 0 bitcask_nifs_keydir_get_int latency (usec) value ------------- Distribution ------------- count 0 | 0 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7631794 2 |@@@@@@@@@ 2345179 4 | 13769 8 | 26056 16 | 324 32 | 22 64 | 2 128 | 0

The EUnit test is freeze_close_reopen_test(), which forces an actual old-style freeze of the keydir and then checks the sanity of folds while frozen. The PULSE model change is something I'm not 100% happy with, but anytime the PULSE model has false positives, it takes a huge amount of time to determine that it's a false alarm. So, this change should eliminate a rare source of false reports ... but I hope I haven't introduced something that will also hide a real error. The problem comes from having a read-only proc folding a cask and having it frozen, then the read-write proc closes & reopens the cask and does a fold. If the keydir has been frozen the entire time, the PULSE model doesn't know about the freezing and thus reports an error in fold/fold_keys results. This model change will discover if there has ever been a fold in progress while the 1st/read-write pid opens the cask. If yes, then fold/fold_keys mismatches are excused.

…ULSE test Bitcask's fold semantics are difficult enough to try to predict, but when a keydir is actually frozen, the job is even more difficult. This NIF is added to reliably inspect if the keydir was frozen during a PULSE test case, and if we find a fold or fold_keys problem, we let it pass. The new NIF is also tested by an existing EUnit test, keydir_wait_pending_test().

engelsanchez · 2014-02-19T15:47:03Z

c_src/bitcask_nifs.c

@@ -137,6 +155,7 @@ struct bitcask_keydir_entry_sib
    uint32_t total_sz;
    uint64_t offset;
    uint32_t tstamp;
+    uint32_t tstamp_epoch;


I imagine you want uint8_t here too, not uint32_t

engelsanchez · 2014-02-19T15:58:35Z

Merged this onto its parent branch to continue review on all changes at once.

- pull in more aggressive pulse tunings from #128 - remove all test sleeps related to old 1 second timestamps, so that things will break if old code is retained.

slfritchie added 7 commits January 22, 2014 18:15

Fix last two Dialyzer errors by removing dead code

432f3b1

Add error filter for harmless race

a50a1c0

Add gets() to PULSE model, analogus to puts() for multiple keys

959ecce

slfritchie mentioned this pull request Jan 31, 2014

Improve tombstone management #128

Closed

slfritchie mentioned this pull request Feb 19, 2014

Dialyzer fixes and turn on warn_untyped_record & warnings_as_errors #139

Merged

engelsanchez reviewed Feb 19, 2014
View reviewed changes

engelsanchez merged commit 5f4d145 into slf-tombstone-management Feb 19, 2014

evanmcc added a commit that referenced this pull request Feb 25, 2014

- pull in multifold regression test from #134

a9e9af4

- pull in more aggressive pulse tunings from #128 - remove all test sleeps related to old 1 second timestamps, so that things will break if old code is retained.

evanmcc added a commit that referenced this pull request Mar 5, 2014

- pull in multifold regression test from #134

cf78204

- pull in more aggressive pulse tunings from #128 - remove all test sleeps related to old 1 second timestamps, so that things will break if old code is retained.

evanmcc added a commit that referenced this pull request Mar 7, 2014

- pull in multifold regression test from #134

1ef2d9c

- pull in more aggressive pulse tunings from #128 - remove all test sleeps related to old 1 second timestamps, so that things will break if old code is retained.

seancribbs deleted the slf-tombstone-management+sub-second-epochs branch April 1, 2015 22:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slf tombstone management+sub second epochs #134

Slf tombstone management+sub second epochs #134

slfritchie commented Jan 30, 2014

engelsanchez Feb 19, 2014

slfritchie Feb 20, 2014

engelsanchez commented Feb 19, 2014

Slf tombstone management+sub second epochs #134

Slf tombstone management+sub second epochs #134

Conversation

slfritchie commented Jan 30, 2014

engelsanchez Feb 19, 2014

Choose a reason for hiding this comment

slfritchie Feb 20, 2014

Choose a reason for hiding this comment

engelsanchez commented Feb 19, 2014