New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slf tombstone management+sub second epochs #134
Merged
engelsanchez
merged 7 commits into
slf-tombstone-management
from
slf-tombstone-management+sub-second-epochs
Feb 19, 2014
Merged
Slf tombstone management+sub second epochs #134
engelsanchez
merged 7 commits into
slf-tombstone-management
from
slf-tombstone-management+sub-second-epochs
Feb 19, 2014
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In order for the multi-folder work to operate correctly, it needs to be able to keep track of exactly when a fold started relative to any mutations that happen during the same 1-second time interval. If a merge process cannot tell exactly if a mutation happened before or after its fold started, then merge may do the wrong thing: operate on a key zero times, or operate on a key multiple times. An epoch counter now subdivides Bitcask timestamps. The epoch counter is incremented whenever an iterator is formed. A new NIF was added, keydir_fold_is_starting(), to inform a fold what the current epoch is. The fold starting timestamp + epoch are used for all get operations that the folder performs. If the keydir contains only a single entry for a key, there's no need to store the epoch with that key. The epoch is stored in the struct bitcask_keydir_entry_sib, when there are multiple entries per key. Things are very tricky (alas) when keeping entries in the siblings 'next' linked list in newest-to-oldest timewise order. A merge can do something "newer" wall-clock-wise with a mutation that is "older" by that same wall-clock view. The 'tstamp' stored in the keydir is the wall-clock-time when the entry was written for any reason, including a merge. However, we do NOT want a merge to change a key's expiration time, thus a merge may not change a key's tstamp -- the solution is to have the keydir also store the key's 'orig_tstamp' to keep a copy of they key's specified-by-the-client timestamp for expiration purposes. To avoid mis-behavior of merging when the OS system clock moves backward across a 1-second boundary, there is new checking for mutations where Now < keydir->biggest_timestamp. Operations are retried when this condition is detected. Try to avoid false-positives in the bitcask_pulse:check_no_tombstones() predicate by calling only when opened in read-write mode. Remove the fold_visits_unfrozen_test_() and replace with a corrected fold_snapshotX_test_()
The amortization mechanism attempts to limit itself to less than 1 msec of additional latency for any get/put/delete call, but as shown below, it doesn't always stay strictly under that limit when you're freeing hundreds of megabytes of data. Below are histograms showing the NIF function latency as recorded on OS X on my MBP for a couple of different mutation rates: 90% and 9%. The workload is: * Put 10M keys * Create an iterator * Modify some percentage of the keys (e.g. 90%, 9%) * Close the iterator * Fetch all 10M keys, measuring the NIF latency time of each call. ---- snip ---- snip ---- snip ---- snip ---- snip ---- By my back-of-the-envelope calculations (1 word = 8 bytes) bitcask_keydir_entry = 3 words + key bytes vs. bitcask_keydir_entry_head = 2 words + key bytes plus bitcask_keydir_entry_sib = 5 words So each 2-sibling entry is using 2 + (2*5) = 12 words, not counting the key bytes. After the sibling sweep, we're going from 12 words -> 3 words per key. So for 10M keys, that's a savings of 9 words -> 687MBytes. (RSS for the 10M keys @ 90% mutation tests peaks at 1.45GB of RAM. By comparison, the 10M key @ 0% mutation test peaks at ~850MByte RSS. So, those numbers roughly match, yay.) No wonder that there are a very small number of bitcask_nifs_keydir_get_int() calls that are taking > 10msec to finish: it may be the OS getting involved with side-effects from free(3) calls?? ** 90% mutation 10M keys, ~90% mutated, 0% deleted via: bitcask_nifs:yoo(10*1000000, 9*1000*1000, 0). *** Tracing on for sequence 1... bitcask_nifs_keydir_get_int latency with off-cpu (usec) value ------------- Distribution ------------- count 8 | 0 16 |@@@@@@@@@@@@@@@ 40 32 |@@@@@@ 17 64 |@@@ 8 128 |@@ 5 256 |@@ 5 512 |@@@@@@@@@@ 27 1024 |@@ 5 2048 | 0 bitcask_nifs_keydir_get_int latency (usec) value ------------- Distribution ------------- count 0 | 0 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7954469 2 |@@@@@@@@ 2051656 4 | 10440 8 | 25446 16 | 209 32 | 5 64 | 0 128 | 9 256 | 0 512 | 11698 1024 | 723 2048 | 12 4096 | 0 8192 | 0 16384 | 1 32768 | 1 65536 | 0 *** Tracing on for sequence 4... bitcask_nifs_keydir_get_int latency with off-cpu (usec) value ------------- Distribution ------------- count 8 | 0 16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 56 32 |@@@@@@ 13 64 |@@@@ 8 128 |@@ 5 256 | 0 bitcask_nifs_keydir_get_int latency (usec) value ------------- Distribution ------------- count 0 | 0 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7754589 2 |@@@@@@@@@ 2211364 4 | 15906 8 | 25269 16 | 375 32 | 8 64 | 0 ** 9% mutation 10M keys, ~9% mutated, 0% deleted via: bitcask_nifs:yoo(10*1000000, 9 * 100*1000, 0). *** Tracing on for sequence 1... bitcask_nifs_keydir_get_int latency with off-cpu (usec) value ------------- Distribution ------------- count 8 | 0 16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 64 32 |@@@@@ 12 64 |@@@ 7 128 |@@ 4 256 |@ 3 512 |@ 2 1024 | 0 bitcask_nifs_keydir_get_int latency (usec) value ------------- Distribution ------------- count 0 | 0 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7530608 2 |@@@@@@@@@@ 2465556 4 | 12014 8 | 24721 16 | 234 32 | 6 64 | 0 128 | 0 256 | 1 512 | 1473 1024 | 3 2048 | 0 *** Tracing on for sequence 4... bitcask_nifs_keydir_get_int latency with off-cpu (usec) value ------------- Distribution ------------- count 4 | 0 8 |@ 3 16 |@@@@@@@@@@@@@@@@@@@@@@@@@ 55 32 |@@@@@@@ 15 64 |@@ 5 128 |@@@ 6 256 |@ 2 512 | 1 1024 | 0 bitcask_nifs_keydir_get_int latency (usec) value ------------- Distribution ------------- count 0 | 0 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7631794 2 |@@@@@@@@@ 2345179 4 | 13769 8 | 26056 16 | 324 32 | 22 64 | 2 128 | 0
The EUnit test is freeze_close_reopen_test(), which forces an actual old-style freeze of the keydir and then checks the sanity of folds while frozen. The PULSE model change is something I'm not 100% happy with, but anytime the PULSE model has false positives, it takes a huge amount of time to determine that it's a false alarm. So, this change should eliminate a rare source of false reports ... but I hope I haven't introduced something that will also hide a real error. The problem comes from having a read-only proc folding a cask and having it frozen, then the read-write proc closes & reopens the cask and does a fold. If the keydir has been frozen the entire time, the PULSE model doesn't know about the freezing and thus reports an error in fold/fold_keys results. This model change will discover if there has ever been a fold in progress while the 1st/read-write pid opens the cask. If yes, then fold/fold_keys mismatches are excused.
…ULSE test Bitcask's fold semantics are difficult enough to try to predict, but when a keydir is actually frozen, the job is even more difficult. This NIF is added to reliably inspect if the keydir was frozen during a PULSE test case, and if we find a fold or fold_keys problem, we let it pass. The new NIF is also tested by an existing EUnit test, keydir_wait_pending_test().
@@ -137,6 +155,7 @@ struct bitcask_keydir_entry_sib | |||
uint32_t total_sz; | |||
uint64_t offset; | |||
uint32_t tstamp; | |||
uint32_t tstamp_epoch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine you want uint8_t here too, not uint32_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
Merged this onto its parent branch to continue review on all changes at once. |
evanmcc
added a commit
that referenced
this pull request
Feb 25, 2014
- pull in more aggressive pulse tunings from #128 - remove all test sleeps related to old 1 second timestamps, so that things will break if old code is retained.
evanmcc
added a commit
that referenced
this pull request
Mar 5, 2014
- pull in more aggressive pulse tunings from #128 - remove all test sleeps related to old 1 second timestamps, so that things will break if old code is retained.
evanmcc
added a commit
that referenced
this pull request
Mar 7, 2014
- pull in more aggressive pulse tunings from #128 - remove all test sleeps related to old 1 second timestamps, so that things will break if old code is retained.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Just when I thought I was done last week with this branch-of-a-branch, PULSE found another problem.
cc: @engelsanchez @evanmcc @jonmeredith @justinsheehy