- log when non-existent fstats entries are trimmed - dedup the fstats list so the above logging isn't constantly triggered.
files update to remove entries for deleted files so they are no longer considered for merge or for status
The fold code was creating a timestamp *before* the keydir was frozen and using that to look at a snapshot of the data. This was hidden by the fact that the snapshot code would give in and give you a closest match if no entry found for the given timestamp, even if this violated the snapshot logic. When I cleaned up that logic, this surfaced. Also, fixing get to return not found in the case an entry for the given snapshot time was not found.
This is an attempt to clean up the multifold logic to make it more maintanable. - New find_result struct holds the results of a find operation. This simplifies the logic of most operations that use a find and later have to figure out exactly where things were found, whether they were a tombstone, etc. - Removes the need to malloc temporary entries while manipulating the keydir. For this, a new proxy struct was created to hold the values of a regular or list entry when it's found. - Removes the need to malloc temporary entries when looking up stuff from the keydir. A new khash operation was added to allow custom hash and equal functions to be used, so you can look up from something other than a variable length entry. - Adds locking to iteration now that it can happen while puts are updating the entries hash. Found during review of original PR
====================================== The problem: ------------ Bitcask was designed to have a single iteration in progress at any one time (multiple concurrent iterations are allowed, but must explicitly ask to use the old snapshot). Effectively this means that to have a trustworthy snapshot, subsequent iterations must wait upon the first to complete, whereupon they're awoken and can attemt to iterate. The fix: -------- The code change here makes a second form of bitcask keydir entry, the entry list. Entry lists are a deduplicated version of entries. The head contains the key, the key length, and a pointer to a list of siblings, which contain the keydir values that can vary from instance to instance of a key. When a put or delete happens while a fold is under way, the entry is converted to an entry list with the original entry and the update. Gets (including fold gets) now discriminate between versions via timestamp. An arguable implementation decision here was to implement entry lists as tagged pointers (LSB set to 1). This does make reasoning about types more complicated, but in a space-constrained environment, the extra complexity is worth the space overhead incurred by a union type with an internal tag. _A complication_ Keydir entries are stored in an in-memory hashtable using the khash library. Unfortunately, khash isn't iteration-safe across resizes. Thus, when we detect that a new put would cause a resize, we fall back on the old path, where a pending keydir is instantiated and all new updates are noted there until all running iterations complete and the keydirs are folded back in. While this still means that folds can block for a potentially long time, since khash doubles the size of its storage at each resize, that populated keydirs are much less likely to run into blocking behavior. Testing: -------- Bitcask has a reasonable test suite. It should be run repeatedly in both IO modes. Note that #93 is still open and fails occasionally. That code is not changed here, so should not be counted. A long PULSE testing run should be done. I've accumulated ~9 hours or so have been done so far, but ideally more should be done. Some riak tests that involve iteration and listing would be good to attempt. Real-world fold-speed tests like handoff and AAE rebuilds would also be interesting to see. The problematic sitation that we're attempting to avoid here is something like this: - on a large keydir, a slow iteration, like a merge, is happening. - at the same time, AAE is waiting with a thrown-away tree - and someone is trying to do a MR job that requires a keyfold iteration You should be able to run riak without this changeand see increased latencies on fold-dependent operations during merges and AAE tree rebuilds. With this change there should be fewer and shorter latency spikes while these things are going on. Alternate approaches: --------------------- _Switch to a hash structure with snapshots_ This likely is a good idea, rather than the current hybrid approach but the existing options are honestly somewhat uninspiring. Because of time limitations, it was decided that there simply wasn't enough time to do this. _layers of pending keydirs_ While this is intuitively appealing, there are two major problems with this approach. The first is the change of get behavior from O(1) to O(N), which could be problematic if N is large. The second is more serious, which is that it's impossible to establish a consistent multi-level iteration snapshot without a lot of memory or time overhead. Eg. with levels 1, 2, and 3, and values A, B, and C ``` 1 A:3 B:4 2 B:del C:7 3 A:1 B:5 C:nf ``` You need to either make a new keydir and merge all the pending keydirs into that temporary keydir for the snapshot, or keep track of seen keys, which is essentially the same approach. Some thinking about the issue suggests that instead of N keydirs, you'd really need N**2 keydirs. Open work: ---------- _rw locks_ This approach requires taking a lock per step of the iterator. It's possible that this could slow down folds considerably due to lock contention. It may be better to move from the `enif_mutex` to a `posix_rw_lock`. This isn't a trivial change, I don't think, so careful measurement should be done to see if it's a problem before we do the work. This change has broader appeal, as it could potentially make concurrent readers (once we've added multiple readers per vnode) faster. _gc thread_ At the moment, we only clean up entry-lists when the value is updated, so if you had a large run of writes during a fold that were then never updated, the overhead there might never be reclaimed. It may be worth it to add a thread that slowly iterates over the keydir when there are no active folders and cleans up existing entry lists. Files changed, w/ brief summary of changes: ------------------------------------------- - `c_src/bitcask_nifs.c` the bulk of the changes described above are here - `src/bitcask.erl` remove keydir git before put instrument folds with time for snapshots update tests, add new tests - `src/bitcask_nifs.erl` add keydir_get/3, timestamp specific get make itr timestamps consitent with the other bitcask timestamps update tests to work with new behavior - `c_src/kash.h` add will_resize check - `src/bitcask_fileops.erl` merges expect hintfiles to always be created, but multiple concurrent folds seem to violate this invariant. sleep for a short time and then abort the merge if the hintfile doesn't get created. - `test/bitcask_qc.erl` changes to how deletion is accounted for, as the keydir put before delete has been removed to reduce the number of entry list members