-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid re-fragmenting memtable range tombstones on each read #4808
Comments
Do I understand this correctly? Fragmented tombstones should computed during write and cached. Reads would reference the cache instead of recomputing.
@abhimadan Can you elaborate? Is this different than what I have proposed above? |
What I meant by caching the fragmented tombstones was that, during the first read after the range deletion, the fragmented memtable tombstones would be stored and used for subsequent reads until the next range deletion invalidates the cache. So basically the "caching" happens on the read path. On the write path, I was thinking of using a data structure that would fragment a tombstone appropriately after it is written. So instead of doing the fragmentation in a separate pass and caching the result, we could go even further and the fragment the tombstones in-place, which I expect to be cheaper, as it wouldn't require re-creating a new data structure every range deletion. The tricky part is that this would modify entries already in the memtable, which the skiplist memtable forbids. @ajkr had an idea to make a special range deletion memtable that would give each range tombstone fragment entry two sequence numbers, to allow tombstones inserted into the memtable to either represent new tombstones or to represent fragments of existing tombstones (I think we may have a more detailed description of this somewhere if you're interested). TL;DR I was proposing either caching would be on the read path or an in-place fragmented data structure on the write path. |
Yes, I would be interested in seeing this. Read path caching seems like it would be the easiest to implement in the short term. I have to look more closely at the write path idea. |
@abhimadan @ajkr , I have implemented read "caching" for the fragmented tombstones as a preliminary step towards closing this issue. Can you link me to the detailed description of the proposal for a special range deletion memtable? It sounds very interesting. Thanks, |
Hi Jared, sorry about the delay. Here's the doc I was talking about: (just the "Range Tombstone Data Format" section). |
Thanks @abhimadan :) ! |
Summary: - Right now each read fragments the memtable range tombstones #4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact. - db_bench is updated to print out the number of range deletions executed if there is any. Pull Request resolved: #10380 Test Plan: - CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed. - Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such an immutable memtable is created and range tombstones are in that memtable. ``` single thread: ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100 multi_thread ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100 ``` Commit 99cdf16 is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results. Results are averaged over 5 runs. Single thread result: | Max # tombstones | main fillrandom micros/op | 99cdf16 | Post PR | main readrandom micros/op | 99cdf16 | Post PR | | ------------- | ------------- |------------- |------------- |------------- |------------- |------------- | | 0 |6.68 |6.57 |6.72 |4.72 |4.79 |4.54 | | 1 |6.67 |6.58 |6.62 |5.41 |4.74 |4.72 | | 10 |6.59 |6.5 |6.56 |7.83 |4.69 |4.59 | | 100 |6.62 |6.75 |6.58 |29.57 |5.04 |5.09 | | 1000 |6.54 |6.82 |6.61 |320.33 |5.22 |5.21 | 32-thread result: note that "Max # tombstones" is per thread. | Max # tombstones | main fillrandom micros/op | 99cdf16 | Post PR | main readrandom micros/op | 99cdf16 | Post PR | | ------------- | ------------- |------------- |------------- |------------- |------------- |------------- | | 0 |234.52 |260.25 |239.42 |5.06 |5.38 |5.09 | | 1 |236.46 |262.0 |231.1 |19.57 |22.14 |5.45 | | 10 |236.95 |263.84 |251.49 |151.73 |21.61 |5.73 | | 100 |268.16 |296.8 |280.13 |2308.52 |22.27 |6.57 | Reviewed By: ajkr Differential Revision: D37916564 Pulled By: cbi42 fbshipit-source-id: 05d6d2e16df26c374c57ddcca13a5bfe9d5b731e
Each time
MemTable::NewRangeTombstoneIterator
is called, the tombstones in the dedicated range deletion memtable are fragmented. Range deletions are probably issued as frequently as other forms of writes, so it should be possible to do some more work to avoid this in the average case. The fragmented tombstones can be cached, though I think doing more work on the write path is preferable, since we currently do the bare minimum of just writing a single entry to the memtable.The text was updated successfully, but these errors were encountered: