Avoid re-fragmenting memtable range tombstones on each read #4808

abhimadan · 2018-12-20T22:16:16Z

Each time MemTable::NewRangeTombstoneIterator is called, the tombstones in the dedicated range deletion memtable are fragmented. Range deletions are probably issued as frequently as other forms of writes, so it should be possible to do some more work to avoid this in the average case. The fragmented tombstones can be cached, though I think doing more work on the write path is preferable, since we currently do the bare minimum of just writing a single entry to the memtable.

The text was updated successfully, but these errors were encountered:

jwasinger · 2019-02-27T03:29:04Z

Do I understand this correctly? Fragmented tombstones should computed during write and cached. Reads would reference the cache instead of recomputing.

more work on the write path is preferable

@abhimadan Can you elaborate? Is this different than what I have proposed above?

abhimadan · 2019-02-27T03:46:02Z

What I meant by caching the fragmented tombstones was that, during the first read after the range deletion, the fragmented memtable tombstones would be stored and used for subsequent reads until the next range deletion invalidates the cache. So basically the "caching" happens on the read path.

On the write path, I was thinking of using a data structure that would fragment a tombstone appropriately after it is written. So instead of doing the fragmentation in a separate pass and caching the result, we could go even further and the fragment the tombstones in-place, which I expect to be cheaper, as it wouldn't require re-creating a new data structure every range deletion. The tricky part is that this would modify entries already in the memtable, which the skiplist memtable forbids. @ajkr had an idea to make a special range deletion memtable that would give each range tombstone fragment entry two sequence numbers, to allow tombstones inserted into the memtable to either represent new tombstones or to represent fragments of existing tombstones (I think we may have a more detailed description of this somewhere if you're interested).

TL;DR I was proposing either caching would be on the read path or an in-place fragmented data structure on the write path.

jwasinger · 2019-02-27T04:20:55Z

(I think we may have a more detailed description of this somewhere if you're interested)

Yes, I would be interested in seeing this.

Read path caching seems like it would be the easiest to implement in the short term. I have to look more closely at the write path idea.

jwasinger · 2019-03-03T19:28:56Z

@abhimadan @ajkr , I have implemented read "caching" for the fragmented tombstones as a preliminary step towards closing this issue. Can you link me to the detailed description of the proposal for a special range deletion memtable? It sounds very interesting.

Thanks,
Jared

abhimadan · 2019-03-03T19:49:30Z

Hi Jared, sorry about the delay. Here's the doc I was talking about: (just the "Range Tombstone Data Format" section).

jwasinger · 2019-03-03T19:57:25Z

Thanks @abhimadan :) !

Summary: - Right now each read fragments the memtable range tombstones #4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact. - db_bench is updated to print out the number of range deletions executed if there is any. Pull Request resolved: #10380 Test Plan: - CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed. - Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such an immutable memtable is created and range tombstones are in that memtable. ``` single thread: ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100 multi_thread ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100 ``` Commit 99cdf16 is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results. Results are averaged over 5 runs. Single thread result: | Max # tombstones | main fillrandom micros/op | 99cdf16 | Post PR | main readrandom micros/op | 99cdf16 | Post PR | | ------------- | ------------- |------------- |------------- |------------- |------------- |------------- | | 0 |6.68 |6.57 |6.72 |4.72 |4.79 |4.54 | | 1 |6.67 |6.58 |6.62 |5.41 |4.74 |4.72 | | 10 |6.59 |6.5 |6.56 |7.83 |4.69 |4.59 | | 100 |6.62 |6.75 |6.58 |29.57 |5.04 |5.09 | | 1000 |6.54 |6.82 |6.61 |320.33 |5.22 |5.21 | 32-thread result: note that "Max # tombstones" is per thread. | Max # tombstones | main fillrandom micros/op | 99cdf16 | Post PR | main readrandom micros/op | 99cdf16 | Post PR | | ------------- | ------------- |------------- |------------- |------------- |------------- |------------- | | 0 |234.52 |260.25 |239.42 |5.06 |5.38 |5.09 | | 1 |236.46 |262.0 |231.1 |19.57 |22.14 |5.45 | | 10 |236.95 |263.84 |251.49 |151.73 |21.61 |5.73 | | 100 |268.16 |296.8 |280.13 |2308.52 |22.27 |6.57 | Reviewed By: ajkr Differential Revision: D37916564 Pulled By: cbi42 fbshipit-source-id: 05d6d2e16df26c374c57ddcca13a5bfe9d5b731e

abhimadan added the up-for-grabs Up for grabs label Dec 20, 2018

jwasinger mentioned this issue Feb 28, 2019

Don't rebuild fragmented tombstones on every read. #5032

Closed

cbi42 mentioned this issue Jul 18, 2022

Fragment memtable range tombstone in the write path #10380

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid re-fragmenting memtable range tombstones on each read #4808

Avoid re-fragmenting memtable range tombstones on each read #4808

abhimadan commented Dec 20, 2018

jwasinger commented Feb 27, 2019

abhimadan commented Feb 27, 2019

jwasinger commented Feb 27, 2019

jwasinger commented Mar 3, 2019

abhimadan commented Mar 3, 2019

jwasinger commented Mar 3, 2019

Avoid re-fragmenting memtable range tombstones on each read #4808

Avoid re-fragmenting memtable range tombstones on each read #4808

Comments

abhimadan commented Dec 20, 2018

jwasinger commented Feb 27, 2019

abhimadan commented Feb 27, 2019

jwasinger commented Feb 27, 2019

jwasinger commented Mar 3, 2019

abhimadan commented Mar 3, 2019

jwasinger commented Mar 3, 2019