Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid re-fragmenting memtable range tombstones on each read #4808

Open
abhimadan opened this issue Dec 20, 2018 · 6 comments
Open

Avoid re-fragmenting memtable range tombstones on each read #4808

abhimadan opened this issue Dec 20, 2018 · 6 comments
Labels
up-for-grabs Up for grabs

Comments

@abhimadan
Copy link
Contributor

Each time MemTable::NewRangeTombstoneIterator is called, the tombstones in the dedicated range deletion memtable are fragmented. Range deletions are probably issued as frequently as other forms of writes, so it should be possible to do some more work to avoid this in the average case. The fragmented tombstones can be cached, though I think doing more work on the write path is preferable, since we currently do the bare minimum of just writing a single entry to the memtable.

@abhimadan abhimadan added the up-for-grabs Up for grabs label Dec 20, 2018
@jwasinger
Copy link

Do I understand this correctly? Fragmented tombstones should computed during write and cached. Reads would reference the cache instead of recomputing.

more work on the write path is preferable

@abhimadan Can you elaborate? Is this different than what I have proposed above?

@abhimadan
Copy link
Contributor Author

What I meant by caching the fragmented tombstones was that, during the first read after the range deletion, the fragmented memtable tombstones would be stored and used for subsequent reads until the next range deletion invalidates the cache. So basically the "caching" happens on the read path.

On the write path, I was thinking of using a data structure that would fragment a tombstone appropriately after it is written. So instead of doing the fragmentation in a separate pass and caching the result, we could go even further and the fragment the tombstones in-place, which I expect to be cheaper, as it wouldn't require re-creating a new data structure every range deletion. The tricky part is that this would modify entries already in the memtable, which the skiplist memtable forbids. @ajkr had an idea to make a special range deletion memtable that would give each range tombstone fragment entry two sequence numbers, to allow tombstones inserted into the memtable to either represent new tombstones or to represent fragments of existing tombstones (I think we may have a more detailed description of this somewhere if you're interested).

TL;DR I was proposing either caching would be on the read path or an in-place fragmented data structure on the write path.

@jwasinger
Copy link

(I think we may have a more detailed description of this somewhere if you're interested)

Yes, I would be interested in seeing this.

Read path caching seems like it would be the easiest to implement in the short term. I have to look more closely at the write path idea.

@jwasinger
Copy link

@abhimadan @ajkr , I have implemented read "caching" for the fragmented tombstones as a preliminary step towards closing this issue. Can you link me to the detailed description of the proposal for a special range deletion memtable? It sounds very interesting.

Thanks,
Jared

@abhimadan
Copy link
Contributor Author

Hi Jared, sorry about the delay. Here's the doc I was talking about: (just the "Range Tombstone Data Format" section).

@jwasinger
Copy link

Thanks @abhimadan :) !

facebook-github-bot pushed a commit that referenced this issue Aug 5, 2022
Summary:
- Right now each read fragments the memtable range tombstones #4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact.
- db_bench is updated to print out the number of range deletions executed if there is any.

Pull Request resolved: #10380

Test Plan:
- CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed.
- Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such  an immutable memtable is created and range tombstones are in that memtable.

```
single thread:
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100

multi_thread
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100
```
Commit 99cdf16 is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results.
Results are averaged over 5 runs.

Single thread result:
| Max # tombstones  | main fillrandom micros/op | 99cdf16 | Post PR | main readrandom micros/op |  99cdf16 | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0    |6.68     |6.57     |6.72     |4.72     |4.79     |4.54     |
| 1    |6.67     |6.58     |6.62     |5.41     |4.74     |4.72     |
| 10   |6.59     |6.5      |6.56     |7.83     |4.69     |4.59     |
| 100  |6.62     |6.75     |6.58     |29.57    |5.04     |5.09     |
| 1000 |6.54     |6.82     |6.61     |320.33   |5.22     |5.21     |

32-thread result: note that "Max # tombstones" is per thread.
| Max # tombstones  | main fillrandom micros/op | 99cdf16 | Post PR | main readrandom micros/op |  99cdf16 | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0    |234.52   |260.25   |239.42   |5.06     |5.38     |5.09     |
| 1    |236.46   |262.0    |231.1    |19.57    |22.14    |5.45     |
| 10   |236.95   |263.84   |251.49   |151.73   |21.61    |5.73     |
| 100  |268.16   |296.8    |280.13   |2308.52  |22.27    |6.57     |

Reviewed By: ajkr

Differential Revision: D37916564

Pulled By: cbi42

fbshipit-source-id: 05d6d2e16df26c374c57ddcca13a5bfe9d5b731e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
up-for-grabs Up for grabs
Projects
None yet
Development

No branches or pull requests

2 participants