Delete duplicated key/value pairs recursively #2829

lixiaoy1 · 2017-09-04T02:37:34Z

This is to implement the idea: http://pad.ceph.com/p/rocksdb-wal-improvement
Add a new flush style called kFlushStyleDedup which users can config by setting
flush_style=kFlushStyleDedup. When flush is triggered, it dedups the key/value
pairs in the oldest memtable against other memtables before flushing the
oldest memtable into L0.

The flush solution benefits for the data which are duplicated between memtables.
With this flush, it can decrease the data flushed into L0 a lot.

Signed-off-by: Xiaoyan Li <xiaoyan.li@intel.com>

facebook-github-bot · 2017-09-04T05:27:05Z

@lixiaoy1 updated the pull request - view changes

lixiaoy1 · 2017-09-05T02:48:16Z

retest please.

ajkr · 2017-09-13T05:08:39Z

I didn't understand why we're adding write_buffer_number_to_flush. I feel it creates the problem this PR intends to solve. Restricting the number of immutable memtables in a flush job makes it likely older versions of a key are flushed and newer versions aren't flushed yet. If we leave it unrestricted, we also get the benefits of larger L0 files: accommodate bigger write bursts and lower write-amp.

ajkr · 2017-09-13T05:32:31Z

Also for the benchmark results, do you mind sharing the full options used for before and after? You can find them either in a file whose name begins with "OPTIONS" in the db directory or near the top of the info log.

Also, did you use upstream rocksdb as the baseline (i.e., without limiting how many memtables a flush can contain)? Thanks!

lixiaoy1 · 2017-09-14T03:07:03Z

@ajkr Thank you for your comments.
I am sorry that the name write_buffer_number_to_flush may be confusing.
When there are N immutable tables to flush, if write_buffer_number_to_flush is set to M(M<N), this PR merges M tables at first (like in master branch), and then compares merged data to the left (N-M) tables. If a key-pair is valid (not deleted/updated in the left N-M tables), it is flushed into L0 sst files. And waiting until N immutable tables, repeat former steps again.
It can decrease data flushed into L0, but increased file numbers in L0.

I used this branch as baseline: https://github.com/ceph/rocksdb/tree/e15382c09c87a65eaeca9bda233bab503f1e5772

For the test obj40960.xlsx:
https://drive.google.com/drive/folders/0B6jqFc7e2yxVdUQ2aEpCR3ItbG8

There are 4 scenarios in the tests: normal_merge* and dup*. The scenario normal_merge* were tested based on this above baseline branch e15382c . And the scenario dup* were tested with this PR.

(The test environment doesn't exist, but I recorded the options )
The common changed options are:
compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=2,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,disableWAL=true,stats_dump_period_sec=600,

And with following different options:
normal_merge2:
min_write_buffer_number_to_merge=2

dup2:
max_background_flushes=1, level0_file_num_compaction_trigger=8

normal_merge3:
min_write_buffer_number_to_merge=3

dup3:
max_background_flushes=1, min_write_buffer_number_to_merge=3, level0_file_num_compaction_trigger=8

Note: I use write_buffer_number_to_flush as 1 in dup* scenarios.

The default level0_file_num_compaction_trigger is 4. I changed it to 8 in dup* scenario as the L0 files generated in dup* scenario is much less than normal_merge*.

ajkr · 2017-09-19T23:24:00Z

Thanks, @lixiaoy1, I understand the use case better now.

ajkr

Can we always set write_buffer_number_to_flush to one when kFlushStyleDedup is enabled? We want to minimize the number of options introduced.

ajkr · 2017-09-25T22:01:59Z

Also, btw, we plan to extend this feature to repeatedly compact the oldest two immutable memtables into one larger immutable memtable. We'll flush the compacted memtable into an L0 file only once it exceeds some size (maybe just write_buffer_size). The point is to get the same benefits without creating smaller L0 files, which generally have caused problems like write stalling. Let us know if you have any thoughts on this :).

lixiaoy1 · 2017-09-26T06:37:44Z

@ajkr Good idea to repeatedly compact the oldest immutable memtables! It seems that the repeated compaction works well with current merge style in master branch instead of this PR. In the master branch, when merging two/three or more immutable memtables, it keeps the merged result in memory instead of flushing into L0. To trigger flush, or the merged memtable exceeds its limit size, or number of logs exceeds its limit, or db_write_buffer exceeds its limit.

facebook-github-bot · 2017-09-27T01:29:25Z

@lixiaoy1 has updated the pull request. View: changes

facebook-github-bot · 2017-09-27T01:33:44Z

@lixiaoy1 has updated the pull request.

facebook-github-bot · 2017-09-29T06:34:03Z

@lixiaoy1 has updated the pull request.

lixiaoy1 · 2017-09-29T06:37:05Z

@ajkr The option write_buffer_number_to_flush is removed.

lixiaoy1 · 2017-10-10T04:02:31Z

I also did following tests:

Generated the KV pairs sequences when doing 4k IO with Ceph/BlueStore in 30 mins.
Create a new RocksDB.
Inject above KV pairs to the db one by one.
Compare the size for every L0 SST files.

I did step 3 and 4 in the following setting with flush_style=kFlushStyleDedup or flush_style=kFlushStyleMerge:
compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=2,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152.

The total size of data written into L0 SST files with flush_style=kFlushStyleDedup is 46653MB; and then total size of data written into L0 SST files with flush_style=kFlushStyleMerge is 36313MB.

This PR can decrease the data written into SST files. It can improve performance when disk is busy.

lixiaoy1 · 2017-10-10T04:03:39Z

@ajkr Any further questions about the PR?

facebook-github-bot · 2017-10-17T08:40:20Z

@lixiaoy1 has updated the pull request.

lixiaoy1 · 2017-10-17T08:43:15Z

This change updates range_del parts.

facebook-github-bot · 2018-02-12T07:35:33Z

@lixiaoy1 has updated the pull request.

lixiaoy1 · 2018-02-13T01:19:53Z

I get the message from AppVeyor build: "Build execution time has reached the maximum allowed time for your plan (60 minutes)."
Please retest.

lixiaoy1 · 2018-02-13T05:29:57Z

retest please.

facebook-github-bot · 2018-03-07T03:17:56Z

@lixiaoy1 has updated the pull request.

This is to implement the idea: http://pad.ceph.com/p/rocksdb-wal-improvement Add a new flush style called kFlushStyleDedup which users can config by setting flush_style=kFlushStyleDedup. When flush is triggered, it dedups the key/value pairs in the oldest memtable against other memtables before flushing the oldest memtable into L0. The flush solution benefits for the data which are duplicated between memtables. With this flush, it can decrease the data flushed into L0 a lot. Signed-off-by: Xiaoyan Li <xiaoyan.li@intel.com>

ajkr · 2023-07-06T17:42:06Z

We revisited this internally as it is still an interesting idea for reducing flush bytes. One thing we realized this time that we didn't notice previously is the assumption that newer memtable data that was used to deduplicate data during flush must be recoverable after a crash. If an older version of a key is deduplicated but the newer version of a key is lost in a crash, then recovery will have a hole at the seqno of the older version of the key. The newer version of the key could be lost simply because of WriteOptions::disableWAL was used, or something more complicated like the host crashed while the newer key version was not fsynced.

facebook-github-bot added the CLA Signed label Sep 4, 2017

This was referenced Sep 4, 2017

[WIP] Delete duplicated key/value pairs recursively #2763

Closed

[WIP] Delete duplicated key/value pairs recursively ceph/rocksdb#19

Open

lixiaoy1 force-pushed the rocksdb_dedup branch from c2131f2 to eeb93ff Compare September 4, 2017 05:26

siying requested a review from ajkr September 11, 2017 16:47

ajkr reviewed Sep 25, 2017

View reviewed changes

lixiaoy1 force-pushed the rocksdb_dedup branch from 41f3aab to eeb93ff Compare September 27, 2017 01:33

lixiaoy1 force-pushed the rocksdb_dedup branch from eeb93ff to 873b7fc Compare September 29, 2017 06:33

lixiaoy1 changed the title ~~[WIP] Delete duplicated key/value pairs recursively~~ Delete duplicated key/value pairs recursively Sep 29, 2017

lixiaoy1 force-pushed the rocksdb_dedup branch from 873b7fc to 0c59462 Compare October 17, 2017 08:40

gfosco assigned siying Jan 22, 2018

siying assigned ajkr and unassigned siying Jan 26, 2018

lixiaoy1 force-pushed the rocksdb_dedup branch from 0c59462 to 91d5f2e Compare February 12, 2018 07:35

lixiaoy1 force-pushed the rocksdb_dedup branch from 91d5f2e to d03ed3c Compare March 7, 2018 03:16

siying assigned ltamasi Sep 10, 2019

ltamasi self-requested a review September 10, 2019 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete duplicated key/value pairs recursively #2829

Delete duplicated key/value pairs recursively #2829

lixiaoy1 commented Sep 4, 2017 •

edited

facebook-github-bot commented Sep 4, 2017

lixiaoy1 commented Sep 5, 2017

ajkr commented Sep 13, 2017

ajkr commented Sep 13, 2017

lixiaoy1 commented Sep 14, 2017 •

edited

ajkr commented Sep 19, 2017

ajkr left a comment

ajkr commented Sep 25, 2017 •

edited

lixiaoy1 commented Sep 26, 2017

facebook-github-bot commented Sep 27, 2017

facebook-github-bot commented Sep 27, 2017

facebook-github-bot commented Sep 29, 2017

lixiaoy1 commented Sep 29, 2017

lixiaoy1 commented Oct 10, 2017 •

edited

lixiaoy1 commented Oct 10, 2017

facebook-github-bot commented Oct 17, 2017

lixiaoy1 commented Oct 17, 2017

facebook-github-bot commented Feb 12, 2018

lixiaoy1 commented Feb 13, 2018

lixiaoy1 commented Feb 13, 2018

facebook-github-bot commented Mar 7, 2018

ajkr commented Jul 6, 2023 •

edited

Delete duplicated key/value pairs recursively #2829

Are you sure you want to change the base?

Delete duplicated key/value pairs recursively #2829

Conversation

lixiaoy1 commented Sep 4, 2017 • edited

facebook-github-bot commented Sep 4, 2017

lixiaoy1 commented Sep 5, 2017

ajkr commented Sep 13, 2017

ajkr commented Sep 13, 2017

lixiaoy1 commented Sep 14, 2017 • edited

ajkr commented Sep 19, 2017

ajkr left a comment

Choose a reason for hiding this comment

ajkr commented Sep 25, 2017 • edited

lixiaoy1 commented Sep 26, 2017

facebook-github-bot commented Sep 27, 2017

facebook-github-bot commented Sep 27, 2017

facebook-github-bot commented Sep 29, 2017

lixiaoy1 commented Sep 29, 2017

lixiaoy1 commented Oct 10, 2017 • edited

lixiaoy1 commented Oct 10, 2017

facebook-github-bot commented Oct 17, 2017

lixiaoy1 commented Oct 17, 2017

facebook-github-bot commented Feb 12, 2018

lixiaoy1 commented Feb 13, 2018

lixiaoy1 commented Feb 13, 2018

facebook-github-bot commented Mar 7, 2018

ajkr commented Jul 6, 2023 • edited

lixiaoy1 commented Sep 4, 2017 •

edited

lixiaoy1 commented Sep 14, 2017 •

edited

ajkr commented Sep 25, 2017 •

edited

lixiaoy1 commented Oct 10, 2017 •

edited

ajkr commented Jul 6, 2023 •

edited