-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce heap operations for range tombstone keys in iterator #10877
Conversation
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1489267
to
4af7be3
Compare
@cbi42 has updated the pull request. You must reimport the pull request before landing. |
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@cbi42 has updated the pull request. You must reimport the pull request before landing. |
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@cbi42 has updated the pull request. You must reimport the pull request before landing. |
@cbi42 has updated the pull request. You must reimport the pull request before landing. |
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
…#10877) Summary: Right now in MergingIterator, for each range tombstone start and end key, we pop one end from heap and push the other end into the heap. This involves extra downheap and upheap cost. In the likely cases when a range tombstone iterator emits relatively adjacent keys, these keys should have similar order within all keys in the heap. This can happen when there is a burst of consecutive range tombstones, and most of the keys covered by them are dropped already. This PR uses `replace_top()` when inserting new range tombstone keys, which is more efficient in these common cases. Pull Request resolved: facebook#10877 Test Plan: - existing UT - ran all flavors of stress test through sandcastle - benchmark: ``` TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=100000000 --writes=800000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=64 Level Files Size(MB) -------------------- 0 8 152 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone/ ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --num=100000000 --reads=1000000 --disable_auto_compactions=true --avoid_flush_during_recovery=true readseq [AVG 5 runs] : 1432116 (± 59664) ops/sec; 224.0 (± 9.3) MB/sec readseq [MEDIAN 5 runs] : 1454886 ops/sec; 227.5 MB/sec readseq [AVG 5 runs] : 1944425 (± 29521) ops/sec; 304.1 (± 4.6) MB/sec readseq [MEDIAN 5 runs] : 1959430 ops/sec; 306.5 MB/sec ``` Reviewed By: ajkr Differential Revision: D40710936 Pulled By: cbi42 fbshipit-source-id: cb782fb9cdcd26c0c3eb9443215a4ef4d2f79022
Summary: Right now in MergingIterator, for each range tombstone start and end key, we pop one end from heap and push the other end into the heap. This involves extra downheap and upheap cost. In the likely cases when a range tombstone iterator emits relatively adjacent keys, these keys should have similar order within all keys in the heap. This can happen when there is a burst of consecutive range tombstones, and most of the keys covered by them are dropped already. This PR uses `replace_top()` when inserting new range tombstone keys, which is more efficient in these common cases. Pull Request resolved: #10877 Test Plan: - existing UT - ran all flavors of stress test through sandcastle - benchmark: ``` TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=100000000 --writes=800000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=64 Level Files Size(MB) -------------------- 0 8 152 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone/ ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --num=100000000 --reads=1000000 --disable_auto_compactions=true --avoid_flush_during_recovery=true readseq [AVG 5 runs] : 1432116 (± 59664) ops/sec; 224.0 (± 9.3) MB/sec readseq [MEDIAN 5 runs] : 1454886 ops/sec; 227.5 MB/sec readseq [AVG 5 runs] : 1944425 (± 29521) ops/sec; 304.1 (± 4.6) MB/sec readseq [MEDIAN 5 runs] : 1959430 ops/sec; 306.5 MB/sec ``` Reviewed By: ajkr Differential Revision: D40710936 Pulled By: cbi42 fbshipit-source-id: cb782fb9cdcd26c0c3eb9443215a4ef4d2f79022
Summary: Right now in MergingIterator, for each range tombstone start and end key, we pop one end from heap and push the other end into the heap. This involves extra downheap and upheap cost. In the likely cases when a range tombstone iterator emits relatively adjacent keys, these keys should have similar order within all keys in the heap. This can happen when there is a burst of consecutive range tombstones, and most of the keys covered by them are dropped already. This PR uses `replace_top()` when inserting new range tombstone keys, which is more efficient in these common cases. Pull Request resolved: facebook/rocksdb#10877 Test Plan: - existing UT - ran all flavors of stress test through sandcastle - benchmark: ``` TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=100000000 --writes=800000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=64 Level Files Size(MB) -------------------- 0 8 152 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone/ ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --num=100000000 --reads=1000000 --disable_auto_compactions=true --avoid_flush_during_recovery=true readseq [AVG 5 runs] : 1432116 (± 59664) ops/sec; 224.0 (± 9.3) MB/sec readseq [MEDIAN 5 runs] : 1454886 ops/sec; 227.5 MB/sec readseq [AVG 5 runs] : 1944425 (± 29521) ops/sec; 304.1 (± 4.6) MB/sec readseq [MEDIAN 5 runs] : 1959430 ops/sec; 306.5 MB/sec ``` Reviewed By: ajkr Differential Revision: D40710936 Pulled By: cbi42 fbshipit-source-id: cb782fb9cdcd26c0c3eb9443215a4ef4d2f79022
Summary: Right now in MergingIterator, for each range tombstone start and end key, we pop one end from heap and push the other end into the heap. This involves extra downheap and upheap cost. In the likely cases when a range tombstone iterator emits relatively adjacent keys, these keys should have similar order within all keys in the heap. This can happen when there is a burst of consecutive range tombstones, and most of the keys covered by them are dropped already. This PR uses `replace_top()` when inserting new range tombstone keys, which is more efficient in these common cases. Pull Request resolved: facebook/rocksdb#10877 Test Plan: - existing UT - ran all flavors of stress test through sandcastle - benchmark: ``` TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=100000000 --writes=800000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=64 Level Files Size(MB) -------------------- 0 8 152 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone/ ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --num=100000000 --reads=1000000 --disable_auto_compactions=true --avoid_flush_during_recovery=true readseq [AVG 5 runs] : 1432116 (± 59664) ops/sec; 224.0 (± 9.3) MB/sec readseq [MEDIAN 5 runs] : 1454886 ops/sec; 227.5 MB/sec readseq [AVG 5 runs] : 1944425 (± 29521) ops/sec; 304.1 (± 4.6) MB/sec readseq [MEDIAN 5 runs] : 1959430 ops/sec; 306.5 MB/sec ``` Reviewed By: ajkr Differential Revision: D40710936 Pulled By: cbi42 fbshipit-source-id: cb782fb9cdcd26c0c3eb9443215a4ef4d2f79022
Summary: Right now in MergingIterator, for each range tombstone start and end key, we pop one end from heap and push the other end into the heap. This involves extra downheap and upheap cost. In the likely cases when a range tombstone iterator emits relatively adjacent keys, these keys should have similar order within all keys in the heap. This can happen when there is a burst of consecutive range tombstones, and most of the keys covered by them are dropped already. This PR uses `replace_top()` when inserting new range tombstone keys, which is more efficient in these common cases. Pull Request resolved: facebook/rocksdb#10877 Test Plan: - existing UT - ran all flavors of stress test through sandcastle - benchmark: ``` TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=100000000 --writes=800000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=64 Level Files Size(MB) -------------------- 0 8 152 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone/ ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --num=100000000 --reads=1000000 --disable_auto_compactions=true --avoid_flush_during_recovery=true readseq [AVG 5 runs] : 1432116 (± 59664) ops/sec; 224.0 (± 9.3) MB/sec readseq [MEDIAN 5 runs] : 1454886 ops/sec; 227.5 MB/sec readseq [AVG 5 runs] : 1944425 (± 29521) ops/sec; 304.1 (± 4.6) MB/sec readseq [MEDIAN 5 runs] : 1959430 ops/sec; 306.5 MB/sec ``` Reviewed By: ajkr Differential Revision: D40710936 Pulled By: cbi42 fbshipit-source-id: cb782fb9cdcd26c0c3eb9443215a4ef4d2f79022
Summary: Right now in MergingIterator, for each range tombstone start and end key, we pop one end from heap and push the other end into the heap. This involves extra downheap and upheap cost. In the likely cases when a range tombstone iterator emits relatively adjacent keys, these keys should have similar order within all keys in the heap. This can happen when there is a burst of consecutive range tombstones, and most of the keys covered by them are dropped already. This PR uses `replace_top()` when inserting new range tombstone keys, which is more efficient in these common cases. Pull Request resolved: facebook/rocksdb#10877 Test Plan: - existing UT - ran all flavors of stress test through sandcastle - benchmark: ``` TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=100000000 --writes=800000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=64 Level Files Size(MB) -------------------- 0 8 152 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone/ ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --num=100000000 --reads=1000000 --disable_auto_compactions=true --avoid_flush_during_recovery=true readseq [AVG 5 runs] : 1432116 (± 59664) ops/sec; 224.0 (± 9.3) MB/sec readseq [MEDIAN 5 runs] : 1454886 ops/sec; 227.5 MB/sec readseq [AVG 5 runs] : 1944425 (± 29521) ops/sec; 304.1 (± 4.6) MB/sec readseq [MEDIAN 5 runs] : 1959430 ops/sec; 306.5 MB/sec ``` Reviewed By: ajkr Differential Revision: D40710936 Pulled By: cbi42 fbshipit-source-id: cb782fb9cdcd26c0c3eb9443215a4ef4d2f79022
Summary: Right now in MergingIterator, for each range tombstone start and end key, we pop one end from heap and push the other end into the heap. This involves extra downheap and upheap cost. In the likely cases when a range tombstone iterator emits relatively adjacent keys, these keys should have similar order within all keys in the heap. This can happen when there is a burst of consecutive range tombstones, and most of the keys covered by them are dropped already. This PR uses
replace_top()
when inserting new range tombstone keys, which is more efficient in these common cases.Test plan: