-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db_bench v5.2 shows fillrandom performance degradation on windows #1976
Comments
@yuslepukhin, I will take a look at the numbers for Linux but I noticed that you are using Direct I/O for reads and writes in 5.2 |
Also for 5.2 if we enable Direct IO then we should set --skip_table_builder_flush=1 |
@IslamAbdelRahman Thanks for responding. 4.8 also makes use of direct IO its just the option name was different back then, before direct io became an official symbol so you will need to change that in the 4.8 command like. For the skip flash option my understanding that it was removed recently and the default is 1. |
@IslamAbdelRahman I am willing to publish more scenarios. However, fillrandom is the only one that exhibits this kind of degradation. |
@IslamAbdelRahman Not sure direct io was available on Linux at 4.8 time frame |
We deprecated --disable_data_sync, so that after each SST file, data is fsynced. Is it possibly the problem. If you revert eb912a9 in 5.2 and the number comes back, then it is the problem. |
@yuslepukhin did you paste the wrong link? |
@siying I believe that's the right one. It comes before disable_data_sync and already exhibits the issue |
@yuslepukhin increasing number of flush and compaction threads. |
@yuslepukhin DB disable_data_sync=true doesn't have persistent guarantee users usually expect from a DB (after machine crash DB is re-openable, data is durable since flush, and data is durable since WAL sync). One reason we remove it is to avoid people from confusing. This option is designed for fast bulkloading: load all the data with this option, force a full compaction and trigger file system sync through a command line. Most people won't understand it, so we would rather not having it. |
@siying Thanks for the explanation. This regression takes place exactly at the scenario you described. |
@yuslepukhin I assume you turn this option off in your production. Do you care about a benchmark that has it on? With Direct I/O on, isn't Fsync() a no-op? Why will it be slower then? |
@siying Rocksdb provides storage in our production but not fault tolerance or replication.
|
@yuslepukhin by no-op, I didn't mean it doesn't need to do anything, I mean it doesn't waiting for many significant I/Os so it should be fast. |
@siying |
The commit data probably was the commit time to private fork? |
I have confirmed that the regression comes from 907f24d This causes ~40% perf degradation in single thread benchmark with the settings we run it. We will evaluate if the single thread bulk insert is important to us and if this commit delivers what it is purposed for. What concerns me is that commit mentions the following bench mark options
Is allow_concurrent_memtable_write option is necessary and does it involves atomic spins for this commit to deliver its worth? |
@yuslepukhin oh, the regression may be caused by that we turned the default of allow_concurrent_memtable_write from false to true. If you load data in single thread, you can turn it off. Is the regression is still there if you explicitly set --allow_concurrent_memtable_write=false? |
@siying It is false by default. The notable difference is that in this test memtable representation is vector which would not support concurrent writes. So far, it is clear that this commit does not work well for vector. |
@yuslepukhin I don't understand why there is a regression by it. For concurrent writes = false, the behavior should have no change. All the code path should stay the same. This should return null: 907f24d#diff-c524911eb80f5abb845bf10cae90b210R1065 and this code path is not executed: 907f24d#diff-d256ef1f6dce451fa3906a4aaba21390R412 I must have missed something. |
@yuslepukhin I can reproduce it locally on Linux too. Let me investigate it. |
"--skip_table_builder_flush=" is deprecated in master but not in 5.2. So you need to specify it if you run benchmark in 5.2 branch. If you specify the parameter, will the performance gap narrow? |
@siying "--skip_table_builder_flush=" is on currently. I am not running 5.2 build, it is not necessary. So my scripts for 4.8 has it on. I have a build off the commit in question. |
@siying On the same build switching to skiplist with 16 threads and concurrent updates makes results worse. |
@yuslepukhin the initial db_bench command you pasted, --skip_table_builder_flush was not specified in the run with 5.2. Do you mean you actually have it on, or you didn't run with 5.2 but master? |
@siying I ran it from master and this is how I learned it was deprecated. |
@yuslepukhin I'm not able to reproduce it with the command line you pasted with master. Is it Windows specific? |
@siying The only thing windows specific is a --db path. In 4.8 there was a --bufferedio which was replaced by direct io everywhere. |
@siying It is also a build with jemalloc |
@yuslepukhin there is another Windows specific thing: we don't support Direct I/O in linux until 5.1. And we moved the write buffering logic from Windows to shared code, so this code path changes in Windows, although it is supposed to be the same code. |
@siying Sure, the profile shows Write() more expensive though, not file access path. |
@siying I think this the place we want to think how to get rid of data structures that allocate node per node. Perhaps a map that is based on a single allocation may help. |
@yuslepukhin let me try to figure out why. If you use --allow_concurrent_writes=false, the allocation should not happen. |
@siying This is exactly that. This one is using vector memtable |
@yuslepukhin I can't see this in my profiling under Linux. Is it because the construction of this class requires an allocation? https://github.com/facebook/rocksdb/blob/master/db/write_batch.cc#L778-L779 |
@yuslepukhin if this theory is correct, then this diff can eliminate the allocation:
|
@siying Construction of the class does not require an allocation, one allocation would not be an issue. Profile shows many inserts/erase into the map. If the switch is false. Perhaps we have some problem with switch value propagation. |
@siying let me step into it |
This is what I see on commit 907f24d |
This can only be executed if
so immutable_db_options_.allow_concurrent_memtable_write is true? Let me see how can we end up with this. |
@yuslepukhin I still don't get how this code path can be triggered. Can you figure out by stepping into it? |
@siying I do not specify the switch on the command line and db_bench has it here is an exact command line I ran for both runs. You should be able to run if --bufferedio=0 is removed d:\dmitrism\perf>db_bench_je --benchmarks=fillrandom --disable_seek_compaction=1 --mmap_read=0 --statistics=1 --hi |
@siying It appears that your theory is correct. For some reason std::map, at least on Microsoft does an allocation when MemTableInserter is instantiated. The value of concurrent updates is still false. |
@siying Many thanks for your patience. May try Monday as internet is not working today. Will ask STL team as to what they think. |
Summary: MemTableInserter default constructs Post processing info std::map. However, on Windows with 2015 STL the default constructed map still dynamically allocates one node which shows up on a profiler and we loose ~40% throughput on fillrandom benchmark. Solution: declare a map as std::aligned storage and optionally construct. This addresses #1976 Before: ------------------------------------------------------------------- Initializing RocksDB Options from command-line flags DB path: [k:\data\BulkLoadRandom_10M_fillonly] fillrandom : 2.775 micros/op 360334 ops/sec; 280.4 MB/s Microseconds per write: Count: 10000000 Average: 2.7749 StdDev: 39.92 Min: 1 Median: 2.0826 Max: 26051 Percentiles: P50: 2.08 P75: 2.55 P99: 3.55 P99.9: 9.58 P99.99: 51.5**6 ------------------------------------------------------ After: Initializing RocksDB Options from command-line flags DB path: [k:\data\BulkLoadRandom_10M_fillon Closes #2011 Differential Revision: D4740823 Pulled By: siying fbshipit-source-id: 1daaa2c
As compared with 4.8. I am wondering if this is a repro on Linux.
5.2 bench command line everything is the same minus deprecated options
db_bench_je --benchmarks=fillrandom --disable_seek_compaction=1 --mmap_read=0 --statistics=1 --histogram=1 --num=10000000 --threads=1 --value_size=800 --block_size=65536 --cache_size=1048576 --bloom_bits=10 --cache_numshardbits=4 --open_files=500000 --verify_checksum=1 --db=k:\data\BulkLoadRandom_10M --sync=0 --disable_wal=1 --compression_type=snappy --stats_interval=1000000 --compression_ratio=0.5 --write_buffer_size=268435456 --target_file_size_base=1073741824 --max_write_buffer_number=30 --max_background_compactions=20 --max_background_flushes=4 --level0_file_num_compaction_trigger=10000000 --level0_slowdown_writes_trigger=10000000 --level0_stop_writes_trigger=10000000 --num_levels=2 --delete_obsolete_files_period_micros=300000000 --min_level_to_compress=2 --max_compaction_bytes=0 --stats_per_interval=1 --max_bytes_for_level_base=10485760 --memtablerep=vector --use_existing_db=0 --disable_auto_compactions=1 --use_direct_reads=1 --use_direct_writes=1 --compaction_readahead_size=3145728 --writable_file_max_buffer_size=1048576 --random_access_max_buffer_size=0 --new_table_reader_for_compaction_inputs=1
4.8 bench command
db_bench_je --benchmarks=fillrandom --disable_seek_compaction=1 --mmap_read=0 --statistics=1 --histogram=1 --num=10000000 --threads=1 --value_size=800 --block_size=65536 --cache_size=1048576 --bloom_bits=10 --cache_numshardbits=4 --open_files=500000 --verify_checksum=1 --db=k:\data\BulkLoadRandom_10M --sync=0 --disable_wal=1 --compression_type=snappy --stats_interval=1000000 --compression_ratio=0.5 --disable_data_sync=1 --write_buffer_size=268435456 --target_file_size_base=1073741824 --max_write_buffer_number=30 --max_background_compactions=20 --max_background_flushes=4 --level0_file_num_compaction_trigger=10000000 --level0_slowdown_writes_trigger=10000000 --level0_stop_writes_trigger=10000000 --num_levels=2 --delete_obsolete_files_period_micros=300000000 --min_level_to_compress=2 --max_grandparent_overlap_factor=10 --stats_per_interval=1 --max_bytes_for_level_base=10485760 --memtablerep=vector --use_existing_db=0 --disable_auto_compactions=1 --source_compaction_factor=10000000 --bufferedio=0 --compaction_readahead_size=3145728 --writable_file_max_buffer_size=1048576 --random_access_max_buffer_size=0 --new_table_reader_for_compaction_inputs=1 --skip_table_builder_flush=1
The results show the following:
4.8
DB path: [k:\data\BulkLoadRandom_10M]
fillrandom : 1.723 micros/op 580378 ops/sec; 451.6 MB/s
Microseconds per write:
Count: 10000000 Average: 1.7227 StdDev: 46.11
Min: 0 Median: 0.9637 Max: 26036
Percentiles: P50: 0.96 P75: 1.50 P99: 2.63 P99.9: 9.00 P99.99: 17.34
Latest GItHub at the time of this writing:
DB path: [k:\data\BulkLoadRandom_10M]
fillrandom : 2.769 micros/op 361160 ops/sec; 281.1 MB/s
Microseconds per write:
Count: 10000000 Average: 2.7686 StdDev: 37.23
Min: 1 Median: 2.0641 Max: 17129
Percentiles: P50: 2.06 P75: 2.55 P99: 3.66 P99.9: 11.10 P99.99: 28.33
The text was updated successfully, but these errors were encountered: