Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db_bench v5.2 shows fillrandom performance degradation on windows #1976

Closed
yuslepukhin opened this issue Mar 9, 2017 · 48 comments
Closed

db_bench v5.2 shows fillrandom performance degradation on windows #1976

yuslepukhin opened this issue Mar 9, 2017 · 48 comments

Comments

@yuslepukhin
Copy link
Contributor

yuslepukhin commented Mar 9, 2017

As compared with 4.8. I am wondering if this is a repro on Linux.

5.2 bench command line everything is the same minus deprecated options

db_bench_je --benchmarks=fillrandom --disable_seek_compaction=1 --mmap_read=0 --statistics=1 --histogram=1 --num=10000000 --threads=1 --value_size=800 --block_size=65536 --cache_size=1048576 --bloom_bits=10 --cache_numshardbits=4 --open_files=500000 --verify_checksum=1 --db=k:\data\BulkLoadRandom_10M --sync=0 --disable_wal=1 --compression_type=snappy --stats_interval=1000000 --compression_ratio=0.5 --write_buffer_size=268435456 --target_file_size_base=1073741824 --max_write_buffer_number=30 --max_background_compactions=20 --max_background_flushes=4 --level0_file_num_compaction_trigger=10000000 --level0_slowdown_writes_trigger=10000000 --level0_stop_writes_trigger=10000000 --num_levels=2 --delete_obsolete_files_period_micros=300000000 --min_level_to_compress=2 --max_compaction_bytes=0 --stats_per_interval=1 --max_bytes_for_level_base=10485760 --memtablerep=vector --use_existing_db=0 --disable_auto_compactions=1 --use_direct_reads=1 --use_direct_writes=1 --compaction_readahead_size=3145728 --writable_file_max_buffer_size=1048576 --random_access_max_buffer_size=0 --new_table_reader_for_compaction_inputs=1

4.8 bench command

db_bench_je --benchmarks=fillrandom --disable_seek_compaction=1 --mmap_read=0 --statistics=1 --histogram=1 --num=10000000 --threads=1 --value_size=800 --block_size=65536 --cache_size=1048576 --bloom_bits=10 --cache_numshardbits=4 --open_files=500000 --verify_checksum=1 --db=k:\data\BulkLoadRandom_10M --sync=0 --disable_wal=1 --compression_type=snappy --stats_interval=1000000 --compression_ratio=0.5 --disable_data_sync=1 --write_buffer_size=268435456 --target_file_size_base=1073741824 --max_write_buffer_number=30 --max_background_compactions=20 --max_background_flushes=4 --level0_file_num_compaction_trigger=10000000 --level0_slowdown_writes_trigger=10000000 --level0_stop_writes_trigger=10000000 --num_levels=2 --delete_obsolete_files_period_micros=300000000 --min_level_to_compress=2 --max_grandparent_overlap_factor=10 --stats_per_interval=1 --max_bytes_for_level_base=10485760 --memtablerep=vector --use_existing_db=0 --disable_auto_compactions=1 --source_compaction_factor=10000000 --bufferedio=0 --compaction_readahead_size=3145728 --writable_file_max_buffer_size=1048576 --random_access_max_buffer_size=0 --new_table_reader_for_compaction_inputs=1 --skip_table_builder_flush=1

The results show the following:
4.8

DB path: [k:\data\BulkLoadRandom_10M]
fillrandom : 1.723 micros/op 580378 ops/sec; 451.6 MB/s
Microseconds per write:
Count: 10000000 Average: 1.7227 StdDev: 46.11
Min: 0 Median: 0.9637 Max: 26036
Percentiles: P50: 0.96 P75: 1.50 P99: 2.63 P99.9: 9.00 P99.99: 17.34

Latest GItHub at the time of this writing:

DB path: [k:\data\BulkLoadRandom_10M]
fillrandom : 2.769 micros/op 361160 ops/sec; 281.1 MB/s
Microseconds per write:
Count: 10000000 Average: 2.7686 StdDev: 37.23
Min: 1 Median: 2.0641 Max: 17129
Percentiles: P50: 2.06 P75: 2.55 P99: 3.66 P99.9: 11.10 P99.99: 28.33

@IslamAbdelRahman
Copy link
Contributor

@yuslepukhin, I will take a look at the numbers for Linux but I noticed that you are using Direct I/O for reads and writes in 5.2

@IslamAbdelRahman
Copy link
Contributor

Also for 5.2 if we enable Direct IO then we should set --skip_table_builder_flush=1

@yuslepukhin
Copy link
Contributor Author

@IslamAbdelRahman Thanks for responding. 4.8 also makes use of direct IO its just the option name was different back then, before direct io became an official symbol so you will need to change that in the 4.8 command like. For the skip flash option my understanding that it was removed recently and the default is 1.

@yuslepukhin
Copy link
Contributor Author

@IslamAbdelRahman I am willing to publish more scenarios. However, fillrandom is the only one that exhibits this kind of degradation.

@yuslepukhin
Copy link
Contributor Author

@IslamAbdelRahman Not sure direct io was available on Linux at 4.8 time frame

@siying
Copy link
Contributor

siying commented Mar 9, 2017

We deprecated --disable_data_sync, so that after each SST file, data is fsynced. Is it possibly the problem. If you revert eb912a9 in 5.2 and the number comes back, then it is the problem.

@yuslepukhin
Copy link
Contributor Author

@siying I already tried off 0a4cdde. The number does not come back. Given the fact that inserts are mostly in memory and flashes are in background I doubt that would be the cause.

@siying
Copy link
Contributor

siying commented Mar 10, 2017

@yuslepukhin did you paste the wrong link?

@yuslepukhin
Copy link
Contributor Author

@siying I believe that's the right one. It comes before disable_data_sync and already exhibits the issue

@yuslepukhin
Copy link
Contributor Author

@siying eb912a9 mentions that performance gains can be achieved by other ways. What other ways he might have meant specifically?

@siying
Copy link
Contributor

siying commented Mar 13, 2017

@yuslepukhin increasing number of flush and compaction threads.

@siying
Copy link
Contributor

siying commented Mar 13, 2017

@yuslepukhin DB disable_data_sync=true doesn't have persistent guarantee users usually expect from a DB (after machine crash DB is re-openable, data is durable since flush, and data is durable since WAL sync). One reason we remove it is to avoid people from confusing. This option is designed for fast bulkloading: load all the data with this option, force a full compaction and trigger file system sync through a command line. Most people won't understand it, so we would rather not having it.

@yuslepukhin
Copy link
Contributor Author

@siying Thanks for the explanation. This regression takes place exactly at the scenario you described.
Increasing number of flash threads does not have any effect here, tried it. There is never a pending flush, they are all scheduled right away. Comp actions are disabled.
I also tried not to sync with direct_io. This made very little difference.
We are talking 1.0-1.4 usec/op of difference.
There is increased wait time for the instrumented mutex though for some reason, before we check whether to schedule flushes.

@siying
Copy link
Contributor

siying commented Mar 13, 2017

@yuslepukhin I assume you turn this option off in your production. Do you care about a benchmark that has it on?

With Direct I/O on, isn't Fsync() a no-op? Why will it be slower then?

@yuslepukhin
Copy link
Contributor Author

yuslepukhin commented Mar 14, 2017

@siying Rocksdb provides storage in our production but not fault tolerance or replication.
With that bulk data transfer speed is more important.

With Direct I/O on, isn't Fsync() a no-op?
Not on Windows. If you look at the implementation it makes use of FlushFileBuffers. Just search the internet.
Why will it be slower then?
I actually do not believe it affects this specific benchmark

@siying
Copy link
Contributor

siying commented Mar 14, 2017

@yuslepukhin by no-op, I didn't mean it doesn't need to do anything, I mean it doesn't waiting for many significant I/Os so it should be fast.

@yuslepukhin
Copy link
Contributor Author

@siying FlushFileBuffers contrary to its documentation, name and signature affects far more things than a single file. Sigh :( However, this is not my focus at the moment.

@yuslepukhin
Copy link
Contributor Author

Out of curiosity, why such a discrepancy between the commit date within the commit and actual commit in the master?
image

@siying
Copy link
Contributor

siying commented Mar 14, 2017

The commit data probably was the commit time to private fork?

@yuslepukhin
Copy link
Contributor Author

I have confirmed that the regression comes from 907f24d

This causes ~40% perf degradation in single thread benchmark with the settings we run it.
The previous commit does not exhibit the regression.

We will evaluate if the single thread bulk insert is important to us and if this commit delivers what it is purposed for. What concerns me is that commit mentions the following bench mark options

TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -disable_auto_compactions -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -num=10000000 --writes=1000000 -max_background_flushes=16 -max_write_buffer_number=16 --threads=64 --batch_size=128 -allow_concurrent_memtable_write -enable_write_thread_adaptive_yield

Is allow_concurrent_memtable_write option is necessary and does it involves atomic spins for this commit to deliver its worth?

@siying
Copy link
Contributor

siying commented Mar 15, 2017

@yuslepukhin oh, the regression may be caused by that we turned the default of allow_concurrent_memtable_write from false to true. If you load data in single thread, you can turn it off. Is the regression is still there if you explicitly set --allow_concurrent_memtable_write=false?

@yuslepukhin
Copy link
Contributor Author

@siying It is false by default. The notable difference is that in this test memtable representation is vector which would not support concurrent writes. So far, it is clear that this commit does not work well for vector.

@siying
Copy link
Contributor

siying commented Mar 15, 2017

@yuslepukhin I don't understand why there is a regression by it. For concurrent writes = false, the behavior should have no change. All the code path should stay the same.

This should return null: 907f24d#diff-c524911eb80f5abb845bf10cae90b210R1065

and this code path is not executed: 907f24d#diff-d256ef1f6dce451fa3906a4aaba21390R412

I must have missed something.

@siying
Copy link
Contributor

siying commented Mar 15, 2017

@yuslepukhin I can reproduce it locally on Linux too. Let me investigate it.

@siying
Copy link
Contributor

siying commented Mar 15, 2017

"--skip_table_builder_flush=" is deprecated in master but not in 5.2. So you need to specify it if you run benchmark in 5.2 branch. If you specify the parameter, will the performance gap narrow?

@yuslepukhin
Copy link
Contributor Author

@siying "--skip_table_builder_flush=" is on currently. I am not running 5.2 build, it is not necessary. So my scripts for 4.8 has it on. I have a build off the commit in question.

@yuslepukhin
Copy link
Contributor Author

@siying On the same build switching to skiplist with 16 threads and concurrent updates makes results worse.

@siying
Copy link
Contributor

siying commented Mar 16, 2017

@yuslepukhin the initial db_bench command you pasted, --skip_table_builder_flush was not specified in the run with 5.2. Do you mean you actually have it on, or you didn't run with 5.2 but master?

@yuslepukhin
Copy link
Contributor Author

@siying I ran it from master and this is how I learned it was deprecated.

@siying
Copy link
Contributor

siying commented Mar 16, 2017

@yuslepukhin I'm not able to reproduce it with the command line you pasted with master. Is it Windows specific?

@yuslepukhin
Copy link
Contributor Author

@siying The only thing windows specific is a --db path. In 4.8 there was a --bufferedio which was replaced by direct io everywhere.

@yuslepukhin
Copy link
Contributor Author

@siying It is also a build with jemalloc

@siying
Copy link
Contributor

siying commented Mar 16, 2017

@yuslepukhin there is another Windows specific thing: we don't support Direct I/O in linux until 5.1. And we moved the write buffering logic from Windows to shared code, so this code path changes in Windows, although it is supposed to be the same code.

@yuslepukhin
Copy link
Contributor Author

@siying Sure, the profile shows Write() more expensive though, not file access path.

@yuslepukhin
Copy link
Contributor Author

Profiling results (sampled CPU). I can share precise CPU usage as well. One is no regression and the one with regression.

highlightsnoregression
highlightregression

@yuslepukhin
Copy link
Contributor Author

@siying I think this the place we want to think how to get rid of data structures that allocate node per node. Perhaps a map that is based on a single allocation may help.

@siying
Copy link
Contributor

siying commented Mar 17, 2017

@yuslepukhin let me try to figure out why. If you use --allow_concurrent_writes=false, the allocation should not happen.

@yuslepukhin
Copy link
Contributor Author

@siying This is exactly that. This one is using vector memtable

@siying
Copy link
Contributor

siying commented Mar 17, 2017

@yuslepukhin I can't see this in my profiling under Linux. Is it because the construction of this class requires an allocation? https://github.com/facebook/rocksdb/blob/master/db/write_batch.cc#L778-L779

@siying
Copy link
Contributor

siying commented Mar 17, 2017

@yuslepukhin if this theory is correct, then this diff can eliminate the allocation:

diff --git a/db/write_batch.cc b/db/write_batch.cc
index 435c621..851eba1 100644
--- a/db/write_batch.cc
+++ b/db/write_batch.cc
@@ -776,7 +776,7 @@ class MemTableInserter : public WriteBatch::Handler {
   const bool concurrent_memtable_writes_;
   bool* has_valid_writes_;
   typedef std::map<MemTable*, MemTablePostProcessInfo> MemPostInfoMap;
-  MemPostInfoMap mem_post_info_map_;
+  std::unique_ptr<MemPostInfoMap> mem_post_info_map_;
   // current recovered transaction we are rebuilding (recovery)
   WriteBatch* rebuilding_trx_;

@@ -805,7 +805,10 @@ class MemTableInserter : public WriteBatch::Handler {
   SequenceNumber get_final_sequence() { return sequence_; }

   void PostProcess() {
-    for (auto& pair : mem_post_info_map_) {
+    if (!mem_post_info_map_) {
+      return;
+    }
+    for (auto& pair : *mem_post_info_map_) {
       pair.first->BatchPostProcess(pair.second);
     }
   }
@@ -1194,7 +1197,10 @@ class MemTableInserter : public WriteBatch::Handler {
       // No need to batch counters locally if we don't use concurrent mode.
       return nullptr;
     }
-    return &mem_post_info_map_[mem];
+    if (!mem_post_info_map_) {
+      mem_post_info_map_.reset(new MemPostInfoMap());
+    }
+    return &(*mem_post_info_map_)[mem];
   }
 };
```
Will it help if you try?

@yuslepukhin
Copy link
Contributor Author

@siying Construction of the class does not require an allocation, one allocation would not be an issue. Profile shows many inserts/erase into the map. If the switch is false. Perhaps we have some problem with switch value propagation.

@yuslepukhin
Copy link
Contributor Author

@siying let me step into it

@yuslepukhin
Copy link
Contributor Author

This is what I see on commit 907f24d

capture

@siying
Copy link
Contributor

siying commented Mar 17, 2017

This can only be executed if parallel is true, but

    bool parallel = immutable_db_options_.allow_concurrent_memtable_write &&
                    write_group.size() > 1;

so immutable_db_options_.allow_concurrent_memtable_write is true? Let me see how can we end up with this.

@siying
Copy link
Contributor

siying commented Mar 17, 2017

@yuslepukhin I still don't get how this code path can be triggered. Can you figure out by stepping into it?

@yuslepukhin
Copy link
Contributor Author

@siying I do not specify the switch on the command line and db_bench has it false by default.

here is an exact command line I ran for both runs. You should be able to run if --bufferedio=0 is removed

d:\dmitrism\perf>db_bench_je --benchmarks=fillrandom --disable_seek_compaction=1 --mmap_read=0 --statistics=1 --hi
stogram=1 --num=10000000 --threads=1 --value_size=800 --block_size=65536 --cache_size=1048576 --bloom_bits=10 --ca
che_numshardbits=4 --open_files=500000 --verify_checksum=1 --db=k:\data\BulkLoadRandom_10M_fillonly --sync=0 --dis
able_wal=1 --compression_type=snappy --stats_interval=1000000 --compression_ratio=0.5 --disable_data_sync=1 --writ
e_buffer_size=268435456 --target_file_size_base=1073741824 --max_write_buffer_number=30 --max_background_compactio
ns=20 --max_background_flushes=16 --level0_file_num_compaction_trigger=10000000 --level0_slowdown_writes_trigger=1
0000000 --level0_stop_writes_trigger=10000000 --num_levels=2 --delete_obsolete_files_period_micros=300000000 --min
_level_to_compress=2 --max_grandparent_overlap_factor=10 --stats_per_interval=1 --max_bytes_for_level_base=1048576
0 --memtablerep=vector --use_existing_db=0 --disable_auto_compactions=1 --source_compaction_factor=10000000 --buff
eredio=0 --compaction_readahead_size=3145728 --writable_file_max_buffer_size=1048576 --random_access_max_buffer_s
ize=0 --new_table_reader_for_compaction_inputs=1 --skip_table_builder_flush=1 1>>PerfResult_03172017_121055\BulkL
oadRandom_10M_fillonly\stats.txt

@yuslepukhin
Copy link
Contributor Author

yuslepukhin commented Mar 17, 2017

@siying It appears that your theory is correct. For some reason std::map, at least on Microsoft does an allocation when MemTableInserter is instantiated. The value of concurrent updates is still false.
inserter
stack

@yuslepukhin
Copy link
Contributor Author

@siying Many thanks for your patience. May try Monday as internet is not working today. Will ask STL team as to what they think.

facebook-github-bot pushed a commit that referenced this issue Mar 22, 2017
Summary:
MemTableInserter default constructs Post processing info
  std::map. However, on Windows with 2015 STL the default
  constructed map still dynamically allocates one node
  which shows up on a profiler and we loose ~40% throughput
  on fillrandom benchmark.
  Solution: declare a map as std::aligned storage and optionally
  construct.

This addresses #1976

Before:
-------------------------------------------------------------------
  Initializing RocksDB Options from command-line flags
  DB path: [k:\data\BulkLoadRandom_10M_fillonly]
  fillrandom   :       2.775 micros/op 360334 ops/sec;  280.4 MB/s
  Microseconds per write:
  Count: 10000000 Average: 2.7749  StdDev: 39.92
  Min: 1  Median: 2.0826  Max: 26051
  Percentiles: P50: 2.08 P75: 2.55 P99: 3.55 P99.9: 9.58 P99.99: 51.5**6
  ------------------------------------------------------

  After:

  Initializing RocksDB Options from command-line flags
  DB path: [k:\data\BulkLoadRandom_10M_fillon
Closes #2011

Differential Revision: D4740823

Pulled By: siying

fbshipit-source-id: 1daaa2c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants