db_bench v5.2 shows fillrandom performance degradation on windows #1976

yuslepukhin · 2017-03-09T20:47:56Z

As compared with 4.8. I am wondering if this is a repro on Linux.

5.2 bench command line everything is the same minus deprecated options

db_bench_je --benchmarks=fillrandom --disable_seek_compaction=1 --mmap_read=0 --statistics=1 --histogram=1 --num=10000000 --threads=1 --value_size=800 --block_size=65536 --cache_size=1048576 --bloom_bits=10 --cache_numshardbits=4 --open_files=500000 --verify_checksum=1 --db=k:\data\BulkLoadRandom_10M --sync=0 --disable_wal=1 --compression_type=snappy --stats_interval=1000000 --compression_ratio=0.5 --write_buffer_size=268435456 --target_file_size_base=1073741824 --max_write_buffer_number=30 --max_background_compactions=20 --max_background_flushes=4 --level0_file_num_compaction_trigger=10000000 --level0_slowdown_writes_trigger=10000000 --level0_stop_writes_trigger=10000000 --num_levels=2 --delete_obsolete_files_period_micros=300000000 --min_level_to_compress=2 --max_compaction_bytes=0 --stats_per_interval=1 --max_bytes_for_level_base=10485760 --memtablerep=vector --use_existing_db=0 --disable_auto_compactions=1 --use_direct_reads=1 --use_direct_writes=1 --compaction_readahead_size=3145728 --writable_file_max_buffer_size=1048576 --random_access_max_buffer_size=0 --new_table_reader_for_compaction_inputs=1

4.8 bench command

db_bench_je --benchmarks=fillrandom --disable_seek_compaction=1 --mmap_read=0 --statistics=1 --histogram=1 --num=10000000 --threads=1 --value_size=800 --block_size=65536 --cache_size=1048576 --bloom_bits=10 --cache_numshardbits=4 --open_files=500000 --verify_checksum=1 --db=k:\data\BulkLoadRandom_10M --sync=0 --disable_wal=1 --compression_type=snappy --stats_interval=1000000 --compression_ratio=0.5 --disable_data_sync=1 --write_buffer_size=268435456 --target_file_size_base=1073741824 --max_write_buffer_number=30 --max_background_compactions=20 --max_background_flushes=4 --level0_file_num_compaction_trigger=10000000 --level0_slowdown_writes_trigger=10000000 --level0_stop_writes_trigger=10000000 --num_levels=2 --delete_obsolete_files_period_micros=300000000 --min_level_to_compress=2 --max_grandparent_overlap_factor=10 --stats_per_interval=1 --max_bytes_for_level_base=10485760 --memtablerep=vector --use_existing_db=0 --disable_auto_compactions=1 --source_compaction_factor=10000000 --bufferedio=0 --compaction_readahead_size=3145728 --writable_file_max_buffer_size=1048576 --random_access_max_buffer_size=0 --new_table_reader_for_compaction_inputs=1 --skip_table_builder_flush=1

The results show the following:
4.8

DB path: [k:\data\BulkLoadRandom_10M]
fillrandom : 1.723 micros/op 580378 ops/sec; 451.6 MB/s
Microseconds per write:
Count: 10000000 Average: 1.7227 StdDev: 46.11
Min: 0 Median: 0.9637 Max: 26036
Percentiles: P50: 0.96 P75: 1.50 P99: 2.63 P99.9: 9.00 P99.99: 17.34

Latest GItHub at the time of this writing:

DB path: [k:\data\BulkLoadRandom_10M]
fillrandom : 2.769 micros/op 361160 ops/sec; 281.1 MB/s
Microseconds per write:
Count: 10000000 Average: 2.7686 StdDev: 37.23
Min: 1 Median: 2.0641 Max: 17129
Percentiles: P50: 2.06 P75: 2.55 P99: 3.66 P99.9: 11.10 P99.99: 28.33

IslamAbdelRahman · 2017-03-09T21:43:30Z

@yuslepukhin, I will take a look at the numbers for Linux but I noticed that you are using Direct I/O for reads and writes in 5.2

IslamAbdelRahman · 2017-03-09T21:48:11Z

Also for 5.2 if we enable Direct IO then we should set --skip_table_builder_flush=1

yuslepukhin · 2017-03-09T23:03:41Z

@IslamAbdelRahman Thanks for responding. 4.8 also makes use of direct IO its just the option name was different back then, before direct io became an official symbol so you will need to change that in the 4.8 command like. For the skip flash option my understanding that it was removed recently and the default is 1.

yuslepukhin · 2017-03-09T23:05:50Z

@IslamAbdelRahman I am willing to publish more scenarios. However, fillrandom is the only one that exhibits this kind of degradation.

yuslepukhin · 2017-03-09T23:09:50Z

@IslamAbdelRahman Not sure direct io was available on Linux at 4.8 time frame

siying · 2017-03-09T23:57:26Z

We deprecated --disable_data_sync, so that after each SST file, data is fsynced. Is it possibly the problem. If you revert eb912a9 in 5.2 and the number comes back, then it is the problem.

yuslepukhin · 2017-03-10T00:13:31Z

@siying I already tried off 0a4cdde. The number does not come back. Given the fact that inserts are mostly in memory and flashes are in background I doubt that would be the cause.

siying · 2017-03-10T00:17:29Z

@yuslepukhin did you paste the wrong link?

yuslepukhin · 2017-03-10T00:38:06Z

@siying I believe that's the right one. It comes before disable_data_sync and already exhibits the issue

yuslepukhin · 2017-03-13T18:28:15Z

@siying eb912a9 mentions that performance gains can be achieved by other ways. What other ways he might have meant specifically?

siying · 2017-03-13T18:48:14Z

@yuslepukhin increasing number of flush and compaction threads.

siying · 2017-03-13T18:52:05Z

@yuslepukhin DB disable_data_sync=true doesn't have persistent guarantee users usually expect from a DB (after machine crash DB is re-openable, data is durable since flush, and data is durable since WAL sync). One reason we remove it is to avoid people from confusing. This option is designed for fast bulkloading: load all the data with this option, force a full compaction and trigger file system sync through a command line. Most people won't understand it, so we would rather not having it.

yuslepukhin · 2017-03-13T19:02:45Z

@siying Thanks for the explanation. This regression takes place exactly at the scenario you described.
Increasing number of flash threads does not have any effect here, tried it. There is never a pending flush, they are all scheduled right away. Comp actions are disabled.
I also tried not to sync with direct_io. This made very little difference.
We are talking 1.0-1.4 usec/op of difference.
There is increased wait time for the instrumented mutex though for some reason, before we check whether to schedule flushes.

siying · 2017-03-13T21:26:17Z

@yuslepukhin I assume you turn this option off in your production. Do you care about a benchmark that has it on?

With Direct I/O on, isn't Fsync() a no-op? Why will it be slower then?

yuslepukhin · 2017-03-14T00:43:13Z

@siying Rocksdb provides storage in our production but not fault tolerance or replication.
With that bulk data transfer speed is more important.

With Direct I/O on, isn't Fsync() a no-op?
Not on Windows. If you look at the implementation it makes use of FlushFileBuffers. Just search the internet.
Why will it be slower then?
I actually do not believe it affects this specific benchmark

siying · 2017-03-14T17:09:41Z

@yuslepukhin by no-op, I didn't mean it doesn't need to do anything, I mean it doesn't waiting for many significant I/Os so it should be fast.

yuslepukhin · 2017-03-14T18:01:55Z

@siying FlushFileBuffers contrary to its documentation, name and signature affects far more things than a single file. Sigh :( However, this is not my focus at the moment.

yuslepukhin · 2017-03-14T18:38:40Z

Out of curiosity, why such a discrepancy between the commit date within the commit and actual commit in the master?

siying · 2017-03-14T18:53:46Z

The commit data probably was the commit time to private fork?

yuslepukhin · 2017-03-15T17:58:14Z

I have confirmed that the regression comes from 907f24d

This causes ~40% perf degradation in single thread benchmark with the settings we run it.
The previous commit does not exhibit the regression.

We will evaluate if the single thread bulk insert is important to us and if this commit delivers what it is purposed for. What concerns me is that commit mentions the following bench mark options

TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -disable_auto_compactions -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -num=10000000 --writes=1000000 -max_background_flushes=16 -max_write_buffer_number=16 --threads=64 --batch_size=128 -allow_concurrent_memtable_write -enable_write_thread_adaptive_yield

Is allow_concurrent_memtable_write option is necessary and does it involves atomic spins for this commit to deliver its worth?

siying · 2017-03-15T18:09:56Z

@yuslepukhin oh, the regression may be caused by that we turned the default of allow_concurrent_memtable_write from false to true. If you load data in single thread, you can turn it off. Is the regression is still there if you explicitly set --allow_concurrent_memtable_write=false?

yuslepukhin · 2017-03-15T18:26:42Z

@siying It is false by default. The notable difference is that in this test memtable representation is vector which would not support concurrent writes. So far, it is clear that this commit does not work well for vector.

siying · 2017-03-15T18:34:55Z

@yuslepukhin I don't understand why there is a regression by it. For concurrent writes = false, the behavior should have no change. All the code path should stay the same.

This should return null: 907f24d#diff-c524911eb80f5abb845bf10cae90b210R1065

and this code path is not executed: 907f24d#diff-d256ef1f6dce451fa3906a4aaba21390R412

I must have missed something.

siying · 2017-03-15T18:47:30Z

@yuslepukhin I can reproduce it locally on Linux too. Let me investigate it.

siying · 2017-03-15T19:44:46Z

"--skip_table_builder_flush=" is deprecated in master but not in 5.2. So you need to specify it if you run benchmark in 5.2 branch. If you specify the parameter, will the performance gap narrow?

yuslepukhin · 2017-03-15T20:44:30Z

@siying "--skip_table_builder_flush=" is on currently. I am not running 5.2 build, it is not necessary. So my scripts for 4.8 has it on. I have a build off the commit in question.

yuslepukhin · 2017-03-15T21:18:09Z

@siying On the same build switching to skiplist with 16 threads and concurrent updates makes results worse.

siying · 2017-03-16T02:03:59Z

@yuslepukhin the initial db_bench command you pasted, --skip_table_builder_flush was not specified in the run with 5.2. Do you mean you actually have it on, or you didn't run with 5.2 but master?

yuslepukhin · 2017-03-16T18:52:47Z

@siying I ran it from master and this is how I learned it was deprecated.

siying · 2017-03-16T21:56:46Z

@yuslepukhin I'm not able to reproduce it with the command line you pasted with master. Is it Windows specific?

yuslepukhin · 2017-03-16T22:57:17Z

@siying The only thing windows specific is a --db path. In 4.8 there was a --bufferedio which was replaced by direct io everywhere.

yuslepukhin · 2017-03-16T22:58:47Z

@siying It is also a build with jemalloc

siying · 2017-03-16T23:03:52Z

@yuslepukhin there is another Windows specific thing: we don't support Direct I/O in linux until 5.1. And we moved the write buffering logic from Windows to shared code, so this code path changes in Windows, although it is supposed to be the same code.

yuslepukhin · 2017-03-16T23:27:26Z

@siying Sure, the profile shows Write() more expensive though, not file access path.

yuslepukhin · 2017-03-17T20:05:29Z

Profiling results (sampled CPU). I can share precise CPU usage as well. One is no regression and the one with regression.

yuslepukhin · 2017-03-17T20:07:20Z

@siying I think this the place we want to think how to get rid of data structures that allocate node per node. Perhaps a map that is based on a single allocation may help.

siying · 2017-03-17T20:25:17Z

@yuslepukhin let me try to figure out why. If you use --allow_concurrent_writes=false, the allocation should not happen.

yuslepukhin · 2017-03-17T20:27:13Z

@siying This is exactly that. This one is using vector memtable

siying · 2017-03-17T20:40:59Z

@yuslepukhin I can't see this in my profiling under Linux. Is it because the construction of this class requires an allocation? https://github.com/facebook/rocksdb/blob/master/db/write_batch.cc#L778-L779

siying · 2017-03-17T20:47:51Z

@yuslepukhin if this theory is correct, then this diff can eliminate the allocation:

diff --git a/db/write_batch.cc b/db/write_batch.cc
index 435c621..851eba1 100644
--- a/db/write_batch.cc
+++ b/db/write_batch.cc
@@ -776,7 +776,7 @@ class MemTableInserter : public WriteBatch::Handler {
   const bool concurrent_memtable_writes_;
   bool* has_valid_writes_;
   typedef std::map<MemTable*, MemTablePostProcessInfo> MemPostInfoMap;
-  MemPostInfoMap mem_post_info_map_;
+  std::unique_ptr<MemPostInfoMap> mem_post_info_map_;
   // current recovered transaction we are rebuilding (recovery)
   WriteBatch* rebuilding_trx_;

@@ -805,7 +805,10 @@ class MemTableInserter : public WriteBatch::Handler {
   SequenceNumber get_final_sequence() { return sequence_; }

   void PostProcess() {
-    for (auto& pair : mem_post_info_map_) {
+    if (!mem_post_info_map_) {
+      return;
+    }
+    for (auto& pair : *mem_post_info_map_) {
       pair.first->BatchPostProcess(pair.second);
     }
   }
@@ -1194,7 +1197,10 @@ class MemTableInserter : public WriteBatch::Handler {
       // No need to batch counters locally if we don't use concurrent mode.
       return nullptr;
     }
-    return &mem_post_info_map_[mem];
+    if (!mem_post_info_map_) {
+      mem_post_info_map_.reset(new MemPostInfoMap());
+    }
+    return &(*mem_post_info_map_)[mem];
   }
 };
```
Will it help if you try?

yuslepukhin · 2017-03-17T21:03:44Z

@siying Construction of the class does not require an allocation, one allocation would not be an issue. Profile shows many inserts/erase into the map. If the switch is false. Perhaps we have some problem with switch value propagation.

yuslepukhin · 2017-03-17T21:09:49Z

@siying let me step into it

yuslepukhin · 2017-03-17T21:17:29Z

This is what I see on commit 907f24d

siying · 2017-03-17T21:21:31Z

This can only be executed if parallel is true, but

    bool parallel = immutable_db_options_.allow_concurrent_memtable_write &&
                    write_group.size() > 1;

so immutable_db_options_.allow_concurrent_memtable_write is true? Let me see how can we end up with this.

siying · 2017-03-17T21:28:30Z

@yuslepukhin I still don't get how this code path can be triggered. Can you figure out by stepping into it?

yuslepukhin · 2017-03-17T21:29:48Z

@siying I do not specify the switch on the command line and db_bench has it false by default.

here is an exact command line I ran for both runs. You should be able to run if --bufferedio=0 is removed

d:\dmitrism\perf>db_bench_je --benchmarks=fillrandom --disable_seek_compaction=1 --mmap_read=0 --statistics=1 --hi
stogram=1 --num=10000000 --threads=1 --value_size=800 --block_size=65536 --cache_size=1048576 --bloom_bits=10 --ca
che_numshardbits=4 --open_files=500000 --verify_checksum=1 --db=k:\data\BulkLoadRandom_10M_fillonly --sync=0 --dis
able_wal=1 --compression_type=snappy --stats_interval=1000000 --compression_ratio=0.5 --disable_data_sync=1 --writ
e_buffer_size=268435456 --target_file_size_base=1073741824 --max_write_buffer_number=30 --max_background_compactio
ns=20 --max_background_flushes=16 --level0_file_num_compaction_trigger=10000000 --level0_slowdown_writes_trigger=1
0000000 --level0_stop_writes_trigger=10000000 --num_levels=2 --delete_obsolete_files_period_micros=300000000 --min
_level_to_compress=2 --max_grandparent_overlap_factor=10 --stats_per_interval=1 --max_bytes_for_level_base=1048576
0 --memtablerep=vector --use_existing_db=0 --disable_auto_compactions=1 --source_compaction_factor=10000000 --buff
eredio=0 --compaction_readahead_size=3145728 --writable_file_max_buffer_size=1048576 --random_access_max_buffer_s
ize=0 --new_table_reader_for_compaction_inputs=1 --skip_table_builder_flush=1 1>>PerfResult_03172017_121055\BulkL
oadRandom_10M_fillonly\stats.txt

yuslepukhin · 2017-03-17T22:00:15Z

@siying It appears that your theory is correct. For some reason std::map, at least on Microsoft does an allocation when MemTableInserter is instantiated. The value of concurrent updates is still false.

yuslepukhin · 2017-03-17T22:29:26Z

@siying Many thanks for your patience. May try Monday as internet is not working today. Will ask STL team as to what they think.

Summary: MemTableInserter default constructs Post processing info std::map. However, on Windows with 2015 STL the default constructed map still dynamically allocates one node which shows up on a profiler and we loose ~40% throughput on fillrandom benchmark. Solution: declare a map as std::aligned storage and optionally construct. This addresses #1976 Before: ------------------------------------------------------------------- Initializing RocksDB Options from command-line flags DB path: [k:\data\BulkLoadRandom_10M_fillonly] fillrandom : 2.775 micros/op 360334 ops/sec; 280.4 MB/s Microseconds per write: Count: 10000000 Average: 2.7749 StdDev: 39.92 Min: 1 Median: 2.0826 Max: 26051 Percentiles: P50: 2.08 P75: 2.55 P99: 3.55 P99.9: 9.58 P99.99: 51.5**6 ------------------------------------------------------ After: Initializing RocksDB Options from command-line flags DB path: [k:\data\BulkLoadRandom_10M_fillon Closes #2011 Differential Revision: D4740823 Pulled By: siying fbshipit-source-id: 1daaa2c

yuslepukhin mentioned this issue Mar 20, 2017

Optionally construct Post Processing Info map in MemTableInserter #2011

Closed

yuslepukhin closed this as completed Mar 22, 2017

db_bench v5.2 shows fillrandom performance degradation on windows #1976

db_bench v5.2 shows fillrandom performance degradation on windows #1976

Comments

yuslepukhin commented Mar 9, 2017 • edited Loading

IslamAbdelRahman commented Mar 9, 2017

IslamAbdelRahman commented Mar 9, 2017

yuslepukhin commented Mar 9, 2017

yuslepukhin commented Mar 9, 2017

yuslepukhin commented Mar 9, 2017

siying commented Mar 9, 2017

yuslepukhin commented Mar 10, 2017

siying commented Mar 10, 2017

yuslepukhin commented Mar 10, 2017

yuslepukhin commented Mar 13, 2017

siying commented Mar 13, 2017

siying commented Mar 13, 2017

yuslepukhin commented Mar 13, 2017

siying commented Mar 13, 2017

yuslepukhin commented Mar 14, 2017 • edited Loading

siying commented Mar 14, 2017

yuslepukhin commented Mar 14, 2017

yuslepukhin commented Mar 14, 2017

siying commented Mar 14, 2017

yuslepukhin commented Mar 15, 2017

siying commented Mar 15, 2017

yuslepukhin commented Mar 15, 2017

siying commented Mar 15, 2017 • edited Loading

siying commented Mar 15, 2017

siying commented Mar 15, 2017

yuslepukhin commented Mar 15, 2017

yuslepukhin commented Mar 15, 2017

siying commented Mar 16, 2017

yuslepukhin commented Mar 16, 2017

siying commented Mar 16, 2017

yuslepukhin commented Mar 16, 2017

yuslepukhin commented Mar 16, 2017

siying commented Mar 16, 2017

yuslepukhin commented Mar 16, 2017

yuslepukhin commented Mar 17, 2017

yuslepukhin commented Mar 17, 2017

siying commented Mar 17, 2017

yuslepukhin commented Mar 17, 2017

siying commented Mar 17, 2017

siying commented Mar 17, 2017

yuslepukhin commented Mar 17, 2017

yuslepukhin commented Mar 17, 2017

yuslepukhin commented Mar 17, 2017

siying commented Mar 17, 2017

siying commented Mar 17, 2017

yuslepukhin commented Mar 17, 2017

yuslepukhin commented Mar 17, 2017 • edited Loading

yuslepukhin commented Mar 17, 2017

yuslepukhin commented Mar 9, 2017 •

edited

Loading

yuslepukhin commented Mar 14, 2017 •

edited

Loading

siying commented Mar 15, 2017 •

edited

Loading

yuslepukhin commented Mar 17, 2017 •

edited

Loading