fillseq throughput is 13% slower from PR 6862 #9351

mdcallag · 2022-01-03T16:26:59Z

fillseq throughput is 13% slower after PR 6862 with git hash e3f953a. The problem is new CPU overhead (user, not system). The diff landed in v6.11.

The test server is a spare UDB host (many core, fast SSD) and fillseq throughput drops from ~800k/s to ~700k/s with this diff. One example of throughput by version (6.0 to 6.22) is here. There is also a regression in 6.14 for which I am still searching.

The problem is new CPU overhead that shows up as more user CPU time as measured via /bin/time db_bench ...
The test takes ~1000 seconds and prior to this diff uses ~3000 seconds of user CPU time vs ~3500 seconds of user CPU time at this diff.

I am not sure whether this depends more on the number of files or concurrency because it doesn't show up as a problem on IO-bound or CPU-bound configs on a small server, nor does it show up on a CPU-bound config on this server. The repro here is what I call an IO-bound config and has ~20X more data (and files) than the CPU-bound config.

I don't see an increase in the context switch rate, so mutex contention does not appear to be a problem.

The command line is:
/usr/bin/time -f '%e %U %S' -o bm.lc.nt16.cm1.d0/1550.e3f953a/benchmark_fillseq.wal_disabled.v400.log.time numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --seed=1641213884

The text was updated successfully, but these errors were encountered:

mdcallag · 2022-01-03T16:32:50Z

Using smaller SST files to get "enough" files with a smaller database might help repro this faster.

mdcallag · 2022-01-03T18:28:37Z

This shows the insert rate every 20 seconds.

For the diff immediately prior to the PR 6862 diff the rate starts at 840k/s and drops to 761k. For the problem diff the rate starts at 822k and drops to 617k. So the difference is larger when there are more files (at test end).

ltamasi · 2022-01-04T15:12:16Z

The PR in question does not affect the application traffic directly; the runtime cost it adds is essentially one hash operation per file per background job (flush/compaction). Do you see any write stalls in your tests? If yes, what kind?

mdcallag · 2022-01-04T16:06:50Z

Code appears to be on the user's Put path because CPU/insert is up, throughput is down and stalls are down.

From test end for diff immediately prior to this one
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 964 memtable_slowdown, interval 2 total count

And from this diff
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 389 memtable_slowdown, interval 0 total count

ltamasi · 2022-01-04T16:12:05Z

Code appears to be on the user's Put path

VersionStorageInfo is the in-memory structure that contains the metadata for the LSM tree. It is definitely not updated on the write path.

You mentioned stall counts are down; what about total stall time?

ltamasi · 2022-01-04T16:24:11Z

Looking at your configuration options, max_background_jobs seems very low (especially since it's a box with many cores). Setting it to 8 only allows 8 / 4 = 2 concurrent flushes, even though you have max_write_buffer_number set to 8.

mdcallag · 2022-01-04T16:43:39Z

For cumulative stall:
before this diff: Cumulative stall: 00:00:45.513 H:M:S, 4.6 percent
with this diff: Cumulative stall: 00:00:18.934 H:M:S, 1.7 percent

For max_background_jobs, the server has 36 HW threads and 18 cores. 8 background jobs gets almost half of the CPU cores when CPU bound, that isn't very low. If RocksDB hardwires the number of background flushes to jobs/4 then that sounds like a feature missing when the code was changed to remove max_background_flushes.

Regardless, the slowdown doesn't appear to be from write stalls.

ltamasi · 2022-01-04T16:58:47Z

If RocksDB hardwires the number of background flushes to jobs/4 then that sounds like a feature missing when the code was changed to remove max_background_flushes.

max_background_flushes and max_background_compactions are still supported. Could you try using them instead of max_background_jobs? I usually set max_background_flushes to max_write_buffer_number - 1 in order to have enough concurrency for all the immutable memtables.

siying · 2022-01-05T00:02:49Z

It looks like @ltamasi is correct that #6862 isn't O(num_total_files), but O(num_new_files). So it's puzzled how this can introduce 13% more CPU.

mdcallag · 2022-01-05T02:56:51Z

Ran the test for v6.26.1 with max_jobs=16. That did not help and performance was similar to max_jobs=8.

mdcallag · 2022-01-05T18:50:02Z

The root cause appears to be that this diff increases the amount of time the global mutex is locked by background threads and then the user thread waits more while trying to lock that mutex when doing writes. From a test that inserts 1B KV pairs via fillseq with PMP running each ~60 seconds using this diff (e3f953a) and the one immediately prior to it (bcefc59). I counted the number of times the user thread was blocked in _lll_lock_wait (pthread mutex internals) in PipelinedWriteImpl -- there were 3 for bcefc59 vs 5 for e3f953a.

When the user thread blocks on _lll_lock_wait, it is here (see "line 477" below)

Status DBImpl::PipelinedWriteImpl(const WriteOptions& write_options,
                                  WriteBatch* my_batch, WriteCallback* callback,
                                  uint64_t* log_used, uint64_t log_ref,
                                  bool disable_memtable, uint64_t* seq_used) {
  PERF_TIMER_GUARD(write_pre_and_post_process_time);
  StopWatch write_sw(env_, immutable_db_options_.statistics.get(), DB_WRITE);

  WriteContext write_context;

  WriteThread::Writer w(write_options, my_batch, callback, log_ref,
                        disable_memtable);
  write_thread_.JoinBatchGroup(&w);
  if (w.state == WriteThread::STATE_GROUP_LEADER) {
    WriteThread::WriteGroup wal_write_group;
    if (w.callback && !w.callback->AllowWriteBatching()) {
      write_thread_.WaitForMemTableWriters();
    }
    mutex_.Lock();                                                                         <-- line 477
    bool need_log_sync = !write_options.disableWAL && write_options.sync;
    bool need_log_dir_sync = need_log_sync && !log_dir_synced_;
    // PreprocessWrite does its own perf timing.
    PERF_TIMER_STOP(write_pre_and_post_process_time);
    w.status = PreprocessWrite(write_options, &need_log_sync, &write_context);
    PERF_TIMER_START(write_pre_and_post_process_time);
    log::Writer* log_writer = logs_.back().writer;
    mutex_.Unlock();

And this is the flattened PMP stack trace for the blocked user thread:

__lll_lock_wait,pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone

And these are the 3 instances for bcefc59 and 5 for e3f953a, that include the filenames I used to save them:

1551.bcefc59/benchmark_fillseq.wal_disabled.v400.log.h.220105_082909:      1 __lll_lock_wait,pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone
1551.bcefc59/benchmark_fillseq.wal_disabled.v400.log.h.220105_083032:      1 __lll_lock_wait,pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone
1551.bcefc59/benchmark_fillseq.wal_disabled.v400.log.h.220105_083725:      1 __lll_lock_wait,pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone

1550.e3f953a/benchmark_fillseq.wal_disabled.v400.log.h.220105_084926:      1 __lll_lock_wait,pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone
1550.e3f953a/benchmark_fillseq.wal_disabled.v400.log.h.220105_085048:      1 __lll_lock_wait,pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone
1550.e3f953a/benchmark_fillseq.wal_disabled.v400.log.h.220105_085904:      1 pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone
1550.e3f953a/benchmark_fillseq.wal_disabled.v400.log.h.220105_090027:      1 __lll_lock_wait,pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone
1550.e3f953a/benchmark_fillseq.wal_disabled.v400.log.h.220105_090720:      1 __lll_lock_wait,pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone
1550.e3f953a/benchmark_fillseq.wal_disabled.v400.log.h.220105_090843:      1 __lll_lock_wait,pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone

Finally, the stack traces for the thread that I suspect holds the global mutex for e3f953a. I list 6 here (and above), 5 when the user thread is blocked on _lll_lock_wait and one where it is still in pthread_mutex_lock. There are in a gist.

And the the 3 stack traces for bcefc59 in a gist.

pdillinger · 2022-01-05T20:48:14Z

Some browsing suggests that the (eventual) users of file_locations_ are just doing sanity checks. Would it make sense to condition population and uses of file_locations_ on force_consistency_checks?

mdcallag · 2022-01-05T20:53:04Z

Flamegraphs are here. github didn't allow me to attach them directly. The 1550 directory is for e3f953a and the 1551 directory is for bcefc59.

I was unable to make differential flamegraphs but comparing them side-by-side I see that TryInstallMemtableFlushResults takes more time with e3f953a. Also, one of the stack traces for e3f953a shows that the Compaction destructor runs for more time with e3f953a from the deallocation it must do for the STL container holding file_locations_ and I think the global mutex was held during that.

siying · 2022-01-05T21:08:57Z

It looks like that most work in flush/compaction threads that can't be parallelize is in VersionBuilder::Rep::SaveTo(), and a good share is hash table operations. If we can move them out of DB mutex, it's possible that the throughput can be back.

mdcallag · 2022-01-05T21:25:28Z

I repeated the repro inserting 1.6B KB pairs rather than 800M with PMP samples each ~60 seconds. For bcefc59 there were 3 samples with DoWrite stuck on _lll_lock_wait. With e3f953a there were 11. So I am reasonably confident that long lock-hold times for the big kernel lock are the issue.

ltamasi · 2022-01-05T21:43:13Z

Thanks for the detailed results @mdcallag.

Similarly to #9354 , it was a conscious decision to make the consistency checks which rely on this hash structure mandatory. For most workloads, the overhead is negligible; it seems fillseq with vector memtable and no WAL is the exception where it makes a difference (given enough SST files). Since the checks in question have already caught multiple data correctness bugs, I would be extremely reluctant to change them. Having said that, I will look into how we can reduce the overhead.

ltamasi · 2022-01-24T17:49:45Z

Looking at the code, there is one simple optimization we can do to reduce the overhead. We can preallocate the space for the hash, which should eliminate the costly reallocation/rehashing as the hash expands. Will put up a patch for this in the next couple of days. I expect this to significantly reduce the performance overhead of the consistency checks but in case it doesn't, we can also look into moving the version saving logic out from under the mutex lock, although I would expect that to be a much more involved change.

ltamasi · 2022-01-25T01:36:01Z

we can also look into moving the version saving logic out from under the mutex lock, although I would expect that to be a much more involved change.

... actually, it might not be that involved after all. Planning to post a patch in the next few days

siying · 2022-05-23T17:15:07Z

we can also look into moving the version saving logic out from under the mutex lock, although I would expect that to be a much more involved change.

... actually, it might not be that involved after all. Planning to post a patch in the next few days

@ltamasi any update to the evaluate of whether "it might not be that involved after all" is true?

ltamasi · 2022-05-23T17:25:10Z

Yeah :) this was fixed in #9504 in February.

siying · 2022-05-24T16:58:01Z

moving the version saving logic out from under the mutex lock

How about "moving the version saving logic out from under the mutex lock"?

ltamasi · 2022-05-24T17:07:32Z

The PR I mentioned moved the hash building part out (which is what was causing the regression and what I had in mind). Not sure how easy it would be to do more of the version saving outside the mutex.

siying · 2022-05-24T17:29:27Z

The PR I mentioned moved the hash building part out (which is what was causing the regression and what I had in mind). Not sure how easy it would be to do more of the version saving outside the mutex.

Thanks for the clarification.

mdcallag added bug Confirmed RocksDB bugs regression labels Jan 3, 2022

mdcallag assigned ltamasi Jan 3, 2022

pdillinger mentioned this issue Jan 5, 2022

fillseq is 17% slower from force_consistency_checks (PR 7446) #9354

Closed

mdcallag added performance Issues related to performance that may or may not be bugs and removed bug Confirmed RocksDB bugs labels Jan 10, 2022

ltamasi mentioned this issue Feb 4, 2022

Mitigate the overhead of building the hash of file locations #9504

Closed

facebook-github-bot closed this as completed in 0cc0543 Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fillseq throughput is 13% slower from PR 6862 #9351

fillseq throughput is 13% slower from PR 6862 #9351

mdcallag commented Jan 3, 2022 •

edited

Loading

mdcallag commented Jan 3, 2022

mdcallag commented Jan 3, 2022

ltamasi commented Jan 4, 2022

mdcallag commented Jan 4, 2022

ltamasi commented Jan 4, 2022

ltamasi commented Jan 4, 2022

mdcallag commented Jan 4, 2022

ltamasi commented Jan 4, 2022

siying commented Jan 5, 2022

mdcallag commented Jan 5, 2022

mdcallag commented Jan 5, 2022 •

edited

Loading

pdillinger commented Jan 5, 2022

mdcallag commented Jan 5, 2022

siying commented Jan 5, 2022

mdcallag commented Jan 5, 2022

ltamasi commented Jan 5, 2022

ltamasi commented Jan 24, 2022 •

edited

Loading

ltamasi commented Jan 25, 2022 •

edited

Loading

siying commented May 23, 2022

ltamasi commented May 23, 2022

siying commented May 24, 2022

ltamasi commented May 24, 2022

siying commented May 24, 2022

fillseq throughput is 13% slower from PR 6862 #9351

fillseq throughput is 13% slower from PR 6862 #9351

Comments

mdcallag commented Jan 3, 2022 • edited Loading

mdcallag commented Jan 3, 2022

mdcallag commented Jan 3, 2022

ltamasi commented Jan 4, 2022

mdcallag commented Jan 4, 2022

ltamasi commented Jan 4, 2022

ltamasi commented Jan 4, 2022

mdcallag commented Jan 4, 2022

ltamasi commented Jan 4, 2022

siying commented Jan 5, 2022

mdcallag commented Jan 5, 2022

mdcallag commented Jan 5, 2022 • edited Loading

pdillinger commented Jan 5, 2022

mdcallag commented Jan 5, 2022

siying commented Jan 5, 2022

mdcallag commented Jan 5, 2022

ltamasi commented Jan 5, 2022

ltamasi commented Jan 24, 2022 • edited Loading

ltamasi commented Jan 25, 2022 • edited Loading

siying commented May 23, 2022

ltamasi commented May 23, 2022

siying commented May 24, 2022

ltamasi commented May 24, 2022

siying commented May 24, 2022

mdcallag commented Jan 3, 2022 •

edited

Loading

mdcallag commented Jan 5, 2022 •

edited

Loading

ltamasi commented Jan 24, 2022 •

edited

Loading

ltamasi commented Jan 25, 2022 •

edited

Loading