Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fillseq throughput is 13% slower from PR 6862 #9351

Closed
mdcallag opened this issue Jan 3, 2022 · 23 comments
Closed

fillseq throughput is 13% slower from PR 6862 #9351

mdcallag opened this issue Jan 3, 2022 · 23 comments
Assignees
Labels
performance Issues related to performance that may or may not be bugs regression

Comments

@mdcallag
Copy link
Contributor

mdcallag commented Jan 3, 2022

fillseq throughput is 13% slower after PR 6862 with git hash e3f953a. The problem is new CPU overhead (user, not system). The diff landed in v6.11.

The test server is a spare UDB host (many core, fast SSD) and fillseq throughput drops from ~800k/s to ~700k/s with this diff. One example of throughput by version (6.0 to 6.22) is here. There is also a regression in 6.14 for which I am still searching.

The problem is new CPU overhead that shows up as more user CPU time as measured via /bin/time db_bench ...
The test takes ~1000 seconds and prior to this diff uses ~3000 seconds of user CPU time vs ~3500 seconds of user CPU time at this diff.

I am not sure whether this depends more on the number of files or concurrency because it doesn't show up as a problem on IO-bound or CPU-bound configs on a small server, nor does it show up on a CPU-bound config on this server. The repro here is what I call an IO-bound config and has ~20X more data (and files) than the CPU-bound config.

I don't see an increase in the context switch rate, so mutex contention does not appear to be a problem.

The command line is:
/usr/bin/time -f '%e %U %S' -o bm.lc.nt16.cm1.d0/1550.e3f953a/benchmark_fillseq.wal_disabled.v400.log.time numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --seed=1641213884

@mdcallag mdcallag added bug Confirmed RocksDB bugs regression labels Jan 3, 2022
@mdcallag
Copy link
Contributor Author

mdcallag commented Jan 3, 2022

Using smaller SST files to get "enough" files with a smaller database might help repro this faster.

@mdcallag
Copy link
Contributor Author

mdcallag commented Jan 3, 2022

This shows the insert rate every 20 seconds.

For the diff immediately prior to the PR 6862 diff the rate starts at 840k/s and drops to 761k. For the problem diff the rate starts at 822k and drops to 617k. So the difference is larger when there are more files (at test end).

@ltamasi
Copy link
Contributor

ltamasi commented Jan 4, 2022

The PR in question does not affect the application traffic directly; the runtime cost it adds is essentially one hash operation per file per background job (flush/compaction). Do you see any write stalls in your tests? If yes, what kind?

@mdcallag
Copy link
Contributor Author

mdcallag commented Jan 4, 2022

Code appears to be on the user's Put path because CPU/insert is up, throughput is down and stalls are down.

From test end for diff immediately prior to this one
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 964 memtable_slowdown, interval 2 total count

And from this diff
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 389 memtable_slowdown, interval 0 total count

@ltamasi
Copy link
Contributor

ltamasi commented Jan 4, 2022

Code appears to be on the user's Put path

VersionStorageInfo is the in-memory structure that contains the metadata for the LSM tree. It is definitely not updated on the write path.

You mentioned stall counts are down; what about total stall time?

@ltamasi
Copy link
Contributor

ltamasi commented Jan 4, 2022

Looking at your configuration options, max_background_jobs seems very low (especially since it's a box with many cores). Setting it to 8 only allows 8 / 4 = 2 concurrent flushes, even though you have max_write_buffer_number set to 8.

@mdcallag
Copy link
Contributor Author

mdcallag commented Jan 4, 2022

For cumulative stall:
before this diff: Cumulative stall: 00:00:45.513 H:M:S, 4.6 percent
with this diff: Cumulative stall: 00:00:18.934 H:M:S, 1.7 percent

For max_background_jobs, the server has 36 HW threads and 18 cores. 8 background jobs gets almost half of the CPU cores when CPU bound, that isn't very low. If RocksDB hardwires the number of background flushes to jobs/4 then that sounds like a feature missing when the code was changed to remove max_background_flushes.

Regardless, the slowdown doesn't appear to be from write stalls.

@ltamasi
Copy link
Contributor

ltamasi commented Jan 4, 2022

If RocksDB hardwires the number of background flushes to jobs/4 then that sounds like a feature missing when the code was changed to remove max_background_flushes.

max_background_flushes and max_background_compactions are still supported. Could you try using them instead of max_background_jobs? I usually set max_background_flushes to max_write_buffer_number - 1 in order to have enough concurrency for all the immutable memtables.

@siying
Copy link
Contributor

siying commented Jan 5, 2022

It looks like @ltamasi is correct that #6862 isn't O(num_total_files), but O(num_new_files). So it's puzzled how this can introduce 13% more CPU.

@mdcallag
Copy link
Contributor Author

mdcallag commented Jan 5, 2022

Ran the test for v6.26.1 with max_jobs=16. That did not help and performance was similar to max_jobs=8.

@mdcallag
Copy link
Contributor Author

mdcallag commented Jan 5, 2022

The root cause appears to be that this diff increases the amount of time the global mutex is locked by background threads and then the user thread waits more while trying to lock that mutex when doing writes. From a test that inserts 1B KV pairs via fillseq with PMP running each ~60 seconds using this diff (e3f953a) and the one immediately prior to it (bcefc59). I counted the number of times the user thread was blocked in _lll_lock_wait (pthread mutex internals) in PipelinedWriteImpl -- there were 3 for bcefc59 vs 5 for e3f953a.

When the user thread blocks on _lll_lock_wait, it is here (see "line 477" below)

Status DBImpl::PipelinedWriteImpl(const WriteOptions& write_options,
                                  WriteBatch* my_batch, WriteCallback* callback,
                                  uint64_t* log_used, uint64_t log_ref,
                                  bool disable_memtable, uint64_t* seq_used) {
  PERF_TIMER_GUARD(write_pre_and_post_process_time);
  StopWatch write_sw(env_, immutable_db_options_.statistics.get(), DB_WRITE);

  WriteContext write_context;

  WriteThread::Writer w(write_options, my_batch, callback, log_ref,
                        disable_memtable);
  write_thread_.JoinBatchGroup(&w);
  if (w.state == WriteThread::STATE_GROUP_LEADER) {
    WriteThread::WriteGroup wal_write_group;
    if (w.callback && !w.callback->AllowWriteBatching()) {
      write_thread_.WaitForMemTableWriters();
    }
    mutex_.Lock();                                                                         <-- line 477
    bool need_log_sync = !write_options.disableWAL && write_options.sync;
    bool need_log_dir_sync = need_log_sync && !log_dir_synced_;
    // PreprocessWrite does its own perf timing.
    PERF_TIMER_STOP(write_pre_and_post_process_time);
    w.status = PreprocessWrite(write_options, &need_log_sync, &write_context);
    PERF_TIMER_START(write_pre_and_post_process_time);
    log::Writer* log_writer = logs_.back().writer;
    mutex_.Unlock();

And this is the flattened PMP stack trace for the blocked user thread:

__lll_lock_wait,pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone

And these are the 3 instances for bcefc59 and 5 for e3f953a, that include the filenames I used to save them:

1551.bcefc59/benchmark_fillseq.wal_disabled.v400.log.h.220105_082909:      1 __lll_lock_wait,pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone
1551.bcefc59/benchmark_fillseq.wal_disabled.v400.log.h.220105_083032:      1 __lll_lock_wait,pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone
1551.bcefc59/benchmark_fillseq.wal_disabled.v400.log.h.220105_083725:      1 __lll_lock_wait,pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone

1550.e3f953a/benchmark_fillseq.wal_disabled.v400.log.h.220105_084926:      1 __lll_lock_wait,pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone
1550.e3f953a/benchmark_fillseq.wal_disabled.v400.log.h.220105_085048:      1 __lll_lock_wait,pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone
1550.e3f953a/benchmark_fillseq.wal_disabled.v400.log.h.220105_085904:      1 pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone
1550.e3f953a/benchmark_fillseq.wal_disabled.v400.log.h.220105_090027:      1 __lll_lock_wait,pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone
1550.e3f953a/benchmark_fillseq.wal_disabled.v400.log.h.220105_090720:      1 __lll_lock_wait,pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone
1550.e3f953a/benchmark_fillseq.wal_disabled.v400.log.h.220105_090843:      1 __lll_lock_wait,pthread_mutex_lock,rocksdb::port::Mutex::Lock,rocksdb::InstrumentedMutex::LockInternal,rocksdb::InstrumentedMutex::Lock,rocksdb::DBImpl::PipelinedWriteImpl,rocksdb::DBImpl::WriteImpl,rocksdb::DBImpl::Write,rocksdb::Benchmark::DoWrite,rocksdb::Benchmark::ThreadBody,rocksdb::(anonymous,start_thread,clone

Finally, the stack traces for the thread that I suspect holds the global mutex for e3f953a. I list 6 here (and above), 5 when the user thread is blocked on _lll_lock_wait and one where it is still in pthread_mutex_lock. There are in a gist.

And the the 3 stack traces for bcefc59 in a gist.

@pdillinger
Copy link
Contributor

Some browsing suggests that the (eventual) users of file_locations_ are just doing sanity checks. Would it make sense to condition population and uses of file_locations_ on force_consistency_checks?

@mdcallag
Copy link
Contributor Author

mdcallag commented Jan 5, 2022

Flamegraphs are here. github didn't allow me to attach them directly. The 1550 directory is for e3f953a and the 1551 directory is for bcefc59.

I was unable to make differential flamegraphs but comparing them side-by-side I see that TryInstallMemtableFlushResults takes more time with e3f953a. Also, one of the stack traces for e3f953a shows that the Compaction destructor runs for more time with e3f953a from the deallocation it must do for the STL container holding file_locations_ and I think the global mutex was held during that.

@siying
Copy link
Contributor

siying commented Jan 5, 2022

It looks like that most work in flush/compaction threads that can't be parallelize is in VersionBuilder::Rep::SaveTo(), and a good share is hash table operations. If we can move them out of DB mutex, it's possible that the throughput can be back.

@mdcallag
Copy link
Contributor Author

mdcallag commented Jan 5, 2022

I repeated the repro inserting 1.6B KB pairs rather than 800M with PMP samples each ~60 seconds. For bcefc59 there were 3 samples with DoWrite stuck on _lll_lock_wait. With e3f953a there were 11. So I am reasonably confident that long lock-hold times for the big kernel lock are the issue.

@ltamasi
Copy link
Contributor

ltamasi commented Jan 5, 2022

Thanks for the detailed results @mdcallag.

Similarly to #9354 , it was a conscious decision to make the consistency checks which rely on this hash structure mandatory. For most workloads, the overhead is negligible; it seems fillseq with vector memtable and no WAL is the exception where it makes a difference (given enough SST files). Since the checks in question have already caught multiple data correctness bugs, I would be extremely reluctant to change them. Having said that, I will look into how we can reduce the overhead.

@mdcallag mdcallag added performance Issues related to performance that may or may not be bugs and removed bug Confirmed RocksDB bugs labels Jan 10, 2022
@ltamasi
Copy link
Contributor

ltamasi commented Jan 24, 2022

Looking at the code, there is one simple optimization we can do to reduce the overhead. We can preallocate the space for the hash, which should eliminate the costly reallocation/rehashing as the hash expands. Will put up a patch for this in the next couple of days. I expect this to significantly reduce the performance overhead of the consistency checks but in case it doesn't, we can also look into moving the version saving logic out from under the mutex lock, although I would expect that to be a much more involved change.

@ltamasi
Copy link
Contributor

ltamasi commented Jan 25, 2022

we can also look into moving the version saving logic out from under the mutex lock, although I would expect that to be a much more involved change.

... actually, it might not be that involved after all. Planning to post a patch in the next few days

@siying
Copy link
Contributor

siying commented May 23, 2022

we can also look into moving the version saving logic out from under the mutex lock, although I would expect that to be a much more involved change.

... actually, it might not be that involved after all. Planning to post a patch in the next few days

@ltamasi any update to the evaluate of whether "it might not be that involved after all" is true?

@ltamasi
Copy link
Contributor

ltamasi commented May 23, 2022

Yeah :) this was fixed in #9504 in February.

@siying
Copy link
Contributor

siying commented May 24, 2022

moving the version saving logic out from under the mutex lock

How about "moving the version saving logic out from under the mutex lock"?

@ltamasi
Copy link
Contributor

ltamasi commented May 24, 2022

The PR I mentioned moved the hash building part out (which is what was causing the regression and what I had in mind). Not sure how easy it would be to do more of the version saving outside the mutex.

@siying
Copy link
Contributor

siying commented May 24, 2022

The PR I mentioned moved the hash building part out (which is what was causing the regression and what I had in mind). Not sure how easy it would be to do more of the version saving outside the mutex.

Thanks for the clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues related to performance that may or may not be bugs regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants