Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not hold mutex when write keys if not necessary #7516

Closed
wants to merge 55 commits into from

Conversation

Little-Wallace
Copy link
Contributor

@Little-Wallace Little-Wallace commented Oct 7, 2020

Problem Summary

RocksDB will acquire the global mutex of db instance for every time when user calls Write. When RocksDB schedules a lot of compaction jobs, it will compete the mutex with write thread and it will hurt the write performance.

Problem Solution:

I want to use log_write_mutex to replace the global mutex in most case so that we do not acquire it in write-thread unless there is a write-stall event or a write-buffer-full event occur.

Test plan

  1. make check
  2. CI
  3. COMPILE_WITH_TSAN=1 make db_stress
    make crash_test
    make crash_test_with_multiops_wp_txn
    make crash_test_with_multiops_wc_txn
    make crash_test_with_atomic_flush

@Little-Wallace Little-Wallace changed the title Do not hold mutex when write keys. Do not hold mutex when write keys if not necessary Oct 7, 2020
db/db_impl/db_impl.cc Outdated Show resolved Hide resolved
db/db_impl/db_impl_write.cc Show resolved Hide resolved
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
@Little-Wallace
Copy link
Contributor Author

@yiwu-arbug PTAL again

@Little-Wallace
Copy link
Contributor Author

I have fixed failed CI caused by data race and deadlock .

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
@riversand963 riversand963 self-requested a review May 20, 2022 18:18
@riversand963
Copy link
Contributor

Thanks for the PR. I plan to take a look this week.

@riversand963
Copy link
Contributor

Thanks @Little-Wallace for the PR! Overall I think this is an improvement.
I will need to take another look. In the meantime, I have proposed #10078 to remove single_column_family_mode_, so that we don't have to make it lock-free.

facebook-github-bot pushed a commit that referenced this pull request May 31, 2022
Summary:
This variable is actually not being used for anything meaningful, thus remove it.

This can make #7516 slightly simpler by reducing the amount of state that must be made lock-free.

Pull Request resolved: #10078

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D36779817

Pulled By: riversand963

fbshipit-source-id: ffb0d9ad6149616917ae5e02bb28102cb90fc406
@riversand963
Copy link
Contributor

I tried db_bench with simple benchmarks, e.g. fillrandom, overwrite, fillseq, and have not been able to see perf improvement in terms of ops/sec (kind of expected). I will try more tweaking later.

based on my experience, you need a large write throughput (with a large writebatch) for a long time and set target_file_size_base to a small value (such as 8MB). And when the number of sst increase to one hundred thousand, you can observe speed reduction.

Thanks for the suggestion. Will try later.

@facebook-github-bot
Copy link
Contributor

@Little-Wallace has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@Little-Wallace has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@riversand963
Copy link
Contributor

Good thing is that I haven't been able to observe a perf regression so far.

@riversand963 riversand963 changed the title Do not hold mutex when write keys if not necessary [1st attempt] Do not hold mutex when write keys if not necessary Jul 21, 2022
@riversand963 riversand963 changed the title [1st attempt] Do not hold mutex when write keys if not necessary Do not hold mutex when write keys if not necessary Jul 21, 2022
@tabokie
Copy link
Contributor

tabokie commented Jul 21, 2022

@riversand963 One suggestion for the benchmark is to use a smaller target_file_size_base: in TiKV we use compaction guard which typically cuts SST files into 10MB chunks, that is much smaller than RocksDB's 64MB default. The higher file count is why we hit the mutex hard. Here's an issue of a production cluster with similar issue: tikv/tikv#12601.

Edit: Oops, apparently little wallace already made this point🤣 ignore me.

@riversand963
Copy link
Contributor

riversand963 commented Jul 21, 2022

While showing end-to-end performance gain requires more efforts, it's easy to show that time spent holding the db mutex has drastically decreased. One of the updated unit tests in this PR, i.e. perf_context_test has an assertion.
Before this PR,

ASSERT_GT(total_db_mutex_nanos, 2000U);

After this PR,

ASSERT_LT(total_db_mutex_nanos, 100U);

I did another simple benchmarking on a non-vm host.

TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillseq,overwrite -duration=60 -batch_size=100 -perf_level=5

Results show

db_mutex_lock_nanos = 564408021 (before)
db_mutex_lock_nanos = 11142 (after)

@facebook-github-bot
Copy link
Contributor

@Little-Wallace has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

tabokie pushed a commit to tabokie/rocksdb that referenced this pull request Jul 22, 2022
Summary:
This variable is actually not being used for anything meaningful, thus remove it.

This can make facebook#7516 slightly simpler by reducing the amount of state that must be made lock-free.

Pull Request resolved: facebook#10078

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D36779817

Pulled By: riversand963

fbshipit-source-id: ffb0d9ad6149616917ae5e02bb28102cb90fc406
Signed-off-by: tabokie <xy.tao@outlook.com>
tabokie pushed a commit to tabokie/rocksdb that referenced this pull request Jul 25, 2022
…#10187)

Summary:
Resolves facebook#10129

I extracted this fix from facebook#7516 since it's also already a bug in main branch, and we want to
separate it from the main part of the PR.

There can be a race condition between two threads. Thread 1 executes
`DBImpl::FindObsoleteFiles()` while thread 2 executes `GetSortedWals()`.
```
Time   thread 1                                thread 2
  |  mutex_.lock
  |  read disable_delete_obsolete_files_
  |  ...
  |  wait on log_sync_cv and release mutex_
  |                                          mutex_.lock
  |                                          ++disable_delete_obsolete_files_
  |                                          mutex_.unlock
  |                                          mutex_.lock
  |                                          while (pending_purge_obsolete_files > 0) { bg_cv.wait;}
  |                                          wake up with mutex_ locked
  |                                          compute WALs tracked by MANIFEST
  |                                          mutex_.unlock
  |  wake up with mutex_ locked
  |  ++pending_purge_obsolete_files_
  |  mutex_.unlock
  |
  |  delete obsolete WAL
  |                                          WAL missing but tracked in MANIFEST.
  V
```

The fix proposed eliminates the possibility of the above by increasing
`pending_purge_obsolete_files_` before `FindObsoleteFiles()` can possibly release the mutex.

Pull Request resolved: facebook#10187

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D37214235

Pulled By: riversand963

fbshipit-source-id: 556ab1b58ae6d19150169dfac4db08195c797184
Signed-off-by: tabokie <xy.tao@outlook.com>
tabokie pushed a commit to tabokie/rocksdb that referenced this pull request Jul 25, 2022
Summary:
RocksDB will acquire the global mutex of db instance for every time when user calls `Write`.  When RocksDB schedules a lot of compaction jobs,   it will compete the mutex with write thread and it will hurt the write performance.

I want to use log_write_mutex to replace the global mutex in most case so that we do not acquire it in write-thread unless there is a write-stall event or a write-buffer-full event occur.

Pull Request resolved: facebook#7516

Test Plan:
1. make check
2. CI
3. COMPILE_WITH_TSAN=1 make db_stress
make crash_test
make crash_test_with_multiops_wp_txn
make crash_test_with_multiops_wc_txn
make crash_test_with_atomic_flush

Reviewed By: siying

Differential Revision: D36908702

Pulled By: riversand963

fbshipit-source-id: 59b13881f4f5c0a58fd3ca79128a396d9cd98efe
Signed-off-by: tabokie <xy.tao@outlook.com>
tabokie pushed a commit to tikv/rocksdb that referenced this pull request Jul 25, 2022
Summary:
This variable is actually not being used for anything meaningful, thus remove it.

This can make facebook#7516 slightly simpler by reducing the amount of state that must be made lock-free.

Pull Request resolved: facebook#10078

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D36779817

Pulled By: riversand963

fbshipit-source-id: ffb0d9ad6149616917ae5e02bb28102cb90fc406
Signed-off-by: tabokie <xy.tao@outlook.com>
tabokie pushed a commit to tikv/rocksdb that referenced this pull request Jul 25, 2022
…#10187)

Summary:
Resolves facebook#10129

I extracted this fix from facebook#7516 since it's also already a bug in main branch, and we want to
separate it from the main part of the PR.

There can be a race condition between two threads. Thread 1 executes
`DBImpl::FindObsoleteFiles()` while thread 2 executes `GetSortedWals()`.
```
Time   thread 1                                thread 2
  |  mutex_.lock
  |  read disable_delete_obsolete_files_
  |  ...
  |  wait on log_sync_cv and release mutex_
  |                                          mutex_.lock
  |                                          ++disable_delete_obsolete_files_
  |                                          mutex_.unlock
  |                                          mutex_.lock
  |                                          while (pending_purge_obsolete_files > 0) { bg_cv.wait;}
  |                                          wake up with mutex_ locked
  |                                          compute WALs tracked by MANIFEST
  |                                          mutex_.unlock
  |  wake up with mutex_ locked
  |  ++pending_purge_obsolete_files_
  |  mutex_.unlock
  |
  |  delete obsolete WAL
  |                                          WAL missing but tracked in MANIFEST.
  V
```

The fix proposed eliminates the possibility of the above by increasing
`pending_purge_obsolete_files_` before `FindObsoleteFiles()` can possibly release the mutex.

Pull Request resolved: facebook#10187

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D37214235

Pulled By: riversand963

fbshipit-source-id: 556ab1b58ae6d19150169dfac4db08195c797184
Signed-off-by: tabokie <xy.tao@outlook.com>
tabokie pushed a commit to tikv/rocksdb that referenced this pull request Jul 25, 2022
Summary:
RocksDB will acquire the global mutex of db instance for every time when user calls `Write`.  When RocksDB schedules a lot of compaction jobs,   it will compete the mutex with write thread and it will hurt the write performance.

I want to use log_write_mutex to replace the global mutex in most case so that we do not acquire it in write-thread unless there is a write-stall event or a write-buffer-full event occur.

Pull Request resolved: facebook#7516

Test Plan:
1. make check
2. CI
3. COMPILE_WITH_TSAN=1 make db_stress
make crash_test
make crash_test_with_multiops_wp_txn
make crash_test_with_multiops_wc_txn
make crash_test_with_atomic_flush

Reviewed By: siying

Differential Revision: D36908702

Pulled By: riversand963

fbshipit-source-id: 59b13881f4f5c0a58fd3ca79128a396d9cd98efe
Signed-off-by: tabokie <xy.tao@outlook.com>
@mdcallag
Copy link
Contributor

I see large improvements with fillseq (leveled, universal) and overwrite (universal) in IO-bound workloads -- up to 1.5X more throughput. Thank you for making RocksDB better.

https://twitter.com/MarkCallaghanDB/status/1574425353564475394

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants