Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support concurrent CF iteration and drop #6147

Closed
wants to merge 1 commit into from

Conversation

javeme
Copy link
Contributor

@javeme javeme commented Dec 10, 2019

It's easy to cause coredump when closing ColumnFamilyHandle with unreleased iterators, especially iterators release is controlled by java GC when using JNI.

This patch fixed concurrent CF iteration and drop, we let iterators(actually SuperVersion) hold a ColumnFamilyData reference to prevent the CF from being released too early.

fixed #5982

It's easy to cause coredump when closing ColumnFamilyHandle with unreleased
iterators, especially iterators release is controlled by java GC when using JNI.

This patch fixed concurrent CF iteration and drop, we let iterators(actually
SuperVersion) hold a ColumnFamilyData reference to prevent the CF from being
released too early.

fixed facebook#5982
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution!

I took a quick look and it seems to be pretty good to me. I'll take a detailed look and maybe run stress test for some while. If you have time, I suggest you run our crash test. If you have time, I suggest you run it too. It's simply "make crash_test -j" and "make crash_test_with_atomic_flush -j". They will run ours.

@javeme
Copy link
Contributor Author

javeme commented Dec 12, 2019

@siying Thanks for your review.

I tried make crash_test -j, but got an error on Mac: "Please install gflags to run rocksdb", I have installed gflags.

$ brew install gflags
Updating Homebrew...
==> Auto-updated Homebrew!
Updated 2 taps (homebrew/cask and homebrew/core).
==> New Formulae
fastlane

Warning: gflags 2.2.2 is already installed and up-to-date
To reinstall 2.2.2, run `brew reinstall gflags`

$ make crash_test -j 2
$DEBUG_LEVEL is 1
Makefile:168: Warning: Compiling in debug mode. Don't use the resulting binary in production
  GEN      util/build_version.cc
python -u tools/db_crashtest.py --simple whitebox --random_kill_odd \
      888887 
python -u tools/db_crashtest.py --simple blackbox 
Running blackbox-crash-test with 
interval_between_crash=120
total-duration=6000

Running whitebox-crash-test with 
total-duration=10000

Running:./db_stress --periodic_compaction_seconds=0 --max_background_compactions=1 --expected_values_path=/var/folders/fl/gc6hkww94db7sfk79_lnv9xjtdxht2/T/tmpog7U9u --format_version=4 --partition_filters=1 --max_write_buffer_number=3 --cache_index_and_filter_blocks=1 --reopen=20 --acquire_snapshot_one_in=10000 --delpercent=4 --log2_keys_per_lock=10 --compaction_ttl=0 --block_size=16384 --use_multiget=1 --allow_concurrent_memtable_write=0 --compact_files_one_in=1000000 --target_file_size_multiplier=1 --clear_column_family_one_in=0 --max_bytes_for_level_base=67108864 --use_full_merge_v1=1 --target_file_size_base=16777216 --checkpoint_one_in=1000000 --mmap_read=0 --compression_type=snappy --writepercent=35 --index_type=2 --readpercent=50 --pause_background_one_in=1000000 --subcompactions=1 --memtablerep=skip_list --prefix_size=-1 --test_batches_snapshots=0 --column_families=1 --db=/var/folders/fl/gc6hkww94db7sfk79_lnv9xjtdxht2/T/rocksdb_crashtest_whitebox5PPfwJ --use_direct_reads=0 --use_block_based_filter=0 --max_manifest_file_size=1073741824 --delrangepercent=1 --compact_range_one_in=1000000 --open_files=500000 --destroy_db_initially=0 --progress_reports=0 --compression_zstd_max_train_bytes=0 --snapshot_hold_ops=100000 --enable_pipelined_write=1 --nooverwritepercent=1 --compression_max_dict_bytes=0 --max_key=100000000 --prefixpercent=0 --use_merge=1 --flush_one_in=1000000 --ops_per_thread=20000000 --index_block_restart_interval=5 --kill_random_test=888887 --cache_size=1048576 --recycle_log_file_num=1 --use_direct_io_for_flush_and_compaction=0 --verify_checksum=1 --bloom_bits=4 --write_buffer_size=33554432

Running db_stress with pid=92750: ./db_stress --periodic_compaction_seconds=100 --max_background_compactions=1 --expected_values_path=/var/folders/fl/gc6hkww94db7sfk79_lnv9xjtdxht2/T/tmp6ZoVgi --format_version=3 --partition_filters=0 --max_write_buffer_number=3 --verify_checksum=1 --cache_index_and_filter_blocks=1 --reopen=20 --acquire_snapshot_one_in=10000 --delpercent=4 --compaction_ttl=0 --block_size=16384 --use_multiget=1 --allow_concurrent_memtable_write=1 --compact_files_one_in=1000000 --target_file_size_multiplier=1 --clear_column_family_one_in=0 --max_bytes_for_level_base=67108864 --use_full_merge_v1=1 --target_file_size_base=16777216 --checkpoint_one_in=1000000 --mmap_read=0 --compression_type=snappy --writepercent=35 --index_type=2 --readpercent=50 --pause_background_one_in=1000000 --subcompactions=1 --ops_per_thread=100000000 --memtablerep=skip_list --set_options_one_in=0 --prefix_size=-1 --test_batches_snapshots=0 --column_families=1 --db=/var/folders/fl/gc6hkww94db7sfk79_lnv9xjtdxht2/T/rocksdb_crashtest_blackboxN3ukOv --use_direct_reads=0 --max_manifest_file_size=1073741824 --compact_range_one_in=1000000 --open_files=-1 --destroy_db_initially=0 --progress_reports=0 --compression_zstd_max_train_bytes=0 --snapshot_hold_ops=100000 --enable_pipelined_write=0 --nooverwritepercent=1 --compression_max_dict_bytes=16384 --max_key=100000000 --prefixpercent=0 --flush_one_in=1000000 --delrangepercent=1 --index_block_restart_interval=6 --cache_size=1048576 --recycle_log_file_num=1 --use_direct_io_for_flush_and_compaction=0 --use_merge=0 --bloom_bits=1.27621733141 --write_buffer_size=33554432


check_mode=0, kill option=888887, exitcode=1

Please install gflags to run rocksdb tools

TEST FAILED. See kill option and exit code above!!!

make: *** [whitebox_crash_test] Error 1
make: *** Waiting for unfinished jobs....
WARNING: db_stress ended before kill: exitcode=1

stderr has error message:
***Please install gflags to run rocksdb tools***
make: *** [blackbox_crash_test] Error 2

@siying
Copy link
Contributor

siying commented Dec 12, 2019

@javeme never mind. I ran it on your behave and it passed. Going forward, to run db_stress, you need to install gflags. I found the easiest way to get a compatible gflag is to download gflag from github, make it and install. Maybe next time:)

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@siying
Copy link
Contributor

siying commented Dec 13, 2019

I realized that we need to add a unit test for it to cover the case we want to cover.

Since the time difference, to speed it up, I plan to merge it.
Since it's my bad that forgot to give the comment about adding unit test now, I actually wrote one:

diff --git a/db/column_family_test.cc b/db/column_family_test.cc
index 95c43ac5a..15c90c758 100644
--- a/db/column_family_test.cc
+++ b/db/column_family_test.cc
@@ -2365,6 +2365,45 @@ TEST_P(ColumnFamilyTest, ReadDroppedColumnFamily) {
   }
 }

+TEST_P(ColumnFamilyTest, LiveIteratorWithDroppedColumnFamily) {
+  db_options_.create_missing_column_families = true;
+  db_options_.max_open_files = 20;
+  // delete obsolete files always
+  db_options_.delete_obsolete_files_period_micros = 0;
+  Open({"default", "one", "two"});
+  ColumnFamilyOptions options;
+  options.level0_file_num_compaction_trigger = 100;
+  options.level0_slowdown_writes_trigger = 200;
+  options.level0_stop_writes_trigger = 200;
+  options.write_buffer_size = 100000;  // small write buffer size
+  Reopen({options, options, options});
+
+  // 1MB should create ~10 files for each CF
+  int kKeysNum = 10000;
+  PutRandomData(1, kKeysNum, 100);
+
+  {
+    std::unique_ptr<Iterator> iterator(
+        db_->NewIterator(ReadOptions(), handles_[1]));
+    iterator->SeekToFirst();
+
+    DropColumnFamilies({1});
+
+    // Make sure iterator created can still be used.
+    int count = 0;
+    for (; iterator->Valid(); iterator->Next()) {
+      ASSERT_OK(iterator->status());
+      ++count;
+    }
+    ASSERT_OK(iterator->status());
+    ASSERT_EQ(count, kKeysNum);
+  }
+
+  Reopen();
+  Close();
+  Destroy();
+}
+
 TEST_P(ColumnFamilyTest, FlushAndDropRaceCondition) {
   db_options_.create_missing_column_families = true;
   Open({"default", "one"});

I ran it and it passed with or without ASAN.

@siying
Copy link
Contributor

siying commented Dec 13, 2019

You can choose to send a separate PR for unit tests after I merge the PR.

@siying
Copy link
Contributor

siying commented Dec 13, 2019

Also, update HISTORY.md to mention this new improvement.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in c2029f9.

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in c2029f9.

@javeme
Copy link
Contributor Author

javeme commented Dec 13, 2019

@siying Thanks very much. I'm glad to add unit tests for both c++ and java, I will submit it after a while.

javeme added a commit to javeme/rocksdb that referenced this pull request Dec 14, 2019
@javeme
Copy link
Contributor Author

javeme commented Dec 14, 2019

You can choose to send a separate PR for unit tests after I merge the PR.
Also, update HISTORY.md to mention this new improvement.

Done

facebook-github-bot pushed a commit that referenced this pull request Dec 18, 2019
Summary:
improve #6147
Pull Request resolved: #6180

Differential Revision: D19148936

fbshipit-source-id: f691c9879fd51d54e96c1a99670cf85ca4485a89
wolfkdy pushed a commit to wolfkdy/rocksdb that referenced this pull request Dec 22, 2019
Summary:
It's easy to cause coredump when closing ColumnFamilyHandle with unreleased iterators, especially iterators release is controlled by java GC when using JNI.

This patch fixed concurrent CF iteration and drop, we let iterators(actually SuperVersion) hold a ColumnFamilyData reference to prevent the CF from being released too early.

fixed facebook#5982
Pull Request resolved: facebook#6147

Differential Revision: D18926378

fbshipit-source-id: 1dff6d068c603d012b81446812368bfee95a5e15
wolfkdy pushed a commit to wolfkdy/rocksdb that referenced this pull request Dec 22, 2019
Summary:
improve facebook#6147
Pull Request resolved: facebook#6180

Differential Revision: D19148936

fbshipit-source-id: f691c9879fd51d54e96c1a99670cf85ca4485a89
facebook-github-bot pushed a commit that referenced this pull request Aug 3, 2021
Summary:
The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk
unlocks the DB mutex before destroying the `ThreadLocalPtr` holding
the per-thread `SuperVersion` pointers when the only remaining reference
is the back reference from `super_version_`. The idea behind this was to
break the circular dependency between `ColumnFamilyData` and `SuperVersion`:
when the penultimate reference goes away, `ColumnFamilyData` can clean up
the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there
is a `SuperVersion` and it is not referenced by anything else.) However,
unlocking the mutex throws a wrench in this plan by making it possible for another thread
to jump in and take another reference to the `ColumnFamilyData`, keeping the
object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like
#8440 ,
#8382 ,
and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet`
destructor we sometimes observe during close in our stress tests.

Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original
rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call
`SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone.
#3510 fixed the deadlock but left the
unlocking in place. #6147 then introduced
the circular dependency and associated cleanup logic described above (in order
to enable iterators to keep the `ColumnFamilyData` for dropped column families alive),
and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`.
Finally, #7749 fixed a memory leak but
apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete`
in `SuperVersion::Cleanup`.

The patch simply eliminates the unlocking and relocking, which has been unnecessary
ever since #3510 made `SuperVersionUnrefHandle` lock-free.
This closes the window during which another thread could increase the reference count,
and hopefully fixes the issues above.

Pull Request resolved: #8605

Test Plan: Ran `make check` and stress tests locally.

Reviewed By: pdillinger

Differential Revision: D30051035

Pulled By: ltamasi

fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528
ltamasi added a commit that referenced this pull request Aug 4, 2021
Summary:
The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk
unlocks the DB mutex before destroying the `ThreadLocalPtr` holding
the per-thread `SuperVersion` pointers when the only remaining reference
is the back reference from `super_version_`. The idea behind this was to
break the circular dependency between `ColumnFamilyData` and `SuperVersion`:
when the penultimate reference goes away, `ColumnFamilyData` can clean up
the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there
is a `SuperVersion` and it is not referenced by anything else.) However,
unlocking the mutex throws a wrench in this plan by making it possible for another thread
to jump in and take another reference to the `ColumnFamilyData`, keeping the
object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like
#8440 ,
#8382 ,
and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet`
destructor we sometimes observe during close in our stress tests.

Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original
rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call
`SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone.
#3510 fixed the deadlock but left the
unlocking in place. #6147 then introduced
the circular dependency and associated cleanup logic described above (in order
to enable iterators to keep the `ColumnFamilyData` for dropped column families alive),
and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`.
Finally, #7749 fixed a memory leak but
apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete`
in `SuperVersion::Cleanup`.

The patch simply eliminates the unlocking and relocking, which has been unnecessary
ever since #3510 made `SuperVersionUnrefHandle` lock-free.
This closes the window during which another thread could increase the reference count,
and hopefully fixes the issues above.

Pull Request resolved: #8605

Test Plan: Ran `make check` and stress tests locally.

Reviewed By: pdillinger

Differential Revision: D30051035

Pulled By: ltamasi

fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

crash caused by concurrent CF iterations and drops
3 participants