Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c4db_endTransaction sometimes results in invalid memory access #776

Closed
borrrden opened this issue May 10, 2019 · 8 comments

Comments

@borrrden
Copy link
Member

commented May 10, 2019

Noticing this during debug testing of master branch for .NET (post-Iridium). Sometimes a call to c4db_endTransaction results in an invalid memory access, probably due to a garbage deleted object. Here is the stack trace

LiteCore.dll!c4Internal::Database::externalTransactionCommitted(const litecore::SequenceTracker & sourceTracker) Line 525
LiteCore.dll!c4Internal::Database::_cleanupTransaction::__l8::(litecore::DataFile * other) Line 514
LiteCore.dll!fleece::function_ref<void __cdecl(litecore::DataFile * __ptr64)>::callback_fn<void (litecore::DataFile *) >(__int64 callable, litecore::DataFile * <params_0>) Line 61
LiteCore.dll!fleece::function_ref<void __cdecl(litecore::DataFile * __ptr64)>::operator()(litecore::DataFile * <params_0>) Line 73
LiteCore.dll!litecore::DataFile::Shared::forOpenDataFiles(litecore::DataFile * except, fleece::function_ref<void __cdecl(litecore::DataFile *)> fn) Line 88
LiteCore.dll!litecore::DataFile::forOtherDataFiles(fleece::function_ref<void __cdecl(litecore::DataFile *)> fn) Line 161
LiteCore.dll!c4Internal::Database::_cleanupTransaction(bool committed) Line 516
LiteCore.dll!c4Internal::Database::endTransaction(bool commit) Line 502

The final LiteCore line of code that is executed is:

lock_guard<mutex> lock(_sequenceTracker->mutex());

Examining _sequenceTracker shows that it is full of garbage values. Another thread is destroying a DBAccess object at the same time (notice the Database destructor has already been passed):

LiteCore.dll!litecore::DataFile::Shared::removeDataFile(litecore::DataFile * dataFile) Line 72
LiteCore.dll!litecore::DataFile::close() Line 136
LiteCore.dll!litecore::SQLiteDataFile::~SQLiteDataFile() Line 175
LiteCore.dll!c4Internal::Database::~Database() Line 187
LiteCore.dll!c4Database::~c4Database() Line 51
LiteCore.dll!fleece::RefCounted::_careful_release() Line 87
LiteCore.dll!fleece::RefCounted::_release() Line 48
LiteCore.dll!fleece::release(c4Database * r) Line 80
LiteCore.dll!c4db_free(c4Database * database) Line 142
LiteCore.dll!litecore::repl::DBAccess::{dtor}::__l5::(c4Database * idb) Line 76
LiteCore.dll!litecore::access_lock<c4Database * __ptr64>::use<void (c4Database *) >(litecore::repl::DBAccess::{dtor}::__l5::void (c4Database *) callback) Line 25
LiteCore.dll!litecore::repl::DBAccess::~DBAccess() Line 74
LiteCore.dll!litecore::repl::Worker::~Worker() Line 108

This thread is on this line:

unique_lock<mutex> lock(_mutex);

Here is an image of all the threads that were running and had LiteCore frames in their backtrace at the time of crash, with some annotations

thread_dump

@snej

This comment has been minimized.

Copy link
Member

commented May 10, 2019

What LiteCore commit is this? I fixed a SequenceTracker bug recently but this doesn't sound like the same one.

@snej

This comment has been minimized.

Copy link
Member

commented May 10, 2019

When a Database commits a transaction, it notifies all other Database instances on the same file of the changes it made. Those Databases probably aren't being called on their regular threads, so some locking is needed; looks like there's a race condition if this happens while a Database is being closed.

@borrrden

This comment has been minimized.

Copy link
Member Author

commented May 10, 2019

FYI this was run on commit f6b85a9

@borrrden

This comment has been minimized.

Copy link
Member Author

commented May 17, 2019

The fix is causing a deadlock on Jenkins so reopening until that is resolved.

@borrrden borrrden reopened this May 17, 2019

@borrrden

This comment has been minimized.

Copy link
Member Author

commented May 17, 2019

Was able to reproduce the deadlock locally using a release build but it doesn't make that much sense. Here are the two waiting threads (note this only occurs on Apple platforms):

thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff66dc0e76 libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff66e78d6c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 96
    frame #2: 0x00007fff66e764b7 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 226
    frame #3: 0x00007fff64354b29 libc++.1.dylib`std::__1::mutex::lock() + 9
    frame #4: 0x00000001045cee4e CppTests`void fleece::function_ref<void (litecore::DataFile*)>::callback_fn<c4Internal::Database::_cleanupTransaction(bool)::$_1>(long, litecore::DataFile*) [inlined] std::__1::lock_guard<std::__1::mutex>::lock_guard(this=<unavailable>, __m=0x00007fecd050d218) at __mutex_base:104:27 [opt]
    frame #5: 0x00000001045cee46 CppTests`void fleece::function_ref<void (litecore::DataFile*)>::callback_fn<c4Internal::Database::_cleanupTransaction(bool)::$_1>(long, litecore::DataFile*) [inlined] std::__1::lock_guard<std::__1::mutex>::lock_guard(this=<unavailable>, __m=0x00007fecd050d218) at __mutex_base:104 [opt]
    frame #6: 0x00000001045cee46 CppTests`void fleece::function_ref<void (litecore::DataFile*)>::callback_fn<c4Internal::Database::_cleanupTransaction(bool)::$_1>(long, litecore::DataFile*) at Database.cc:532 [opt]
    frame #7: 0x00000001045cee29 CppTests`void fleece::function_ref<void (litecore::DataFile*)>::callback_fn<c4Internal::Database::_cleanupTransaction(bool)::$_1>(long, litecore::DataFile*) [inlined] c4Internal::Database::_cleanupTransaction(this=<unavailable>, other=<unavailable>)::$_1::operator()(litecore::DataFile*) const at Database.cc:520 [opt]
    frame #8: 0x00000001045cee14 CppTests`void fleece::function_ref<void (litecore::DataFile*)>::callback_fn<c4Internal::Database::_cleanupTransaction(callable=<unavailable>, params=<unavailable>)::$_1>(long, litecore::DataFile*) at function_ref.hh:59 [opt]
    frame #9: 0x00000001045fc7ce CppTests`litecore::DataFile::forOtherDataFiles(fleece::function_ref<void (litecore::DataFile*)>) [inlined] fleece::function_ref<void (litecore::DataFile*)>::operator(this=<unavailable>, params=<unavailable>)(litecore::DataFile*) const at function_ref.hh:72:12 [opt]
    frame #10: 0x00000001045fc7c8 CppTests`litecore::DataFile::forOtherDataFiles(fleece::function_ref<void (litecore::DataFile*)>) at DataFile+Shared.hh:88 [opt]
    frame #11: 0x00000001045fc79b CppTests`litecore::DataFile::forOtherDataFiles(this=0x00007fecd0507fe0, fn=function_ref<void (litecore::DataFile *)> @ 0x00007fcc7b98dcb0)>) at DataFile.cc:160 [opt]
    frame #12: 0x00000001045cdf51 CppTests`c4Internal::Database::_cleanupTransaction(this=0x00007fecd050ba20, committed=true) at Database.cc:518:28 [opt]
    frame #13: 0x0000000104739e41 CppTests`c4Internal::tryCatch(C4Error*, fleece::function_ref<void ()>) [inlined] fleece::function_ref<void ()>::operator(this=<unavailable>)() const at function_ref.hh:72:12 [opt]
    frame #14: 0x0000000104739e3c CppTests`c4Internal::tryCatch(error=0x00007ffeeb7a3590, fn=<unavailable>)>) at c4ExceptionUtils.cc:35 [opt]
    frame #15: 0x0000000104736472 CppTests`::c4db_endTransaction(database=<unavailable>, commit=<unavailable>, outError=<unavailable>) at c4Database.cc:219:12 [opt]
    frame #16: 0x00000001045b665a CppTests`C4Test::createFleeceRev(c4Database*, FLSlice, FLSlice, FLSlice, unsigned char) at c4Test.hh:122:13 [opt]
    frame #17: 0x00000001045b6644 CppTests`C4Test::createFleeceRev(c4Database*, FLSlice, FLSlice, FLSlice, unsigned char) [inlined] TransactionHelper::~TransactionHelper(this=<unavailable>) at c4Test.hh:119 [opt]
    frame #18: 0x00000001045b6644 CppTests`C4Test::createFleeceRev(db=0x00007fecd050ba20, docID=(buf = 0x00000001048435b5, size = 8), revID=<unavailable>, json=(buf = 0x00000001048435e8, size = 8), flags='\0') at c4Test.cc:423 [opt]
    frame #19: 0x000000010459597a CppTests`(anonymous namespace)::____C_A_T_C_H____T_E_S_T____947::test(this=0x00007ffeeb7a3a00) at ReplicatorLoopbackTest.cc:959:5 [opt]
    frame #20: 0x00000001045afdf5 CppTests`Catch::MethodTestCase<(anonymous namespace)::____C_A_T_C_H____T_E_S_T____947>::invoke(this=0x00007feccde1d3e0) const at catch.hpp:650:9 [opt]
    frame #21: 0x000000010451481e CppTests`Catch::RunContext::invokeActiveTestCase() [inlined] Catch::TestCase::invoke(this=<unavailable>) const at catch.hpp:7519:15 [opt]
    frame #22: 0x0000000104514811 CppTests`Catch::RunContext::invokeActiveTestCase(this=0x00007ffeeb7a4400) at catch.hpp:6158 [opt]
    frame #23: 0x0000000104514254 CppTests`Catch::RunContext::runCurrentTest(this=0x00007ffeeb7a4400, redirectedCout="", redirectedCerr="") at catch.hpp:6129:21 [opt]
    frame #24: 0x00000001044fccb5 CppTests`Catch::RunContext::runTest(this=0x00007ffeeb7a4400, testCase=<unavailable>) at catch.hpp:5949:21 [opt]
    frame #25: 0x00000001044fc491 CppTests`Catch::runTests(config=<unavailable>) at catch.hpp:6297:35 [opt]
    frame #26: 0x000000010451d674 CppTests`Catch::Session::run(this=<unavailable>) at catch.hpp:6405:42 [opt]
    frame #27: 0x000000010450a768 CppTests`main [inlined] Catch::Session::run(this=0x0000000104873308, argc=<unavailable>, argv=<unavailable>) at catch.hpp:6384:30 [opt]
    frame #28: 0x000000010450a744 CppTests`main(argc=<unavailable>, argv=<unavailable>) at catch.hpp:10333 [opt]
    frame #29: 0x00007fff66c84ed9 libdyld.dylib`start + 1
    frame #30: 0x00007fff66c84ed9 libdyld.dylib`start + 1
thread #4, queue = 'Push<-ws://cli/'
    frame #0: 0x00007fff66dc0e76 libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff66e78d6c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 96
    frame #2: 0x00007fff66e764b7 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 226
    frame #3: 0x00007fff64354b29 libc++.1.dylib`std::__1::mutex::lock() + 9
    frame #4: 0x00000001045fc3a3 CppTests`litecore::DataFile::Shared::removeDataFile(litecore::DataFile*) [inlined] std::__1::unique_lock<std::__1::mutex>::unique_lock(this=<unavailable>, __m=0x00007fecd051b1c0) at __mutex_base:131:61 [opt]
    frame #5: 0x00000001045fc39b CppTests`litecore::DataFile::Shared::removeDataFile(litecore::DataFile*) [inlined] std::__1::unique_lock<std::__1::mutex>::unique_lock(this=<unavailable>, __m=0x00007fecd051b1c0) at __mutex_base:131 [opt]
    frame #6: 0x00000001045fc39b CppTests`litecore::DataFile::Shared::removeDataFile(this=0x00007fecd051b0c0, dataFile=0x00007fecd050e400) at DataFile+Shared.hh:72 [opt]
    frame #7: 0x00000001045fc60a CppTests`litecore::DataFile::close(this=0x00007fecd050e400) at DataFile.cc:136:22 [opt]
    frame #8: 0x00000001045cbf16 CppTests`c4Internal::Database::~Database(this=0x00007fecd0514930) at Database.cc:190:24 [opt]
    frame #9: 0x0000000104735c43 CppTests`c4Database::~c4Database() [inlined] c4Database::~c4Database(this=0x00007fecd0514930) at c4Database.cc:50:1 [opt]
    frame #10: 0x0000000104735c1a CppTests`c4Database::~c4Database() [inlined] c4Database::~c4Database(this=0x00007fecd0514930) at c4Database.cc:48 [opt]
    frame #11: 0x0000000104735c1a CppTests`c4Database::~c4Database(this=0x00007fecd0514930) at c4Database.cc:48 [opt]
    frame #12: 0x000000010474cff1 CppTests`litecore::repl::DBWorker::~DBWorker() [inlined] c4::ref<c4Database>::freeRef(c=<unavailable>) at c4.hh:54:61 [opt]
    frame #13: 0x000000010474cfec CppTests`litecore::repl::DBWorker::~DBWorker() [inlined] c4::ref<c4Database>::~ref(this=<unavailable>) at c4.hh:43 [opt]
    frame #14: 0x000000010474cfe5 CppTests`litecore::repl::DBWorker::~DBWorker() [inlined] c4::ref<c4Database>::~ref(this=<unavailable>) at c4.hh:43 [opt]
    frame #15: 0x000000010474cfe5 CppTests`litecore::repl::DBWorker::~DBWorker(this=0x00007feccdd14d80) at DBWorker.hh:37 [opt]
    frame #16: 0x000000010474cc9e CppTests`litecore::repl::DBWorker::~DBWorker() [inlined] litecore::repl::DBWorker::~DBWorker(this=0x00007feccdd14d80) at DBWorker.hh:37:11 [opt]
    frame #17: 0x000000010474cc99 CppTests`litecore::repl::DBWorker::~DBWorker(this=0x00007feccdd14d80) at DBWorker.hh:37 [opt]
    frame #18: 0x000000010475bc53 CppTests`litecore::repl::Pusher::~Pusher() [inlined] fleece::RefCounted::_release(this=<unavailable>) at RefCounted.hh:51:73 [opt]
    frame #19: 0x000000010475bc47 CppTests`litecore::repl::Pusher::~Pusher() [inlined] void fleece::release<litecore::repl::DBWorker>(r=<unavailable>) at RefCounted.hh:79 [opt]
    frame #20: 0x000000010475bc42 CppTests`litecore::repl::Pusher::~Pusher() [inlined] fleece::Retained<litecore::repl::DBWorker>::~Retained(this=<unavailable>) at RefCounted.hh:99 [opt]
    frame #21: 0x000000010475bc3b CppTests`litecore::repl::Pusher::~Pusher() [inlined] fleece::Retained<litecore::repl::DBWorker>::~Retained(this=<unavailable>) at RefCounted.hh:99 [opt]
    frame #22: 0x000000010475bc3b CppTests`litecore::repl::Pusher::~Pusher(this=0x00007feccdd029a0) at Pusher.hh:32 [opt]
    frame #23: 0x000000010475b2ae CppTests`litecore::repl::Pusher::~Pusher() [inlined] litecore::repl::Pusher::~Pusher(this=0x00007feccdd029a0) at Pusher.hh:32:11 [opt]
    frame #24: 0x000000010475b2a9 CppTests`litecore::repl::Pusher::~Pusher(this=0x00007feccdd029a0) at Pusher.hh:32 [opt]
    frame #25: 0x00007fff66c35d53 libdispatch.dylib`_dispatch_call_block_and_release + 12
    frame #26: 0x00007fff66c36dcf libdispatch.dylib`_dispatch_client_callout + 8
    frame #27: 0x00007fff66c3d124 libdispatch.dylib`_dispatch_lane_serial_drain + 618
    frame #28: 0x00007fff66c3dbdc libdispatch.dylib`_dispatch_lane_invoke + 388
    frame #29: 0x00007fff66c46090 libdispatch.dylib`_dispatch_workloop_worker_thread + 603
    frame #30: 0x00007fff66e7760b libsystem_pthread.dylib`_pthread_wqthread + 409
    frame #31: 0x00007fff66e77405 libsystem_pthread.dylib`start_wqthread + 13

Two other threads are running but seem inconsequential:

thread #2, name = 'LiteCore Timer'
    frame #0: 0x00007fff66dc17de libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007fff66e7b593 libsystem_pthread.dylib`_pthread_cond_wait + 724
    frame #2: 0x00007fff64359bda libc++.1.dylib`std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 18
    frame #3: 0x00000001047b98fb CppTests`litecore::actor::Timer::Manager::run(this=0x00007feccdc22a30) at Timer.cc:51:28 [opt]
    frame #4: 0x00000001047b9f4c CppTests`litecore::actor::Timer::Manager::Manager(this=<unavailable>)::$_0::operator()() const at Timer.cc:33:24 [opt]
    frame #5: 0x00000001047b9f39 CppTests`decltype(__f=<unavailable>)::$_0>(fp)()) std::__1::__invoke<litecore::actor::Timer::Manager::Manager()::$_0>(litecore::actor::Timer::Manager::Manager()::$_0&&) at type_traits:4339:1 [opt]
    frame #6: 0x00000001047b9eed CppTests`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, litecore::actor::Timer::Manager::Manager()::$_0> >(void*) [inlined] void std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, litecore::actor::Timer::Manager::Manager()::$_0>(__t=<unavailable>)::$_0>&, std::__1::__tuple_indices<>) at thread:342:5 [opt]
    frame #7: 0x00000001047b9ee4 CppTests`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, litecore::actor::Timer::Manager::Manager()::$_0> >(__vp=0x00007feccdc20df0) at thread:352 [opt]
    frame #8: 0x00007fff66e78305 libsystem_pthread.dylib`_pthread_body + 126
    frame #9: 0x00007fff66e7b26f libsystem_pthread.dylib`_pthread_start + 70
    frame #10: 0x00007fff66e77415 libsystem_pthread.dylib`thread_start + 13
thread #3
    frame #0: 0x00007fff66dbfb6a libsystem_kernel.dylib`__workq_kernreturn + 10
    frame #1: 0x00007fff66e77630 libsystem_pthread.dylib`_pthread_wqthread + 446
    frame #2: 0x00007fff66e77405 libsystem_pthread.dylib`start_wqthread + 13

EDIT 100% reproducible for me running CppTests "[Push][Conflict]"

@borrrden

This comment has been minimized.

Copy link
Member Author

commented May 17, 2019

Alright this is fairly convoluted but here is what happens:

1 [1st Thread]. Pusher gets destroyed which results in the ~Database() function acquiring the sequence tracker mutex (A) and then yielding
2 [2nd Thread]. Creates a document in another instance of the same DB file, resulting in a c4db_endTransaction call which takes its own sequence tracker mutex (B), then via forOtherDataFiles takes the shared file mutex (C) and blocks on the other database's sequence tracker mutex (A) via externalTransactionCommitted because step 1 took it.
3 [1st Thread]. Resumes into removeDataFile and blocks on the shared file mutex (C) that was taken in step 2

@borrrden

This comment has been minimized.

Copy link
Member Author

commented May 17, 2019

What if, as an alternative solution, we simply reversed the order of operations here so that the datafile is removed from shared first and then closed, not the other way around:

void DataFile::close() {
for (auto& i : _keyStores) {
i.second->close();
}
_close();
if (_shared->removeDataFile(this))
logInfo("Closing database");
}

@snej

This comment has been minimized.

Copy link
Member

commented May 17, 2019

I think reversing the order would cause trouble deleting a database. The implementation waits until it's the only connection before proceeding, and it does this by checking with _shared whether there are other connections. So the change would create a race where one thread would delete the db before the other thread finished closing.

borrrden added a commit that referenced this issue May 18, 2019

Alternative solution for #776
Instead of trying to lock, use another method that indicates that a data file is in the process of closing

borrrden added a commit that referenced this issue May 21, 2019

Alternative solution for close vs transaction race (#785)
* Alternative solution for #776

Instead of trying to lock, use another method that indicates that a data file is in the process of closing

@borrrden borrrden closed this in d55410d May 21, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.