Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"mutex lock failed" during initial pull replication of large # of docs #813

Open
Grimal opened this issue Jul 31, 2019 · 16 comments

Comments

@Grimal
Copy link

commented Jul 31, 2019

This seems like it may be related to #652

CouchbaseLite/2.5.2 (Swift; iOS 12.2; iPad) Build/3 Commit/e613ef5b LiteCore/2.5.2 (3)

Initial launch of freshly installed copy of the client app will perform a large continuous pull replication. Logging indicates this exception is reliably thrown after pulling the same document with multiple revisions. Subsequent launches of the app do not throw the exception and the replication will complete, however the checkpoint never advances. Further delete/reinstall/replicate cycles will always result in the same exception, and the same checkpoint. No warnings or errors are logged on the sync gateway or in the client.

output of thread backtrace all

>   thread #1, queue = 'com.apple.main-thread'
>     frame #0: 0x00000001103aa22a libsystem_kernel.dylib`mach_msg_trap + 10
>     frame #1: 0x00000001103aa76c libsystem_kernel.dylib`mach_msg + 60
>     frame #2: 0x000000010ec81684 CoreFoundation`__CFRunLoopServiceMachPort + 212
>     frame #3: 0x000000010ec7bcc9 CoreFoundation`__CFRunLoopRun + 1657
>     frame #4: 0x000000010ec7b302 CoreFoundation`CFRunLoopRunSpecific + 626
>     frame #5: 0x00000001150632fe GraphicsServices`GSEventRunModal + 65
>     frame #6: 0x00000001193f6ba2 UIKitCore`UIApplicationMain + 140
>     frame #7: 0x000000010af8ba2b Live`main at AppDelegate.swift:12:7
>     frame #8: 0x000000011008d541 libdyld.dylib`start + 1
>   thread #3
>     frame #0: 0x00000001103abbfe libsystem_kernel.dylib`__workq_kernreturn + 10
>     frame #1: 0x0000000110400636 libsystem_pthread.dylib`_pthread_wqthread + 458
>     frame #2: 0x00000001104003fd libsystem_pthread.dylib`start_wqthread + 13
>   thread #4
>     frame #0: 0x00000001103abbfe libsystem_kernel.dylib`__workq_kernreturn + 10
>     frame #1: 0x0000000110400636 libsystem_pthread.dylib`_pthread_wqthread + 458
>     frame #2: 0x00000001104003fd libsystem_pthread.dylib`start_wqthread + 13
>   thread #5, name = 'com.apple.uikit.eventfetch-thread'
>     frame #0: 0x00000001103aa22a libsystem_kernel.dylib`mach_msg_trap + 10
>     frame #1: 0x00000001103aa76c libsystem_kernel.dylib`mach_msg + 60
>     frame #2: 0x000000010ec81684 CoreFoundation`__CFRunLoopServiceMachPort + 212
>     frame #3: 0x000000010ec7bcc9 CoreFoundation`__CFRunLoopRun + 1657
>     frame #4: 0x000000010ec7b302 CoreFoundation`CFRunLoopRunSpecific + 626
>     frame #5: 0x000000010ba8c044 Foundation`-[NSRunLoop(NSRunLoop) runMode:beforeDate:] + 277
>     frame #6: 0x000000010ba8c246 Foundation`-[NSRunLoop(NSRunLoop) runUntilDate:] + 79
>     frame #7: 0x00000001194e7417 UIKitCore`-[UIEventFetcher threadMain] + 118
>     frame #8: 0x000000010baa1307 Foundation`__NSThread__start__ + 1197
>     frame #9: 0x00000001104012eb libsystem_pthread.dylib`_pthread_body + 126
>     frame #10: 0x0000000110404249 libsystem_pthread.dylib`_pthread_start + 66
>     frame #11: 0x000000011040040d libsystem_pthread.dylib`thread_start + 13
>   thread #10, name = 'JavaScriptCore bmalloc scavenger'
>     frame #0: 0x00000001103ad86a libsystem_kernel.dylib`__psynch_cvwait + 10
>     frame #1: 0x000000011040456e libsystem_pthread.dylib`_pthread_cond_wait + 722
>     frame #2: 0x000000010b4d7da1 libc++.1.dylib`std::__1::condition_variable::__do_timed_wait(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::time_point<std::__1::chrono::system_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >) + 93
>     frame #3: 0x000000011b21daa5 JavaScriptCore`std::__1::cv_status std::__1::condition_variable::wait_until<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::time_point<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > > const&) + 117
>     frame #4: 0x000000011b21d96f JavaScriptCore`std::__1::cv_status std::__1::condition_variable_any::wait_until<std::__1::unique_lock<bmalloc::Mutex>, std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >(std::__1::unique_lock<bmalloc::Mutex>&, std::__1::chrono::time_point<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > > const&) + 95
>     frame #5: 0x000000011b21c675 JavaScriptCore`bmalloc::Scavenger::threadRunLoop() + 261
>     frame #6: 0x000000011b21bdb9 JavaScriptCore`bmalloc::Scavenger::threadEntryPoint(bmalloc::Scavenger*) + 9
>     frame #7: 0x000000011b21d757 JavaScriptCore`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(bmalloc::Scavenger*), bmalloc::Scavenger*> >(void*) + 39
>     frame #8: 0x00000001104012eb libsystem_pthread.dylib`_pthread_body + 126
>     frame #9: 0x0000000110404249 libsystem_pthread.dylib`_pthread_start + 66
>     frame #10: 0x000000011040040d libsystem_pthread.dylib`thread_start + 13
>   thread #11, name = 'LiteCore Timer'
>     frame #0: 0x00000001103ad86a libsystem_kernel.dylib`__psynch_cvwait + 10
>     frame #1: 0x000000011040456e libsystem_pthread.dylib`_pthread_cond_wait + 722
>     frame #2: 0x000000010b4d7da1 libc++.1.dylib`std::__1::condition_variable::__do_timed_wait(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::time_point<std::__1::chrono::system_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >) + 93
>     frame #3: 0x000000010b83b0a3 CouchbaseLiteSwift`litecore::actor::Timer::Manager::run() [inlined] std::__1::cv_status std::__1::condition_variable::wait_for<long long, std::__1::ratio<1l, 1000000000l> >(this=<unavailable>, __lk=0x00006000027be2f8, __d=<unavailable>) at __mutex_base:0:9 [opt]
>     frame #4: 0x000000010b83b08d CouchbaseLiteSwift`litecore::actor::Timer::Manager::run() [inlined] std::__1::cv_status std::__1::condition_variable::wait_until<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >(this=<unavailable>, __lk=0x00006000027be2f8) at __mutex_base:384 [opt]
>     frame #5: 0x000000010b83b08d CouchbaseLiteSwift`litecore::actor::Timer::Manager::run(this=0x00006000027be2e0) at Timer.cc:73 [opt]
>     frame #6: 0x000000010b83af89 CouchbaseLiteSwift`litecore::actor::Timer::Manager::Manager(this=<unavailable>)::$_0::operator()() const at Timer.cc:33:24 [opt]
>     frame #7: 0x000000010b83af79 CouchbaseLiteSwift`decltype(__f=<unavailable>)::$_0>(fp)()) std::__1::__invoke<litecore::actor::Timer::Manager::Manager()::$_0>(litecore::actor::Timer::Manager::Manager()::$_0&&) at type_traits:4339:1 [opt]
>     frame #8: 0x000000010b83af2d CouchbaseLiteSwift`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, litecore::actor::Timer::Manager::Manager()::$_0> >(void*) [inlined] void std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, litecore::actor::Timer::Manager::Manager()::$_0>(__t=<unavailable>)::$_0>&, std::__1::__tuple_indices<>) at thread:342:5 [opt]
>     frame #9: 0x000000010b83af28 CouchbaseLiteSwift`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, litecore::actor::Timer::Manager::Manager()::$_0> >(__vp=0x0000600001780e70) at thread:352 [opt]
>     frame #10: 0x00000001104012eb libsystem_pthread.dylib`_pthread_body + 126
>     frame #11: 0x0000000110404249 libsystem_pthread.dylib`_pthread_start + 66
>     frame #12: 0x000000011040040d libsystem_pthread.dylib`thread_start + 13
>   thread #12, queue = 'Repl->wss://sg.mydomain.com:4984/mydb/_blipsync'
>     frame #0: 0x00000001103aeb0e libsystem_kernel.dylib`__psynch_rw_wrlock + 10
>     frame #1: 0x0000000110402dd0 libsystem_pthread.dylib`_pthread_rwlock_lock_wait + 75
>     frame #2: 0x0000000110400bc6 libsystem_pthread.dylib`_pthread_rwlock_lock_slow + 547
>     frame #3: 0x000000010b27598f libBacktraceRecording.dylib`remove_stack_from_backtrace_uniquing_table + 32
>     frame #4: 0x000000010b272a11 libBacktraceRecording.dylib`decrement_work_item_refcount + 80
>     frame #5: 0x000000010b272316 libBacktraceRecording.dylib`gcd_queue_item_complete_hook + 166
>     frame #6: 0x000000011001bcbb libdispatch.dylib`_dispatch_continuation_pop + 590
>     frame #7: 0x000000011002de93 libdispatch.dylib`_dispatch_source_invoke + 2249
>     frame #8: 0x0000000110020046 libdispatch.dylib`_dispatch_lane_serial_drain + 299
>     frame #9: 0x0000000110020e9c libdispatch.dylib`_dispatch_lane_invoke + 425
>     frame #10: 0x000000011002aea3 libdispatch.dylib`_dispatch_workloop_worker_thread + 733
>     frame #11: 0x0000000110400611 libsystem_pthread.dylib`_pthread_wqthread + 421
>     frame #12: 0x00000001104003fd libsystem_pthread.dylib`start_wqthread + 13
>   thread #13, queue = 'inc->wss://sg.mydomain.com:4984/mydb/_blipsync'
>     frame #0: 0x00000001103aeb26 libsystem_kernel.dylib`__psynch_rw_unlock + 10
>     frame #1: 0x0000000110403057 libsystem_pthread.dylib`_pthread_rwlock_unlock_drop + 47
>     frame #2: 0x0000000110403017 libsystem_pthread.dylib`_pthread_rwlock_unlock_slow + 336
>     frame #3: 0x000000010b275614 libBacktraceRecording.dylib`enter_stack_in_backtrace_uniquing_table + 94
>     frame #4: 0x000000010b271eca libBacktraceRecording.dylib`gcd_queue_item_enqueue_hook + 638
>     frame #5: 0x000000011004a25a libdispatch.dylib`_dispatch_introspection_queue_item_enqueue_hook + 46
>     frame #6: 0x000000011001d30f libdispatch.dylib`dispatch_async + 274
>     frame #7: 0x000000010b6e0ba7 CouchbaseLiteSwift`litecore::repl::Worker::changedStatus() [inlined] litecore::actor::GCDMailbox::enqueue(this=<unavailable>) block_pointer) at GCDMailbox.cc:104:9 [opt]
>     frame #8: 0x000000010b6e0b5c CouchbaseLiteSwift`litecore::repl::Worker::changedStatus() [inlined] void litecore::actor::Actor::enqueue<litecore::repl::Worker, litecore::repl::Worker*, litecore::repl::Worker::Status>(this=<unavailable>, fn=51 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00, args=<unavailable>)(litecore::repl::Worker*, litecore::repl::Worker::Status), litecore::repl::Worker*, litecore::repl::Worker::Status) at Actor.hh:108 [opt]
>     frame #9: 0x000000010b6e0af3 CouchbaseLiteSwift`litecore::repl::Worker::changedStatus() [inlined] litecore::repl::Worker::childChangedStatus(this=<unavailable>, task=<unavailable>, status=<unavailable>) at Worker.hh:158 [opt]
>     frame #10: 0x000000010b6e0af3 CouchbaseLiteSwift`litecore::repl::Worker::changedStatus(this=0x00007fc6a1d3d700) at Worker.cc:275 [opt]
>     frame #11: 0x000000010b6e08ea CouchbaseLiteSwift`litecore::repl::Worker::afterEvent(this=0x00007fc6a1d3d700) at Worker.cc:269:13 [opt]
>     frame #12: 0x000000010b83b89e CouchbaseLiteSwift`::___ZN8litecore5actor10GCDMailbox7enqueueEU13block_pointerFvvE_block_invoke() [inlined] litecore::actor::GCDMailbox::afterEvent(this=<unavailable>) at GCDMailbox.cc:126:17 [opt]
>     frame #13: 0x000000010b83b895 CouchbaseLiteSwift`::___ZN8litecore5actor10GCDMailbox7enqueueEU13block_pointerFvvE_block_invoke(.block_descriptor=<unavailable>) at GCDMailbox.cc:102 [opt]
>     frame #14: 0x0000000110017d7f libdispatch.dylib`_dispatch_call_block_and_release + 12
>     frame #15: 0x0000000110018db5 libdispatch.dylib`_dispatch_client_callout + 8
>     frame #16: 0x0000000110020225 libdispatch.dylib`_dispatch_lane_serial_drain + 778
>     frame #17: 0x0000000110020ed0 libdispatch.dylib`_dispatch_lane_invoke + 477
>     frame #18: 0x0000000110020046 libdispatch.dylib`_dispatch_lane_serial_drain + 299
>     frame #19: 0x0000000110020e9c libdispatch.dylib`_dispatch_lane_invoke + 425
>     frame #20: 0x000000011002aea3 libdispatch.dylib`_dispatch_workloop_worker_thread + 733
>     frame #21: 0x0000000110400611 libsystem_pthread.dylib`_pthread_wqthread + 421
>     frame #22: 0x00000001104003fd libsystem_pthread.dylib`start_wqthread + 13
> * thread #14, queue = 'Pull->wss://sg.mydomain.com:4984/mydb/_blipsync', stop reason = breakpoint 1.2
>   * frame #0: 0x000000010cf25441 libc++abi.dylib`__cxa_throw
>     frame #1: 0x000000010b51121b libc++.1.dylib`std::__1::__throw_system_error(int, char const*) + 77
>     frame #2: 0x000000010b50751d libc++.1.dylib`std::__1::mutex::lock() + 29
>     frame #3: 0x000000010b6a33f1 CouchbaseLiteSwift`litecore::actor::Batcher<litecore::repl::Replicator, litecore::repl::ReplicatedRev>::push(litecore::repl::ReplicatedRev*) [inlined] std::__1::lock_guard<std::__1::mutex>::lock_guard(this=<unavailable>, __m=0x00007fc6a3f07148) at __mutex_base:104:27 [opt]
>     frame #4: 0x000000010b6a33e9 CouchbaseLiteSwift`litecore::actor::Batcher<litecore::repl::Replicator, litecore::repl::ReplicatedRev>::push(litecore::repl::ReplicatedRev*) [inlined] std::__1::lock_guard<std::__1::mutex>::lock_guard(this=<unavailable>, __m=0x00007fc6a3f07148) at __mutex_base:104 [opt]
>     frame #5: 0x000000010b6a33e9 CouchbaseLiteSwift`litecore::actor::Batcher<litecore::repl::Replicator, litecore::repl::ReplicatedRev>::push(this=0x00007fc6a3f07120, item=0x0000600002789170) at Batcher.hh:52 [opt]
>     frame #6: 0x000000010b6b8e14 CouchbaseLiteSwift`litecore::repl::Puller::_revsFinished() [inlined] litecore::repl::Worker::finishedDocument(this=<unavailable>, rev=<unavailable>) at Worker.cc:214:27 [opt]
>     frame #7: 0x000000010b6b8da8 CouchbaseLiteSwift`litecore::repl::Puller::_revsFinished(this=0x00007fc6a3f06ee0) at Puller.cc:253 [opt]
>     frame #8: 0x000000010b83ba25 CouchbaseLiteSwift`::___ZN8litecore5actor10GCDMailbox12enqueueAfterENSt3__16chrono8durationIdNS2_5ratioILl1ELl1EEEEEU13block_pointerFvvE_block_invoke() [inlined] litecore::actor::GCDMailbox::safelyCall(this=<unavailable>, block=<unavailable>) block_pointer) const at GCDMailbox.cc:87:13 [opt]
>     frame #9: 0x000000010b83ba1f CouchbaseLiteSwift`::___ZN8litecore5actor10GCDMailbox12enqueueAfterENSt3__16chrono8durationIdNS2_5ratioILl1ELl1EEEEEU13block_pointerFvvE_block_invoke(.block_descriptor=<unavailable>) at GCDMailbox.cc:115 [opt]
>     frame #10: 0x0000000110017d7f libdispatch.dylib`_dispatch_call_block_and_release + 12
>     frame #11: 0x0000000110018db5 libdispatch.dylib`_dispatch_client_callout + 8
>     frame #12: 0x0000000110020225 libdispatch.dylib`_dispatch_lane_serial_drain + 778
>     frame #13: 0x0000000110020e9c libdispatch.dylib`_dispatch_lane_invoke + 425
>     frame #14: 0x000000011002aea3 libdispatch.dylib`_dispatch_workloop_worker_thread + 733
>     frame #15: 0x0000000110400611 libsystem_pthread.dylib`_pthread_wqthread + 421
>     frame #16: 0x00000001104003fd libsystem_pthread.dylib`start_wqthread + 13
>   thread #15, name = 'com.apple.NSURLConnectionLoader'
>     frame #0: 0x00000001103aa22a libsystem_kernel.dylib`mach_msg_trap + 10
>     frame #1: 0x00000001103aa76c libsystem_kernel.dylib`mach_msg + 60
>     frame #2: 0x000000010ec81684 CoreFoundation`__CFRunLoopServiceMachPort + 212
>     frame #3: 0x000000010ec7bcc9 CoreFoundation`__CFRunLoopRun + 1657
>     frame #4: 0x000000010ec7b302 CoreFoundation`CFRunLoopRunSpecific + 626
>     frame #5: 0x0000000110ac2764 CFNetwork`-[__CoreSchedulingSetRunnable runForever] + 210
>     frame #6: 0x000000010baa1307 Foundation`__NSThread__start__ + 1197
>     frame #7: 0x00000001104012eb libsystem_pthread.dylib`_pthread_body + 126
>     frame #8: 0x0000000110404249 libsystem_pthread.dylib`_pthread_start + 66
>     frame #9: 0x000000011040040d libsystem_pthread.dylib`thread_start + 13
>   thread #16
>     frame #0: 0x00000001103abbfe libsystem_kernel.dylib`__workq_kernreturn + 10
>     frame #1: 0x0000000110400636 libsystem_pthread.dylib`_pthread_wqthread + 458
>     frame #2: 0x00000001104003fd libsystem_pthread.dylib`start_wqthread + 13
>   thread #17, name = 'com.apple.CFStream.LegacyThread'
>     frame #0: 0x00000001103aa22a libsystem_kernel.dylib`mach_msg_trap + 10
>     frame #1: 0x00000001103aa76c libsystem_kernel.dylib`mach_msg + 60
>     frame #2: 0x000000010ec81684 CoreFoundation`__CFRunLoopServiceMachPort + 212
>     frame #3: 0x000000010ec7bcc9 CoreFoundation`__CFRunLoopRun + 1657
>     frame #4: 0x000000010ec7b302 CoreFoundation`CFRunLoopRunSpecific + 626
>     frame #5: 0x000000010ec98ce5 CoreFoundation`_legacyStreamRunLoop_workThread + 261
>     frame #6: 0x00000001104012eb libsystem_pthread.dylib`_pthread_body + 126
>     frame #7: 0x0000000110404249 libsystem_pthread.dylib`_pthread_start + 66
>     frame #8: 0x000000011040040d libsystem_pthread.dylib`thread_start + 13
>   thread #18
>     frame #0: 0x0000000000000000
>   thread #19, name = 'com.apple.CFSocket.private'
>     frame #0: 0x00000001103b161a libsystem_kernel.dylib`__select + 10
>     frame #1: 0x000000010ec8a83a CoreFoundation`__CFSocketManager + 666
>     frame #2: 0x00000001104012eb libsystem_pthread.dylib`_pthread_body + 126
>     frame #3: 0x0000000110404249 libsystem_pthread.dylib`_pthread_start + 66
>     frame #4: 0x000000011040040d libsystem_pthread.dylib`thread_start + 13
@Grimal

This comment has been minimized.

Copy link
Author

commented Jul 31, 2019

So the pull replication always sets the final sequence to 7021997. Subsequent pulls will pick up from this point, but never update the checkpoint. The following log snips are from the initial replication:

> 14:47:55.457351| [WS]: <CBLWebSocket: 0x60000088d290>: HasBytesAvailable
> 14:47:55.457365| [WS]: DoRead...
> 14:47:55.457378| [WS]: DoRead read 21396 bytes
> 14:47:55.457382| [WS]: <<< received 21396 bytes [now 21396 pending]
> 14:47:55.457426| [BLIP]: {16} Received frame: REQ #74 M-NC, length 4707
> 14:47:55.457566| [BLIP]: {16} Received frame: REQ #89 M-NC, length 6844
> 14:47:55.457588| [BLIP]: {9} Receiving 'rev' REQ #89 NZ
> 14:47:55.457695| [BLIP]: {16} Received frame: REQ #90 --NC, length 1203
> 14:47:55.457709| [BLIP]: {9} Receiving 'rev' REQ #90 NZ
> 14:47:55.457729| [BLIP]: {9} Finished receiving 'rev' REQ #90 NZ
> 14:47:55.457795| [BLIP]: {16} Received frame: REQ #84 M-NC, length 5018
> 14:47:55.457805| [Sync]: {43} Received revision 'orderHistory::32071785' #1-468625ae2b900430a6f4e574ae3e1dc4 (seq '3534128')
> 14:47:55.457925| [BLIP]: {16} Received frame: REQ #91 --NC, length 927
> 14:47:55.457951| [BLIP]: {9} Receiving 'rev' REQ #91 NZ
> 14:47:55.457972| [BLIP]: {9} Finished receiving 'rev' REQ #91 NZ
> 14:47:55.457994| [BLIP]: {16} Received frame: REQ #75 M-NC, length 2661
> 14:47:55.458097| [Sync]: {44} Received revision 'order::32083891' #2-2711deaec1a62c68b7e8ac4ab52e7f54 (seq '**7021997**')
> 14:47:55.460637| [WS]: <CBLWebSocket: 0x60000088d290>: HasBytesAvailable
> 14:47:55.460651| [WS]: DoRead...
> 14:47:55.460665| [WS]: DoRead read 12282 bytes
> 14:47:55.460670| [WS]: <<< received 12282 bytes [now 12282 pending]
> 14:47:55.460747| [BLIP]: {16} Received frame: REQ #79 M-NC, length 2629
> 14:47:55.460867| [BLIP]: {16} Received frame: REQ #92 M-NC, length 6844
> 14:47:55.460893| [BLIP]: {9} Receiving 'rev' REQ #92 NZ
> 14:47:55.461007| [BLIP]: {16} Received frame: REQ #93 M-NC, length 204
> 14:47:55.461021| [BLIP]: {9} Receiving 'rev' REQ #93 NZ
> 14:47:55.461042| [BLIP]: {16} Received frame: REQ #71 M-NC, length 892
> 14:47:55.461081| [BLIP]: {16} Received frame: REQ #68 --NC, length 1683
> 14:47:55.461160| [BLIP]: {9} Finished receiving 'rev' REQ #68 NZ
> 14:47:55.461271| [Sync]: {41} Received revision 'html:salesmanDashboard::00651771' #1-0c4f71c3a4a3ca20abc2b44139115e4a (seq '7333790')

... sync continues, another 300 lines or so of similar docs coming over, and then ...

> 14:47:55.535717| [WS]: <CBLWebSocket: 0x60000088d290>: HasBytesAvailable
> 14:47:55.535731| [WS]: DoRead...
> 14:47:55.535789| [WS]: DoRead read 18122 bytes
> 14:47:55.535911| [WS]: <<< received 18122 bytes [now 18122 pending]
> 14:47:55.535963| [BLIP]: {16} Received frame: REQ #123 M-NC, length 6134
> 14:47:55.535992| [BLIP]: {9} Receiving 'rev' REQ #123 NZ
> 14:47:55.536217| [BLIP]: {16} Received frame: REQ #99 M-NC, length 2996
> 14:47:55.536318| [BLIP]: {16} Received frame: REQ #100 M-NC, length 203
> 14:47:55.536353| [BLIP]: {16} Received frame: REQ #110 M-NC, length 5633
> 14:47:55.536476| [DB]: {8} Saved 'html:salesmanDashboard::00651807' rev #1-08cb985fb7a28312faccf413762477c9 as seq 69
> 14:47:55.536486| [BLIP]: {16} Received frame: REQ #83 --NC, length 596
> 14:47:55.536497| [Sync]: {11}     {'html:salesmanDashboard::00651807' #1-08cb985fb7a28312faccf413762477c9 <- } seq 69
> 14:47:55.536527| [BLIP]: {9} Finished receiving 'rev' REQ #83 NZ
> 14:47:55.536548| [BLIP]: {16} Received frame: REQ #101 M-NC, length 2524
> 14:47:55.536576| [Sync]: {33} Received revision 'html:salesmanDashboard::00671234' #1-045e96bf163d23c44e0a7c3076c3f0aa (seq '7333805')
> 14:47:55.536990| [DB]: {8} Saved 'html:salesmanDashboard::00671157' rev #1-6a71939366dd29209805825ef5687a9b as seq 70
> 14:47:55.536995| [Sync]: {11}     {'html:salesmanDashboard::00671157' #1-6a71939366dd29209805825ef5687a9b <- } seq 70
> 14:47:55.537397| [DB]: {8} Saved 'html:salesmanDashboard::00671200' rev #408-7ef4a884d3d70c80ee70718046eef812 as seq 71
> 14:47:55.537400| [Sync]: {11}     {'html:salesmanDashboard::00671200' #408-7ef4a884d3d70c80ee70718046eef812 <- 407-253f6b207af4e70f32ec00b311c56071,406-a7f94632d5d5b3f9836ffd3c701c2ab1,405-88d80fa9a9d25954d1d4766896d055bb,404-bd25e15a10a4a48e9984005fdd041262,403-74c78c5ee311db0bdb04a1e21f8db178,402-a623fde306f599b4f9a8645b2c44eca8,401-71e7ab8dadfa51e496588874da2fd261,400-d7c47385a20e6a49adb8d6504b61f57d,399-83d090906ca8237d6624cd7263c273f7,398-3aae49ecd89e83f2cf4bcf48ba2ed2dd,397-e8bae5c15bb13cc7a1d1ffa026088121,396-43f5757c353dff8d59a6b174094d9fab,395-c4e448d9d035da06f14ea1e0395ffe58,394-cd0fa6660f20c1d78704de000209d571,393-0025edb56bc89de11f7aaa78bd2b8719,392-b9403ae79da60a7e079ebcb518b95c3a,391-c2ad75b2cdf579c65eef8807f2a472d8,390-5353b9c4cd323992d86a6c389770dbf8,389-e1e5ac9ea9665e738abd7be660e817b6,388-3355a46fe8f6d45917311f8b2a4416e9} seq 71

... exception thrown (end of log) ...

@djpongh djpongh added this to the Cobalt milestone Aug 1, 2019

@borrrden

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

This is one of those unfortunate situations in which the moving parts are so complex that it's nearly impossible to tell at what point things are going wrong. You say this happens every time that the initial replication is performed? Is there any way you can isolate the set of remote data and provide us with an endpoint to test against? The reason I ask is because obviously if this happens for every replication we would have easily caught this in our testing suite. The fact that we didn't means that there is some unknown factor here causing this. I can suggest some mechanisms for sharing a URL that wouldn't be public here if you prefer.

@Grimal

This comment has been minimized.

Copy link
Author

commented Aug 2, 2019

Yes, I can reliably duplicate the exception. It always occurs at the same point, and the exception is always thrown trying to lock the mutex in Batcher::push(ITEM *item). I will try to light up a test environment without a lot of the extraneous noise.

@Grimal

This comment has been minimized.

Copy link
Author

commented Aug 2, 2019

In building a test case, I uncovered a clue. In my app, I have the need to reconfigure the replicator with a new set of channels at various points in the life cycle. The list of channels is altered based on the contents of certain documents. Not sure this is the proper method of doing so, but I've been stopping the Replicator and immediately creating a new instance with the new ReplicatorConfiguration.

Leading up to the exception being thrown, the Replicator is on its third rodeo. If I instead supply the Replicator with the final, full list of channels without a reconfigure/restart, there is no exception thrown. This is a pretty good hint towards why that mutex might be invalid.

However, this also appears to be unrelated to the second problem I've been encountering, where the checkpoint is not advancing. I made the assumption the two issues coincided, but they do not. I'm still seeing the checkpoint stuck at the exact same point, despite completing with no warnings, errors, or exceptions being thrown.

@borrrden

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

Don't worry too much about removing noise. As long as it reproduces the issue in a somewhat timely fashion it will be immensely helpful.

@Grimal

This comment has been minimized.

Copy link
Author

commented Aug 3, 2019

I setup a test environment that will exhibit the issue. Feel free to hit this setup, if you need console access I can share that with you as well.

If you want to setup in your own environment, here's the steps:

  1. Setup server and SG and configure with beer-sample.
  2. Add a channels property to each document: beers get channels = ["beers"] and breweries get channels = ["breweries"]
  3. Setup a CouchbaseLiteSwift app to pull from SG. Start by pulling only the beers channel, then at some point during the active pull replication, stop the replication and start it again with channels = ["beers", "breweries"]. (e.g. addDocumentReplicationListener, and pick a beer document id to react to)
  4. Set a breakpoint on all exceptions, and run the app.

Here's sample Swift code:

import UIKit
import CouchbaseLiteSwift

class ViewController: UIViewController {

private var db: Database?
private var replicator: Replicator?

override func viewDidLoad() {
    super.viewDidLoad()
    self.db = try? Database.init(name: "CBLTEST", config: DatabaseConfiguration())
    startReplicating()
}

private func startReplicating(_ channels: [String] = ["beers"]) {
    print("Starting replication (channels: \(channels)")
    let target = URLEndpoint(url: URL(string: "wss://grimal.com:4984/beer-sample")!)
    let config = ReplicatorConfiguration.init(database: self.db!, target: target)
    config.replicatorType = .pull
    config.channels = channels
    config.continuous = true
    config.authenticator = BasicAuthenticator.init(username: "couchbase", password: "support")

    self.replicator = Replicator.init(config: config)

    self.replicator?.addDocumentReplicationListener { (replication) in
        for document in replication.documents {
            var status = ""
            if document.flags.contains(.deleted) { status += " (Deleted)" }
            if document.flags.contains(.accessRemoved) { status += " (Access removed)" }
            if document.error != nil { status += " (ERROR: \(String(describing: document.error)))" }
            print("\(replication.isPush ? "Pushed" : "Pulled") \(document.id) \(status)")

            // Test - Once we hit a specific document, restart the replicator with beers and breweries channels (to 'encourage' the mutex lock on Batcher::push to fail)
            if document.id == "alvinne_picobrouwerij-podge_belgian_imperial_stout" {
                self.replicator?.stop()
                self.replicator = nil
                self.startReplicating(["beers", "breweries"])
            }
        }
    }
    self.replicator?.start()
}

}

@snej

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

Cleaned-up call stack of the crash is

* thread #14, queue = 'Pull->wss://sg.mydomain.com:4984/mydb/_blipsync', stop reason = breakpoint 1.2
0: __cxa_throw
1: std::__1::__throw_system_error(int, char const*) + 77
2: std::__1::mutex::lock() + 29
3: std::__1::lock_guard::lock_guard(this=<unavailable>, __m=0x00007fc6a3f07148) at __mutex_base:104:27
4: std::__1::lock_guard>::lock_guard(this=<unavailable>, __m=0x00007fc6a3f07148) at __mutex_base:104
5: Batcher::push(this=0x00007fc6a3f07120, item=0x0000600002789170) at Batcher.hh:52
6: Worker::finishedDocument(this=<unavailable>, rev=<unavailable>) at Worker.cc:214:27
7: Puller::_revsFinished(this=0x00007fc6a3f06ee0) at Puller.cc:253
8: litecore::actor::GCDMailbox::safelyCall(this=<unavailable>, block=<unavailable>) block_pointer) const at GCDMailbox.cc:87:13

Something weird happens between frames 6 and 5, because Worker::finishedDocuments doesn't call Batcher::push. So each half of the backtrace makes sense, but the two halves don't fit together.

@Grimal

This comment has been minimized.

Copy link
Author

commented Aug 6, 2019

In the little chunk of test code there's no wait for the replicator to stop prior to setting the new config and starting a new instance. I've tested doing that, but waiting for stop doesn't seem to change the behavior (it still throws the same exception).

@borrrden

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

Thanks! I was able to reproduce this with your test. Opened CBL-221

@borrrden

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

Now that it happened once it refuses to happen again with an app uninstall and reinstall. Is that all that is needed to trigger it again?

@borrrden

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

@snej Here is how those two frames are connected:

#5 litecore::actor::Batcher<litecore::repl::Replicator, litecore::repl::ReplicatedRev>::push(litecore::repl::ReplicatedRev*) at /Users/borrrden/Development/couchbase-lite-ios-EE/couchbase-lite-ios/vendor/couchbase-lite-core/vendor/BLIP-Cpp/src/util/Batcher.hh:52
#6 litecore::repl::Replicator::endedDocument(litecore::repl::ReplicatedRev*) at /Users/borrrden/Development/couchbase-lite-ios-EE/couchbase-lite-ios/vendor/couchbase-lite-core/Replicator/Replicator.cc:253
#7 litecore::repl::Worker::finishedDocument(litecore::repl::ReplicatedRev*) at /Users/borrrden/Development/couchbase-lite-ios-EE/couchbase-lite-ios/vendor/couchbase-lite-core/Replicator/Worker.cc:214
#8 litecore::repl::Puller::_revsFinished() at /Users/borrrden/Development/couchbase-lite-ios-EE/couchbase-lite-ios/vendor/couchbase-lite-core/Replicator/Puller.cc:253
@Grimal

This comment has been minimized.

Copy link
Author

commented Aug 6, 2019

Deleting the app from the simulator and restarting for me has been close to, but not quite 100% on throwing the exception. I've tried doing a checkpoint reset, but that won't throw the exception. Presumably because the documents are already in the local DB?

@djpongh djpongh modified the milestones: Cobalt, Mercury Aug 7, 2019

@borrrden

This comment has been minimized.

Copy link
Member

commented Aug 8, 2019

Your presumption is correct. I don't know if you've followed the linked ticket at all but I managed to find the root cause. This happens during firing the listener event and so naturally if there are no docs to fire for it won't happen.

@Grimal

This comment has been minimized.

Copy link
Author

commented Aug 8, 2019

I looked at the ticket earlier today, good job finding that one quickly. The more I dig into the platform the more I appreciate the work you guys are putting in. This is good stuff.

@borrrden

This comment has been minimized.

Copy link
Member

commented Aug 8, 2019

You have your own repro case to thank for that. Repro cases like that are worth their weight in gold (putting aside the fact that their weight is actually zero....let's not ruin the image....)

@snej

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

@Grimal thanks so much for helping us reproduce the problem! We've been getting sporadic reports of this for a long time but could never track it down — the exception type is a red herring, and even the backtraces are confusing due to the bad cast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.