Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

potential deadlock in roudi Mon+Discover and IPC-msg-process thread #1546

Open
qclzdh opened this issue Jul 26, 2022 · 14 comments
Open

potential deadlock in roudi Mon+Discover and IPC-msg-process thread #1546

qclzdh opened this issue Jul 26, 2022 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@qclzdh
Copy link

qclzdh commented Jul 26, 2022

Required information

Operating system:
E.g. Ubuntu 20.04 LTS

Compiler version:
E.g. GCC 9.x

Observed result or behaviour:
publisher send thread, roudi Mon+Discover and IPC-msg-process threads are deadlock. And no more pub/sub/req/res can register to roudi, all the register will fail, since IPC-msg-process is waiting lock.

Expected result or behaviour:
a. pub send process won't block
b. roudi thread won't deadlcok

Conditions where it occurred / Performed steps:
During some stress test, we found there is some case, publisher and roudi will be deadlock, and no more client can register to the roudi. This issue happened, when some subscribers are crashed due to some unexpected reason.
I know we should avoid this kind of crash. But i also want to feedback this issue to you, maybe you guys know how to make roudi and publisher fail safe, fail safe in this case is:
a. publisher won't block at any case, even subscriber crash
b. iox-roudi won't deadlock, even subscriber crash

Following is dump of publisher:

#0  iox::concurrent::IndexQueue<256ul, unsigned long>::popIfSizeIsAtLeast (this=0x7f596b301e90, requiredSize=256, index=@0x7ffc8ce1b828: 140022067436738)
    at /home/xxx/project/iceoryx/iceoryx_hoofs/include/iceoryx_hoofs/internal/concurrent/lockfree_queue/index_queue.inl:254
#1  0x00007f596da90efb in iox::concurrent::ResizeableLockFreeQueue<iox::mepoo::ShmSafeUnmanagedChunk, 256ul>::tryGetUsedIndex (this=0x7f596b301680, index=@0x7ffc8ce1b828: 140022067436738)
    at /home/xxx/project/iceoryx/iceoryx_hoofs/include/iceoryx_hoofs/internal/concurrent/lockfree_queue/resizeable_lockfree_queue.inl:175
#2  0x00007f596da90668 in iox::concurrent::ResizeableLockFreeQueue<iox::mepoo::ShmSafeUnmanagedChunk, 256ul>::pushImpl<iox::mepoo::ShmSafeUnmanagedChunk const> (this=0x7f596b301680, value=...)
    at /home/xxx/project/iceoryx/iceoryx_hoofs/include/iceoryx_hoofs/internal/concurrent/lockfree_queue/resizeable_lockfree_queue.inl:201
#3  0x00007f596da8fdcf in iox::concurrent::ResizeableLockFreeQueue<iox::mepoo::ShmSafeUnmanagedChunk, 256ul>::push (this=0x7f596b301680, value=...)
    at /home/xxx/project/iceoryx/iceoryx_hoofs/include/iceoryx_hoofs/internal/concurrent/lockfree_queue/resizeable_lockfree_queue.inl:182
#4  0x00007f596da8f5c3 in iox::cxx::VariantQueue<iox::mepoo::ShmSafeUnmanagedChunk, 256ul>::push (this=0x7f596b301678, value=...) at /home/xxx/project/iceoryx/iceoryx_hoofs/include/iceoryx_hoofs/internal/cxx/variant_queue.inl:86
#5  0x00007f596da8ee9d in iox::popo::ChunkQueuePusher<iox::popo::ChunkQueueData<iox::DefaultChunkQueueConfig, iox::popo::ThreadSafePolicy> >::push (this=0x7ffc8ce1b988, chunk=...)
    at /home/xxx/project/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_queue_pusher.inl:49
#6  0x00007f596da8e3cb in iox::popo::ChunkDistributor<iox::popo::ChunkDistributorData<iox::DefaultChunkDistributorConfig, iox::popo::ThreadSafePolicy, iox::popo::ChunkQueuePusher<iox::popo::ChunkQueueData<iox::DefaultChunkQueueConfig, iox::popo::ThreadSafePolicy> > > >::pushToQueue (this=0x7ffc8ce1dc78, queue=..., chunk=...) at /home/xxx/project/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_distributor.inl:216
#7  0x00007f596da8da50 in iox::popo::ChunkDistributor<iox::popo::ChunkDistributorData<iox::DefaultChunkDistributorConfig, iox::popo::ThreadSafePolicy, iox::popo::ChunkQueuePusher<iox::popo::ChunkQueueData<iox::DefaultChunkQueueConfig, iox::popo::ThreadSafePolicy> > > >::deliverToAllStoredQueues (this=0x7ffc8ce1dc78, chunk=...) at /home/xxx/project/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_distributor.inl:149
#8  0x00007f596da8ceb4 in iox::popo::ChunkSender<iox::popo::ChunkSenderData<8u, iox::popo::ChunkDistributorData<iox::DefaultChunkDistributorConfig, iox::popo::ThreadSafePolicy, iox::popo::ChunkQueuePusher<iox::popo::ChunkQueueData<iox::DefaultChunkQueueConfig, iox::popo::ThreadSafePolicy> > > > >::send (this=0x7ffc8ce1dc78, chunkHeader=0x7f59604e5ed8) at /home/xxx/project/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_sender.inl:189
#9  0x00007f596da8c40b in iox::popo::PublisherPortUser::sendChunk (this=0x7ffc8ce1dc68, chunkHeader=0x7f59604e5ed8) at /home/xxx/project/iceoryx/iceoryx_posh/source/popo/ports/publisher_port_user.cpp:62
#10 0x000055d1cfaad712 in iox::popo::UntypedPublisherImpl<iox::popo::BasePublisher<iox::popo::PublisherPortUser> >::publish (this=0x7ffc8ce1dc60, userPayload=0x7f59604e5f00)
    at /home/xxx/project/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/untyped_publisher_impl.inl:38
#11 0x000055d1cfaac6c3 in <lambda(auto:5&)>::operator()<void*>(void *&) const (__closure=0x7ffc8ce1dc30, userPayload=@0x7ffc8ce1de80: 0x7f59604e5f00) at /home/xxx/project/iceoryx/iceoryx_examples/icedelivery/iox_publisher_untyped.cpp:60
#12 0x000055d1cfaacb92 in iox::cxx::function_ref<void(void*&)>::<lambda(void*, void*&)>::operator()(void *, void *&) const (this=0x0, target=0x7ffc8ce1dc30, args#0=@0x7ffc8ce1de80: 0x7f59604e5f00)
    at /home/xxx/project/iceoryx/iceoryx_hoofs/include/iceoryx_hoofs/internal/cxx/function_ref.inl:34
#13 0x000055d1cfaacbbe in iox::cxx::function_ref<void(void*&)>::<lambda(void*, void*&)>::_FUN(void *, void *&) () at /home/xxx/project/iceoryx/iceoryx_hoofs/include/iceoryx_hoofs/internal/cxx/function_ref.inl:33
#14 0x000055d1cfaae04e in iox::cxx::function_ref<void (void*&)>::operator()(void*&) const (this=0x7ffc8ce1dc40, args#0=@0x7ffc8ce1de80: 0x7f59604e5f00) at /home/xxx/project/iceoryx/iceoryx_hoofs/include/iceoryx_hoofs/internal/cxx/function_ref.inl:80
#15 0x000055d1cfaad77b in iox::cxx::expected<void*, iox::popo::AllocationError>::and_then(iox::cxx::function_ref<void (void*&)> const&) (this=0x7ffc8ce1de80, callable=...)
    at /home/xxx/project/iceoryx/iceoryx_hoofs/include/iceoryx_hoofs/internal/cxx/expected.inl:271
#16 0x000055d1cfaac964 in main () at /home/xxx/project/iceoryx/iceoryx_examples/icedelivery/iox_publisher_untyped.cpp:62

and following tis the dump of roudi thread Mon+Discover

#0  __lll_lock_wait (futex=futex@entry=0x7f0c98a79ff0, private=128) at lowlevellock.c:52
#1  0x00007f0c9aec9131 in __GI___pthread_mutex_lock (mutex=0x7f0c98a79ff0) at ../nptl/pthread_mutex_lock.c:115
#2  0x00007f0c9b341e89 in iox::posix::PosixCallBuilder<int, pthread_mutex_t*>::operator()(pthread_mutex_t*) && (this=0x7f0c8c711ec0, arguments#0=0x7f0c98a79ff0)
    at /home/xxx/project/iceoryx/iceoryx_hoofs/include/iceoryx_hoofs/internal/posix_wrapper/posix_call.inl:75
#3  0x00007f0c9b3417bd in iox::posix::mutex::lock (this=0x7f0c98a79ff0) at /home/xxx/project/iceoryx/iceoryx_hoofs/source/posix_wrapper/mutex.cpp:68
#4  0x00007f0c9b4a8ee9 in iox::popo::ThreadSafePolicy::lock (this=0x7f0c98a79ff0) at /home/xxx/project/iceoryx/iceoryx_posh/source/popo/building_blocks/locking_policy.cpp:26
#5  0x00007f0c9b4890d0 in std::lock_guard<iox::popo::ChunkDistributorData<iox::DefaultChunkDistributorConfig, iox::popo::ThreadSafePolicy, iox::popo::ChunkQueuePusher<iox::popo::ChunkQueueData<iox::DefaultChunkQueueConfig, iox::popo::ThreadSafePolicy> > > const>::lock_guard (this=0x7f0c8c711fc8, __m=...) at /usr/include/c++/9/bits/std_mutex.h:159
#6  0x00007f0c9b48d790 in iox::popo::ChunkDistributor<iox::popo::ChunkDistributorData<iox::DefaultChunkDistributorConfig, iox::popo::ThreadSafePolicy, iox::popo::ChunkQueuePusher<iox::popo::ChunkQueueData<iox::DefaultChunkQueueConfig, iox::popo::ThreadSafePolicy> > > >::tryRemoveQueue (this=0x7f0c8c712260, queueToRemove=...) at /home/xxx/project/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_distributor.inl:103
#7  0x00007f0c9b48cf99 in iox::popo::PublisherPortRouDi::dispatchCaProMessageAndGetPossibleResponse (this=0x7f0c8c712250, caProMessage=...) at /home/xxx/project/iceoryx/iceoryx_posh/source/popo/ports/publisher_port_roudi.cpp:104
#8  0x00007f0c9b633c5d in iox::roudi::PortManager::sendToAllMatchingPublisherPorts (this=0x7f0c9b6b5a80 <iox::roudi::IceOryxRouDiApp::run()::m_rouDiComponents+56320>, message=..., subscriberSource=...)
    at /home/xxx/project/iceoryx/iceoryx_posh/source/roudi/port_manager.cpp:520
#9  0x00007f0c9b6357f9 in iox::roudi::PortManager::<lambda(auto:14)>::operator()<iox::capro::CaproMessage>(iox::capro::CaproMessage) const (__closure=0x7f0c8c713820, caproMessage=...) at /home/xxx/project/iceoryx/iceoryx_posh/source/roudi/port_manager.cpp:837
#10 0x00007f0c9b637c04 in iox::cxx::function_ref<void(iox::capro::CaproMessage&)>::<lambda(void*, iox::capro::CaproMessage&)>::operator()(void *, iox::capro::CaproMessage &) const (this=0x0, target=0x7f0c8c713820, args#0=...)
    at /home/xxx/project/iceoryx/iceoryx_hoofs/include/iceoryx_hoofs/internal/cxx/function_ref.inl:34
#11 0x00007f0c9b637c44 in iox::cxx::function_ref<void(iox::capro::CaproMessage&)>::<lambda(void*, iox::capro::CaproMessage&)>::_FUN(void *, iox::capro::CaproMessage &) ()
    at /home/xxx/project/iceoryx/iceoryx_hoofs/include/iceoryx_hoofs/internal/cxx/function_ref.inl:33
#12 0x00007f0c9b63d21e in iox::cxx::function_ref<void (iox::capro::CaproMessage&)>::operator()(iox::capro::CaproMessage&) const (this=0x7f0c8c713830, args#0=...) at /home/xxx/project/iceoryx/iceoryx_hoofs/include/iceoryx_hoofs/internal/cxx/function_ref.inl:80
#13 0x00007f0c9b63ad75 in iox::cxx::optional<iox::capro::CaproMessage>::and_then(iox::cxx::function_ref<void (iox::capro::CaproMessage&)> const&) (this=0x7f0c8c7138a0, callable=...)
    at /home/xxx/project/iceoryx/iceoryx_hoofs/include/iceoryx_hoofs/internal/cxx/optional.inl:301
#14 0x00007f0c9b635903 in iox::roudi::PortManager::destroySubscriberPort (this=0x7f0c9b6b5a80 <iox::roudi::IceOryxRouDiApp::run()::m_rouDiComponents+56320>, subscriberPortData=0x7f0c98cfb3f0)
    at /home/xxx/project/iceoryx/iceoryx_posh/source/roudi/port_manager.cpp:838
#15 0x00007f0c9b634e2a in iox::roudi::PortManager::deletePortsOfProcess (this=0x7f0c9b6b5a80 <iox::roudi::IceOryxRouDiApp::run()::m_rouDiComponents+56320>, runtimeName=...) at /home/xxx/project/iceoryx/iceoryx_posh/source/roudi/port_manager.cpp:745
#16 0x00007f0c9b65e9be in iox::roudi::ProcessManager::monitorProcesses (this=0x7f0c9b804630 <iox::roudi::IceOryxRouDiApp::run()::roudi+112>) at /home/xxx/project/iceoryx/iceoryx_posh/source/roudi/process_manager.cpp:717
#17 0x00007f0c9b65e586 in iox::roudi::ProcessManager::run (this=0x7f0c9b804630 <iox::roudi::IceOryxRouDiApp::run()::roudi+112>) at /home/xxx/project/iceoryx/iceoryx_posh/source/roudi/process_manager.cpp:667
#18 0x00007f0c9b64f40a in iox::roudi::RouDi::monitorAndDiscoveryUpdate (this=0x7f0c9b8045c0 <iox::roudi::IceOryxRouDiApp::run()::roudi>) at /home/xxx/project/iceoryx/iceoryx_posh/source/roudi/roudi.cpp:155
#19 0x00007f0c9b658887 in std::__invoke_impl<void, void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> (__f=@0x56090dbc6600: (void (iox::roudi::RouDi::*)(class iox::roudi::RouDi * const)) 0x7f0c9b64f3a8 <iox::roudi::RouDi::monitorAndDiscoveryUpdate()>, 
    __t=@0x56090dbc65f8: 0x7f0c9b8045c0 <iox::roudi::IceOryxRouDiApp::run()::roudi>) at /usr/include/c++/9/bits/invoke.h:73
#20 0x00007f0c9b65850f in std::__invoke<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> (__fn=@0x56090dbc6600: (void (iox::roudi::RouDi::*)(class iox::roudi::RouDi * const)) 0x7f0c9b64f3a8 <iox::roudi::RouDi::monitorAndDiscoveryUpdate()>)
    at /usr/include/c++/9/bits/invoke.h:95
#21 0x00007f0c9b658195 in std::thread::_Invoker<std::tuple<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> >::_M_invoke<0ul, 1ul> (this=0x56090dbc65f8) at /usr/include/c++/9/thread:244
#22 0x00007f0c9b6580ad in std::thread::_Invoker<std::tuple<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> >::operator() (this=0x56090dbc65f8) at /usr/include/c++/9/thread:251
#23 0x00007f0c9b65805e in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> > >::_M_run (this=0x56090dbc65f0) at /usr/include/c++/9/thread:195
#24 0x00007f0c9b1abde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#25 0x00007f0c9aec6609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#26 0x00007f0c9b000133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

and this is the dump of roudi thread IPC-msg-process

#0  __lll_lock_wait (futex=futex@entry=0x7f0c9b825cf0 <iox::roudi::IceOryxRouDiApp::run()::roudi+137008>, private=0) at lowlevellock.c:52
#1  0x00007f0c9aec90a3 in __GI___pthread_mutex_lock (mutex=0x7f0c9b825cf0 <iox::roudi::IceOryxRouDiApp::run()::roudi+137008>) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f0c9b638d9b in __gthread_mutex_lock (__mutex=0x7f0c9b825cf0 <iox::roudi::IceOryxRouDiApp::run()::roudi+137008>) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749
#3  0x00007f0c9b63910a in std::mutex::lock (this=0x7f0c9b825cf0 <iox::roudi::IceOryxRouDiApp::run()::roudi+137008>) at /usr/include/c++/9/bits/std_mutex.h:100
#4  0x00007f0c9b653f8b in iox::concurrent::smart_lock<iox::roudi::ProcessManager, std::mutex>::Proxy::Proxy (this=0x7f0c8bf14390, base=..., lock=...) at /home/xxx/project/iceoryx/iceoryx_hoofs/include/iceoryx_hoofs/internal/concurrent/smart_lock.inl:127
#5  0x00007f0c9b652b3e in iox::concurrent::smart_lock<iox::roudi::ProcessManager, std::mutex>::operator-> (this=0x7f0c9b804630 <iox::roudi::IceOryxRouDiApp::run()::roudi+112>)
    at /home/xxx/project/iceoryx/iceoryx_hoofs/include/iceoryx_hoofs/internal/concurrent/smart_lock.inl:92
#6  0x00007f0c9b650e2c in iox::roudi::RouDi::processMessage (this=0x7f0c9b8045c0 <iox::roudi::IceOryxRouDiApp::run()::roudi>, message=..., cmd=@0x7f0c8bf1466c: iox::runtime::IpcMessageType::KEEPALIVE, runtimeName=...)
    at /home/xxx/project/iceoryx/iceoryx_posh/source/roudi/roudi.cpp:416
#7  0x00007f0c9b64f609 in iox::roudi::RouDi::processRuntimeMessages (this=0x7f0c9b8045c0 <iox::roudi::IceOryxRouDiApp::run()::roudi>) at /home/xxx/project/iceoryx/iceoryx_posh/source/roudi/roudi.cpp:179
#8  0x00007f0c9b658887 in std::__invoke_impl<void, void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> (__f=@0x56090dbc6760: (void (iox::roudi::RouDi::*)(class iox::roudi::RouDi * const)) 0x7f0c9b64f478 <iox::roudi::RouDi::processRuntimeMessages()>, 
    __t=@0x56090dbc6758: 0x7f0c9b8045c0 <iox::roudi::IceOryxRouDiApp::run()::roudi>) at /usr/include/c++/9/bits/invoke.h:73
#9  0x00007f0c9b65850f in std::__invoke<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> (__fn=@0x56090dbc6760: (void (iox::roudi::RouDi::*)(class iox::roudi::RouDi * const)) 0x7f0c9b64f478 <iox::roudi::RouDi::processRuntimeMessages()>)
    at /usr/include/c++/9/bits/invoke.h:95
#10 0x00007f0c9b658195 in std::thread::_Invoker<std::tuple<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> >::_M_invoke<0ul, 1ul> (this=0x56090dbc6758) at /usr/include/c++/9/thread:244
#11 0x00007f0c9b6580ad in std::thread::_Invoker<std::tuple<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> >::operator() (this=0x56090dbc6758) at /usr/include/c++/9/thread:251
#12 0x00007f0c9b65805e in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> > >::_M_run (this=0x56090dbc6750) at /usr/include/c++/9/thread:195
#13 0x00007f0c9b1abde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007f0c9aec6609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#15 0x00007f0c9b000133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
@elBoberido
Copy link
Member

@qclzdh I have to look closer to this but it looked like a duplicate of #325. There is one last lock for registration of the application. When an app crashes at the wrong time, a deadlock can occur. It is on our radar but not easy to remove the last lock without sacrificing some features.

@qclzdh
Copy link
Author

qclzdh commented Jul 26, 2022

@elBoberido It seems this is different issue, instread of saying deadlock, it's more suitable to say Mon+Discover and IPC-msg-process threads' lock are starving,since the publisher is in a inifinite loop while it holds a lock.

@elfenpiff
Copy link
Contributor

@qclzdh What version of iceoryx are you using?

Could you please tell us what do you mean when you say that the publisher is stuck in an infinite loop? Do you suspect him to be caught in the while loop in ResizeableLockFreeQueue::pushImpl or do you maybe have some more hints for us?

cc @MatthiasKillat

@qclzdh
Copy link
Author

qclzdh commented Jul 27, 2022

@elfenpiff I am using iox-2.0 release, Yes, i think publisher is in the while loop in ResizeableLockFreeQueue::pushImpl. From my analysis, the sequence how this issue happened is :
a. publisher already took the m_queues lock
b. the subscriber queue is full. and subscriber crashed during index pop in function LockFreeQueue<ElementType, Capacity>::pop(). So one index lost.
c. publisher has no chance to go out the ResizeableLockFreeQueue::pushImpl loop
d. roudi monitor detect subscriber crash, and try to clean the resource, include remove the queue from publisher m_queues, so roudi will try to get the m_queues lock, but publisher is still holding the lock, so monitor+discovery thread waiting lock.
e. a new subscriber start-up, try to reigster to roudi, then roudi IPC-MSG-Thread will try "m_prcMgr->addSubscriberForProcess", but m_prcMgr is locked by monitor+discovery thread, since m_prcMgr is a smart_lock, so IPC-MSG-Thread will also wait on one lock.
f. then all the register to roudi will fail, since IPC-MSG-Thread is blocked.

@MatthiasKillat
Copy link
Contributor

MatthiasKillat commented Jul 27, 2022

@qclzdh Thanks for discovering this issue.

The short version is that I think a buffer cell in ResizeableLockFreeQueue can be lost (i.e. no one owns it and it will leak in some way) in cases a publisher app crashes. A a consequence ResizeableLockFreeQueue::pushImpl calls in other publisher apps may run into a endless loop, but I need to check the details to explain why.

If true, this is actually very tricky to fix it properly. I will look into it to give a more detailed reason, then we can try to reproduce it and work on a fix.

Assuming it is as I assume it shows that it is hard to get the following guarantee that is somewhat stronger then regular lock-free code: a crashing application/thread cannot block any other thread. Inside an application in the C++ thread model this is not a problem, as an individual thread that crashes causes the application itself to crash. But if each thread is in its own this is not true anymore and we arrive at this problem.

TLDR: I will look into the details.

@MatthiasKillat
Copy link
Contributor

MatthiasKillat commented Aug 2, 2022

@qclzdh I looked at the problem and you are right. While the problem occurs in the ResizeableLockFreeQueue, it is similar in the basic LockFreeQueue.

There is an invariant that each index is always either owned by the queue itself or by some application. The problem here is that applications can crash and the queue will not reclaim the index. However, since we cannot write arbitrarily large types atomically we need a mechanism like this (I at least know no other way).

Assume the queue has capacity 3 and some application crashes during push or pop while holding the index. For simplicity assume it is a subscribing application and the only one. Then eventually there are no free indices anymore as the publishing application fills the queue, i.e. it contains 2 elements now.

The semantics of pushImpl is that it tries to obtain a free index. If it fails, it concludes the queue is (temporarily) full and tries to obtain the least recently used (FIFO) index from the used ones, but ONLY if the queue is still full. This is to keep the invariant that a full queue where nothing i popped from remains full at all time and that we do not unnecessarily lose elements.

Now the queue can never become full again (full is considered 3 elements in this case), as the index is lost. So it again tries to obtain a free index and the process repeats. This means no push call can make progress unless there is some pop call if an index is lost. But there does not have to be one, as there may not be a subscriber.

In a strict sense it means that the queue is NOT lock-free, as one suspended thread/application can block others from progressing.

Now what can be done about it?

Non-solution
As a bandaid fix we may consider replacing popIfFull (or popIfSizeIsAtLeast) with pop from the used indices. But this only masks the problem. We could now progress but we can progressively use indices until the queue is essentially non-functional anymore (already after the first lost index it does not have the capacity we configured).

Observation
We cannot reclaim an index when we suspect it is not used anymore (as the process may just have been delayed, but not crashed).

Solution strategy
We may somehow track the used indices and whether the process using it died. We can reclaim the index if we can infer that the process died (with certainty!). This is non-trivial to do in a lock-free way but should be possible.

Observation
The free indices do not need to be stored in a queue (FIFO). An arbitrary lock-free container suffices. Does this help us somehow in the tracking of indices and alive processes? Note that ideally we have a data structure that provides a free index with O(1) access time (for performance), but to fix the problem in principle this is not mandatory (could be O(n) or worse).

Observation
Everything would be easier with dynamic memory, as then we would only leak some memory (this can get bad over time though). This is not an option and would undermine the original design.

As a pseudo fix we can have something to detect a potential problem after a predetermined number of unsuccessful loop runs. However, the reaction is unclear and it is not a nice solution (the queue is still not lock-free).

Currently we have to assume that nothing crashes AND the operations are scheduled fairly. If these conditions are met the queue operations should not block indefinitely. The first requirement is fairly restrictive though and I would like to avoid it.

Note that in a single process with fairly scheduled threads this would not be an issue, as a crashing thread will crash the whole application in C++. Due to fair scheduling the situation described above cannot occur in a single process.

FYI @elBoberido I will try to come up with a solution but this is a difficult problem I think. However, my honor as an algorithm designer demands it and hence I will succeed (eventually ....) :-)

@MatthiasKillat
Copy link
Contributor

MatthiasKillat commented Aug 3, 2022

I thought further about this and I think we need to distinguish two crash cases.

Assume we have publishers A and B on some topic and subscribers X and Y on the same topic. Note that the queues that can block as described above belong to/are associated with the subscribers.

Subscriber crash
If subscriber X crashes it may block publisher A and B as described, but the queue should be cleaned up as it is of no use anymore. I am not sure how the current mechanism does this with all details, but this cleanup requires an interprocess mutex to remove the queue from the lists in publisher A and B. I think there also might be some problems. Furthermore as long as another subscriber is active, this blocking scenario should not occur (at least if it takes data).

Edit: the final claim is wrong and can be ignored. This other subscriber uses its own queue and the queue of the crashed subscriber cannot be cleaned up until the publish returns (which it will not in the scenario above, hence we have a deadlock)

Publisher crash
This can cause the same problem as say publisher A may also crash while holding an index as described above (assume it is from Q_X to subscriber X). Now the Q_X cannot be removed, as it is still needed by publisher A. But publisher A is blocked for the reason described above.

Conclusion
In the subscriber crash case we just need to ensure we can safely disconnect the queue of the crashed subscriber (which is non-trivial).

In the publisher crash case we need a stronger recovery mechanism t recover the index OR make sure that this kind of index loss that blocks other publishers cannot happen. I think the former is easier but still fairly difficult. With the the latter approach we have the problem that we cannot have atomic writes of arbitrary size and have to simulate them as done with the ownership of the index. I do not see a way around this without severely restricting the data types of the queue (to 64 bit).

The good news is that the Indexqueue itself is not compromised and can still assumed to be lock-free.

@MatthiasKillat
Copy link
Contributor

Another problem is that the queue cannot be cleaned up as the publish happens under mutex protection and cleanup needs the same mutex. Now cleanup cannot obtain the lock since publish does not return.

The mutex is only contested during cleanup or generally when queues are added or removed (i.e. rarely).

So we need some solution to leave the loop with reasonable semantics even in case of error (otherwise the whole cleanup must be reconsidered).

@mossmaurice mossmaurice added the bug Something isn't working label Aug 8, 2022
@GerdHirsch
Copy link

the problem with the lost indices was known from the beginning. If a process crashes while it has the ownership of a memory area, the index on it, the index is lost and with it the memory for the data.
This problem does not occur with SPSCQueues and in addition they are also Wait-Free. A solution could be to connect the user processes only via SPSC queues and to use the MPMC queues only for internal processes that are appropriately secured and do not terminate abnormally.
You could probably live with the slightly increased latency.

@MatthiasKillat
Copy link
Contributor

@GerdHirsch Yes, a collection of SPSC queues is a solution but requires extensive redesign. @elBoberido is considering it but it has design implications.

If MPMC queues are to be used, I considered the questions.

  1. Can the loss of memory cells be avoided for complex (non-atomic) types?
  2. Can the loss of memory cells be detected?
  3. Can memory cells be reclaimed if the loss is detected?

With some arguments I will not fully state here for brevity, I came to the conclusion that 1. is to be answered with NO if a process can die at anytime (or even only interrupted indefinitely, which is equivalent for our purposes).

This leaves considering 2. and 3., which I think can be answered positively. It is, however, quite intricate to come up with a lock-free detection and recovery mechanism. The code of this kind of mechanism will not be elegant, at least the way I am considering now.

@elBoberido
Copy link
Member

@MatthiasKillat @GerdHirsch yes, a SPSC has some advantages but also some disadvantages. The biggest problem is that it might result in quite a big refactoring but it also opens the door for other simplifications like memory allocations. I think at some point we need to do some brainstorming and decide on how to proceed

cc @budrus

@wangxuemin
Copy link

wangxuemin commented Jul 19, 2023

we aslo have this problem .

(gdb) bt

#0  __lll_lock_wait (futex=futex@entry=0xffff90ccd200, private=128) at lowlevellock.c:52
#1  0x0000ffff930c4cec in __GI___pthread_mutex_lock (mutex=0xffff90ccd200) at pthread_mutex_lock.c:115
#2  0x0000ffff935d67f0 in iox::posix::mutex::lock() () from /usr/local/lib/libiceoryx_hoofs.so.2
#3  0x0000ffff9316df50 in iox::popo::ThreadSafePolicy::lock() const () from /usr/local/lib/libiceoryx_posh.so.2
#4  0x0000ffff93152898 in iox::popo::PublisherPortRouDi::tryGetCaProMessage() () from /usr/local/lib/libiceoryx_posh.so.2
#5  0x0000ffff93667f1c in iox::roudi::PortManager::destroyPublisherPort(iox::popo::PublisherPortData*) () from /usr/local/lib/libiceoryx_posh_roudi.so.2
#6  0x0000ffff93668f84 in iox::roudi::PortManager::deletePortsOfProcess(iox::cxx::string<100ul> const&) () from /usr/local/lib/libiceoryx_posh_roudi.so.2
#7  0x0000ffff93689ebc in iox::roudi::ProcessManager::removeProcessAndDeleteRespectiveSharedMemoryObjects(iox::cxx::list<iox::roudi::Process, 300ul>::IteratorBase<false>&, iox::roudi::ProcessManager::TerminationFeedback) () from /usr/local/lib/libiceoryx_posh_roudi.so.2
#8  0x0000ffff9368a4b8 in iox::roudi::ProcessManager::searchForProcessAndRemoveIt(iox::cxx::string<100ul> const&, iox::roudi::ProcessManager::TerminationFeedback) ()
   from /usr/local/lib/libiceoryx_posh_roudi.so.2
#9  0x0000ffff9368a658 in iox::roudi::ProcessManager::unregisterProcess(iox::cxx::string<100ul> const&) () from /usr/local/lib/libiceoryx_posh_roudi.so.2
#10 0x0000ffff9367c4a0 in iox::roudi::RouDi::processMessage(iox::runtime::IpcMessage const&, iox::runtime::IpcMessageType const&, iox::cxx::string<100ul> const&) () from /usr/local/lib/libiceoryx_posh_roudi.so.2
#11 0x0000ffff9367ad68 in iox::roudi::RouDi::processRuntimeMessages() () from /usr/local/lib/libiceoryx_posh_roudi.so.2
#12 0x0000ffff934a1fac in ?? () from /lib/aarch64-linux-gnu/libstdc++.so.6
#13 0x0000ffff930c2624 in start_thread (arg=0xffff934a1f90) at pthread_create.c:477
#14 0x0000ffff9333349c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

(gdb) p *(pthread_mutex_t *)0xffff90ccd200
$2 = {__data = {__lock = 2, __count = 1, __owner = 156616, __nusers = 1, __kind = 129, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = "\002\000\000\000\001\000\000\000\310c\002\000\001\000\000\000\201", '\000' <repeats 30 times>, __align = 4294967298}

roudi thread Mon+Discover wait a mutex, but this mutex is hold by a crash process. so the roudi wait nerver return ....

@elBoberido

@elBoberido
Copy link
Member

@wangxuemin unfortunately a bigger refactoring is required to fix this and I'm quite busy with my startup and currently don't have that much time for iceoryx development. The startup offers paid support in case this is critical for you :)

@wangxuemin
Copy link

@wangxuemin unfortunately a bigger refactoring is required to fix this and I'm quite busy with my startup and currently don't have that much time for iceoryx development. The startup offers paid support in case this is critical for you :)

Thanks for your reply~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants