Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sometimes got POPO__CHUNK_LOCKING_ERROR #999

Closed
qclzdh opened this issue Dec 30, 2021 · 14 comments · Fixed by #1000
Closed

sometimes got POPO__CHUNK_LOCKING_ERROR #999

qclzdh opened this issue Dec 30, 2021 · 14 comments · Fixed by #1000
Assignees
Labels
bug Something isn't working

Comments

@qclzdh
Copy link

qclzdh commented Dec 30, 2021

Required information

Operating system:
E.g. Ubuntu 18.04 LTS

Compiler version:
E.g. GCC 7.4.0

Observed result or behaviour:
got this error during publishing data:
_iceoryx/iceoryx_hoofs/source/posix_wrapper/mutex.cpp:67 { bool iox::posix::mutex::lock() -> pthread_mutex_lock } ::: [ 0 ] Success
ICEORYX error! POPO__CHUNK_LOCKING_ERROR
iceoryx_hoofs/source/error_handling/error_handling.cpp:57: static void iox::ErrorHandler::ReactOnErrorLevel(iox::ErrorLevel, const char
): Assertion `false' failed._
*

Expected result or behaviour:
What do you expect to happen?
No error happend

Conditions where it occurred / Performed steps:
Describe how one can reproduce the bug.
It's very difficult to reproduce this error, but i already see this error several times.

Could you give some hints, why this happened?

elfenpiff added a commit to ApexAI/iceoryx that referenced this issue Jan 1, 2022
…e mutex functions for errno checking

Signed-off-by: Christian Eltzschig <me@elchris.org>
@elfenpiff
Copy link
Contributor

elfenpiff commented Jan 1, 2022

@qclzdh I looked at the code and found a bug in our mutex implementation which causes the mutex to print always [ 0 ] Success as error message when it fails.
Could you please check out my pull request #1000 , with the mutex fix, run your code again and provide me the updated error message. Then I can continue the bug hunt.

@qclzdh
Copy link
Author

qclzdh commented Jan 2, 2022

Thanks, @elfenpiff , i will run with the patch again, and update the result to you.

@qclzdh
Copy link
Author

qclzdh commented Jan 2, 2022

@elfenpiff , with the new path, i got this error:
iceoryx/iceoryx_hoofs/source/posix_wrapper/mutex.cpp:67 { bool iox::posix::mutex::lock() -> pthread_mutex_lock } ::: [ 22 ] Invalid argument
terminate called without an active exception
Aborted

@elBoberido
Copy link
Member

@qclzdh do you have a few more information on how to reproduce this, e.g. are there multiple publishers or multiple subscribers? Was an application stopped or died before this happened?

@elBoberido elBoberido added the bug Something isn't working label Jan 3, 2022
@qclzdh
Copy link
Author

qclzdh commented Jan 5, 2022

i compile iox with one-to-many-only, and there is several subscribers in different process, i re-checked the log, one warning log "[Warning]: Application xxxxx_336 not responding (last response 1563 milliseconds ago) --> removing it" appeared before this locking error, is this related?

@elBoberido
Copy link
Member

@qclzdh yes, this is related. Can you share more information like the PublisherOptions and SubscriberOptions? How fast is the publisher sending data?

As far as I know one-to-many-only is not as much tested as many-to-many. Could you check if the error also occurs when you don't use on-to-many-only? This would help to narrow the error cause down.

@qclzdh
Copy link
Author

qclzdh commented Jan 5, 2022

@elBoberido , here is options:
'''
PublisherOptions.historyCapacity: 0U
PublisherOptions.nodeName: ""
PublisherOptions.offerOnCreate: true
PublisherOptions.SubscriberTooSlowPolicy: iox::popo::SubscriberTooSlowPolicy::DISCARD_OLDEST_DATA

SubscriberOptions.queueCapacity: 16U
SubscriberOptions.historyRequest: 0U
SubscriberOptions.nodeName: ""
SubscriberOptions.subscribeOnCreate: true
SubscriberOptions.queueFullPolicy:iox::popo::QueueFullPolicy::DISCARD_OLDEST_DATA
'''
How fast is the publisher sending data? --->> very fast, i test it with while(1), and has no delay/sleep.

It seems it takes few days on ubuntu to reproduce this, but it only takes few hours on arm64-qemu, i just got more log.
Here is the log format ,
(353/350), 353-->tid, 350-->pid
(xxx/350), this log is from one process which has some subscribers and publishers.
(xxx/245): this log is from iox-roud.
from the log, it looks like, timedWait didn't wakeup timely, no keep alive MSG to roudi within 1.5s, so roudi killed this process.

[2022-01-05 11:51:49.510065] W/iceoryx(353/350): 2022-01-05 11:51:49.509 [Warning]: sending keep alive
[2022-01-05 11:51:49.512830] W/iceoryx(353/350): 2022-01-05 11:51:49.512 [Warning]: sendKeepAliveAndHandleShutdownPreparation
[2022-01-05 11:51:49.849028] W/iceoryx(353/350): keep live thread wake from timedWait
[2022-01-05 11:51:49.851879] W/iceoryx(353/350): keep live thread call m_callable
[2022-01-05 11:51:49.854752] W/iceoryx(353/350): 2022-01-05 11:51:49.854 [Warning]: sending keep alive
[2022-01-05 11:51:49.868315] W/iceoryx(353/350): 2022-01-05 11:51:49.868 [Warning]: sendKeepAliveAndHandleShutdownPreparation
[2022-01-05 11:51:50.179458] W/iceoryx(353/350): keep live thread wake from timedWait
[2022-01-05 11:51:50.179585] W/iceoryx(353/350): keep live thread call m_callable
[2022-01-05 11:51:50.179863] W/iceoryx(353/350): 2022-01-05 11:51:50.179 [Warning]: sending keep alive
[2022-01-05 11:51:50.193033] W/iceoryx(353/350): 2022-01-05 11:51:50.192 [Warning]: sendKeepAliveAndHandleShutdownPreparation
[2022-01-05 11:51:50.378504] W/iceoryx(307/245): keep live thread wake from timedWait
[2022-01-05 11:51:50.378629] W/iceoryx(307/245): keep live thread call m_callable
[2022-01-05 11:51:50.378988] W/iceoryx(308/245): keep live thread wake from timedWait
[2022-01-05 11:51:50.379088] W/iceoryx(308/245): keep live thread call m_callable
[2022-01-05 11:51:50.380017] W/iceoryx(309/245): keep live thread wake from timedWait
[2022-01-05 11:51:50.380301] W/iceoryx(309/245): keep live thread call m_callable
[2022-01-05 11:51:50.504497] W/iceoryx(353/350): keep live thread wake from timedWait
[2022-01-05 11:51:50.504624] W/iceoryx(353/350): keep live thread call m_callable
[2022-01-05 11:51:50.504952] W/iceoryx(353/350): 2022-01-05 11:51:50.504 [Warning]: sending keep alive
[2022-01-05 11:51:50.509818] W/iceoryx(353/350): 2022-01-05 11:51:50.509 [Warning]: sendKeepAliveAndHandleShutdownPreparation
[2022-01-05 11:51:50.810246] W/iceoryx(353/350): keep live thread wake from timedWait
[2022-01-05 11:51:50.810390] W/iceoryx(353/350): keep live thread call m_callable
[2022-01-05 11:51:50.810762] W/iceoryx(353/350): 2022-01-05 11:51:50.810 [Warning]: sending keep alive
[2022-01-05 11:51:50.812424] W/iceoryx(353/350): 2022-01-05 11:51:50.812 [Warning]: sendKeepAliveAndHandleShutdownPreparation
[2022-01-05 11:51:51.379811] W/iceoryx(307/245): keep live thread wake from timedWait
[2022-01-05 11:51:51.379981] W/iceoryx(307/245): keep live thread call m_callable
[2022-01-05 11:51:51.380336] W/iceoryx(308/245): keep live thread wake from timedWait
[2022-01-05 11:51:51.380437] W/iceoryx(308/245): keep live thread call m_callable
[2022-01-05 11:51:51.380767] W/iceoryx(309/245): keep live thread wake from timedWait
[2022-01-05 11:51:51.380869] W/iceoryx(309/245): keep live thread call m_callable
[2022-01-05 11:51:52.380467] W/iceoryx(307/245): keep live thread wake from timedWait
[2022-01-05 11:51:52.380620] W/iceoryx(307/245): keep live thread call m_callable
[2022-01-05 11:51:52.381181] W/iceoryx(308/245): keep live thread wake from timedWait
[2022-01-05 11:51:52.381294] W/iceoryx(308/245): keep live thread call m_callable
[2022-01-05 11:51:52.381468] W/iceoryx(309/245): keep live thread wake from timedWait
[2022-01-05 11:51:52.381562] W/iceoryx(309/245): keep live thread call m_callable
[2022-01-05 11:51:52.403250] W/iceoryx(310/245): 2022-01-05 11:51:52.400 [Warning]: Application ARACOM_350 not responding (last response 1589 milliseconds ago) --> removing it
[2022-01-05 11:51:52.678946] W/iceoryx(351/350): 2022-01-05 11:51:52.678 [Warning]: ICEORYX error! POPO__CHUNK_SENDER_INVALID_CHUNK_TO_SEND_FROM_USER
[2022-01-05 11:51:53.381169] W/iceoryx(307/245): keep live thread wake from timedWait
[2022-01-05 11:51:53.381303] W/iceoryx(307/245): keep live thread call m_callable
[2022-01-05 11:51:53.382108] W/iceoryx(308/245): keep live thread wake from timedWait
[2022-01-05 11:51:53.382210] W/iceoryx(308/245): keep live thread call m_callable
[2022-01-05 11:51:53.382426] W/iceoryx(309/245): keep live thread wake from timedWait
[2022-01-05 11:51:53.382519] W/iceoryx(309/245): keep live thread call m_callable
[2022-01-05 11:51:53.502339] W/iceoryx(355/350): 2022-01-05 11:51:53.502 [Warning]: ICEORYX error! POPO__CHUNK_RECEIVER_INVALID_CHUNK_TO_RELEASE_FROM_USER
[2022-01-05 11:51:53.545493] W/iceoryx(353/350): keep live thread wake from timedWait
[2022-01-05 11:51:53.556738] W/iceoryx(353/350): keep live thread call m_callable
[2022-01-05 11:51:53.560927] W/iceoryx(353/350): 2022-01-05 11:51:53.560 [Warning]: sending keep alive
[2022-01-05 11:51:53.562095] W/iceoryx(312/245): 2022-01-05 11:51:53.561 [Warning]: Received Keepalive from unknown process ARACOM_350
[2022-01-05 11:51:53.567152] W/iceoryx(353/350): 2022-01-05 11:51:53.566 [Warning]: sendKeepAliveAndHandleShutdownPreparation
[2022-01-05 11:51:53.508718] W/iceoryx(351/350): iceoryx/iceoryx_hoofs/source/posix_wrapper/mutex.cpp:67 { bool iox::posix::mutex::lock() -> pthread_mutex_lock } ::: [ 0 ] Success
[2022-01-05 11:51:53.567854] E/iceoryx(351/350): 2022-01-05 11:51:53.567 [ Error ]: ICEORYX error! POPO__CHUNK_LOCKING_ERROR
[2022-01-05 11:51:54.382301] W/iceoryx(307/245): keep live thread wake from timedWait
[2022-01-05 11:51:54.382471] W/iceoryx(307/245): keep live thread call m_callable
[2022-01-05 11:51:54.383022] W/iceoryx(308/245): keep live thread wake from timedWait
[2022-01-05 11:51:54.383132] W/iceoryx(308/245): keep live thread call m_callable
[2022-01-05 11:51:54.383298] W/iceoryx(309/245): keep live thread wake from timedWait

@elBoberido
Copy link
Member

elBoberido commented Jan 6, 2022

@qclzdh with your question in mind related to thread safety of publisher and subscriber and the errors with the same pid but different tid, is the publisher guarded by a mutex?

Edit:
The easiest way to make the publisher thread safe is to use the smart_lock, e.g.

#include "iceoryx_hoofs/internal/concurrent/smart_lock.hpp"
...
iox::concurrent::smart_lock<iox::popo::Publisher<MyType>> publisher(
    iox::concurrent::ForwardArgsToCTor, iox::capro::ServiceDescription{"Foo", "Bar", "Baz"});

With this, all methods will be locked by a mutex upon access. If you want to call multiple methods with only locking once you can do

auto publisherScopeGuard = publisher.getScopeGuard();
publisherScopeGuard->loan().and_then([](auto& sample) { sample.publish(); });

elfenpiff added a commit to ApexAI/iceoryx that referenced this issue Jan 18, 2022
Signed-off-by: Christian Eltzschig <me@elchris.org>
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue Jan 18, 2022
…e mutex functions for errno checking

Signed-off-by: Christian Eltzschig <me@elchris.org>
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue Jan 18, 2022
Signed-off-by: Christian Eltzschig <me@elchris.org>
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue Jan 18, 2022
…e mutex functions for errno checking

Signed-off-by: Christian Eltzschig <me@elchris.org>
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue Jan 18, 2022
Signed-off-by: Christian Eltzschig <me@elchris.org>
elfenpiff added a commit that referenced this issue Jan 18, 2022
…-in-mutex

iox-#999 The posixCall must use the return value of the mutex functio…
@elfenpiff elfenpiff reopened this Jan 18, 2022
@budrus
Copy link
Contributor

budrus commented Jan 20, 2022

@qclzdh Have you tried the proposal from @elBoberido. Publishers and Subscribers are not thread safe on their API, so you would have to protect with a mutex or our smart_lock

@qclzdh
Copy link
Author

qclzdh commented Jan 21, 2022

@budrus Finally, i understand why this happened in my side.

  1. this locking error "pthread_mutex_lock } ::: [ 22 ] Invalid argument" was caused due to roudi kill that process, since roudi didn't received keepalive msg within 1.5s.
  2. There are some socket sending in the test application, at the worst case, that sending operation will block about ~2s, this block will affect all the thread on the same cpu. So when the keepalive thread is on the same cpu with that block operation, there are some chance that keepalive message won't be sent to roudi. This also can explatin, why it is difficult to reproduce on ubuntu, since it has 8 processors, but we only use one processor in aarch64-qemu.

@qclzdh
Copy link
Author

qclzdh commented Jan 21, 2022

@budrus Yes, we used mutex for those APIs.

@elBoberido
Copy link
Member

@qclzdh there is an option to run RouDi without monitoring -m off but then you (potentionally) loose the cleanup of terminated processes. I'm not quite sure if the cleanup kicks in when the application tries to register again at RouDi

@qclzdh
Copy link
Author

qclzdh commented Jan 21, 2022

@elBoberido thanks, i noticed this, and disable the monitor mode.

@budrus
Copy link
Contributor

budrus commented Feb 3, 2022

@qclzdh As far as I understood you now know what the problem was. Then I would close this issue. Feel free to reopen if you still have a problem

@budrus budrus closed this as completed Feb 3, 2022
MatthiasKillat pushed a commit to ApexAI/iceoryx that referenced this issue Feb 6, 2022
…e mutex functions for errno checking

Signed-off-by: Christian Eltzschig <me@elchris.org>
MatthiasKillat pushed a commit to ApexAI/iceoryx that referenced this issue Feb 6, 2022
Signed-off-by: Christian Eltzschig <me@elchris.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants