Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mutex owner died -> POPO__CHUNK_LOCKING_ERROR #2193

Open
niclar opened this issue Feb 14, 2024 · 57 comments
Open

mutex owner died -> POPO__CHUNK_LOCKING_ERROR #2193

niclar opened this issue Feb 14, 2024 · 57 comments
Labels
bug Something isn't working needs info A bug report is waiting for more information

Comments

@niclar
Copy link
Contributor

niclar commented Feb 14, 2024

Hi, we just encountered a fatal crash in roudi (first time on b184113 since it came out), roudi logs below.

Maybe an iceoryx "client" process died prior, but no iceoryx logging in them or the syslog indicates this. And this should not crash roudi of course..
-Do you maybe have any pointers as to why we encountered this and how it should be remedied ?

(We're running 21 iceoryx client processes on the box, pinned and isolated, ~1500 publishers, and roudi on highest prio)

Thanks

Required information

Operating system:
Ubuntu 22.04.2 LTS

Compiler version:
Clang 17.0.5

Eclipse iceoryx version:
b184113 (v2.0.x, 20231121)

Observed result or behaviour:

^[[0;90m2024-02-14 07:06:02.846 ^[[0;1;31;103m[Error]^[[m: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/d0c9061204-f53d20e6d8.clean/iceoryx_hoofs/source/posix_wrapper/mutex.cpp:323 { expected<void, MutexLockError> iox::posix::mutex::lock() -> pthread_mutex_lock }  :::  [ 130 ]  Owner died
^[[0;90m2024-02-14 07:06:02.846 ^[[0;1;31;103m[Error]^[[m: The thread/process which owned the mutex died. The mutex is now in an inconsistent state and must be put into a consistent state again with Mutex::make_consistent()
^[[0;90m2024-02-14 07:06:02.846 ^[[0;1;97;41m[Fatal]^[[m: Locking of an inter-process mutex failed! This indicates that the application holding the lock was terminated or the resources were cleaned up by RouDi due to an unresponsive application.
^[[0;90m2024-02-14 07:06:02.846 ^[[0;1;31;103m[Error]^[[m: ICEORYX error! POPO__CHUNK_LOCKING_ERROR
libc++abi: terminating

Edit by elBoberido:

Related issues

@elBoberido
Copy link
Member

@niclar this is related to #325. For the history feature there is a mutex involved. RouDi locks the mutex and can insert the samples into the queue without the interference of the publisher. This mutex is only contested when subscribers are added or removed but nevertheless, the publisher holds the lock for a short amount of time when publishing. If the application is terminated during this time and RouDi tries to clean up the resources of the process, it finds the corrupt mutex and terminates. I once tried to fix the problem but unfortunately, the data guarded by the mutex might also be corrupted and more refactoring would have been necessary to fully fix the problem. Other things came up and the issue never really happened in our setup.

This should actually only happen if either an application died or if RouDi assumed that an application died. But then you should see something like Application foo not responding (last response 1509 milliseconds ago) --> removing it. If the monitoring is turned off, then RouDi does not detect the dead application immediately but only when the application tries to re-register. But then there is also a log message. If you don't see any log messages for removing or re-registering an application, then I'm a bit at a loss and it could be anything, including cosmic rays.

@mossmaurice mossmaurice added the bug Something isn't working label Feb 20, 2024
@niclar
Copy link
Contributor Author

niclar commented Feb 23, 2024

Looking at the logs again it looks like roudi incorrectly assumed 4 processes died prior. Don't know why it would assume this though. -Is there a starvation issue in the detection ?

@elBoberido
Copy link
Member

@niclar this is related to #1361. I already improved the situation on master

@niclar
Copy link
Contributor Author

niclar commented Feb 26, 2024

-Looks like that fix is/was (2023-11-01) already in the code running (2023-11-23)

@elBoberido
Copy link
Member

Oh, I just saw v2.0.x and thought it came from the release branch.

Is the system time changed abruptly so that jumps can occur?

Other scenarios that lead to this error might be when the threads which update the heartbeat value are not scheduled.

A workaround might be to massively increase the PROCESS_KEEP_ALIVE_TIMEOUT. Unfortunately it can currently not be changed via a cmake paremeter but that would not be difficult to add.

@niclar
Copy link
Contributor Author

niclar commented Feb 27, 2024

These thread(s) are part of the publisher(s) & not roudi right ? -I reckon they might have been starved in this instance. Some deadline monotonic scheduling or similar would be nice to be able to jack in here..

@elBoberido
Copy link
Member

It is a thread of the runtime. I think it should not be too difficult to add a updateHeartbeat method to the runtime. One could then call it manually.

@niclar
Copy link
Contributor Author

niclar commented Mar 1, 2024

This morning we encountered this fatal error again during (post) the morning restart routine of some iceoryx client processes.

@elBoberido
Copy link
Member

Do you have more information, e.g. was the load during that time quite high?

@niclar
Copy link
Contributor Author

niclar commented Mar 7, 2024

cpu core utilization is low overall with roudi running on a dedicated core interrupt isolated. we received this anew the other day seemingly "spontanously" a few hours after a mid day restart of ~20 roudi client processes.

@elBoberido
Copy link
Member

Do you have some logs of spikes in the CPU utilization when the issue occurred?

@niclar
Copy link
Contributor Author

niclar commented Mar 11, 2024

...we've added core utilization logging now for next time it happens...

@elfenpiff
Copy link
Contributor

@elBoberido I looked at the issue and was wondering if we could remove the mutex completely if we sacrifice the history? When on connecting a subscriber to a publisher no history has to be delivered, there should be also no need for a mutex.

@niclar Do you require the history when a subscriber connects?

Another solution to get rid of the mutex would be to transfer the samples without a lock (if this is possible) but then it is possible that a user receives the same sample twice. But only when connecting to a new publisher (and this can maybe filtered out with sequence numbers?!)

@niclar
Copy link
Contributor Author

niclar commented Mar 11, 2024

@elfenpiff, good news, we do not. All pub/sub are instantiated with historyCapacity=0 and historyRequest=0 respectively.

@elBoberido
Copy link
Member

It is not only the history. AFAIK, the mutex also guards the adding of the subscriber queue to the publisher.

Furthermore, the POPO__CHUNK_LOCKING_ERROR is only a symptom. The issue is that RouDi kills running applications because of the heartbeat thread not being scheduled. At least that's my assumption.

@elfenpiff
Copy link
Contributor

@elBoberido

Furthermore, the POPO__CHUNK_LOCKING_ERROR is only a symptom. The issue is that RouDi kills running applications because of the heartbeat thread not being scheduled. At least that's my assumption.

This we can mitigate with turning monitoring off. But a better alternative would be to have a liveliness QoS that enforces some contracts like that a subscriber as to collect a sample latest after X seconds and the publisher must publisher every Y seconds a sample. But this would require some time-consuming refactoring.

It is not only the history. AFAIK, the mutex also guards the adding of the subscriber queue to the publisher.

Yes, you are right. But this could be solved by bringing loffli into the play and a lock-free optional where the atomic signals that the thing is set or not. This is also some major refactoring but I think it is solvable.

@elBoberido
Copy link
Member

Well, the monitoring was explicitly turned on as far as I know 😅

@elBoberido
Copy link
Member

@niclar did you encounter this problem again?

In the meantime I'm a bit hesitant on exposing the heartbeat triggering as public API. An alternative would be to increase the timeout for the detection of died processes. Currently it is about 1.5 seconds and the heartbeat should be sent every 300ms. If this happens again, can you check the RouDi log output for messages like Application foo not responding (last response #### milliseconds ago) --> removing it and report the amount of milliseconds?

@niclar
Copy link
Contributor Author

niclar commented Apr 15, 2024

As a matter of fact we did friday evening;

^[[0;90m2024-04-12 18:35:37.238 ^[[0;1;93m[Warn ]^[[m: Application marketview.prd not responding (last response 1511 milliseconds ago) --> removing it ^[[0;90m2024-04-12 18:35:37.306 ^[[0;1;31;103m[Error]^[[m: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/d0c9061204-bf50be8845.clean/iceoryx_hoofs/source/posix_wrapper/mutex.cpp:323 { expected<void, MutexLockError> iox::posix::mutex::lock() -> pthread_mutex_lock } ::: [ 130 ] Owner died ^[[0;90m2024-04-12 18:35:37.306 ^[[0;1;31;103m[Error]^[[m: The thread/process which owned the mutex died. The mutex is now in an inconsistent state and must be put into a consistent state again with Mutex::make_consistent() ^[[0;90m2024-04-12 18:35:37.306 ^[[0;1;97;41m[Fatal]^[[m: Locking of an inter-process mutex failed! This indicates that the application holding the lock was terminated or the resources were cleaned up by RouDi due to an unresponsive application. ^[[0;90m2024-04-12 18:35:37.306 ^[[0;1;31;103m[Error]^[[m: ICEORYX error! POPO__CHUNK_LOCKING_ERROR libc++abi: terminating

@elBoberido
Copy link
Member

Okay, the deadline was missed only by a little bit. For some reasons the runtime thread which sends the heartbeat was not triggered for a long time. Is this on Windows or on Linux?

As a quick workaround I would suggest to try to change the 5 in this line in iceoryx_posh_types.hpp to 10

constexpr units::Duration PROCESS_KEEP_ALIVE_TIMEOUT = 5 * PROCESS_KEEP_ALIVE_INTERVAL; // > PROCESS_KEEP_ALIVE_INTERVAL

It does not fix the problem but should make it much more unrealistic. You could increase the time even more. With e.g. 200 it would be one minute until RouDi would assume a process is dead. The 1.5 seconds came from a time where RouDi did not clean up the resources when a crashed application tried to re-register and therefore we kept it low in order to not have to wait to long. But now it could be indeed set to 1 minute which would not be too bad to clean the remainders of a crashed process up but should virtually prevent killing running applications.

The more I think about it, the more I come to the conclusion to just increase this timeout to eliminate this source for leaving a corrupt mutex behind.

@niclar
Copy link
Contributor Author

niclar commented Apr 16, 2024

This is on linux, dedicated cores. -I applied the workaround and will get back if we encounter it again.

Thanks!

@niclar
Copy link
Contributor Author

niclar commented Apr 22, 2024

Still encountering the issue

2024-04-22 11:32:58.571 [Warn ]: Application X not responding (last response 3046 milliseconds ago) --> removing it
2024-04-22 11:32:58.579 [Error]: /mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/d0c9061204-2ee4ec5b88.clean/iceoryx_hoofs/source/posix_wrapper/mutex.cpp:323 { expected<void, MutexLockError> iox::posix::mutex::lock() -> pthread_mutex_lock } ::: [ 130 ] Owner died
2024-04-22 11:32:58.579 [Error]: The thread/process which owned the mutex died. The mutex is now in an inconsistent state and must be put into a consistent state again with Mutex::make_consistent()
2024-04-22 11:32:58.579 [Fatal]: Locking of an inter-process mutex failed! This indicates that the application holding the lock was terminated or the resources were cleaned up by RouDi due to an unresponsive application.
2024-04-22 11:32:58.579 [Error]: ICEORYX error! POPO__CHUNK_LOCKING_ERROR

@elBoberido
Copy link
Member

This is weird, especially in combination with the high socket usage from #2268.

Just to exclude other possibilities. You are using the plain upstream version from the commit mentioned in the description and not a custom branch with that commit cherry-picked?

@niclar
Copy link
Contributor Author

niclar commented Apr 22, 2024

It might very well be related to the IPC queues yes.

-Yes, the only difference being the attached "patches";
increased_process_keep_alive_intervall.patch
service_description.patch

Build options;
-DINTROSPECTION=ON
-DIOX_MAX_PUBLISHERS=4096
-DIOX_MAX_SUBSCRIBERS=4096
-DIOX_MAX_SUBSCRIBERS_PER_PUBLISHER=64
-DIOX_MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY=16
-DIOX_MAX_CHUNKS_HELD_PER_SUBSCRIBER_SIMULTANEOUSLY=256
-DIOX_MAX_CLIENTS_PER_SERVER=512
-DIOX_MAX_NUMBER_OF_NOTIFIERS=512

Roudi conf;

iox::config::CmdLineArgs_t cmdLineArgs;
cmdLineArgs.run = conf.net.enabled;
cmdLineArgs.logLevel = iox::log::LogLevel::INFO;
cmdLineArgs.monitoringMode = iox::roudi::MonitoringMode::ON;
cmdLineArgs.compatibilityCheckLevel = iox::version::CompatibilityCheckLevel::PATCH;
//cmdLineArgs.uniqueRouDiId
cmdLineArgs.processTerminationDelay = iox::roudi::PROCESS_DEFAULT_TERMINATION_DELAY;
cmdLineArgs.processKillDelay = iox::roudi::PROCESS_DEFAULT_KILL_DELAY;
//cmdLineArgs.configFilePath;

@niclar
Copy link
Contributor Author

niclar commented Apr 22, 2024

We'll update to head in a day or two and see how that plays out...

@elBoberido
Copy link
Member

elBoberido commented Apr 22, 2024

Those patches should not have any negative effect on the heartbeat. Do you run the iceoryx tests after the build? Since you changed some of the compile time constants it is strongly recommended to run the tests. I think we need to put some emphasis on this in the documentation.

Oh, and could you increase the timeout to to 60_s. If it still fails with such a long timeout, then there is something extremely fishy.

@niclar
Copy link
Contributor Author

niclar commented Apr 22, 2024

It seems like HEAD might have solved the ipc queue and this cleanup issue , let us run it a while longer, to be sure

@elBoberido
Copy link
Member

The log does not fit the 1minute timeout. It still terminates the app after 1.5 seconds.

@niclar
Copy link
Contributor Author

niclar commented Apr 26, 2024

1 minute timeout is set/compiled however. No idea why this happens..

constexpr units::Duration PROCESS_KEEP_ALIVE_TIMEOUT = 200 * PROCESS_KEEP_ALIVE_INTERVAL;

@elBoberido
Copy link
Member

Can you set it to 60_s?

@elBoberido
Copy link
Member

Oh, since you are using the main branch, can you also run RouDi with -x commitId to ensure that all applications use the same iceoryx version. If you want to be more strict you could also deactivate reproducible builds and use -x buildDate. The default is patch which allows all applications with the same patch level to connect to RouDi.

@PMSRGroup
Copy link

@niclar Have you managed to solve the issue? If so, how? We are getting the same exact problem, currently trying the options listed here but no luck so far. Will update to latest later tonight and let it run for a few days.
Currently running: f602986

@rbdm-qnt
Copy link

Having the same issue here aswell, on commit f6dc4e1 (latest). @niclar @elBoberido any way to make it less frequent or solve it completely?

@elBoberido
Copy link
Member

@rbdm-qnt if the monitoring is turned off, then this will not happen but in the case the application crashes, the resources are only cleaned up when RouDi shuts down or when the application re-registers.

Can you try the following changes in iceoryx_posh_types.hpp

-- constexpr units::Duration PROCESS_KEEP_ALIVE_TIMEOUT = 5 * PROCESS_KEEP_ALIVE_INTERVAL;
++ constexpr units::Duration PROCESS_KEEP_ALIVE_TIMEOUT = 60_s;

My guess is that this issue is related to jumps in the system time like mentioned further above. I was just able to trigger the issue by changing the system time with timedatectl. I think the problem is that we rely on the timestamp from the last heartbeat of the application. If there are jumps in the system time, the heartbeat thread of the application as well as the monitoring thread of RouDi will be woken up and now there is a race between the application sending the heartbeat with the updated timestamp and RouDi checking for the liveliness of the applications. If RouDi wins and the jump was into the future, it assumes that the application did not send the heartbeat for too long and therefore is in a degraded state.

Please check if your system time jumps and if increasing the PROCESS_KEEP_ALIVE_TIMEOUT timeout reduces the frequency.

The proper solution would probably be to use a counter instead of a timestamp and then we could have a heuristic like when the counter was not incremented for 5 consecutive times RouDi checks for the liveliness, the application can be considered to be in a degraded state.

@elBoberido
Copy link
Member

@rbdm-qnt @PMSRGroup @niclar did you have a chance to check whether my theory with the jumps in system time is correct and the workaround works?

@elBoberido elBoberido added the needs info A bug report is waiting for more information label Jun 3, 2024
@passchaos
Copy link

@elBoberido Can the timestamp used by the heartbeat be the system boot time, so that it is not affected by the system time

@elBoberido
Copy link
Member

@passchaos are you also experiencing this issue?

I just had a look at the code and we are already using CLOCK_MONOTONIC, which is related to the boot time on linux. Unfortunately this is also affected by adjustments performed by adjtime and NTP. There is CLOCK_MONOTONIC_RAW which should not be affected but I can also trigger the error with that clock source by performing these steps.

sudo timedatectl set-ntp false
sudo timedatectl set-time 18:30:00
sudo timedatectl set-true

At least in one of three runs, RouDi assumes the application did not send the heartbeat.

The workaround would be to increase the PROCESS_KEEP_ALIVE_TIMEOUT to a value larger than the jumps in the clock.

A proper fix would be to rework the heartbeat handling and I also have an idea how to do it but I'm currently quite busy with other topics.

@passchaos
Copy link

@elBoberido I have found the cause of the problem, and it is not roudi's problem, but the problem of the sender sending the keepalive signal

The code https://github.com/eclipse-iceoryx/iceoryx/blob/main/iceoryx_hoofs/time/source/duration.cpp#L55 here use the CLOCK_REALTIME. On the other hand, older version, such as v2.0.6, also use CLOCK_REALTIME in the timer sources, see: https://github.com/eclipse-iceoryx/iceoryx/blob/v2.0.6/iceoryx_hoofs/source/posix_wrapper/timer.cpp#L243 and https://github.com/eclipse-iceoryx/iceoryx/blob/v2.0.6/iceoryx_hoofs/source/posix_wrapper/timer.cpp#L436

@elBoberido
Copy link
Member

@passchaos the code in duration.cpp relates only to conversions from Duration to timespec and is not related to the heartbeat.

The heartbeat uses CLOCK_MONOTONIC
https://github.com/eclipse-iceoryx/iceoryx/blob/main/iceoryx_posh/source/runtime/heartbeat.cpp#L50-L60

@passchaos
Copy link

@elBoberido Well, I know this is the logic of the heartbeat, and each process involved in the communication sends a Keepalive signal to roudi, but because each process uses a timer to send a Keepalive signal, the timer uses a time source that is not monotonic (see my previous link).
Therefore, if the system time is set back to the previous time, the timer considers that the time is not up and does not send a Keepalive signal to roudi.
When I change to monotonic, the loss of heartbeat due to time adjustment no longer occurs.

@elBoberido
Copy link
Member

elBoberido commented Jun 6, 2024

@passchaos ah, right. The thread is waits on a semaphore and that causes the issue. The problem is that sem_timedwait requires CLOCK_REALTIME, see also https://www.man7.org/linux/man-pages/man3/sem_timedwait.3p.html.

Have you checked whether one of your cores runs at 100% after the change? I assume that the heartbeat thread does not sleep anymore and returns immediately from the sem_timedwait call.

@passchaos
Copy link

@elBoberido Yeah, I have also noticed this problem, it is a sad thing

@hrudhansh
Copy link

hrudhansh commented Jun 7, 2024

This issue is showing for me 7/10 times upon stopping publisher process.

How to recreate?

  1. Run a pub-sub with no delay in between publishes.
  2. Register a SIGINT signal handler in your main like signal(SIGINT, SignalHandler);
  3. Upon ctrl+c on the pub process, you would see it stall.
  4. Upon ctrl+c again you would see the above error in iox-roudi.

Hopefully this helps you recreate/debug this on your end.

On my end, I ran gdb on the pub process with -exec handle SIGINT nostop & -exec handle SIGINT pass, a breakpoint on the exit(sigint); and called pkill -SIGINT publisher in a separate terminal. I noticed:

  • In cases where it fails, the signal handler seems to be called in the middle of a publish thread.
  • In both cases (fail vs no-fail), there is a KeepAlive thread running with the semaphore. So not sure if it's this.

How do I stop/interrupt the publisher thread gracefully? It seems like just letting the de-constructor call is not enough.
How do I stop the KeepAlive thread gracefully?

Not advocating a quick fix, but I wonder if we can put an atomic bool somewhere in the publish thread that we can call to abruptly stop what it's doing, and tear down.

Note: This test was done on commit: b2cd72b which is not latest. But the same issue persists on latest.

@elBoberido
Copy link
Member

@hrudhansh I think the root of your problem is a different one than from this issue. It just manifests in the same error code. Can you please create a separate issue for your problem?

@elBoberido
Copy link
Member

@niclar @PMSRGroup @rbdm-qnt @passchaos shall we schedule a community meeting to discuss the problem, solution approaches and how to distribute the work?

@gpalmer-latai
Copy link

Maybe I'm interested as well. I haven't enabled monitoring but I encounter probably the same error case when shutting down RouDi - likely because some process has crashed at some point while holding the mutex.

@niclar
Copy link
Contributor Author

niclar commented Jun 19, 2024

yes please do, it's a serious bug that needs to be addressed

@gpalmer-latai
Copy link

What are folks thoughts on the idea of a short-term mitigation whereby this error becomes "non-fatal" and is treated as a leaked resource? Basically it means that this publisher port is in a corrupted state and may not be reused until the whole system (RouDI) is restarted. The default configuration has an allotment for 512 publisher ports overall so in practice this seems like a manageable scenario, so long as you aren't trying to keep restarting the failing process and repeatedly corrupting publisher ports.

@elBoberido
Copy link
Member

@gpalmer-latai I've thought about this and had a brief look at the code. Unfortunately, it is also quite some work since the locking failure is not exposed to RouDi

@niclar and all interested parties: how about a meeting next week on Tuesday at 5 p.m. CEST?

@gpalmer-latai
Copy link

gpalmer-latai commented Jun 20, 2024

That works for me - same link as the regular meetup?

@niclar
Copy link
Contributor Author

niclar commented Jun 20, 2024

@elBoberido meetup at Tuesday at 5 p.m. CEST works fine.
@gpalmer-latai at least for us such a stop gap solution with proper warnings (#free ports #corrupt ports) would suffice.

@hrudhansh
Copy link

Interested in the meeting as well, will possibly enlighten me a fix for #2304

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs info A bug report is waiting for more information
Projects
None yet
Development

No branches or pull requests

9 participants