Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

icehello example may happen bug when system time rollback #1292

Open
1su1 opened this issue Mar 16, 2022 · 11 comments
Open

icehello example may happen bug when system time rollback #1292

1su1 opened this issue Mar 16, 2022 · 11 comments
Labels
bug Something isn't working

Comments

@1su1
Copy link

1su1 commented Mar 16, 2022

Required information

Operating system:
E.g. Ubuntu 20.04.3 LTS

Compiler version:
E.g. gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Observed result or behaviour:

iox-cpp-publisher-helloworld sent value: 8
iox-cpp-publisher-helloworld sent value: 9
iox-cpp-publisher-helloworld sent value: 10
iox-cpp-publisher-helloworld sent value: 11
iox-cpp-publisher-helloworld sent value: 12
iox-cpp-publisher-helloworld sent value: 13
/home/sxy/ros2_galactic/src/eclipse-iceoryx/iceoryx/iceoryx_utils/source/posix_wrapper/mutex.cpp:92 { bool iox::posix::mutex::lock() }  :::  [ 22 ]  Success
2022-03-16 15:58:22.422 [ Error ]: ICEORYX error! POPO__CHUNK_LOCKING_ERROR
iox-cpp-publisher-helloworld: /home/sxy/ros2_galactic/src/eclipse-iceoryx/iceoryx/iceoryx_utils/source/error_handling/error_handling.cpp:56: static void iox::ErrorHandler::ReactOnErrorLevel(iox::ErrorLevel, const char*): Assertion `false' failed.
Aborted

image

Expected result or behaviour:
When system time rollback happened, all as usual.

iox-cpp-publisher-helloworld sent value: 4
iox-cpp-publisher-helloworld sent value: 5
iox-cpp-publisher-helloworld sent value: 6
iox-cpp-publisher-helloworld sent value: 7
iox-cpp-publisher-helloworld sent value: 8
iox-cpp-publisher-helloworld sent value: 9
iox-cpp-publisher-helloworld sent value: 10
iox-cpp-publisher-helloworld sent value: 11
iox-cpp-publisher-helloworld sent value: 12
iox-cpp-publisher-helloworld sent value: 13
iox-cpp-publisher-helloworld sent value: 14
iox-cpp-publisher-helloworld sent value: 15
iox-cpp-publisher-helloworld sent value: 16
iox-cpp-publisher-helloworld sent value: 17
^C2022-03-16 18:28:50.752 [Verbose]: RouDi cleaned up resources of iox-cpp-publisher-helloworld. Shutting down gracefully.

Conditions where it occurred / Performed steps:

1. iox-toudi
2. ./iox-cpp-subscriber-helloworld
3. iox-cpp-publisher-helloworld
4. timedatectl set-ntp false 
5. timedatectl set-time 15:58:30 (and maybe need "timedatectl set-time 15:58:20"), to make system time rollback
@1su1
Copy link
Author

1su1 commented Mar 16, 2022

We used iceoryx in TDA4, but in TDA4 sometimes happend system time rollback in few days, and then ice will throw exception. So I try it in x86 Linux, and it also happened, need help.

@elfenpiff
Copy link
Contributor

@1su1

I think we never considered this use case and I think this is something iceoryx does not support. The reason is that we have cyclic task which run with a defined period. Here we use functions which are sleep_until like to solve the problem of jittering or moving periodicity.

Stuff like this can happen:

  1. Time is 11:00 pm, RouDi waits for the heartbeat which has to come every 500ms.
  2. Now we have a time rollback and it is 10:00 pm, our central broker RouDi assumes now that it waits for over an hour for the heartbeat and will kill the application since it seems like it is not responding anymore.

We use internal timers which wait for requests (for instance when you create/destroy ports like the subscriber/publisher)
It may happen when you turn back the time while roudi is waiting that the waiting time increases by several ours and from a user perspective it looks like a deadlock.

Iceoryx does not support a change of the time while it is running. When it is working then it is just by accident.

I will investigate this further and keep this as a bug open. But please do not count on a quick fix in the next days!

For the time being I sadly can only recommend to not change the system time while iceoryx is running.

@1su1
Copy link
Author

1su1 commented Mar 16, 2022

Thanks for the analysis, that's enough. I haven't delved into iceoryx at the moment, I just want to know why it crashes in this case now. and then I'll try to look into iceoryx.

@elfenpiff elfenpiff added the bug Something isn't working label Mar 18, 2022
@1su1
Copy link
Author

1su1 commented Aug 5, 2022

I fixed this bug by modified "iceoryx_posh_types.hpp : 254"
from using BaseClock_t = std::chrono::steady_clock; to using BaseClock_t = std::chrono::system_clock;

Will it affect other modules?

@elBoberido
Copy link
Member

@1su I don't think this is a viable solution. The system clock has the issue that it can be changed by the user while the steady clock should only monotonically increase

@1su1
Copy link
Author

1su1 commented Aug 11, 2022

But using this approach does solve the problem

@elBoberido
Copy link
Member

@1su1 it may solve your specific problem but creates another one. The steady clock is not supposed to change and therefore that one is used. I'm not even sure why this problem happens on your side since you wrote that there is a system time rollback. Is this correct or is there be steady clock reset?

A proper fix for this would be to refactor the liveliness detection to not rely on any timestamps (#1361) since this also makes debugging harder. A quick fix for you would be to disable the monitoring and start RouDi with -m off. Is this a viable workaround until we have a proper fix?

@yanxiaochun1005
Copy link

@1su1 @elBoberido The system time rollback will cause m_keepAliveTask to work abnormally, because the underlying part of m_keepAliveTask uses sem_timedwait to delay sending keeplive packets.I have now changed to using std::this_thread::sleep_for to delay sending the keeplive package while adapting to thread exit logic. The test results are normal.

@yanxiaochun1005
Copy link

inline void PeriodicTask::run() noexcept
{
do
{
IOX_DISCARD_RESULT(m_callable());
std::this_thread::sleep_for(std::chrono::milliseconds(m_interval.toMilliseconds()));
} while (!m_stop_flag)
}
m_stop_flag is adapting to thread exit logic

@wjswjq
Copy link

wjswjq commented Jul 13, 2023

@elfenpiff I've met the same issue, when time rollback on version 2.0.3. I guess this issue may be related to the different clocks used by roudi & helloworld apps. Roudi use steady_clock when checking keepAlive signal, while helloworld apps use PeriodicTask to send keepAlive signal which use TimeSpecReference::Epoch.
I suggest that PeriodicTask/Semaphore shall use TimeSpecReference::Monotonic instead.

@elfenpiff
Copy link
Contributor

@wjswjq This is sadly not so easily possible since the posix semaphore does not support monotonic clocks in sem_timedwait(), see: https://pubs.opengroup.org/onlinepubs/9699919799/functions/sem_timedwait.html

I think we have to use here sem_clockwait (gnu libc extension) and sem_timedwait_monotonic (qnx) but this will not play well with all the other platforms :/ ... I will try to think of something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants