Skip to content

2024 06 25 Eclipse iceoryx developer meetup

Mathias Kraus edited this page Jun 25, 2024 · 4 revisions

Eclipse iceoryx developer meetup to discuss #2193 and #325

Date: 2024/06/25

Time: 17:00 CET

Link: https://app.element.io/jitsi.html#conferenceDomain=meet.element.io&conferenceId=JitsiVtfrqukadefbqiqfxryxabai&userId=&roomId=!AooDAAwkyNWwkMElpt%3Agitter.im&roomName=eclipse%2Ficeoryx&startWithAudioMuted=true&startWithVideoMuted=true&language=en

https://giphy.com/gifs/hail-hypnotoad-rou0CTAp6Z8VW/fullscreen

Issues to be discussed

Attendees

  • Mathias Kraus, ekxide IO GmbH
  • Graham
  • Niclas
  • Hrudhansh

Agenda

  1. Discuss the root cause of #2193 and #325
  2. Possible solutions
  3. Word distribution

Minutes

1.1 The keep alive thread does not wake up after the system time changes

  • the thread is waiting in a semaphore timed_wait call
  • timed_wait requires the CLOCK_REALTIME which is affected by changes in the local time
  • additionally the heartbeat is send as timestamp
    • the timestamp uses the monotonic clock
    • jumps into the future can still happen with the monotonic clock
    • when RouDi checks for the last heartbeat, it might have a timestamp after a jump which is too far into the future compared to the heartbeat timestamp

1.2 There is a mutex to guard adding and removing subscriber queues to the publisher

  • the mutex also needs to be locked when the publisher accesses the subscriber queues for publishing
  • this is quite fast since there is no contention unless subscriber are added or removed from the publisher
  • nevertheless, there is a small window in which the application could terminate abnormally and leave a locked mutex behind
  • this usually only happens on either:
    • a crash -> can be prevented on the application side by e.g. not running threads in parallel when publishing smaples
    • not handling signals which leads to not running the destructors -> can be prevented by handling signals
    • sending SIGKILL -> unfortunately this is also done by RouDi when monitoring in turned on and RouDi does not receive a heartbeat for some time; RouDi assumes the application is unresponsive and sends a SIGKILL in order to safely reclaim the resources

2.1 Possible solution for the monitoring issue

  • create a timer_create abstraction
  • use that abstraction in combination with a blocking semaphore wait instead of the semaphore timed_wait
  • use a counter for the heartbeat instead of the timestamp
    • add a mechanism for RouDi to check when the counter does not change for some time to detect unresponsive applications

2.2 Possible solution for locking issue

3.1 ekxide might be contracted to fix the monitoring issue

3.2 Graham is looking into implementing the workaround

  • Mathias checks if the workaround is feasible and gives some hints on how to proceed
Clone this wiki locally