Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop: broken internal heap structure #1682

Closed
ambroise-arm opened this issue May 3, 2023 · 5 comments
Closed

Infinite loop: broken internal heap structure #1682

ambroise-arm opened this issue May 3, 2023 · 5 comments
Assignees

Comments

@ambroise-arm
Copy link

Description

CycloneDDS can get stuck in an infinite loop inside of fibheap.c caused by a corruption of the internal state of the heap data structure.

Cause

It appears the code can call ddsrt_fibheap_decrease_key on a node that was previously removed from the heap. The fipheap implementation assumes that the node to decrease is part of its internal structure, hence adding the removed node to the heap without setting its next, prev, and other attributes, leaving them in the same state they were in at the time the node was removed from the heap, thus potentially breaking the internal representation.

In my case this is the backtrace of it happening:

#0  [...] in ddsrt_fibheap_decrease_key (fhdef=fhdef@entry=[...] <evq_xevents_fhdef>, fh=fh@entry=[...], vnode=vnode@entry=[...]) at src/ddsrt/src/fibheap.c:285
#1  [...] in ddsi_delete_xevent_nosync (ev=[...]) at src/core/ddsi/src/ddsi_xevent.c:268
#2  ddsi_delete_xevent (ev=[...]) at src/core/ddsi/src/ddsi_xevent.c:307
#3  [...] in ddsi_free_pwr_rd_match (m=[...]) at src/core/ddsi/src/ddsi_endpoint_match.c:646
#4  [...] in gc_delete_proxy_writer (gcreq=<optimized out>) at src/core/ddsi/src/ddsi_proxy_endpoint.c:401
#5  [...] in gcreq_queue_thread (q=[...]) at src/core/ddsi/src/ddsi_gc.c:182
#6  [...] in create_thread_wrapper (ptr=[...]) at src/core/ddsi/src/ddsi_thread.c:255
#7  [...] in os_startRoutineWrapper (threadContext=threadContext@entry=[...]) at src/ddsrt/src/threads/posix/threads.c:191

Which, on a subsequent call to ddsrt_fibheap_extract_min can lead to the following infinite loop

as n may never get back to mark, depending on the state of the node added by ddsrt_fibheap_decrease_key.

Reproduce

I don't know the root cause of ddsi_delete_xevent_nosync calling ddsrt_fibheap_decrease_key on a node that was already removed from the heap.
My setup has CycloneDDS running on Zephyr (support added with #1615) but I can't share the application code (yet).

I have the following to demonstrate the corruption process, if it helps:

#include "dds/ddsrt/fibheap.h"
#include <stddef.h>

struct event { ddsrt_fibheap_node_t heapnode; int x; };

static int compare_event (const void *va, const void *vb) {
  const struct event *a = va; const struct event *b = vb;
  return (a->x == b->x) ? 0 : (a->x < b->x) ? -1 : 1;
}

static const ddsrt_fibheap_def_t events_fhdef = DDSRT_FIBHEAPDEF_INITIALIZER(offsetof (struct event, heapnode), compare_event);

int main() {
  ddsrt_fibheap_t events;
  ddsrt_fibheap_init (&events_fhdef, &events);

  struct event event_pool[4];
  for (int i = 0; i < (int)(sizeof(event_pool) / sizeof(struct event)); i++) {
    event_pool[i].x = i;
  }

  ddsrt_fibheap_insert(&events_fhdef, &events, &event_pool[0]);
  ddsrt_fibheap_insert(&events_fhdef, &events, &event_pool[1]);
  ddsrt_fibheap_insert(&events_fhdef, &events, &event_pool[2]);

  (void) ddsrt_fibheap_extract_min(&events_fhdef, &events);
  (void) ddsrt_fibheap_extract_min(&events_fhdef, &events);

  ddsrt_fibheap_insert(&events_fhdef, &events, &event_pool[3]);

  (void) ddsrt_fibheap_extract_min(&events_fhdef, &events);

  // Illegal call to decrease_key on a node that was previously extracted from the heap
  event_pool[1].x = -1;
  ddsrt_fibheap_decrease_key(&events_fhdef, &events, &event_pool[1]);

  // Gets stuck in an infinite loop
  (void) ddsrt_fibheap_extract_min(&events_fhdef, &events);

  return 0;
}

Environment

The backtrace above is from 87b3177, but I also verified it happening on current tip of master branch (53a92b1).

cc @PatrickM-ZS

@eboasson
Copy link
Contributor

eboasson commented May 6, 2023

Thanks for the detailed report. Indeed, calling ddsrt_fibheap_decrease_key on a node that was already removed from the heap is deadly, so I'll need to look for the root cause. In the meantime, if it is easily reproducible, would you be able to try it with the version of the sources without #1622?

As (bad) luck has it, I merged that one before I merged the Zephyr support ... One way to solve that would be to take a clone of the repository you don't mind messing around with and doing:

git reset --hard 80d5fc34
git fetch upstream pull/1615/head
git merge FETCH_HEAD

That basically brings it to the state just before #1622 got merged, and then adds the Zephyr support on top. Simply reverting #1622 gives a lot of conflicts and I suspect doesn't bring you much you need to check, but I may be wrong there.

What would be particularly interesting is if it fails even without #1622 in because that variant has been around for years without giving any trouble.

@ambroise-arm
Copy link
Author

Just merging the PR doesn't work because it has the xevent patch in its history. I've cherry-picked the #1615 commits instead to get in the desired state.
I can confirm that without #1622 I don't see the heap issue.

@eboasson
Copy link
Contributor

@ambroise-arm thank you for making the effort to test it again, and also apologies for my doing such an ill thought-through suggestion.

I have spent a good bit of time staring at the code trying to find something that changed in a suspicious manner. Unfortunately, the only thing I have found so far is something that appears harmless (the more so because it did take the expected path in the stack trace with the crash): reading the ev->sync_state field outside the lock in ddsi_delete_xevent to decide whether to take the _nosync or the _sync path.

The reason I think it may well be harmless is that it is either CSODS_NO_SYNC_NEEDED for the entire lifetime of the event, or it is never CSODS_NO_SYNC_NEEDED, and it is a simple integer and so it should never see some temporary "impossible" value when it compares it to NO_SYNC. Simple 32-bit int loads and stores are atomic in modern hardware, so I don't quite see how that could go wrong.

That said, the Zephyr/ARM port needed some work to deal with -fshort-enums being in effect, and that means it is not only a 8-bit int in your case. That does make it different from essentially all platforms that it regularly runs on and so does allow for "works on CI and locally" vs "crashes on Zephyr".

Still, I'd bet that 8-bit accesses are atomic on ARM and given the memory layout:

  ddsrt_mtime_t tsched;
  enum cb_sync_on_delete_state sync_state;
  union {
    ddsi_xevent_cb_t cb;

where we know tsched is a 64-bit integer (and so 32-bit aligned in your case) is changing all the time and; cb is constant (32 or 64 bit, I don't know, but also 32-bit aligned). Doesn't seem likely sync_state would ever contain garbage.

Even though I can't convince myself that is could be the problem, I do think it makes sense to tell you. I might be wrong, after all. The diff below is the change I have locally right now that will likely end up in a PR in the near future.

diff --git a/src/core/ddsi/src/ddsi_xevent.c b/src/core/ddsi/src/ddsi_xevent.c
index ec5d95034..a4745d96e 100644
--- a/src/core/ddsi/src/ddsi_xevent.c
+++ b/src/core/ddsi/src/ddsi_xevent.c
@@ -248,17 +248,14 @@ static int nontimed_xevent_in_queue (struct ddsi_xeventq *evq, struct ddsi_xeven
 }
 #endif
 
-static void free_xevent (struct ddsi_xeventq *evq, struct ddsi_xevent *ev)
+static void free_xevent (struct ddsi_xevent *ev)
 {
-  (void) evq;
   ddsrt_free (ev);
 }
 
-static void ddsi_delete_xevent_nosync (struct ddsi_xevent *ev)
+static void ddsi_delete_xevent_nosync (struct ddsi_xeventq *evq, struct ddsi_xevent *ev)
 {
-  struct ddsi_xeventq *evq = ev->evq;
-  ddsrt_mutex_lock (&evq->lock);
-  assert (ev->sync_state != CSODS_EXECUTING);
+  assert (ev->sync_state == CSODS_NO_SYNC_NEEDED);
   /* Can delete it only once, no matter how we implement it internally */
   assert (ev->tsched.v != TSCHED_DELETE);
   assert (TSCHED_DELETE < ev->tsched.v);
@@ -275,13 +272,10 @@ static void ddsi_delete_xevent_nosync (struct ddsi_xevent *ev)
   /* TSCHED_DELETE is absolute minimum time, so chances are we need to
      wake up the thread.  The superfluous signal is harmless. */
   ddsrt_cond_broadcast (&evq->cond);
-  ddsrt_mutex_unlock (&evq->lock);
 }
 
-static void ddsi_delete_xevent_sync (struct ddsi_xevent *ev)
+static void ddsi_delete_xevent_sync (struct ddsi_xeventq *evq, struct ddsi_xevent *ev)
 {
-  struct ddsi_xeventq *evq = ev->evq;
-  ddsrt_mutex_lock (&evq->lock);
   /* wait until neither scheduled nor executing; loop in case the callback reschedules the event */
   while (ev->tsched.v != DDS_NEVER || ev->sync_state == CSODS_EXECUTING)
   {
@@ -296,16 +290,24 @@ static void ddsi_delete_xevent_sync (struct ddsi_xevent *ev)
       ddsrt_cond_wait (&evq->cond, &evq->lock);
     }
   }
-  ddsrt_mutex_unlock (&evq->lock);
-  free_xevent (evq, ev);
+  free_xevent (ev);
 }
 
 void ddsi_delete_xevent (struct ddsi_xevent *ev)
 {
+  struct ddsi_xeventq * const evq = ev->evq;
+  ddsrt_mutex_lock (&evq->lock);
   if (ev->sync_state == CSODS_NO_SYNC_NEEDED)
-    ddsi_delete_xevent_nosync (ev);
+  {
+    // schedule at TSCHED_DELETE, handler thread will free
+    ddsi_delete_xevent_nosync (evq, ev);
+  }
   else
-    ddsi_delete_xevent_sync (ev);
+  {
+    // wait while executing, then free
+    ddsi_delete_xevent_sync (evq, ev);
+  }
+  ddsrt_mutex_unlock (&evq->lock);
 }
 
 int ddsi_resched_xevent_if_earlier (struct ddsi_xevent *ev, ddsrt_mtime_t tsched)
@@ -463,7 +465,7 @@ void ddsi_xeventq_free (struct ddsi_xeventq *evq)
   struct ddsi_xevent *ev;
   assert (evq->thrst == NULL);
   while ((ev = ddsrt_fibheap_extract_min (&evq_xevents_fhdef, &evq->xevents)) != NULL)
-    free_xevent (evq, ev);
+    free_xevent (ev);
 
   {
     struct ddsi_xpack *xp = ddsi_xpack_new (evq->gv, false);
@@ -568,7 +570,7 @@ static void handle_xevents (struct ddsi_thread_state * const thrst, struct ddsi_
     {
       struct ddsi_xevent *xev = ddsrt_fibheap_extract_min (&evq_xevents_fhdef, &xevq->xevents);
       if (xev->tsched.v == TSCHED_DELETE)
-        free_xevent (xevq, xev);
+        free_xevent (xev);
       else
       {
         ddsi_thread_state_awake_to_awake_no_nest (thrst);

@ambroise-arm
Copy link
Author

Thanks for the feedback. Sorry I haven't had time to look at this issue again. I won't get back to it before September, but I will get back to it.

@ambroise-arm
Copy link
Author

and also apologies for my doing such an ill thought-through suggestion.

Your suggestion was good, I didn't mean to be rude, I just wanted to document the deviation from the instructions. And thanks for looking into it more!

The diff below is the change I have locally right now that will likely end up in a PR in the near future.

I see it made it as 48aa8b3 . You were right, that was not the source of the issue I was seeing.
It appears that the issue was with Zephyr. I was using Zephyr 3.3 at the time. Testing again with Zephyr 3.4 I don't see this anymore. I tried to bisect Zephyr to see what solved it, but there is a good chunk of the history between 3.3 and 3.4 that doesn't seem to work well with CycloneDDS so I couldn't complete the bisection.

So I suppose #1622 made use of a new API that was bugged in Zephyr in 3.3 and was fixed since then. Closing the issue as all is well now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants