-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infinite loop: broken internal heap structure #1682
Comments
Thanks for the detailed report. Indeed, calling As (bad) luck has it, I merged that one before I merged the Zephyr support ... One way to solve that would be to take a clone of the repository you don't mind messing around with and doing:
That basically brings it to the state just before #1622 got merged, and then adds the Zephyr support on top. Simply reverting #1622 gives a lot of conflicts and I suspect doesn't bring you much you need to check, but I may be wrong there. What would be particularly interesting is if it fails even without #1622 in because that variant has been around for years without giving any trouble. |
@ambroise-arm thank you for making the effort to test it again, and also apologies for my doing such an ill thought-through suggestion. I have spent a good bit of time staring at the code trying to find something that changed in a suspicious manner. Unfortunately, the only thing I have found so far is something that appears harmless (the more so because it did take the expected path in the stack trace with the crash): reading the The reason I think it may well be harmless is that it is either That said, the Zephyr/ARM port needed some work to deal with Still, I'd bet that 8-bit accesses are atomic on ARM and given the memory layout: ddsrt_mtime_t tsched;
enum cb_sync_on_delete_state sync_state;
union {
ddsi_xevent_cb_t cb; where we know Even though I can't convince myself that is could be the problem, I do think it makes sense to tell you. I might be wrong, after all. The diff below is the change I have locally right now that will likely end up in a PR in the near future. diff --git a/src/core/ddsi/src/ddsi_xevent.c b/src/core/ddsi/src/ddsi_xevent.c
index ec5d95034..a4745d96e 100644
--- a/src/core/ddsi/src/ddsi_xevent.c
+++ b/src/core/ddsi/src/ddsi_xevent.c
@@ -248,17 +248,14 @@ static int nontimed_xevent_in_queue (struct ddsi_xeventq *evq, struct ddsi_xeven
}
#endif
-static void free_xevent (struct ddsi_xeventq *evq, struct ddsi_xevent *ev)
+static void free_xevent (struct ddsi_xevent *ev)
{
- (void) evq;
ddsrt_free (ev);
}
-static void ddsi_delete_xevent_nosync (struct ddsi_xevent *ev)
+static void ddsi_delete_xevent_nosync (struct ddsi_xeventq *evq, struct ddsi_xevent *ev)
{
- struct ddsi_xeventq *evq = ev->evq;
- ddsrt_mutex_lock (&evq->lock);
- assert (ev->sync_state != CSODS_EXECUTING);
+ assert (ev->sync_state == CSODS_NO_SYNC_NEEDED);
/* Can delete it only once, no matter how we implement it internally */
assert (ev->tsched.v != TSCHED_DELETE);
assert (TSCHED_DELETE < ev->tsched.v);
@@ -275,13 +272,10 @@ static void ddsi_delete_xevent_nosync (struct ddsi_xevent *ev)
/* TSCHED_DELETE is absolute minimum time, so chances are we need to
wake up the thread. The superfluous signal is harmless. */
ddsrt_cond_broadcast (&evq->cond);
- ddsrt_mutex_unlock (&evq->lock);
}
-static void ddsi_delete_xevent_sync (struct ddsi_xevent *ev)
+static void ddsi_delete_xevent_sync (struct ddsi_xeventq *evq, struct ddsi_xevent *ev)
{
- struct ddsi_xeventq *evq = ev->evq;
- ddsrt_mutex_lock (&evq->lock);
/* wait until neither scheduled nor executing; loop in case the callback reschedules the event */
while (ev->tsched.v != DDS_NEVER || ev->sync_state == CSODS_EXECUTING)
{
@@ -296,16 +290,24 @@ static void ddsi_delete_xevent_sync (struct ddsi_xevent *ev)
ddsrt_cond_wait (&evq->cond, &evq->lock);
}
}
- ddsrt_mutex_unlock (&evq->lock);
- free_xevent (evq, ev);
+ free_xevent (ev);
}
void ddsi_delete_xevent (struct ddsi_xevent *ev)
{
+ struct ddsi_xeventq * const evq = ev->evq;
+ ddsrt_mutex_lock (&evq->lock);
if (ev->sync_state == CSODS_NO_SYNC_NEEDED)
- ddsi_delete_xevent_nosync (ev);
+ {
+ // schedule at TSCHED_DELETE, handler thread will free
+ ddsi_delete_xevent_nosync (evq, ev);
+ }
else
- ddsi_delete_xevent_sync (ev);
+ {
+ // wait while executing, then free
+ ddsi_delete_xevent_sync (evq, ev);
+ }
+ ddsrt_mutex_unlock (&evq->lock);
}
int ddsi_resched_xevent_if_earlier (struct ddsi_xevent *ev, ddsrt_mtime_t tsched)
@@ -463,7 +465,7 @@ void ddsi_xeventq_free (struct ddsi_xeventq *evq)
struct ddsi_xevent *ev;
assert (evq->thrst == NULL);
while ((ev = ddsrt_fibheap_extract_min (&evq_xevents_fhdef, &evq->xevents)) != NULL)
- free_xevent (evq, ev);
+ free_xevent (ev);
{
struct ddsi_xpack *xp = ddsi_xpack_new (evq->gv, false);
@@ -568,7 +570,7 @@ static void handle_xevents (struct ddsi_thread_state * const thrst, struct ddsi_
{
struct ddsi_xevent *xev = ddsrt_fibheap_extract_min (&evq_xevents_fhdef, &xevq->xevents);
if (xev->tsched.v == TSCHED_DELETE)
- free_xevent (xevq, xev);
+ free_xevent (xev);
else
{
ddsi_thread_state_awake_to_awake_no_nest (thrst); |
Thanks for the feedback. Sorry I haven't had time to look at this issue again. I won't get back to it before September, but I will get back to it. |
Your suggestion was good, I didn't mean to be rude, I just wanted to document the deviation from the instructions. And thanks for looking into it more!
I see it made it as 48aa8b3 . You were right, that was not the source of the issue I was seeing. So I suppose #1622 made use of a new API that was bugged in Zephyr in 3.3 and was fixed since then. Closing the issue as all is well now. |
Description
CycloneDDS can get stuck in an infinite loop inside of
fibheap.c
caused by a corruption of the internal state of the heap data structure.Cause
It appears the code can call
ddsrt_fibheap_decrease_key
on a node that was previously removed from the heap. The fipheap implementation assumes that the node to decrease is part of its internal structure, hence adding the removed node to the heap without setting itsnext
,prev
, and other attributes, leaving them in the same state they were in at the time the node was removed from the heap, thus potentially breaking the internal representation.In my case this is the backtrace of it happening:
Which, on a subsequent call to
ddsrt_fibheap_extract_min
can lead to the following infinite loopcyclonedds/src/ddsrt/src/fibheap.c
Line 160 in 87b3177
n
may never get back tomark
, depending on the state of the node added byddsrt_fibheap_decrease_key
.Reproduce
I don't know the root cause of
ddsi_delete_xevent_nosync
callingddsrt_fibheap_decrease_key
on a node that was already removed from the heap.My setup has CycloneDDS running on Zephyr (support added with #1615) but I can't share the application code (yet).
I have the following to demonstrate the corruption process, if it helps:
Environment
The backtrace above is from 87b3177, but I also verified it happening on current tip of master branch (53a92b1).
cc @PatrickM-ZS
The text was updated successfully, but these errors were encountered: