Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TSCH: fix a bug in slot scheduling #2140

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

atiselsts
Copy link
Contributor

The scheduling of a new TSCH timeslot at the end of tsch_slot_operation() will enter a count-to-infinity loop if called less or equal-to RTIMER_GUARD ticks before the start of the next TSCH timeslot (i.e. 1 or 2 ticks on most platforms on which RTIMER_GUARD==2). This patch fixes this behavior by busywaiting until the start of the next slot in that case.

In order to avoid double busywaiting, the busywait code is now executed in TSCH_SCHEDULE_AND_YIELD if and only if it was not done in tsch_schedule_slot_operation().

This code was developed as part of the SPHERE project (http://irc-sphere.ac.uk/)

@g-oikonomou g-oikonomou self-assigned this Mar 14, 2017
@simonduq
Copy link
Member

Hmm, I have seen something similar, I wonder if that was the same issue or not. I think the problem was when skipping a slot (lock or non-active slot), we would directly try and schedule the next slot, although we had not even reached the real start of the current timeslot (we'd be 1 or 2 ticks before, but now looking at tsch_schedule_slot_operation I'm no longer sure why this was the case).

@simonduq
Copy link
Member

Now about the fix: I'm worried that in cases we really did miss the deadline (e.g. join, or any slot operation that took too long), we'll end up busy-waiting (wait for a rtimer wrap)

@atiselsts
Copy link
Contributor Author

@simonduq that sounds like another thing that would trigger the same behavior: the bug here is triggered because it looks like check_timer_miss() only works correctly if the reference time is not in the future. If there are any doubts anywhere that it might happen, may be better to add something like if(!RTIMER_CLOCK_LT(RTIMER_NOW(), ref_time + RTIMER_GUARD)) condition in the calling code?

(As a side note, I feel that this sort of thing, the core operation of TSCH, is crying for a randomized brute force testing...)

I'm worried that in cases we really did miss the deadline (e.g. join, or any slot operation that took too long), we'll end up busy-waiting (wait for a rtimer wrap)

The waiting will happen while RTIMER_CLOCK_LT(RTIMER_NOW(), (t0) + (offset) returns true. So if the waiting is started more than half the rtimer period in the future (1 second of the 2 period on msp430) then indeed it will wait for the wraparound. However, if this huge delay ever happens it just means that there's another bug somewhere and the system is already broken :)

@simonduq
Copy link
Member

(As a side note, I feel that this sort of thing, the core operation of TSCH, is crying for a randomized brute force testing...)

I love the idea :)

However, if this huge delay ever happens it just means that there's another bug somewhere and the system is already broken :)

Right, but the problem remains whenever tsch_schedule_slot_operation is called directly. For instance when joining, we often have a few deadline missed.

@atiselsts
Copy link
Contributor Author

I think it's a reasonable assumption to make that the scheduling will not be missed by more than 1 second, even on msp430.

@simonduq
Copy link
Member

Right, I agree. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants