New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arch/stm32f7: Fixes bug in tickless driver where the compare register… #1380
Conversation
g_tickless.pending = true; | ||
|
||
uint64_t counter = ((uint64_t)g_tickless.overflow << 32) | | ||
STM32_TIM_GETCOUNTER(g_tickless.tch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an overflow from the LS to MS occurs, the fetched value will be wrong. I usually do this (which does not require disabling interrupts):
uint64_t counter;
uint32_t ms1;
uint32_t ms2;
uint32_t ls;
do
{
ms1 = get_ms();
ls = get_ls();
ms2 = get_ms();
}
while (ms1 != ms2);
counter = (ms1 << 32) | ls;
Note also that declarations with code is not c89, but that is not so important in M CU-specific code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gregory-nutt Noted, I'll make this change, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an overflow from the LS to MS occurs, the fetched value will be wrong. I usually do this (which does not require disabling interrupts):
uint64_t counter; uint32_t ms1; uint32_t ms2; uint32_t ls; do { ms1 = get_ms(); ls = get_ls(); ms2 = get_ms(); } while (ms1 != ms2); counter = (ms1 << 32) | ls;
Well, I didn't notice that the MS part is just a software overflow counter and not a hardware register. This that case, what I propose here is nonsense. If interrupts are disabled, not special precautions needs to be taken.
|
||
if (counter > tm) | ||
{ | ||
stm32_interval_handler(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gregory-nutt Any recommendations on how to call the stm32_interval_handler in the interrupt context instead of the callers context so that the expiration works as it would if called on timer expiration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any recommendations on how to call the stm32_interval_handler in the interrupt context instead of the callers context so that the expiration works as it would if called on timer expiration?
You can't really get into the interrupt context without being in an interrupt. Entering a critical section is about as close as you can get.
The only side effect is per Issue #1138 If that is a concern you can also disable pre-emption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is, this function eventually calls nxsig_timeout, like it should, but because the task is still currently running, the signal is not actually set, which means the task eventually waits and since the timeout already happened, things are locked.
I'm assuming it would not be safe to change line 128 of sig_timedwait.c to allow the signal in the TSTATE_TASK_RUNNING state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you are saying that nxsig_timeout() is being called before up_block_task()? This is the only place that nxsig_timeout is referenced:
358 /* Start the watchdog */
359
360 wd_start(rtcb->waitdog, waitticks,
361 nxsig_timeout, 1, wdparm.pvarg);
362
363 /* Now wait for either the signal or the watchdog, but
364 * first, make sure this is not the idle task,
365 * descheduling that isn't going to end well.
366 */
367
368 DEBUGASSERT(NULL != rtcb->flink);
369 up_block_task(rtcb, TSTATE_WAIT_SIG);
The watchdog should not occur because we are in a critical section. We will stay in the critical section until rtcb is blocked in TSTATE_WAIT_SIG state. I guess I don't understand you problem description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gregory-nutt
The issue is that the wd_start happening line 360 is being called with a very small waitticks (6 = 6us).
By the time we get into wd_start, call up_alarm_start() we wind up trying to schedule a time that has already passed. Before my changes, this just resulted in the work queue being locked. I suppose it would wakeup after a full overflow cycle.
So my change is trying to basically expire the timer before it even starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that nxsig_timeout can occur before we get into up_block_task. I'm not sure how to protect for this yet.
As I argued above, try using up_udelay(6). There is no loss in performance in this case because the CPU is never relinquished in time by the thread anyway. This corrnercase is certainly not worth a redesign.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that nxsig_timeout can occur before we get into up_block_task. I'm not sure how to protect for this yet.
@patacongo @davids5 @xiaoxiang781216 Any ideas?
nxsig_timedwait is protected by the critical section, so before up_block_task get called, the interrupt service doesn't have the chance to run.
As I argued above, try using up_udelay(6). There is no loss in performance in this case because the CPU is never relinquished in time by the thread anyway. This corrnercase is certainly not worth a redesign.
No, if @antmerlino's report is correct, we need fix this bug:
1.Either fix the race condition or
2.Enlarge timeout to a safe value at wd level
How do you check every place to ensure all timeout isn't too small to hit the race condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patacongo Can you explain something to me.
In sig_timedwait.c line 258 we enter a critical section, disabling interrupts.
We do not call leave_critical_section(), re-enabling interrupts before we start the watchdog and call up_block_task.
How does this work? Does IRQ context switch somewhere when the task switches to re-enable interrupts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patacongo Can you explain something to me.
In sig_timedwait.c line 258 we enter a critical section, disabling interrupts.
We do not call leave_critical_section(), re-enabling interrupts before we start the watchdog and call up_block_task.
How does this work? Does IRQ context switch somewhere when the task switches to re-enable interrupts?
The critical section (and interrupt state) is a per thread property. Disabling interrupts only has effect while the task is running. When it suspends and another task runs, then that task's interrupt state is placed into effect. So (usually) interrupts are re-enabled while the task sleeps.
I works in the context switch because the context switch saves and restores the current state of the ARM interrupt control registers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, my report was wrong. There was never a race condition at the sig_timedwait level - Greg's answer cleared that up for me. I just didn't protect for the race condition completely at the STM32 level, there was still a very specific timing sequence that would cause it to lock up.
I've now been running stable with the latest branch.
I can't get the nxstyle line 800 error to go away. Otherwise I fixed all other formatting issues.
This is ready for final review and then merging.
b8146a8
to
c83be2c
Compare
… is set to a value less than the current time.
c83be2c
to
0018e07
Compare
@antmerlino I fixed that nxstyle error. |
LGTM. |
… is set to a value less than the current time.
Summary
When using a very high resolution timer, and very small time intervals, it is possible for the Compare register of the STM32 timer peripheral to be set to a value that has just passed. This means the timer expiration won't happen until a full overflow cycle of the timer.
The proposed solution is to disable interrupts, set the capture register, and then after, but before turning back on interrupts, check to see if the time has already passed. If it has, we force the expiration immediately.
Impact
All tickless drivers should be evaluated for this flaw. The LPC family devices have more protection for this and guided my understanding of this issue.
Testing
Run on STM32F765 processor with 1us clock and interval timers of varying times - some very short (6us)