Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arch/stm32f7: Fixes bug in tickless driver where the compare register… #1380

Merged
merged 2 commits into from Jul 18, 2020

Conversation

antmerlino
Copy link
Contributor

@antmerlino antmerlino commented Jul 7, 2020

… is set to a value less than the current time.

Summary

When using a very high resolution timer, and very small time intervals, it is possible for the Compare register of the STM32 timer peripheral to be set to a value that has just passed. This means the timer expiration won't happen until a full overflow cycle of the timer.

The proposed solution is to disable interrupts, set the capture register, and then after, but before turning back on interrupts, check to see if the time has already passed. If it has, we force the expiration immediately.

Impact

All tickless drivers should be evaluated for this flaw. The LPC family devices have more protection for this and guided my understanding of this issue.

Testing

Run on STM32F765 processor with 1us clock and interval timers of varying times - some very short (6us)

g_tickless.pending = true;

uint64_t counter = ((uint64_t)g_tickless.overflow << 32) |
STM32_TIM_GETCOUNTER(g_tickless.tch);
Copy link
Contributor

@patacongo patacongo Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an overflow from the LS to MS occurs, the fetched value will be wrong. I usually do this (which does not require disabling interrupts):

uint64_t counter;
uint32_t ms1;
uint32_t ms2;
uint32_t ls;

do
  {
    ms1 = get_ms();
    ls  = get_ls();
    ms2 = get_ms();
  }
while (ms1 != ms2);

counter = (ms1 << 32) | ls;

Note also that declarations with code is not c89, but that is not so important in M CU-specific code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gregory-nutt Noted, I'll make this change, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an overflow from the LS to MS occurs, the fetched value will be wrong. I usually do this (which does not require disabling interrupts):

uint64_t counter;
uint32_t ms1;
uint32_t ms2;
uint32_t ls;

do
  {
    ms1 = get_ms();
    ls  = get_ls();
    ms2 = get_ms();
  }
while (ms1 != ms2);

counter = (ms1 << 32) | ls;

Well, I didn't notice that the MS part is just a software overflow counter and not a hardware register. This that case, what I propose here is nonsense. If interrupts are disabled, not special precautions needs to be taken.


if (counter > tm)
{
stm32_interval_handler();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gregory-nutt Any recommendations on how to call the stm32_interval_handler in the interrupt context instead of the callers context so that the expiration works as it would if called on timer expiration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any recommendations on how to call the stm32_interval_handler in the interrupt context instead of the callers context so that the expiration works as it would if called on timer expiration?

You can't really get into the interrupt context without being in an interrupt. Entering a critical section is about as close as you can get.

The only side effect is per Issue #1138 If that is a concern you can also disable pre-emption.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gregory-nutt

The problem is, this function eventually calls nxsig_timeout, like it should, but because the task is still currently running, the signal is not actually set, which means the task eventually waits and since the timeout already happened, things are locked.

I'm assuming it would not be safe to change line 128 of sig_timedwait.c to allow the signal in the TSTATE_TASK_RUNNING state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you are saying that nxsig_timeout() is being called before up_block_task()? This is the only place that nxsig_timeout is referenced:

358               /* Start the watchdog */
359
360               wd_start(rtcb->waitdog, waitticks,
361                        nxsig_timeout, 1, wdparm.pvarg);
362
363               /* Now wait for either the signal or the watchdog, but
364                * first, make sure this is not the idle task,
365                * descheduling that isn't going to end well.
366                */
367
368               DEBUGASSERT(NULL != rtcb->flink);
369               up_block_task(rtcb, TSTATE_WAIT_SIG);

The watchdog should not occur because we are in a critical section. We will stay in the critical section until rtcb is blocked in TSTATE_WAIT_SIG state. I guess I don't understand you problem description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gregory-nutt
The issue is that the wd_start happening line 360 is being called with a very small waitticks (6 = 6us).

By the time we get into wd_start, call up_alarm_start() we wind up trying to schedule a time that has already passed. Before my changes, this just resulted in the work queue being locked. I suppose it would wakeup after a full overflow cycle.

So my change is trying to basically expire the timer before it even starts.

Copy link
Contributor

@patacongo patacongo Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that nxsig_timeout can occur before we get into up_block_task. I'm not sure how to protect for this yet.

As I argued above, try using up_udelay(6). There is no loss in performance in this case because the CPU is never relinquished in time by the thread anyway. This corrnercase is certainly not worth a redesign.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that nxsig_timeout can occur before we get into up_block_task. I'm not sure how to protect for this yet.
@patacongo @davids5 @xiaoxiang781216 Any ideas?

nxsig_timedwait is protected by the critical section, so before up_block_task get called, the interrupt service doesn't have the chance to run.

As I argued above, try using up_udelay(6). There is no loss in performance in this case because the CPU is never relinquished in time by the thread anyway. This corrnercase is certainly not worth a redesign.

No, if @antmerlino's report is correct, we need fix this bug:
1.Either fix the race condition or
2.Enlarge timeout to a safe value at wd level
How do you check every place to ensure all timeout isn't too small to hit the race condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patacongo Can you explain something to me.

In sig_timedwait.c line 258 we enter a critical section, disabling interrupts.

We do not call leave_critical_section(), re-enabling interrupts before we start the watchdog and call up_block_task.

How does this work? Does IRQ context switch somewhere when the task switches to re-enable interrupts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patacongo Can you explain something to me.

In sig_timedwait.c line 258 we enter a critical section, disabling interrupts.

We do not call leave_critical_section(), re-enabling interrupts before we start the watchdog and call up_block_task.

How does this work? Does IRQ context switch somewhere when the task switches to re-enable interrupts?

The critical section (and interrupt state) is a per thread property. Disabling interrupts only has effect while the task is running. When it suspends and another task runs, then that task's interrupt state is placed into effect. So (usually) interrupts are re-enabled while the task sleeps.

I works in the context switch because the context switch saves and restores the current state of the ARM interrupt control registers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, my report was wrong. There was never a race condition at the sig_timedwait level - Greg's answer cleared that up for me. I just didn't protect for the race condition completely at the STM32 level, there was still a very specific timing sequence that would cause it to lock up.

I've now been running stable with the latest branch.

I can't get the nxstyle line 800 error to go away. Otherwise I fixed all other formatting issues.

This is ready for final review and then merging.

@antmerlino antmerlino force-pushed the stm32f7-tickless branch 3 times, most recently from b8146a8 to c83be2c Compare July 8, 2020 16:12
… is set to a value less than the current time.
@antmerlino antmerlino changed the title [DRAFT] arch/stm32f7: Fixes bug in tickless driver where the compare register… arch/stm32f7: Fixes bug in tickless driver where the compare register… Jul 10, 2020
@Ouss4
Copy link
Member

Ouss4 commented Jul 18, 2020

I can't get the nxstyle line 800 error to go away. Otherwise I fixed all other formatting issues.
This is ready for final review and then merging.

@antmerlino I fixed that nxstyle error.
@davids5 @xiaoxiang781216 @patacongo If you don't have any other concerns, please merge.

@xiaoxiang781216
Copy link
Contributor

LGTM.

@xiaoxiang781216 xiaoxiang781216 merged commit f91372c into apache:master Jul 18, 2020
@btashton btashton added this to To-Add in Release Notes - 10.0.0 Oct 14, 2020
@btashton btashton moved this from To-Add to Added in Release Notes - 10.0.0 Oct 19, 2020
@antmerlino antmerlino deleted the stm32f7-tickless branch March 26, 2022 03:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants