pm: Fix atomicity of rtos idle lock acquiring (IDFGH-876) #3110

tim-nordell-nimbelink · 2019-02-28T19:40:14Z

It's possible for esp_pm_impl_isr_hook(...) to be nested due to the fact
that interrupts are nested on the ESP32. To fix this we need to place the
acquiring of the lock into a critical section to ensure it does not get
nested on the system, otherwise the system will never release the idle
lock when this occurs and will not go into lower power states.

A sample backtrace encountering this (the code was instrumented to go into
a while(1) loop when the condition was hit to get this backtrace) from
commit d7a7a68:

#0  leave_idle () at esp-idf/components/esp32/pm_esp32.c:444
#1  0x4008143a in esp_pm_impl_isr_hook () at esp-idf/components/esp32/pm_esp32.c:473
#2  0x40082750 in _xt_medint2 () at esp-idf/components/freertos/xtensa_vectors.S:1243
#3  0x4000bff0 in ?? ()
#4  0x40090bb0 in vTaskExitCritical (mux=0x3ffbd230) at esp-idf/components/freertos/tasks.c:4304
#5  0x40081758 in esp_pm_lock_acquire (handle=0x3ffbd218) at esp-idf/components/esp32/pm_locks.c:126
#6  0x40081399 in leave_idle () at esp-idf/components/esp32/pm_esp32.c:440
#7  0x4008143a in esp_pm_impl_isr_hook () at esp-idf/components/esp32/pm_esp32.c:473
#8  0x400826b8 in _xt_lowint1 () at esp-idf/components/freertos/xtensa_vectors.S:1154
#9  0x400d14b0 in esp_pm_impl_waiti () at esp-idf/components/esp32/pm_esp32.c:483
#10 0x400d2c77 in esp_vApplicationIdleHook () at esp-idf/components/esp32/freertos_hooks.c:63
#11 0x40091008 in prvIdleTask (pvParameters=0x0) at esp-idf/components/freertos/tasks.c:3412
#12 0x40090344 in vPortTaskWrapper (pxCode=0x40090ffc <prvIdleTask>, pvParameters=0x0) at esp-idf/components/freertos/port.c:143

Signed-off-by: Tim Nordell tim.nordell@nimbelink.com

It's possible for esp_pm_impl_isr_hook(...) to be nested due to the fact that interrupts are nested on the ESP32. To fix this we need to place the acquiring of the lock into a critical section to ensure it does not get nested on the system, otherwise the system will never release the idle lock when this occurs and will not go into lower power states. A sample backtrace encountering this (the code was instrumented to go into a while(1) loop when the condition was hit to get this backtrace) from commit d7a7a68: #0 leave_idle () at esp-idf/components/esp32/pm_esp32.c:444 espressif#1 0x4008143a in esp_pm_impl_isr_hook () at esp-idf/components/esp32/pm_esp32.c:473 espressif#2 0x40082750 in _xt_medint2 () at esp-idf/components/freertos/xtensa_vectors.S:1243 espressif#3 0x4000bff0 in ?? () espressif#4 0x40090bb0 in vTaskExitCritical (mux=0x3ffbd230) at esp-idf/components/freertos/tasks.c:4304 espressif#5 0x40081758 in esp_pm_lock_acquire (handle=0x3ffbd218) at esp-idf/components/esp32/pm_locks.c:126 espressif#6 0x40081399 in leave_idle () at esp-idf/components/esp32/pm_esp32.c:440 espressif#7 0x4008143a in esp_pm_impl_isr_hook () at esp-idf/components/esp32/pm_esp32.c:473 espressif#8 0x400826b8 in _xt_lowint1 () at esp-idf/components/freertos/xtensa_vectors.S:1154 espressif#9 0x400d14b0 in esp_pm_impl_waiti () at esp-idf/components/esp32/pm_esp32.c:483 espressif#10 0x400d2c77 in esp_vApplicationIdleHook () at esp-idf/components/esp32/freertos_hooks.c:63 espressif#11 0x40091008 in prvIdleTask (pvParameters=0x0) at esp-idf/components/freertos/tasks.c:3412 espressif#12 0x40090344 in vPortTaskWrapper (pxCode=0x40090ffc <prvIdleTask>, pvParameters=0x0) at esp-idf/components/freertos/port.c:143 Signed-off-by: Tim Nordell <tim.nordell@nimbelink.com>

CLAassistant · 2019-02-28T19:40:22Z

All committers have signed the CLA.

igrr · 2019-04-02T07:19:53Z

Thanks for submitting this @tim-nordell-nimbelink! As we were running some stress tests with this change applied, we ran into a rare crash which is still being investigated. The fix itself looks correct, but we want to make sure we understand the cause of this new crash before we merge this. Will update when we have some progress.

tim-nordell-nimbelink · 2019-04-02T13:45:45Z

Hi @igrr -

Thanks for looking it over. Hopefully you can find the root cause of the new crash that you're observing! I can understand that a crash that happens rarely is hard to debug, and testing is always a good thing.

Good luck!
Tim

igrr · 2019-04-15T09:37:05Z

@tim-nordell-nimbelink The issues I have previously observed were due to the fact that preemption may happen not only inside leave_idle function, but also inside the update_ccompare function (which is called from esp_pm_impl_isr_hook). This resulted in issues such as #3057, when the following sequence of events occurs:

Level 1 interrupt happens, esp_pm_impl_isr_hook is called
Hook runs up to update_ccompare function
Level 2 interrupt happens on the same CPU, esp_pm_impl_isr_hook is called again.
update_ccompare returns, s_need_update_ccompare[core_id] is set to false.
Other CPU (which was waiting for s_need_update_ccompare[core_id] to become false) sets s_ccount_div = 0 (this happens in on_freq_update function).
Level 2 interrupt completes and level 1 interrupt resumes in update_ccompare function
Integer division by 0 happens in update_ccompare, because s_ccount_div is now zero.

Please check the attached patch, it moves the interrupt lock into esp_pm_impl_isr_hook. It solves the issues we have observed. Please let me know if you have any concerns, if not we will merge that instead of your PR.
4775.patch.zip

tim-nordell-nimbelink · 2019-04-15T14:56:38Z

Hi @igrr -

It should still maintain the fix for the issue I saw; I'm not running the update_ccompare() code here since I'm running on a single core for power consumption reasons.

I do think I see a potential problem within on_freq_update(...) though - is it possible for both cores to run this code simultaneously? If so, then:

Core 1 and 2 are past setting s_ccount_div to old_ticks_per_us
Core 2 gets an an interrupt so it doesn't get as far as notifying core 1
Core 1 notifies core 2
Core 2 handles update of ccompare from interrupt
Core 2 gets another interrupt before notifying core 1
Core 1 finishes and sets s_ccount_div to 0
Core 2 notifies core 1 of update for ccompare[...]
Core 1 goes into interrupt for handling ccompare(...), and has a divide by 0

Another scenario:

Core 1 and 2 are past setting s_ccount_div to old_ticks_per_us
Core 2 gets an interrupt before reaching update_ccompare()
Core 1 notifies core 2 for updating
Core 2 processes interrupt for updating ccompare()
Core 1 sets s_ccount_div to 0
Core 2 exits interrupt service routines and gets into update_ccompare(...) routine, but s_ccount_div is set to 0 at this point so it faults.

It'd be a race condition for it to occur with enough interrupt activity at just the right time, but I bet it's possible. I think you need a lock that spans both cores (instead of the two cores independently) around the setting of s_ccount_div/s_ccount_mul and clearing of those. I'll never see this race condition myself since I'm running a single core.

That might allow for the lock proposed in this pull request to remain where it is if you indeed need the multi-core lock since the setting of s_need_update_ccompare[...] is solely within on_freq_update(...).

Tim

Follows the approach proposed in #3110, but masks the interrupts during the entire ISR hook, not only during leave_idle. Interrupt nesting during update_ccompare may also cause issues. Closes #3057

igrr · 2019-04-30T04:24:02Z

@tim-nordell-nimbelink Sorry that I haven't noticed your comment before.

potential problem within on_freq_update(...) though - is it possible for both cores to run this code simultaneously?

I think this situation should not happen due the check at the top of do_switch function. If a frequency switch is already in progress, then the other CPU will either cancel its switch or wait until the first one is completed.

The patch attached above has been merged in d31ee80. Please let me know if you see any related issues. And thanks for contributing!

Follows the approach proposed in #3110, but masks the interrupts during the entire ISR hook, not only during leave_idle. Interrupt nesting during update_ccompare may also cause issues. Closes #3057

Follows the approach proposed in espressif#3110, but masks the interrupts during the entire ISR hook, not only during leave_idle. Interrupt nesting during update_ccompare may also cause issues. Closes espressif#3057

Follows the approach proposed in #3110, but masks the interrupts during the entire ISR hook, not only during leave_idle. Interrupt nesting during update_ccompare may also cause issues. Closes #3057

projectgus requested a review from igrr March 5, 2019 05:43

github-actions bot changed the title ~~pm: Fix atomicity of rtos idle lock acquiring~~ pm: Fix atomicity of rtos idle lock acquiring (IDFGH-876) Mar 28, 2019

igrr mentioned this pull request Apr 3, 2019

pm_esp32.c:426 IntegerDivideByZero (IDFGH-631) #3057

Closed

igrr closed this Apr 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pm: Fix atomicity of rtos idle lock acquiring (IDFGH-876) #3110

pm: Fix atomicity of rtos idle lock acquiring (IDFGH-876) #3110

tim-nordell-nimbelink commented Feb 28, 2019

CLAassistant commented Feb 28, 2019 •

edited

Loading

igrr commented Apr 2, 2019

tim-nordell-nimbelink commented Apr 2, 2019

igrr commented Apr 15, 2019

tim-nordell-nimbelink commented Apr 15, 2019 •

edited

Loading

igrr commented Apr 30, 2019

pm: Fix atomicity of rtos idle lock acquiring (IDFGH-876) #3110

pm: Fix atomicity of rtos idle lock acquiring (IDFGH-876) #3110

Conversation

tim-nordell-nimbelink commented Feb 28, 2019

CLAassistant commented Feb 28, 2019 • edited Loading

igrr commented Apr 2, 2019

tim-nordell-nimbelink commented Apr 2, 2019

igrr commented Apr 15, 2019

tim-nordell-nimbelink commented Apr 15, 2019 • edited Loading

igrr commented Apr 30, 2019

CLAassistant commented Feb 28, 2019 •

edited

Loading

tim-nordell-nimbelink commented Apr 15, 2019 •

edited

Loading