Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pm: Fix atomicity of rtos idle lock acquiring (IDFGH-876) #3110

Conversation

tim-nordell-nimbelink
Copy link
Contributor

It's possible for esp_pm_impl_isr_hook(...) to be nested due to the fact
that interrupts are nested on the ESP32. To fix this we need to place the
acquiring of the lock into a critical section to ensure it does not get
nested on the system, otherwise the system will never release the idle
lock when this occurs and will not go into lower power states.

A sample backtrace encountering this (the code was instrumented to go into
a while(1) loop when the condition was hit to get this backtrace) from
commit d7a7a68:

#0  leave_idle () at esp-idf/components/esp32/pm_esp32.c:444
#1  0x4008143a in esp_pm_impl_isr_hook () at esp-idf/components/esp32/pm_esp32.c:473
#2  0x40082750 in _xt_medint2 () at esp-idf/components/freertos/xtensa_vectors.S:1243
#3  0x4000bff0 in ?? ()
#4  0x40090bb0 in vTaskExitCritical (mux=0x3ffbd230) at esp-idf/components/freertos/tasks.c:4304
#5  0x40081758 in esp_pm_lock_acquire (handle=0x3ffbd218) at esp-idf/components/esp32/pm_locks.c:126
#6  0x40081399 in leave_idle () at esp-idf/components/esp32/pm_esp32.c:440
#7  0x4008143a in esp_pm_impl_isr_hook () at esp-idf/components/esp32/pm_esp32.c:473
#8  0x400826b8 in _xt_lowint1 () at esp-idf/components/freertos/xtensa_vectors.S:1154
#9  0x400d14b0 in esp_pm_impl_waiti () at esp-idf/components/esp32/pm_esp32.c:483
#10 0x400d2c77 in esp_vApplicationIdleHook () at esp-idf/components/esp32/freertos_hooks.c:63
#11 0x40091008 in prvIdleTask (pvParameters=0x0) at esp-idf/components/freertos/tasks.c:3412
#12 0x40090344 in vPortTaskWrapper (pxCode=0x40090ffc <prvIdleTask>, pvParameters=0x0) at esp-idf/components/freertos/port.c:143

Signed-off-by: Tim Nordell tim.nordell@nimbelink.com

It's possible for esp_pm_impl_isr_hook(...) to be nested due to the fact
that interrupts are nested on the ESP32.  To fix this we need to place the
acquiring of the lock into a critical section to ensure it does not get
nested on the system, otherwise the system will never release the idle
lock when this occurs and will not go into lower power states.

A sample backtrace encountering this (the code was instrumented to go into
a while(1) loop when the condition was hit to get this backtrace) from
commit d7a7a68:

    #0  leave_idle () at esp-idf/components/esp32/pm_esp32.c:444
    espressif#1  0x4008143a in esp_pm_impl_isr_hook () at esp-idf/components/esp32/pm_esp32.c:473
    espressif#2  0x40082750 in _xt_medint2 () at esp-idf/components/freertos/xtensa_vectors.S:1243
    espressif#3  0x4000bff0 in ?? ()
    espressif#4  0x40090bb0 in vTaskExitCritical (mux=0x3ffbd230) at esp-idf/components/freertos/tasks.c:4304
    espressif#5  0x40081758 in esp_pm_lock_acquire (handle=0x3ffbd218) at esp-idf/components/esp32/pm_locks.c:126
    espressif#6  0x40081399 in leave_idle () at esp-idf/components/esp32/pm_esp32.c:440
    espressif#7  0x4008143a in esp_pm_impl_isr_hook () at esp-idf/components/esp32/pm_esp32.c:473
    espressif#8  0x400826b8 in _xt_lowint1 () at esp-idf/components/freertos/xtensa_vectors.S:1154
    espressif#9  0x400d14b0 in esp_pm_impl_waiti () at esp-idf/components/esp32/pm_esp32.c:483
    espressif#10 0x400d2c77 in esp_vApplicationIdleHook () at esp-idf/components/esp32/freertos_hooks.c:63
    espressif#11 0x40091008 in prvIdleTask (pvParameters=0x0) at esp-idf/components/freertos/tasks.c:3412
    espressif#12 0x40090344 in vPortTaskWrapper (pxCode=0x40090ffc <prvIdleTask>, pvParameters=0x0) at esp-idf/components/freertos/port.c:143

Signed-off-by: Tim Nordell <tim.nordell@nimbelink.com>
@CLAassistant
Copy link

CLAassistant commented Feb 28, 2019

CLA assistant check
All committers have signed the CLA.

@projectgus projectgus requested a review from igrr March 5, 2019 05:43
@github-actions github-actions bot changed the title pm: Fix atomicity of rtos idle lock acquiring pm: Fix atomicity of rtos idle lock acquiring (IDFGH-876) Mar 28, 2019
@igrr
Copy link
Member

igrr commented Apr 2, 2019

Thanks for submitting this @tim-nordell-nimbelink! As we were running some stress tests with this change applied, we ran into a rare crash which is still being investigated. The fix itself looks correct, but we want to make sure we understand the cause of this new crash before we merge this. Will update when we have some progress.

@tim-nordell-nimbelink
Copy link
Contributor Author

Hi @igrr -

Thanks for looking it over. Hopefully you can find the root cause of the new crash that you're observing! I can understand that a crash that happens rarely is hard to debug, and testing is always a good thing.

Good luck!
Tim

@igrr
Copy link
Member

igrr commented Apr 15, 2019

@tim-nordell-nimbelink The issues I have previously observed were due to the fact that preemption may happen not only inside leave_idle function, but also inside the update_ccompare function (which is called from esp_pm_impl_isr_hook). This resulted in issues such as #3057, when the following sequence of events occurs:

  1. Level 1 interrupt happens, esp_pm_impl_isr_hook is called
  2. Hook runs up to update_ccompare function
  3. Level 2 interrupt happens on the same CPU, esp_pm_impl_isr_hook is called again.
  4. update_ccompare returns, s_need_update_ccompare[core_id] is set to false.
  5. Other CPU (which was waiting for s_need_update_ccompare[core_id] to become false) sets s_ccount_div = 0 (this happens in on_freq_update function).
  6. Level 2 interrupt completes and level 1 interrupt resumes in update_ccompare function
  7. Integer division by 0 happens in update_ccompare, because s_ccount_div is now zero.

Please check the attached patch, it moves the interrupt lock into esp_pm_impl_isr_hook. It solves the issues we have observed. Please let me know if you have any concerns, if not we will merge that instead of your PR.
4775.patch.zip

@tim-nordell-nimbelink
Copy link
Contributor Author

tim-nordell-nimbelink commented Apr 15, 2019

Hi @igrr -

It should still maintain the fix for the issue I saw; I'm not running the update_ccompare() code here since I'm running on a single core for power consumption reasons.

I do think I see a potential problem within on_freq_update(...) though - is it possible for both cores to run this code simultaneously? If so, then:

  1. Core 1 and 2 are past setting s_ccount_div to old_ticks_per_us
  2. Core 2 gets an an interrupt so it doesn't get as far as notifying core 1
  3. Core 1 notifies core 2
  4. Core 2 handles update of ccompare from interrupt
  5. Core 2 gets another interrupt before notifying core 1
  6. Core 1 finishes and sets s_ccount_div to 0
  7. Core 2 notifies core 1 of update for ccompare[...]
  8. Core 1 goes into interrupt for handling ccompare(...), and has a divide by 0

Another scenario:

  1. Core 1 and 2 are past setting s_ccount_div to old_ticks_per_us
  2. Core 2 gets an interrupt before reaching update_ccompare()
  3. Core 1 notifies core 2 for updating
  4. Core 2 processes interrupt for updating ccompare()
  5. Core 1 sets s_ccount_div to 0
  6. Core 2 exits interrupt service routines and gets into update_ccompare(...) routine, but s_ccount_div is set to 0 at this point so it faults.

It'd be a race condition for it to occur with enough interrupt activity at just the right time, but I bet it's possible. I think you need a lock that spans both cores (instead of the two cores independently) around the setting of s_ccount_div/s_ccount_mul and clearing of those. I'll never see this race condition myself since I'm running a single core.

That might allow for the lock proposed in this pull request to remain where it is if you indeed need the multi-core lock since the setting of s_need_update_ccompare[...] is solely within on_freq_update(...).

  • Tim

igrr added a commit that referenced this pull request Apr 26, 2019
Follows the approach proposed in #3110,
but masks the interrupts during the entire ISR hook, not only during
leave_idle. Interrupt nesting during update_ccompare may also cause
issues.

Closes #3057
@igrr
Copy link
Member

igrr commented Apr 30, 2019

@tim-nordell-nimbelink Sorry that I haven't noticed your comment before.

potential problem within on_freq_update(...) though - is it possible for both cores to run this code simultaneously?

I think this situation should not happen due the check at the top of do_switch function. If a frequency switch is already in progress, then the other CPU will either cancel its switch or wait until the first one is completed.

The patch attached above has been merged in d31ee80. Please let me know if you see any related issues. And thanks for contributing!

@igrr igrr closed this Apr 30, 2019
igrr added a commit that referenced this pull request Jul 2, 2019
Follows the approach proposed in #3110,
but masks the interrupts during the entire ISR hook, not only during
leave_idle. Interrupt nesting during update_ccompare may also cause
issues.

Closes #3057
trombik pushed a commit to trombik/esp-idf that referenced this pull request Aug 9, 2019
Follows the approach proposed in espressif#3110,
but masks the interrupts during the entire ISR hook, not only during
leave_idle. Interrupt nesting during update_ccompare may also cause
issues.

Closes espressif#3057
igrr added a commit that referenced this pull request Sep 21, 2019
Follows the approach proposed in #3110,
but masks the interrupts during the entire ISR hook, not only during
leave_idle. Interrupt nesting during update_ccompare may also cause
issues.

Closes #3057
igrr added a commit that referenced this pull request Oct 21, 2019
Follows the approach proposed in #3110,
but masks the interrupts during the entire ISR hook, not only during
leave_idle. Interrupt nesting during update_ccompare may also cause
issues.

Closes #3057
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants