Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve thread safety in esp_timer (IDFGH-9920) #11215

Merged

Conversation

JensGutermuth
Copy link
Contributor

Inadequate locking in the esp_timer component allowed corruption of the s_timers linked list:

  1. timer_armed(timer) returns false
  2. another task arms the timer and adds it to s_timers
  3. the list is locked
  4. the timer is inserted into s_timers again

The last step results in a loop in the s_timers list, which causes an infinite loop when iterated. This change always locks the list before checking if the timer is already armed avoiding the data race.

@CLAassistant
Copy link

CLAassistant commented Apr 17, 2023

CLA assistant check
All committers have signed the CLA.

@espressif-bot espressif-bot added the Status: Opened Issue is new label Apr 17, 2023
@github-actions github-actions bot changed the title improve thread safety in esp_timer improve thread safety in esp_timer (IDFGH-9920) Apr 17, 2023
Copy link
Member

@igrr igrr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix @JensGutermuth, the changes looks good to me!

Do you happen to have any piece of code which triggers this race condition? I would like to add a test case for this, so if you have some code already it would save a bit of time. (I.e. something that fails without the fix and works correctly after the fix.)

components/esp_timer/src/esp_timer.c Outdated Show resolved Hide resolved
components/esp_timer/src/esp_timer.c Outdated Show resolved Hide resolved
@JensGutermuth
Copy link
Contributor Author

I'll see if I can put together a small example triggering the race tomorrow. I've seen the issue sporadically in a fairly large proprietary project. Heavily using the NimBLE GATT server seemed to have triggered it most often.

Side note: You guys are fast! Getting a thoughtful response in under 30 minutes is impressive. Very cool!

@JensGutermuth
Copy link
Contributor Author

Replacing main.c of the sample project with the following code and building it with the default sdkconfig should trigger the race and cause a crash within a couple of seconds on an ESP32 using esp-idf ef64e4e (current master). I've added an example console output below.

Code to reproduce the race in esp_timer

#include <stdio.h>

#include <freertos/FreeRTOS.h>
#include <freertos/task.h>

#include <esp_timer.h>
#include <esp_err.h>
#include <esp_log.h>

static esp_timer_handle_t shared_timer = NULL;
static volatile int shared_timer_happened = 0;

void esp_timer_callback(void * arg)
{
    // do something
    shared_timer_happened++;
}

/**
 * @brief Starts and stops the shared_timer in a loop to trigger a race
 *
 * Starting and stopping an esp_timer from different tasks may
 * trigger a race. This function usually triggers it within a couple of seconds
 * if run concurrently in different tasks, perferably on different cores.
 */
void toggle_shared_timer(void * pvParameters)
{
    while(1)
    {
        for (int i = 0; i < 10000; i++)
        {
            esp_timer_start_once(shared_timer, 1000);
            esp_timer_stop(shared_timer);
        }
        ESP_LOGI("check", "ok core %d", xPortGetCoreID());

        // yield to allow the IDLE task to run
        vTaskDelay(1);
    }

}

void app_main(void)
{
    // create shared timer
    esp_timer_create_args_t create_args = {
        .callback = esp_timer_callback,
        .arg = NULL,
        .dispatch_method = ESP_TIMER_TASK,
        .name = "shared_timer",
        .skip_unhandled_events = true
    };
    ESP_ERROR_CHECK(esp_timer_create(&create_args, &shared_timer));

    // start second task
    xTaskCreatePinnedToCore(toggle_shared_timer, "second_task", 4096, NULL, 20, NULL, 1);

    // toggle the shared timer in a loop
    toggle_shared_timer(NULL);
}

Example console output

I (319) app_start: Starting scheduler on CPU0
I (323) app_start: Starting scheduler on CPU1
I (323) main_task: Started on CPU0
I (333) main_task: Calling app_main()
Guru Meditation Error: Core  0 panic'ed (Interrupt wdt timeout on CPU0).

Core  0 register dump:
PC      : 0x40084625  PS      : 0x00060034  A0      : 0x80087ea8  A1      : 0x3ffb3bc0
0x40084625: xt_utils_compare_and_set at <...>/esp-idf/components/xtensa/include/xt_utils.h:222
 (inlined by) esp_cpu_compare_and_set at <...>/esp-idf/components/esp_hw_support/cpu.c:438

A2      : 0x3ffb012c  A3      : 0xb33fffff  A4      : 0x0000abab  A5      : 0x00060023
A6      : 0xb33fffff  A7      : 0xb33fffff  A8      : 0x00000000  A9      : 0x00000000
A10     : 0x00000001  A11     : 0x00000000  A12     : 0x80087ff6  A13     : 0x3ffb4a40
A14     : 0x00000003  A15     : 0x00060023  SAR     : 0x00000000  EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x00000000


Backtrace: 0x40084622:0x3ffb3bc0 0x40087ea5:0x3ffb3be0 0x40082dfe:0x3ffb3c10 0x400d33f9:0x3ffb3c30 0x400d3529:0x3ffb3c60 0x40087d0d:0x3ffb3c80
0x40084622: xt_utils_compare_and_set at <...>/esp-idf/components/xtensa/include/xt_utils.h:215
 (inlined by) esp_cpu_compare_and_set at <...>/esp-idf/components/esp_hw_support/cpu.c:438

0x40087ea5: spinlock_acquire at <...>/esp-idf/components/esp_hw_support/include/spinlock.h:112
 (inlined by) xPortEnterCriticalTimeout at <...>/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:501

0x40082dfe: xPortEnterCriticalTimeoutSafe at <...>/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/include/freertos/portmacro.h:589
 (inlined by) vPortEnterCriticalSafe at <...>/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/include/freertos/portmacro.h:596
 (inlined by) timer_list_lock at <...>/esp-idf/components/esp_timer/src/esp_timer.c:356

0x400d33f9: timer_process_alarm at <...>/esp-idf/components/esp_timer/src/esp_timer.c:370

0x400d3529: timer_task at <...>/esp-idf/components/esp_timer/src/esp_timer.c:437 (discriminator 1)

0x40087d0d: vPortTaskWrapper at <...>/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:162



Core  1 register dump:
PC      : 0x40082e4a  PS      : 0x00060034  A0      : 0x80082f85  A1      : 0x3ffb6910
0x40082e4a: timer_insert at <...>/esp-idf/components/esp_timer/src/esp_timer.c:287 (discriminator 2)

A2      : 0x3ffb59dc  A3      : 0x00000000  A4      : 0x80087ff6  A5      : 0x3ffb12a0
A6      : 0x00000003  A7      : 0x00060023  A8      : 0x3ffb59dc  A9      : 0x00000000
A10     : 0x00000000  A11     : 0x0000a7ab  A12     : 0x00000000  A13     : 0x0000a7ab
A14     : 0x02280bb6  A15     : 0xb33fffff  SAR     : 0x00000000  EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x00000000


Backtrace: 0x40082e47:0x3ffb6910 0x40082f82:0x3ffb6930 0x400d50ab:0x3ffb6950 0x40087d0d:0x3ffb6970
0x40082e47: timer_insert at <...>/esp-idf/components/esp_timer/src/esp_timer.c:289

0x40082f82: esp_timer_start_once at <...>/esp-idf/components/esp_timer/src/esp_timer.c:210

0x400d50ab: toggle_shared_timer at <...>/esp_timer_race_condition/main/main.c:32 (discriminator 3)

0x40087d0d: vPortTaskWrapper at <...>/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:162

@igrr
Copy link
Member

igrr commented Apr 18, 2023

Thanks for addressing my comment and for the reproducer, @JensGutermuth!

One final request, could you please combine your two commits into one (git rebase -i, choose fixup) so that we can keep a clean commit history? (I can also do this on my side, but in that case the commit ID of your contribution will change and Github won't consider the PR as "merged" and will show it as "closed".)

Inadequate locking in the esp_timer component allowed corruption
of the s_timers linked list:

1. timer_armed(timer) returns false
2. another task arms the timer and adds it to s_timers
3. the list is locked
4. the timer is inserted into s_timers again

The last step results in a loop in the s_timers list, which causes
an infinite loop when iterated. This change always locks the
list before checking if the timer is already armed avoiding
the data race.
@JensGutermuth
Copy link
Contributor Author

One final request, could you please combine your two commits into one

Sure :)

@igrr igrr added the PR-Sync-Merge Pull request sync as merge commit label Apr 19, 2023
@igrr
Copy link
Member

igrr commented Apr 19, 2023

sha=3ba70490c94f306dc2573965879581b2f3210cc2

@AxelLin
Copy link
Contributor

AxelLin commented Apr 23, 2023

I can reproduce the panic with v4.3 branch using #11215 (comment) .
Please also backport the fix for v4.3.

@espressif-bot espressif-bot added the Status: In Progress Work is in progress label Apr 24, 2023
@espressif-bot espressif-bot added Status: Done Issue is done internally Resolution: NA Issue resolution is unavailable and removed Status: In Progress Work is in progress labels May 2, 2023
@chipweinberger
Copy link
Contributor

chipweinberger commented May 6, 2023

I am also hitting this in v4.4 all the time. In my case it caused a circular list of timers and corresponding infinite loop.

Please get this merged and backported.

My issue: #11338

@espressif-bot espressif-bot merged commit 500b5b2 into espressif:master May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR-Sync-Merge Pull request sync as merge commit Resolution: NA Issue resolution is unavailable Status: Done Issue is done internally Status: Opened Issue is new
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants