Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vTaskDelayUntil() method assertion failure (IDFGH-2045) #4230

Closed
stuarthatchbaby opened this issue Oct 22, 2019 · 6 comments
Closed

vTaskDelayUntil() method assertion failure (IDFGH-2045) #4230

stuarthatchbaby opened this issue Oct 22, 2019 · 6 comments
Assignees

Comments

@stuarthatchbaby
Copy link

stuarthatchbaby commented Oct 22, 2019

This issue is referenced in #391 which was closed against its original report, but there is a secondary issue in the comments which appears to still be an problem in 3.3.

Occasionally (every few-hundred user-hours where we call every second or so), vTaskDelayUntil() can abort with an assertion failure, as detailed below.

/home/stuart/esp/esp-idf-v3.3-rc/components/freertos/tasks.c:1340 (vTaskDelayUntil)- assert failed!
abort() was called at PC 0x400970f1 on core 0
ELF file SHA256: ed66a63567798d37e4441146b46e2ee6db61b0e03fc5cd18f81c58d21e832e85
Backtrace: 0x4008eef7:0x3ffd4d30 0x4008f20d:0x3ffd4d50 0x400970f1:0x3ffd4d70 0x400f48f1:0x3ffd4d90 0x400f4914:0x3ffd4dd0
Rebooting...

This causes a reboot, or sometimes a crash if the SW_CPU_RESET gets hung-up.

@github-actions github-actions bot changed the title vTaskDelayUntil() method assertion failure vTaskDelayUntil() method assertion failure (IDFGH-2045) Oct 22, 2019
@Alvin1Zhang
Copy link
Collaborator

@stuarthatchbaby Thanks for reporting the issue. Would you please help provide more details as suggested in the issue template? Information like elf, sdk configuration, backtrace, log outputs, commit ID, hardware and etc. would help us debug further. Thanks.

@Dazza0
Copy link
Contributor

Dazza0 commented Oct 23, 2019

@stuarthatchbaby
Could you explain a little more about the task that calls vTaskDelayUntil() (e.g. what the priority of the task is, and whether it is pinned or not). Details about the task that calls vTaskSuspendAll() would also be helpful (might be called indirectly through SPI flash write functions).

configASSERT( uxSchedulerSuspended[ xPortGetCoreID() ] == 0 );

Assuming that vTaskDelayUntil() was not called between calls to vTaskSuspendAll() and xTaskResumeAll(), I can feasibly see one other scenario (outlined below) where this assert can fail. This bug can occur due to configASSERT( uxSchedulerSuspended[ xPortGetCoreID() ] == 0 ); not being placed in a critical section, thus an interrupt or context switch can occur between getting the core ID, and assessing the value of uxSchedulerSuspended

This assert could trigger under the following scenario

  1. An unpinned TaskA() is running on core 0 and calls vTaskDelayUntil()
  2. Within vTaskDelayUntil(), xPortGetCoreID() will be called which should return 0 in this case
  3. A context switch or interrupt occurs immediately after xPortGetCoreID() is called
  4. A higher priority TaskB() now runs on core 0, and it calls vTaskSuspendAll()
  5. Because core 0 is running TaskB(), the unpinned TaskA() now switches to run on core 1
  6. Because xPortGetCoreID() has already been called before the context switch, TaskA() will check for configASSERT( uxSchedulerSuspended[0] == 0 ); which has been currently disabled by TaskB()

As a temporary fix, try putting the assertion inside the critical section of the function (as shown below) and see if that resolves the issue:

configASSERT( pxPreviousWakeTime );
configASSERT( ( xTimeIncrement > 0U ) );

taskENTER_CRITICAL(&xTaskQueueMutex);
configASSERT( uxSchedulerSuspended[ xPortGetCoreID() ] == 0 );
...

@stuarthatchbaby
Copy link
Author

Thanks!

As background here's the backtrace:

$ xtensa-esp32-elf-addr2line -piaf -e *.elf 0x4008eef7:0x3ffd4d30 0x4008f20d:0x3ffd4d50 0x400970f1:0x3ffd4d70 0x400f48f1:0x3ffd4d90 0x400f4914:0x3ffd4dd0
0x4008eef7: invoke_abort at /home/stuart/esp/esp-idf-v3.3-rc/components/esp32/panic.c:695
0x4008f20d: abort at /home/stuart/esp/esp-idf-v3.3-rc/components/esp32/panic.c:695
0x400970f1: vTaskDelayUntil at /home/stuart/esp/esp-idf-v3.3-rc/components/freertos/tasks.c:4513 (discriminator 2)
0x400f48f1: Mic_UDP_fetch at /home/stuart/esp/Our_ESP32/esp32_tasks/tasks/Sound_mic/Sound_mic.c:266
0x400f4914: Sound_mic_UDP at /home/stuart/esp/Our_ESP32/esp32_tasks/tasks/Sound_mic/Sound_mic.c:274

Sound_mic is reasonably low priority (5) and not pinned. It lifts 160 byte chunks from a ring buffer, processes and dispatches over UDP, then waits for the remainder of 20ms. The incidence is high enough to cause pain, but realistically occurs per many-millions of calls so it does sound like a weird context-switch interaction as @Dazza0 suggests.

We don't call vTaskSuspendAll(), or do any meaningful flash operations, but we do use Event Groups extensively and I see vTaskSuspendAll() hiding in there.

I'm happy to move the assert. Does it make sense to pin the task as well? We've only got a couple of calls to vTaskDelayUntil() with the remaining instances being much slower. I don't think it would hurt functionality to pin any of them.

@Alvin1Zhang
Copy link
Collaborator

@stuarthatchbaby Thanks for the updates, since this ticket was already closed and same with 4230, let us track the updates in 4230, thanks.

@Dazza0
Copy link
Contributor

Dazza0 commented Oct 24, 2019

@stuarthatchbaby
Pinning the tasks that call vTaskDelay() of vTaskDelayUntil() will technically prevent this problem. Disabling the configASSERT() is also an option.

But the onus should be on FreeRTOS functions to be thread safe. It seems like the thread safety of the configASSERT( uxSchedulerSuspended[ xPortGetCoreID()] == 0 ); was not considered when we modified FreeRTOS for SMP. The functions vTaskDelete(), vTaskSuspend(), xTaskResumeAll(), appear to also susceptible to this bug.

I'll push a commit to fix this issue on master, then backport it to v3.3

@stuarthatchbaby
Copy link
Author

@Dazza0 - As a quick work-around I've disabled the asserts in the functions above to produce a test build. It'll be several days before we get meaningful A/B metrics back but I'll update here once we have them.

espressif-bot pushed a commit that referenced this issue Dec 16, 2019
This commit fixes thread safety issues with configASSERT() calls
regarding the value of uxSchedulerSuspended. A false negative
occurs if a context switch to the opposite core occurs in between
the getting the core ID and the assesment.

Closes #4230
espressif-bot pushed a commit that referenced this issue Jan 3, 2020
This commit fixes thread safety issues with configASSERT() calls
regarding the value of uxSchedulerSuspended. A false negative
occurs if a context switch to the opposite core occurs in between
the getting the core ID and the assesment.

Closes #4230
jack0c pushed a commit that referenced this issue Jul 29, 2020
This commit fixes thread safety issues with configASSERT() calls
regarding the value of uxSchedulerSuspended. A false negative
occurs if a context switch to the opposite core occurs in between
the getting the core ID and the assesment.

Closes #4230
espressif-bot pushed a commit that referenced this issue Nov 25, 2021
This issue was earlier fixed in commit 79e74e5
but during migration to newer FreeRTOS release, it got introduced again.

This commit fixes thread safety issues with configASSERT() calls
regarding the value of uxSchedulerSuspended. A false negative
occurs if a context switch to the opposite core occurs in between
the getting the core ID and the assesment.

Relevant #4230
Closes #7726
Closes IDFGH-6041
espressif-bot pushed a commit that referenced this issue Dec 9, 2021
This issue was earlier fixed in commit 79e74e5
but during migration to newer FreeRTOS release, it got introduced again.

This commit fixes thread safety issues with configASSERT() calls
regarding the value of uxSchedulerSuspended. A false negative
occurs if a context switch to the opposite core occurs in between
the getting the core ID and the assesment.

Relevant #4230
Closes #7726
Closes IDFGH-6041
espressif-bot pushed a commit that referenced this issue Dec 24, 2021
This issue was earlier fixed in commit 79e74e5
but during migration to newer FreeRTOS release, it got introduced again.

This commit fixes thread safety issues with configASSERT() calls
regarding the value of uxSchedulerSuspended. A false negative
occurs if a context switch to the opposite core occurs in between
the getting the core ID and the assesment.

Relevant #4230
Closes #7726
Closes IDFGH-6041
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants