esp_ota_begin() starving network task of cpu time with large partitions and SPI RAM (IDFGH-261) #2083

markwj · 2018-06-21T02:13:33Z

We use 16MB ESP32 modules with 4MB OTA partitions and external SPI RAM. Currently our firmware image is approximately 2.8MB in size. The flow of our code is to make an http GET request for the firmware image, read the headers (in particular download size), call esp_ota_begin() with the firmware download size, then download chunk by chunk and call esp_ota_write for each chunk, and a final esp_ota_end when done. Our networking buffers are in INTERNAL RAM, not SPI RAM.

After our firmware exceeded about 2MB in size, and we started to use SPIRAM more in our application, we started to see random crashes during OTA firmware updates over wifi.

abort() was called at PC 0x401b8e84 on core 0
0x401b8e84: pm_on_beacon_rx at ??:?

Backtrace: 0x40091e6b:0x3ffcc4a0 0x40091fc3:0x3ffcc4c0 0x401b8e84:0x3ffcc4e0 0x401b94ef:0x3ffcc520 0x401b9bd1:0x3ffcc550 0x40089e62:0x3ffcc5a0

0x40091e6b: invoke_abort at /Users/mark/esp/esp-idf/components/esp32/panic.c:669
0x40091fc3: abort at /Users/mark/esp/esp-idf/components/esp32/panic.c:669
0x401b8e84: pm_on_beacon_rx at ??:?
0x401b94ef: ppRxProtoProc at ??:?
0x401b9bd1: ppRxPkt at ??:?
0x40089e62: ppTask at ??:?

We narrowed down the issue to using networking functions (reading from the TCP/IP socket) after calling esp_ota_begin with large image sizes (over approximately 2MB). The Espressif code calls a single esp_partition_erase_range() which disables the SPI RAM cache and blocks any task trying to access that. If the system networking task is blocked for too long, then it seems to get messed up in handling wifi beacons, and panics when it finally gets some CPU time?

If we change esp_ota_begin() to use a loop to erase the partition in 256KB chunks (calling esp_partition_erase_range() multiple times), with a 1 tick vTaskDelay between each chunk, the problem goes away and OTA works again. The vTaskDelay is required as without it the panic in pm_on_beacon_rx still happens.

I am not sure how to address this issue. The core spi_flash_erase_range, that this all depends on, already has a loop to erase sector/block by sector/block, and it uses spi_flash_guard_start() and spi_flash_guard_end() correctly for each sector/block. It seems that the tasks are not getting any/enough cpu time between calls to spi_flash_guard_start()/spi_flash_guard_end() in that sector/block erase loop.

Adding a delay there seems kludgy. Perhaps a freertos call to allow other blocked tasks to run? I tried adding a taskYIELD() after the spi_flash_guard_end() in spi_flash_erase_range, but that didn't solve the problem (presumably the task that called esp_ota_begin was higher priority than the networking task?). A vTaskDelay(1) in the same place works and solves the problem, but just seems horribly kludgy.

Overall, I'm just very uncomfortable with how invasive esp_ota_begin() is. With a 2.8MB image size, it blocks all other tasks that touch SPI RAM for about 17 seconds. That is not good. We should be able to OTA flash without starving other tasks in the system so badly.

There is a separate bug in the pm_on_beacon_rx that is triggered by this, but that seems more a symptom that a solution.

Patrik-Berglund · 2018-06-26T05:26:12Z

We are having the same problem, good you brought this up.

Although we are using a smaller flash, 4MB divided in two OTA paritions we are hit quite heavily by this since our application needs to continue working during OTA download.

Our image is at the moment about 1MB.

markwj · 2018-07-04T02:20:19Z

The vTaskDelay(1) after spi_flash_guard_end() in spi_flash_erase_range workaround seems to solve the issue, but is horrendously kludgy. Any better solution?

Patrik-Berglund · 2018-07-04T07:35:30Z

@markwj I will test that solution by the end of next week, not in the office right now. But modifying the SDK is a last resort...

Found this old issue
#578

Added Kconfig options to enable yield operation during flash erase Closes: #2083 Closes: IDFGH-261

Added Kconfig options to enable yield operation during flash erase. By default disable. Closes: #2083 Closes: IDFGH-261

Added Kconfig options to enable yield operation during flash erase Closes: espressif#2083 Closes: IDFGH-261

Added Kconfig options to enable yield operation during flash erase Closes: #2083 Closes: #4916 Closes: IDFGH-261

FayeY changed the title ~~esp_ota_begin() starving network task of cpu time with large partitions and SPI RAM~~ [TW#23636] esp_ota_begin() starving network task of cpu time with large partitions and SPI RAM Jun 27, 2018

projectgus changed the title ~~[TW#23636] esp_ota_begin() starving network task of cpu time with large partitions and SPI RAM~~ esp_ota_begin() starving network task of cpu time with large partitions and SPI RAM (IDFGH-261) Mar 12, 2019

igrr pushed a commit that referenced this issue Apr 26, 2019

spi_flash: Add vTaskDelay while a long erasing

7a28858

Added Kconfig options to enable yield operation during flash erase Closes: #2083 Closes: IDFGH-261

igrr pushed a commit that referenced this issue Apr 28, 2019

spi_flash: Add vTaskDelay while a long erasing

1bca6d0

Added Kconfig options to enable yield operation during flash erase. By default disable. Closes: #2083 Closes: IDFGH-261

igrr pushed a commit that referenced this issue May 6, 2019

spi_flash: Add vTaskDelay while a long erasing

fe0d45d

Added Kconfig options to enable yield operation during flash erase. By default disable. Closes: #2083 Closes: IDFGH-261

Hallot pushed a commit to Hallot/esp-idf that referenced this issue Apr 23, 2020

spi_flash: Add vTaskDelay while a long erasing

5db6833

Added Kconfig options to enable yield operation during flash erase Closes: espressif#2083 Closes: IDFGH-261

Hallot mentioned this issue Apr 23, 2020

Cherry-pick missing commit 7a2885885c25821d8985b6a97798956f12907b5a to master (IDFGH-3157) #5171

Closed

espressif-bot closed this as completed in 3cb655e May 5, 2020

espressif-bot pushed a commit that referenced this issue May 23, 2020

spi_flash(LEGACY_IMPL): Add vTaskDelay while a long erasing

1554fd3

Added Kconfig options to enable yield operation during flash erase Closes: #2083 Closes: #4916 Closes: IDFGH-261

espressif-bot pushed a commit that referenced this issue Jun 15, 2020

spi_flash(LEGACY_IMPL): Add vTaskDelay while a long erasing

98ac272

Added Kconfig options to enable yield operation during flash erase Closes: #2083 Closes: #4916 Closes: IDFGH-261

dreske mentioned this issue Jul 14, 2020

esp_ota_begin() breaks wifi connection (IDFGH-3656) #5591

Closed

chrismerck mentioned this issue Mar 15, 2021

[v3.3.4] esp_ota_begin causes Wi-Fi stack starvation on 4 MiB partitions (IDFGH-4932) #6723

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

esp_ota_begin() starving network task of cpu time with large partitions and SPI RAM (IDFGH-261) #2083

esp_ota_begin() starving network task of cpu time with large partitions and SPI RAM (IDFGH-261) #2083

markwj commented Jun 21, 2018 •

edited

Loading

Patrik-Berglund commented Jun 26, 2018

markwj commented Jul 4, 2018

Patrik-Berglund commented Jul 4, 2018

esp_ota_begin() starving network task of cpu time with large partitions and SPI RAM (IDFGH-261) #2083

esp_ota_begin() starving network task of cpu time with large partitions and SPI RAM (IDFGH-261) #2083

Comments

markwj commented Jun 21, 2018 • edited Loading

Patrik-Berglund commented Jun 26, 2018

markwj commented Jul 4, 2018

Patrik-Berglund commented Jul 4, 2018

markwj commented Jun 21, 2018 •

edited

Loading