Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

esp_ota_begin() starving network task of cpu time with large partitions and SPI RAM (IDFGH-261) #2083

Closed
markwj opened this issue Jun 21, 2018 · 3 comments

Comments

@markwj
Copy link

markwj commented Jun 21, 2018

We use 16MB ESP32 modules with 4MB OTA partitions and external SPI RAM. Currently our firmware image is approximately 2.8MB in size. The flow of our code is to make an http GET request for the firmware image, read the headers (in particular download size), call esp_ota_begin() with the firmware download size, then download chunk by chunk and call esp_ota_write for each chunk, and a final esp_ota_end when done. Our networking buffers are in INTERNAL RAM, not SPI RAM.

After our firmware exceeded about 2MB in size, and we started to use SPIRAM more in our application, we started to see random crashes during OTA firmware updates over wifi.

abort() was called at PC 0x401b8e84 on core 0
0x401b8e84: pm_on_beacon_rx at ??:?

Backtrace: 0x40091e6b:0x3ffcc4a0 0x40091fc3:0x3ffcc4c0 0x401b8e84:0x3ffcc4e0 0x401b94ef:0x3ffcc520 0x401b9bd1:0x3ffcc550 0x40089e62:0x3ffcc5a0

0x40091e6b: invoke_abort at /Users/mark/esp/esp-idf/components/esp32/panic.c:669
0x40091fc3: abort at /Users/mark/esp/esp-idf/components/esp32/panic.c:669
0x401b8e84: pm_on_beacon_rx at ??:?
0x401b94ef: ppRxProtoProc at ??:?
0x401b9bd1: ppRxPkt at ??:?
0x40089e62: ppTask at ??:?

We narrowed down the issue to using networking functions (reading from the TCP/IP socket) after calling esp_ota_begin with large image sizes (over approximately 2MB). The Espressif code calls a single esp_partition_erase_range() which disables the SPI RAM cache and blocks any task trying to access that. If the system networking task is blocked for too long, then it seems to get messed up in handling wifi beacons, and panics when it finally gets some CPU time?

If we change esp_ota_begin() to use a loop to erase the partition in 256KB chunks (calling esp_partition_erase_range() multiple times), with a 1 tick vTaskDelay between each chunk, the problem goes away and OTA works again. The vTaskDelay is required as without it the panic in pm_on_beacon_rx still happens.

I am not sure how to address this issue. The core spi_flash_erase_range, that this all depends on, already has a loop to erase sector/block by sector/block, and it uses spi_flash_guard_start() and spi_flash_guard_end() correctly for each sector/block. It seems that the tasks are not getting any/enough cpu time between calls to spi_flash_guard_start()/spi_flash_guard_end() in that sector/block erase loop.

Adding a delay there seems kludgy. Perhaps a freertos call to allow other blocked tasks to run? I tried adding a taskYIELD() after the spi_flash_guard_end() in spi_flash_erase_range, but that didn't solve the problem (presumably the task that called esp_ota_begin was higher priority than the networking task?). A vTaskDelay(1) in the same place works and solves the problem, but just seems horribly kludgy.

Overall, I'm just very uncomfortable with how invasive esp_ota_begin() is. With a 2.8MB image size, it blocks all other tasks that touch SPI RAM for about 17 seconds. That is not good. We should be able to OTA flash without starving other tasks in the system so badly.

There is a separate bug in the pm_on_beacon_rx that is triggered by this, but that seems more a symptom that a solution.

@Patrik-Berglund
Copy link

We are having the same problem, good you brought this up.

Although we are using a smaller flash, 4MB divided in two OTA paritions we are hit quite heavily by this since our application needs to continue working during OTA download.

Our image is at the moment about 1MB.

@FayeY FayeY changed the title esp_ota_begin() starving network task of cpu time with large partitions and SPI RAM [TW#23636] esp_ota_begin() starving network task of cpu time with large partitions and SPI RAM Jun 27, 2018
@markwj
Copy link
Author

markwj commented Jul 4, 2018

The vTaskDelay(1) after spi_flash_guard_end() in spi_flash_erase_range workaround seems to solve the issue, but is horrendously kludgy. Any better solution?

@Patrik-Berglund
Copy link

@markwj I will test that solution by the end of next week, not in the office right now. But modifying the SDK is a last resort...

Found this old issue
#578

@projectgus projectgus changed the title [TW#23636] esp_ota_begin() starving network task of cpu time with large partitions and SPI RAM esp_ota_begin() starving network task of cpu time with large partitions and SPI RAM (IDFGH-261) Mar 12, 2019
igrr pushed a commit that referenced this issue Apr 26, 2019
Added Kconfig options to enable yield operation during flash erase

Closes: #2083
Closes: IDFGH-261
igrr pushed a commit that referenced this issue Apr 28, 2019
Added Kconfig options to enable yield operation during flash erase.
By default disable.

Closes: #2083
Closes: IDFGH-261
igrr pushed a commit that referenced this issue May 6, 2019
Added Kconfig options to enable yield operation during flash erase.
By default disable.

Closes: #2083
Closes: IDFGH-261
Hallot pushed a commit to Hallot/esp-idf that referenced this issue Apr 23, 2020
Added Kconfig options to enable yield operation during flash erase

Closes: espressif#2083
Closes: IDFGH-261
espressif-bot pushed a commit that referenced this issue May 23, 2020
Added Kconfig options to enable yield operation during flash erase

Closes: #2083
Closes: #4916
Closes: IDFGH-261
espressif-bot pushed a commit that referenced this issue Jun 15, 2020
Added Kconfig options to enable yield operation during flash erase

Closes: #2083
Closes: #4916
Closes: IDFGH-261
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants