[v3.3.4] `esp_ota_begin` causes Wi-Fi stack starvation on 4 MiB partitions (IDFGH-4932) #6723

chrismerck · 2021-03-15T16:45:13Z

We have not tested on master branch as that requires major work to upgrade from v3.3.4 IDF. I'm posting this in the hope that it helps someone else with this issue. Hopefully we will find the resources to upgrade our application to v4.x, as it seems that in lastest master this code has been heavily reworked.

Similar issues:

esp_ota_begin() breaks wifi connection (IDFGH-3656) #5591 : Similar issue, but with different Wi-Fi errors.
esp_ota_begin() starving network task of cpu time with large partitions and SPI RAM (IDFGH-261) #2083 : Very similar issue, which had a related PR (Cherry-pick missing commit 7a2885885c25821d8985b6a97798956f12907b5a to master (IDFGH-3157) #5171) but this is insufficient as described below
[TW#12379] esp_ota_begin() runs into default 5s watchdog on large firmware binaries #578 : old issue but with same cause, where projectgus recommended similar solution to above PR

Environment

Development Kit: none
Module or chip used: ESP32-WROVER-B & ESP32-WROVER-E (16MB)
IDF version v3.3.4
Build System: Make
Compiler version: 1.22.0-80-g6c4433a
Operating System: macOS
Using an IDE?: No
Power Supply: USB or PoE

Problem Description

The IDF function esp_ota_begin blocks for the duration of the partition erasure, and during that period it causes serious performance problems with other subsystems.

The problem is caused by spi_flash_erase_range which blocks all other execution while erasing each flash block (via
spi_flash_disable_interrupts_caches_and_other_cpu). I speculate that, although the flash guard is released and re-acquired after each lock, other waiting tasks are not given a chance to run. This is a fairly aggressive manoeuvre that may be acceptable for small partitions, but causes problems when the partition takes more than, say, 3 seconds to erase. In that case, application-level deadlines may be missed.

Workaround: SPI_FLASH_YIELD

To address this issue, the config SPI_FLASH_YIELD_DURING_ERASE was introduced (#5171) which effectively adds a vTaskDelay(1) after each block erase. This seems to have helped some users. However, in our case the 1-tick delay was apparently not sufficient to prevent starvation of Wi-Fi and modest application tasks.

One solution is to increase CONFIG_SPI_FLASH_ERASE_YIELD_TICKS to effectively delay longer after each block erasure. However, there's still a shortcoming in this "ERASE_YIELD" trick: the CONFIG_SPI_FLASH_ERASE_YIELD_DURATION_MS variable is not used in a reasonable way. It checks for a single block erase being longer than a threshold, which effectively defaults to 10 ms, at which setting it will always execute. However, if this threshold is increased, then it will likely never run, and starvation will occur. Simply increasing the ticks may incur significant delays (64 blocks * number of ticks). It would make much more sense to check to see if a certain total erase time has been elapsed before calling the vTaskDelay. This is exactly what we've done with our internal patchset on top of ESP-IDF.

We found that a 500 ms delay every 2000 ms worked adequately. It could likely also be more fine-grained, perhaps 200 ms delay every 1000 ms.

However, it appears that this logic has been removed in master, with spi_flash_erase_range being deprecated. I suppose we should be upgrading our application to IDF v4.x.

Documentation

I believe that the esp_ota_begin documentation should make it clear that this function blocks for the duration of the partition erasure, and that this may take on the order of 4 sec per megabyte erased.

Similar Fixed Issue not Backported?

There was a claim (#5591 (comment)) in a closely-related issue that a fix for some Wi-Fi interference from esp_ota_begin was fixed in commit 4761c40. However, it would appear that this has not been backported to v3.x IDF. However I could be mistaken.

Expected Behavior

Calling esp_ota_begin in one task does not cause:

disconnection from Wi-Fi
disconnection from AWS IoT (MQTT)
other transport-level network failure
starvation of user tasks
triggering task watchdog

However, it is acceptable that it may:

cause some performance degradation short of starvation*
take up to 15 seconds to complete for a 4 MB partition (in real-world, loaded application)
cause some (recoverable) packet loss on Wi-Fi

(*) Although it is unclear to me why erasing an unused partition requires blocking other task execution.

Actual Behavior

esp_ota_begin against a 4 MB partition with a modest amount of background tasks running (normal priority) causes:

disconnection from AWS IoT
errors from Wi-Fi stack
starvation of user tasks, causing application API deadlines to be missed (say, 3 second reply times on HTTP)
task watchdog triggered intermittently (even after setting to 10 seconds!)

Steps to reproduce

Use 4 MB ota partitions on a 16 MB module.
Connect to AWS IoT (MQTT) using library included in IDF.
Run a normal-priority "main" task that should execute every 100ms and run for 10ms (10% CPU load).
Call esp_ota_erase.

Monitor:

duration of that call
execution timing of the main task
look for errors from Wi-Fi driver and AWS IoT / mbedtls

Code to reproduce this issue

Steps above summarize what we are doing in our commercial application which I believe could be used to reproduce the issue in an example. However, at this time I have not prepared a separate public example project.

Debug Logs

E () wifi:esf_buf
W () wifi:Next TBTT incorrect!

Other items

sdkconfig.txt

Relevant part:

CONFIG_SPI_FLASH_YIELD_DURING_ERASE=y
CONFIG_SPI_FLASH_ERASE_YIELD_DURATION_MS=20
CONFIG_SPI_FLASH_ERASE_YIELD_TICKS=1

partitions.csv.txt

Relevant part showing 4 MiB partitions:

ota_0,    0,    ota_0,   0x300000,0x400000
ota_1,    0,    ota_1,   0x700000,0x400000

The text was updated successfully, but these errors were encountered:

Alvin1Zhang · 2021-03-16T01:25:57Z

Thanks for the very detailed report and sorry for the inconvenience, we will look into soon.

shubhamkulkarni97 · 2021-03-16T15:08:16Z

Hi @chrismerck,

Thanks for the detailed instructions to reproduce this issue. I'm able to reproduce task watchdog getting triggered on v3.3.4 (could not reproduce the TBTT part).

In most of the cases, firmware size with WiFi, BLE and OTA enabled is around 1.2 MB. So we keep the default partition size as 1.5 MB. With default size, it takes less time to erase to erase the partition and such issues are not observed.
Is there any specific reason for firmware size of around 4MB? Have you embedded some files into firmware which leads to large binary? In that case, you can store files in different partition with the use of SPIFFS. This will reduce binary size.
We had received a PR to support incremental flash erase at Modify the OTA implementation to incrementally erase flash (IDFGH-3250) #5246. This feature is merged in master branch and not backported. I have created a patch on top v3.3.4 that might help in your use case. Please find the patch attached below. I'll discuss internally about backporting this feature but cannot assure you as of now.
0001-esp_https_ota-app_update-Support-incremental-flash-e.patch.zip
ESP32-C3 has SPI flash auto suspend feature in which hardware handles flash cache. With future chips such issues should not be observed.

Thanks,
Shubham

chrismerck · 2021-03-16T15:59:52Z

Hi @shubhamkulkarni97 , Thank you for the incredibly prompt turn-around.

In most of the cases, firmware size with WiFi, BLE and OTA enabled is around 1.2 MB.

That's with just the IDF and some demo code! We then build a really great product on top of it. That takes a lot of flash space :)

We have a local HTTP API to support local control by our app and other smart home systems, plus other protocols (UDP and telnet) for performance and legacy compatibility. Then we bring in Lua and several radio drivers. It adds up quick.

This is something that seems to be a bit of a blind spot for Espressif, to be totally honest. It's virtually impossible to use the 4 MB modules for any serious application that's more than a thin-client, yet it is quite difficult to obtain dev kits with 8MB or 16MB, and the new mini modules are only available in 4MB.

you can store files in different partition with the use of SPIFFS

We already do store our data in a separate 1MB database partition. We considered SPIFFS along with other embedded database libraries but unfortunately had to roll our own to fit our requirements. (Especially handling database compaction after record deletions.)

I have created a patch on top v3.3.4 that might help in your use case.

Wow, that's amazing, thank you. We may give this a try.

With future chips (ESP32-C3, etc.) such issues should not be observed.

That's excellent news. The flash cache has been tricky for us. The C3 may have a nice effect of boosting performance of our Bond Bridge product, since we use Core1 baremetal for time sensitive radio operations where our performance is currently limited by flash cache (we need to disable all interrupts during critical radio operations, which can last seconds in some circumstances.)

The dual-core C3 + 16MB, and a mini module C3 + 8MB (especially this one) are what we are waiting for. Cannot do much with 4MB sadly.

shubhamkulkarni97 · 2021-03-31T06:01:34Z

@chrismerck, Did you get a chance to try out the patch provided in above comment? Does it fix the issue?

mahavirj · 2021-03-31T10:27:09Z

@chrismerck

This is something that seems to be a bit of a blind spot for Espressif, to be totally honest. It's virtually impossible to use the 4 MB modules for any serious application that's more than a thin-client, yet it is quite difficult to obtain dev kits with 8MB or 16MB, and the new mini modules are only available in 4MB.

Thanks for your feedback. In case you have not seen, request you to go through our product ordering guide available here. I do see several modules offering 8M/16M flash parts. For more information on availability, please get in touch with our sales team. However, we do see lot of traction for 4M modules, in-fact we are also getting requests for 2M flash modules, just too many business verticals and use-cases in this domain :-)

The dual-core C3 + 16MB, and a mini module C3 + 8MB (especially this one) are what we are waiting for. Cannot do much with 4MB sadly.

FYI, C3 is single core RISC-V chip. However, upcoming ESP32-S3 is dual-core one, more information

mahavirj · 2021-04-26T04:28:35Z

@chrismerck

Hopefully patch provided earlier helped to fix your problem, closing this issue. Please feel free to re-open in case you need further help.

espressif-bot added the Status: Opened Issue is new label Mar 15, 2021

github-actions bot changed the title ~~[v3.3.4] esp_ota_begin causes Wi-Fi stack starvation on 4 MiB partitions~~ [v3.3.4] esp_ota_begin causes Wi-Fi stack starvation on 4 MiB partitions (IDFGH-4932) Mar 15, 2021

espressif-bot added Status: In Progress Work is in progress and removed Status: Opened Issue is new labels Mar 30, 2021

espressif-bot added the Awaiting Response awaiting a response from the author label Apr 12, 2021

mahavirj closed this as completed Apr 26, 2021

espressif-bot added Resolution: Done Issue is done internally Status: Done Issue is done internally and removed Status: In Progress Work is in progress labels Apr 26, 2021

softwaregravy mentioned this issue Feb 21, 2023

Tracing back to esp-idf change platformio/platform-espressif32#1040

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v3.3.4] `esp_ota_begin` causes Wi-Fi stack starvation on 4 MiB partitions (IDFGH-4932) #6723

[v3.3.4] `esp_ota_begin` causes Wi-Fi stack starvation on 4 MiB partitions (IDFGH-4932) #6723

chrismerck commented Mar 15, 2021 •

edited

Loading

Alvin1Zhang commented Mar 16, 2021

shubhamkulkarni97 commented Mar 16, 2021

chrismerck commented Mar 16, 2021

shubhamkulkarni97 commented Mar 31, 2021

mahavirj commented Mar 31, 2021

mahavirj commented Apr 26, 2021

[v3.3.4] esp_ota_begin causes Wi-Fi stack starvation on 4 MiB partitions (IDFGH-4932) #6723

[v3.3.4] esp_ota_begin causes Wi-Fi stack starvation on 4 MiB partitions (IDFGH-4932) #6723

Comments

chrismerck commented Mar 15, 2021 • edited Loading

Environment

Problem Description

Workaround: SPI_FLASH_YIELD

Documentation

Similar Fixed Issue not Backported?

Expected Behavior

Actual Behavior

Steps to reproduce

Code to reproduce this issue

Debug Logs

Other items

Alvin1Zhang commented Mar 16, 2021

shubhamkulkarni97 commented Mar 16, 2021

chrismerck commented Mar 16, 2021

shubhamkulkarni97 commented Mar 31, 2021

mahavirj commented Mar 31, 2021

mahavirj commented Apr 26, 2021

[v3.3.4] `esp_ota_begin` causes Wi-Fi stack starvation on 4 MiB partitions (IDFGH-4932) #6723

[v3.3.4] `esp_ota_begin` causes Wi-Fi stack starvation on 4 MiB partitions (IDFGH-4932) #6723

chrismerck commented Mar 15, 2021 •

edited

Loading