Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v3.3.4] esp_ota_begin causes Wi-Fi stack starvation on 4 MiB partitions (IDFGH-4932) #6723

Closed
chrismerck opened this issue Mar 15, 2021 · 6 comments
Labels
Awaiting Response awaiting a response from the author Resolution: Done Issue is done internally Status: Done Issue is done internally

Comments

@chrismerck
Copy link

chrismerck commented Mar 15, 2021

We have not tested on master branch as that requires major work to upgrade from v3.3.4 IDF. I'm posting this in the hope that it helps someone else with this issue. Hopefully we will find the resources to upgrade our application to v4.x, as it seems that in lastest master this code has been heavily reworked.

Similar issues:

Environment

  • Development Kit: none
  • Module or chip used: ESP32-WROVER-B & ESP32-WROVER-E (16MB)
  • IDF version v3.3.4
  • Build System: Make
  • Compiler version: 1.22.0-80-g6c4433a
  • Operating System: macOS
  • Using an IDE?: No
  • Power Supply: USB or PoE

Problem Description

The IDF function esp_ota_begin blocks for the duration of the partition erasure, and during that period it causes serious performance problems with other subsystems.

The problem is caused by spi_flash_erase_range which blocks all other execution while erasing each flash block (via
spi_flash_disable_interrupts_caches_and_other_cpu). I speculate that, although the flash guard is released and re-acquired after each lock, other waiting tasks are not given a chance to run. This is a fairly aggressive manoeuvre that may be acceptable for small partitions, but causes problems when the partition takes more than, say, 3 seconds to erase. In that case, application-level deadlines may be missed.

Workaround: SPI_FLASH_YIELD

To address this issue, the config SPI_FLASH_YIELD_DURING_ERASE was introduced (#5171) which effectively adds a vTaskDelay(1) after each block erase. This seems to have helped some users. However, in our case the 1-tick delay was apparently not sufficient to prevent starvation of Wi-Fi and modest application tasks.

One solution is to increase CONFIG_SPI_FLASH_ERASE_YIELD_TICKS to effectively delay longer after each block erasure. However, there's still a shortcoming in this "ERASE_YIELD" trick: the CONFIG_SPI_FLASH_ERASE_YIELD_DURATION_MS variable is not used in a reasonable way. It checks for a single block erase being longer than a threshold, which effectively defaults to 10 ms, at which setting it will always execute. However, if this threshold is increased, then it will likely never run, and starvation will occur. Simply increasing the ticks may incur significant delays (64 blocks * number of ticks). It would make much more sense to check to see if a certain total erase time has been elapsed before calling the vTaskDelay. This is exactly what we've done with our internal patchset on top of ESP-IDF.

We found that a 500 ms delay every 2000 ms worked adequately. It could likely also be more fine-grained, perhaps 200 ms delay every 1000 ms.

However, it appears that this logic has been removed in master, with spi_flash_erase_range being deprecated. I suppose we should be upgrading our application to IDF v4.x.

Documentation

I believe that the esp_ota_begin documentation should make it clear that this function blocks for the duration of the partition erasure, and that this may take on the order of 4 sec per megabyte erased.

Similar Fixed Issue not Backported?

There was a claim (#5591 (comment)) in a closely-related issue that a fix for some Wi-Fi interference from esp_ota_begin was fixed in commit 4761c40. However, it would appear that this has not been backported to v3.x IDF. However I could be mistaken.

Expected Behavior

Calling esp_ota_begin in one task does not cause:

  • disconnection from Wi-Fi
  • disconnection from AWS IoT (MQTT)
  • other transport-level network failure
  • starvation of user tasks
  • triggering task watchdog

However, it is acceptable that it may:

  • cause some performance degradation short of starvation*
  • take up to 15 seconds to complete for a 4 MB partition (in real-world, loaded application)
  • cause some (recoverable) packet loss on Wi-Fi

(*) Although it is unclear to me why erasing an unused partition requires blocking other task execution.

Actual Behavior

esp_ota_begin against a 4 MB partition with a modest amount of background tasks running (normal priority) causes:

  • disconnection from AWS IoT
  • errors from Wi-Fi stack
  • starvation of user tasks, causing application API deadlines to be missed (say, 3 second reply times on HTTP)
  • task watchdog triggered intermittently (even after setting to 10 seconds!)

Steps to reproduce

  1. Use 4 MB ota partitions on a 16 MB module.
  2. Connect to AWS IoT (MQTT) using library included in IDF.
  3. Run a normal-priority "main" task that should execute every 100ms and run for 10ms (10% CPU load).
  4. Call esp_ota_erase.

Monitor:

  • duration of that call
  • execution timing of the main task
  • look for errors from Wi-Fi driver and AWS IoT / mbedtls

Code to reproduce this issue

Steps above summarize what we are doing in our commercial application which I believe could be used to reproduce the issue in an example. However, at this time I have not prepared a separate public example project.

Debug Logs

E () wifi:esf_buf
W () wifi:Next TBTT incorrect!

Other items

Relevant part:

CONFIG_SPI_FLASH_YIELD_DURING_ERASE=y
CONFIG_SPI_FLASH_ERASE_YIELD_DURATION_MS=20
CONFIG_SPI_FLASH_ERASE_YIELD_TICKS=1

Relevant part showing 4 MiB partitions:

ota_0,    0,    ota_0,   0x300000,0x400000
ota_1,    0,    ota_1,   0x700000,0x400000
@espressif-bot espressif-bot added the Status: Opened Issue is new label Mar 15, 2021
@github-actions github-actions bot changed the title [v3.3.4] esp_ota_begin causes Wi-Fi stack starvation on 4 MiB partitions [v3.3.4] esp_ota_begin causes Wi-Fi stack starvation on 4 MiB partitions (IDFGH-4932) Mar 15, 2021
@Alvin1Zhang
Copy link
Collaborator

Thanks for the very detailed report and sorry for the inconvenience, we will look into soon.

@shubhamkulkarni97
Copy link
Contributor

Hi @chrismerck,

Thanks for the detailed instructions to reproduce this issue. I'm able to reproduce task watchdog getting triggered on v3.3.4 (could not reproduce the TBTT part).

  • In most of the cases, firmware size with WiFi, BLE and OTA enabled is around 1.2 MB. So we keep the default partition size as 1.5 MB. With default size, it takes less time to erase to erase the partition and such issues are not observed.
  • Is there any specific reason for firmware size of around 4MB? Have you embedded some files into firmware which leads to large binary? In that case, you can store files in different partition with the use of SPIFFS. This will reduce binary size.
  • We had received a PR to support incremental flash erase at Modify the OTA implementation to incrementally erase flash (IDFGH-3250) #5246. This feature is merged in master branch and not backported. I have created a patch on top v3.3.4 that might help in your use case. Please find the patch attached below. I'll discuss internally about backporting this feature but cannot assure you as of now.
    0001-esp_https_ota-app_update-Support-incremental-flash-e.patch.zip
  • ESP32-C3 has SPI flash auto suspend feature in which hardware handles flash cache. With future chips such issues should not be observed.

Thanks,
Shubham

@chrismerck
Copy link
Author

Hi @shubhamkulkarni97 , Thank you for the incredibly prompt turn-around.

In most of the cases, firmware size with WiFi, BLE and OTA enabled is around 1.2 MB.

That's with just the IDF and some demo code! We then build a really great product on top of it. That takes a lot of flash space :)

We have a local HTTP API to support local control by our app and other smart home systems, plus other protocols (UDP and telnet) for performance and legacy compatibility. Then we bring in Lua and several radio drivers. It adds up quick.

This is something that seems to be a bit of a blind spot for Espressif, to be totally honest. It's virtually impossible to use the 4 MB modules for any serious application that's more than a thin-client, yet it is quite difficult to obtain dev kits with 8MB or 16MB, and the new mini modules are only available in 4MB.

you can store files in different partition with the use of SPIFFS

We already do store our data in a separate 1MB database partition. We considered SPIFFS along with other embedded database libraries but unfortunately had to roll our own to fit our requirements. (Especially handling database compaction after record deletions.)

I have created a patch on top v3.3.4 that might help in your use case.

Wow, that's amazing, thank you. We may give this a try.

With future chips (ESP32-C3, etc.) such issues should not be observed.

That's excellent news. The flash cache has been tricky for us. The C3 may have a nice effect of boosting performance of our Bond Bridge product, since we use Core1 baremetal for time sensitive radio operations where our performance is currently limited by flash cache (we need to disable all interrupts during critical radio operations, which can last seconds in some circumstances.)

The dual-core C3 + 16MB, and a mini module C3 + 8MB (especially this one) are what we are waiting for. Cannot do much with 4MB sadly.

@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Opened Issue is new labels Mar 30, 2021
@shubhamkulkarni97
Copy link
Contributor

@chrismerck, Did you get a chance to try out the patch provided in above comment? Does it fix the issue?

@mahavirj
Copy link
Member

@chrismerck

This is something that seems to be a bit of a blind spot for Espressif, to be totally honest. It's virtually impossible to use the 4 MB modules for any serious application that's more than a thin-client, yet it is quite difficult to obtain dev kits with 8MB or 16MB, and the new mini modules are only available in 4MB.

Thanks for your feedback. In case you have not seen, request you to go through our product ordering guide available here. I do see several modules offering 8M/16M flash parts. For more information on availability, please get in touch with our sales team. However, we do see lot of traction for 4M modules, in-fact we are also getting requests for 2M flash modules, just too many business verticals and use-cases in this domain :-)

The dual-core C3 + 16MB, and a mini module C3 + 8MB (especially this one) are what we are waiting for. Cannot do much with 4MB sadly.

FYI, C3 is single core RISC-V chip. However, upcoming ESP32-S3 is dual-core one, more information

@espressif-bot espressif-bot added the Awaiting Response awaiting a response from the author label Apr 12, 2021
@mahavirj
Copy link
Member

@chrismerck

Hopefully patch provided earlier helped to fix your problem, closing this issue. Please feel free to re-open in case you need further help.

@espressif-bot espressif-bot added Resolution: Done Issue is done internally Status: Done Issue is done internally and removed Status: In Progress Work is in progress labels Apr 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting Response awaiting a response from the author Resolution: Done Issue is done internally Status: Done Issue is done internally
Projects
None yet
Development

No branches or pull requests

5 participants