Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WROVER-B Flash Corrupted in Field (IDFGH-2932) #4968

Closed
vonnieda opened this issue Mar 20, 2020 · 16 comments
Closed

WROVER-B Flash Corrupted in Field (IDFGH-2932) #4968

vonnieda opened this issue Mar 20, 2020 · 16 comments

Comments

@vonnieda
Copy link
Contributor

Environment

  • Development Kit: none
  • Module or chip used: ESP32-WROVER-B
  • IDF version: v3.3.1
  • Build System: Make
  • Compiler version: xtensa-esp32-elf-gcc (crosstool-NG crosstool-ng-1.22.0-80-g6c4433a) 5.2.0
  • Operating System: macOS
  • Power Supply: Battery

Problem Description

Short version: A small percentage of boards in the field have had their flash corrupted so that they are unable to boot. It appears the first half or more of the flash has been overwritten with random data.

Long version: I started this as a forum thread at ( https://esp32.com/viewtopic.php?f=2&t=14719 ) and there are significant details there.

Customers report a boot loop. Upon receipt of the device the console shows a loop of:

rst:0x10 (RTCWDT_RTC_RESET),boot:0x3b (SPI_FAST_FLASH_BOOT)
flash read err, 1000
ets_main.c 371 
ets Jun  8 2016 00:22:57

I used esptool.py to pull the flash image off the board and found that the bootloader, partition table, ota_data, first application image and part of the second application image are overwritten with what appears to be random data. The only flash writing that my firmware does is via NVS and OTA APIs. I do not access the flash directly.

I have a hunch that when this has happened it is during rapid power cycles, perhaps due to low battery brownout. I have not been able to confirm that in person yet, but I see some evidence of it.

Reflashing the board via UART recovers it just fine and it operates normally.

A few findings that might be of significance:

  1. I noticed that even the first 0x1000 bytes of flash contain the random data. On good boards I've seen that this is instead 0xff. I don't flash anything to that area, but I don't know if the ESP uses it internally for anything. If it doesn't, it seems odd to me that it would contain data.
  2. 3.3v EFUSE is set during provisioning, and I verified it was still set on the board. We use MTDI for other purposes so this is required in our use case.
  3. The entropy (calculated with ent command line tool) of the bad image is twice that of a corresponding good image.

This issue has affected a small but significant number of devices in the field. It results in a completely bricked device that requires return. I'd really appreciate some help or ideas on how this could be happening.

Thanks,
Jason

@github-actions github-actions bot changed the title WROVER-B Flash Corrupted in Field WROVER-B Flash Corrupted in Field (IDFGH-2932) Mar 20, 2020
@negativekelvin
Copy link
Contributor

the first half or more of the flash has been overwritten with random data

One possibility is that it is jumping into some rom flash write loop due to glitching and you may notice that nothing has been erased and all bit changes are 1->0. If you have the dangerous flash writes blocked then it probably isn't happening due to memory corruption in higher level functions. Although it may be possible if it is wrapping around from an out of bounds flash address.

@vonnieda
Copy link
Contributor Author

Thanks @negativekelvin - I do have dangerous writes blocked.

I do believe it's likely that the writes are all 1->0 as there are no regions of 0xff at all. Additionally, there are no human readable strings in the flash either, so I think it's unlikely that it's writing random chunks of RAM to flash - it really seems to be random.

Given this, do you have any suggestions for what I can do about it? Is there a way to write lock the flash until power is stable or anything like that?

@negativekelvin
Copy link
Contributor

There are some things you could try like go into deepsleep forever when you detect a low voltage if there is a way to hard reboot once someone charges the battery. But you might have to add a voltage supervisor or pmic.

@projectgus
Copy link
Contributor

Hi @vonnieda ,

Are you able to please send me privately a full flash dump from a "bad" device and a flash dump from a "good" device (ideally one which has never had this issue). Can email to angus at espressif com.

You say your device goes into a brownout state due to low battery, can you please give some more details about how the unit is powered, what other voltage regulator or protection circuitry (if any) is present, etc? What VCC range(s) is the ESP32 exposed to while "enabled" (ie EN not pulled low)?

@vonnieda
Copy link
Contributor Author

Hi @projectgus, thank you for responding. Unfortunately, sending my images will not be possible due to the proprietary nature of our product.

I do think I could probably send sections of the image, or perhaps images where I've erased every other 16 bytes, or something like that. And I could of course send small sections from specific offsets. Would that be enough of a sample to help?

As for the power situation: We are single cell lithium ion powered with a buck. The board runs at 3.3v. We have an on board IO controller that handles power management and sequencing. We bring up power before EN, but this is where I think that some devices might have had an issue. I believe the IO controller may have had a problem where the output of those signals was unknown or rapidly shifting. I haven't seen this in person - it's just a hunch.

Would you say that a condition where power was unknown and EN was changing, floating, or unknown could cause a situation like this?

Thanks,
Jason

@projectgus
Copy link
Contributor

projectgus commented Mar 24, 2020

Hi @vonnieda,

Of course, no problem. How about a private email with a flash dump of the first 0x2000 bytes of a "good" and a "bad" board? This is mostly bootloader and the "random" data which appears not to be a valid app, so it shouldn't contain anything proprietary.

Would you say that a condition where power was unknown and EN was changing, floating, or unknown could cause a situation like this?

I can think of two possible scenarios here:

One is, the WROVER-B datasheet recommends using an external supply voltage supervisor if there is any possibility the voltage drops below 2.3V, in this case the CHIP_PU/EN pin should be driven low. This is because of the flash chip voltage range. If the chip tries to start up in this condition (ie VCC <2.3V but CHIP_PU not driven low) then it's possible the flash chip became corrupted.

Second possibility, the recommended WROVER-B VCC range is 3.0V - 3.6V. If the Brownout detector is not enabled (or threshold set too low) and the voltage falls below 3.0V while writing to the flash, it may have caused data to be written to an invalid address - similar to what was discussed above. Similarly if there is a lot of electrical noise for some reason. It may also have corrupted enough of the data signal that the values in the flash look random, although this does seem like more of an unlikely possibility.

If the IO controller you mention is causing EN to fluctuate when a brownout occurs or under some other borderline conditions, then it could make either of these scenarios worse.

@vonnieda
Copy link
Contributor Author

Thank you @projectgus - files have been sent.

@projectgus
Copy link
Contributor

projectgus commented Mar 30, 2020

Hi @vonnieda ,

Got the dumps, thanks for that.

I agree that the values written here look quite random.

One thing I see is that starting from address 0x174c, there are bits set in the corrupt dump which are not set in the "good" dump. This is something that a rogue SPI flash write command cannot do - those sectors of the flash need to be erased first (to all 0xFF) with an erase command, and then the bits cleared by a separate write command.

So my best guess is that the integrated flash chip has been operated out of spec at some point (probably voltage spec, operating while the VCC levels were outside the WROVER datasheet specs? any other extreme temperature events or other possible cases?) and it's caused physical corruption of the actual flash cells, rather than anything that the ESP32 software has told it to do.

It does seem odd to me that the whole flash chip would fail in this way, rather than just a few corrupt bits flipping. And I can't explain why the setting of random bits starts at address 0x174c rather than 0x1000 (before 0x1000 the bytes are all 0xFF in the "good" dump so there's no possibility to set any new bits.)

I've attached the little Python script that I used to check for set bits in case you'd like to take a look. It also dumps some other simple bitwise operations applied between new and old - I manually had a quick look at those hex dumps to see if any other patterns stood out, but I don't see any.

check_cleared_bits.py.txt

Angus

@vonnieda
Copy link
Contributor Author

Thank you @projectgus, for looking into this, and for your help and advice. Based on this we will try to reproduce this by forcing the condition and work on and fix.

Thanks,
Jason

@Alvin1Zhang
Copy link
Collaborator

@vonnieda Thanks for reporting. Would you please help share if any updates or details for this issue? Thanks.

@vonnieda
Copy link
Contributor Author

@Alvin1Zhang No further updates from my end. I believe the voltage issue is probably correct, but have not yet had time to try to duplicate it in the lab.

@projectgus
Copy link
Contributor

Thanks for the update, @vonnieda . I'm going to close this as it seems like the root cause is most likely not specific to ESP-IDF, but if you find further information to the contrary then please let us know or open a new issue if it looks like ESP-IDF is doing something wrong here.

@shreyasbharath
Copy link

@vonnieda can you please share any information about how you fixed this in your product? We are running into a similar (perhaps same?) issue where we are seeing random flash corruption on our products.

@vonnieda
Copy link
Contributor Author

vonnieda commented Dec 2, 2021

@shreyasbharath I was never able to reproduce this in the lab on a regular basis, but as a protective measure we started adding a voltage supervisor that ensures ESP_ENABLE stays low until the voltage is good, and that seems to have resolved it.

@shreyasbharath
Copy link

@vonnieda thanks for the response!

It turns out that at least one of the issues we are experiencing is related to the change of flash chip (from Giga Devices to XMC), and an overerase issue in the bootloader.

@winzkigermany
Copy link
Contributor

We went through this with ~3000 chips in the field :-( It cost us a ton of money.
The fix and explanation is on our website (link below)
Good news: You can integrate it in your own next firmware update
Bad news: You have to do it before your modules fail.

https://en.hoerbert.com/technology/esp32-critical-fatal-problem-source-in-some-wrover-e-modules/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants