New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WROVER-B Flash Corrupted in Field (IDFGH-2932) #4968
Comments
One possibility is that it is jumping into some rom flash write loop due to glitching and you may notice that nothing has been erased and all bit changes are 1->0. If you have the dangerous flash writes blocked then it probably isn't happening due to memory corruption in higher level functions. Although it may be possible if it is wrapping around from an out of bounds flash address. |
Thanks @negativekelvin - I do have dangerous writes blocked. I do believe it's likely that the writes are all 1->0 as there are no regions of 0xff at all. Additionally, there are no human readable strings in the flash either, so I think it's unlikely that it's writing random chunks of RAM to flash - it really seems to be random. Given this, do you have any suggestions for what I can do about it? Is there a way to write lock the flash until power is stable or anything like that? |
There are some things you could try like go into deepsleep forever when you detect a low voltage if there is a way to hard reboot once someone charges the battery. But you might have to add a voltage supervisor or pmic. |
Hi @vonnieda , Are you able to please send me privately a full flash dump from a "bad" device and a flash dump from a "good" device (ideally one which has never had this issue). Can email to angus at espressif com. You say your device goes into a brownout state due to low battery, can you please give some more details about how the unit is powered, what other voltage regulator or protection circuitry (if any) is present, etc? What VCC range(s) is the ESP32 exposed to while "enabled" (ie EN not pulled low)? |
Hi @projectgus, thank you for responding. Unfortunately, sending my images will not be possible due to the proprietary nature of our product. I do think I could probably send sections of the image, or perhaps images where I've erased every other 16 bytes, or something like that. And I could of course send small sections from specific offsets. Would that be enough of a sample to help? As for the power situation: We are single cell lithium ion powered with a buck. The board runs at 3.3v. We have an on board IO controller that handles power management and sequencing. We bring up power before EN, but this is where I think that some devices might have had an issue. I believe the IO controller may have had a problem where the output of those signals was unknown or rapidly shifting. I haven't seen this in person - it's just a hunch. Would you say that a condition where power was unknown and EN was changing, floating, or unknown could cause a situation like this? Thanks, |
Hi @vonnieda, Of course, no problem. How about a private email with a flash dump of the first 0x2000 bytes of a "good" and a "bad" board? This is mostly bootloader and the "random" data which appears not to be a valid app, so it shouldn't contain anything proprietary.
I can think of two possible scenarios here: One is, the WROVER-B datasheet recommends using an external supply voltage supervisor if there is any possibility the voltage drops below 2.3V, in this case the CHIP_PU/EN pin should be driven low. This is because of the flash chip voltage range. If the chip tries to start up in this condition (ie VCC <2.3V but CHIP_PU not driven low) then it's possible the flash chip became corrupted. Second possibility, the recommended WROVER-B VCC range is 3.0V - 3.6V. If the Brownout detector is not enabled (or threshold set too low) and the voltage falls below 3.0V while writing to the flash, it may have caused data to be written to an invalid address - similar to what was discussed above. Similarly if there is a lot of electrical noise for some reason. It may also have corrupted enough of the data signal that the values in the flash look random, although this does seem like more of an unlikely possibility. If the IO controller you mention is causing EN to fluctuate when a brownout occurs or under some other borderline conditions, then it could make either of these scenarios worse. |
Thank you @projectgus - files have been sent. |
Hi @vonnieda , Got the dumps, thanks for that. I agree that the values written here look quite random. One thing I see is that starting from address 0x174c, there are bits set in the corrupt dump which are not set in the "good" dump. This is something that a rogue SPI flash write command cannot do - those sectors of the flash need to be erased first (to all 0xFF) with an erase command, and then the bits cleared by a separate write command. So my best guess is that the integrated flash chip has been operated out of spec at some point (probably voltage spec, operating while the VCC levels were outside the WROVER datasheet specs? any other extreme temperature events or other possible cases?) and it's caused physical corruption of the actual flash cells, rather than anything that the ESP32 software has told it to do. It does seem odd to me that the whole flash chip would fail in this way, rather than just a few corrupt bits flipping. And I can't explain why the setting of random bits starts at address 0x174c rather than 0x1000 (before 0x1000 the bytes are all 0xFF in the "good" dump so there's no possibility to set any new bits.) I've attached the little Python script that I used to check for set bits in case you'd like to take a look. It also dumps some other simple bitwise operations applied between new and old - I manually had a quick look at those hex dumps to see if any other patterns stood out, but I don't see any. Angus |
Thank you @projectgus, for looking into this, and for your help and advice. Based on this we will try to reproduce this by forcing the condition and work on and fix. Thanks, |
@vonnieda Thanks for reporting. Would you please help share if any updates or details for this issue? Thanks. |
@Alvin1Zhang No further updates from my end. I believe the voltage issue is probably correct, but have not yet had time to try to duplicate it in the lab. |
Thanks for the update, @vonnieda . I'm going to close this as it seems like the root cause is most likely not specific to ESP-IDF, but if you find further information to the contrary then please let us know or open a new issue if it looks like ESP-IDF is doing something wrong here. |
@vonnieda can you please share any information about how you fixed this in your product? We are running into a similar (perhaps same?) issue where we are seeing random flash corruption on our products. |
@shreyasbharath I was never able to reproduce this in the lab on a regular basis, but as a protective measure we started adding a voltage supervisor that ensures ESP_ENABLE stays low until the voltage is good, and that seems to have resolved it. |
@vonnieda thanks for the response! It turns out that at least one of the issues we are experiencing is related to the change of flash chip (from Giga Devices to XMC), and an overerase issue in the bootloader. |
We went through this with ~3000 chips in the field :-( It cost us a ton of money. https://en.hoerbert.com/technology/esp32-critical-fatal-problem-source-in-some-wrover-e-modules/ |
Environment
Problem Description
Short version: A small percentage of boards in the field have had their flash corrupted so that they are unable to boot. It appears the first half or more of the flash has been overwritten with random data.
Long version: I started this as a forum thread at ( https://esp32.com/viewtopic.php?f=2&t=14719 ) and there are significant details there.
Customers report a boot loop. Upon receipt of the device the console shows a loop of:
I used esptool.py to pull the flash image off the board and found that the bootloader, partition table, ota_data, first application image and part of the second application image are overwritten with what appears to be random data. The only flash writing that my firmware does is via NVS and OTA APIs. I do not access the flash directly.
I have a hunch that when this has happened it is during rapid power cycles, perhaps due to low battery brownout. I have not been able to confirm that in person yet, but I see some evidence of it.
Reflashing the board via UART recovers it just fine and it operates normally.
A few findings that might be of significance:
This issue has affected a small but significant number of devices in the field. It results in a completely bricked device that requires return. I'd really appreciate some help or ideas on how this could be happening.
Thanks,
Jason
The text was updated successfully, but these errors were encountered: