Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silent file corruption on NRF devices #3623

Closed
kdb424 opened this issue Oct 31, 2020 · 17 comments
Closed

Silent file corruption on NRF devices #3623

kdb424 opened this issue Oct 31, 2020 · 17 comments
Milestone

Comments

@kdb424
Copy link

kdb424 commented Oct 31, 2020

On large disk writes, filesystem corruption seems to be an issue on the disk. It reports as working properly to the host OS, even after a reboot of the controller. This has been tested on several devices, but is most problematic on the nice_nano, which has almost identical specs to the itsybitsyNRF. Tested using several Cpy versions including 6 rc.0. Steps to reproduce
https://github.com/KMKfw/kmk_firmware

  1. copy KMK folder to the root of the drive
  2. wait until disk sync or run sync on linux
  3. code is corrupted when trying to run

This issue is much more likely to occur while trying to replace files on disk, though even initial copy seems to fail, and half of the time, crashes to safe mode if a disk sync is forced. rsync reports that the files are correct, when the code is run, random files will have errors that make no sense, and be corrupt with garbage data. Copying the same files over them works eventually, and all code with 0 changes will run properly.

Tested on Linux, as well as a Windows 10 VM on Virtualbox

@dhalbert
Copy link
Collaborator

The nice_nano does not have an external QSPI flash chip, and uses the nRF52840's internal flash for a filesystem. On the Particle Xenon, CIRCUITPY is on the external flash chip.

Do you see any difference between boards that use use the internal flash and external flash when you see this problem?

@kdb424
Copy link
Author

kdb424 commented Oct 31, 2020

I remembered incorrectly. Sorry for that mistake, the Xenon works perfectly.

@dhalbert
Copy link
Collaborator

OK, so this is probably internal filesystem only. Thanks! That's very helpful. We have seen this in the past, and I have a vague memory of trying to fix it, but clearly it needs more investigation.

@dhalbert
Copy link
Collaborator

dhalbert commented Oct 31, 2020

Some previous issues with similar problems that we thought we fixed:
#2318
#2338
#1654

@kdb424
Copy link
Author

kdb424 commented Oct 31, 2020

Let me know when you think you have a fix, let me know the branch and I'll give it a shot on my devices as well and report back. I don't have nearly a deep enough understanding of circuitpython to contribute farther.

@dhalbert dhalbert added this to the 6.x.0 - Features milestone Oct 31, 2020
@dhalbert dhalbert added the nrf52 label Oct 31, 2020
@tannewt
Copy link
Member

tannewt commented Nov 3, 2020

@dhalbert isn't this a bug? I'd expect the bug label and either the 6.x.x milestone if we're going to debug soon or Long Term if we aren't.

@dhalbert
Copy link
Collaborator

dhalbert commented Nov 3, 2020

It is a bug that I marked it with the wrong milestone :)

@dhalbert dhalbert added the bug label Nov 3, 2020
@kdb424
Copy link
Author

kdb424 commented Nov 4, 2020

Wanted to follow this up with tests on Mac and Windows. Same results. Success reported when writing the file, safely ejected the disk. File corruption detected when running code as there were errors on the code that was read back on the nice_nano itself. Rebooting the microcontroller (hard reset), the corruption could be seen in that file on the computer.

Edit: Looks like I may have spoken too soon. It may be writing correctly, and reporting to the OS that it's writing faster than it is. I left the board for 5 minutes after the copy, and eventually it seemed to settle if I didn't interrupt it. Running sync on the system claimed that it was done writing while the code still would not run. Doing nothing for 5 minutes as the code tried to run over and over eventually worked. Will report back any further findings.

@kdb424
Copy link
Author

kdb424 commented Nov 4, 2020

Following up that this still is an issue.

Auto-reload is on. Simply save files over USB to run them or enter REPL to disable.
main.py output:
Traceback (most recent call last):
  File "main.py", line 8, in <module>
  File "0İ4|.df2�3��22�3��22[_init_coord_mappingkmk/kmk_keyboard.pyany
extensionsany
co", line 4, in <module>
RuntimeError: Corrupt .mpy file

Clean formatting storage with storage.erase_filesystem seems to help for a while. Eventually I have to run that or all files are corrupt and the device will disconnect me from a comm (REPL) any time I copy a file.

@kdb424
Copy link
Author

kdb424 commented Nov 9, 2020

I've managed to come up with a way to consistently get corruption. I've frozen in as many modules as I could to fill up the primary storage. My custom circuitpython image is at 1.1MB. Adding only a few small files onto the CIRCUITPY drive and I consistently see corruption very often. Hopefully this will help you reproduce this more quickly to see what is causing the issue. I can send over the U2F for the nice_nano as well that I used to trigger this as well if it is at all helpful.

@dhalbert
Copy link
Collaborator

dhalbert commented Nov 9, 2020

Do you see this even if you don't run any code, or do you have to run code and eventually it happens?

Is the code you're running using microcontroller.nvm or doing BLE pairing? Both of these write to internal flash, so I want to see if the regions are overlapping by accident or something like that. In the meantime, I will check the flash layout.

@dhalbert
Copy link
Collaborator

dhalbert commented Nov 9, 2020

My custom circuitpython image is at 1.1MB.

There is only 1MB of flash on the nRF52840! So I'm not sure why it's not catching that your image is too big. Even before this experiment of adding additional frozen modules, did you have some frozen modules?

(EDIT) I think you may be referring to the .uf2 size, which is considerably larger than the actual size of the firmware in flash. (There is empty space in the .uf2 file, and overhead).

@dhalbert
Copy link
Collaborator

dhalbert commented Nov 9, 2020

I am on the trail of a possible issue.

@kdb424
Copy link
Author

kdb424 commented Nov 9, 2020

Is the code you're running using microcontroller.nvm or doing BLE pairing?

The code does run bluetooth pairing, but no microcontroller.nvm usage.

There is only 1MB of flash on the nRF52840! So I'm not sure why it's not catching that your image is too big. Even before this
experiment of adding additional frozen modules, did you have some frozen modules?

I was referring the UF2 size, my apologizes. This issues was opened using only stock Circuitpython, downloaded from the official site.

@kdb424
Copy link
Author

kdb424 commented Nov 15, 2020

Wanted to report back that I now have an unbootable microcontroller. Even flashing fresh (official) circuitpython to the device seems to not get to a REPL. I assume that this may be related to this bug as it's only been used on Circuitpython

@dhalbert
Copy link
Collaborator

dhalbert commented Nov 15, 2020

I have reproduced bad filesystem writes by writing large files (~180k) on a PCA10059 (uses internal filesystem). They take a very long time, sometimes, to write. I am investigating a couple of hypotheses about what's wrong.

@kdb424 You could try erasing everything re-flashing the bootloader. See the instructions in https://github.com/adafruit/Adafruit_nRF52_Bootloader/blob/master/README.md

@dhalbert dhalbert modified the milestones: 6.x.x - Bug Fixes, 7.0.0 Mar 31, 2021
@dhalbert dhalbert self-assigned this Jul 15, 2021
@dhalbert dhalbert removed their assignment Aug 25, 2021
@dhalbert
Copy link
Collaborator

I re-tested this with 7.0.0-beta.0, and I can no longer reproduce the problem. I tried copying a 200k file several times, syncing, and then comparing, and it was not corrupted. I also copied the kmk directory tree several times, though it fills up the PCA10059 filesystem now before it completes. In addition, the slow writes I saw before are not appearing.

We have fixed a number of things about background tasks, USB, and time-keeping since this report, though I'm not sure which fix may have finally fixedt this problem, if indeed it is fixed.

I will close this for now, but please reopen if you see it again with 7.0.0-beta.0 or later. Thanks for your report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants