Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize lfs #397

Merged
merged 16 commits into from
Jan 6, 2020
Merged

Serialize lfs #397

merged 16 commits into from
Jan 6, 2020

Conversation

henrygab
Copy link
Collaborator

@henrygab henrygab commented Dec 20, 2019

Fix #350
Fix #325
Fix #227
Fix #222
Fix #393

@hathach -- This PR supercedes #372 (Guard against race conditions in Flash FS cache), as #72 was not really providing a proper fix of the underlying problem:

LFS itself is not safe to call simultaneously from multiple contexts.

This PR solves this by wrapping the InternalFS API with a non-recursive mutex, as recommended.

I realize the PR uses newly declared wrapper functions. I tried to avoid this, but the changes got messy quite quickly, mostly due to some of the macros used to verify return codes exiting the function (which would leak the mutex). This is a reasonably clean solution, without the more drastic move to use of RAII objects such as std::lock_guard<>.

@mattratto
Copy link

This is great - thanks Henry and all. I have seen a file corruption issue remaining with 372 (much more infrequent than before but still there). Wonderful to have this wrapped up.

@henrygab
Copy link
Collaborator Author

@mattratto -- You're right! This will be wonderful to have wrapped up.

It'd be great to have some indication that this resolves a real-world reported instance, such as your remaining corruption problem. Would you be willing to reply confirming that you no longer see your remaining corruption after using this fix? (after you are fairly confident, of course)

Let's hope this closes this one for good!

Copy link
Member

@hathach hathach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and all of your hard work, and sorry for the late response. I have been busy with usb things in the last few days. To be honest, I don't like the wrap function, it make code difficult to read and maintain especially adding more API. We should just drop the VERIFY_LFS() and use normal way to return error instead.

cores/nRF5/rtos.cpp Outdated Show resolved Hide resolved
libraries/Adafruit_LittleFS/src/Adafruit_LittleFS.cpp Outdated Show resolved Hide resolved
@hathach
Copy link
Member

hathach commented Dec 25, 2019

@henrygab do you think we also need mutex for the Adafruit_LittleFS_File as well ?
https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/master/libraries/Adafruit_LittleFS/src/Adafruit_LittleFS_File.cpp

@hathach
Copy link
Member

hathach commented Dec 25, 2019

I make PR to your branch to unwrap the VERIFY_LFS() here, since it it a bit too much work to ask you to change, please let's me know what you think.
henrygab#5

@hathach
Copy link
Member

hathach commented Dec 26, 2019

I still see the assert file with the stress test, we definitely need to wrap up the Adafruit_LittleFS_File as well

Task n3 writing ...
Task loop writing ...
Task loop writing ...
assertion "head >= 2 && head <= lfs->cfg->block_count" failed: file "/home/hathach/Arduino/hardware/adafruit/nrf52/libraries/Adafruit_LittleFS/src/littlefs/lfs.c", line 1235, function: lfs_ctz_extend
Task high writing ...
Task n3 writing ...
Task n1 writing ...
Task n3 writing ...

@henrygab
Copy link
Collaborator Author

@henrygab do you think we also need mutex for the Adafruit_LittleFS_File as well ?

Yes, I think we need to also use the mutex when lfs APIs are used from within Adafruit_LittleFS_File.
Let me see if I can create a small patch, likely by declaring a friend function (or two).

@henrygab
Copy link
Collaborator Author

@hathach -- the Adafruit_LittleFS_File public APIs appear to call into other public APIs, and thus would deadlock if simply ensuring the lock is acquired at each public API function entry / released at each exit.

I understand that you have indicated that you do not prefer the use of a wrapper function. I am happy to consider alternatives, but believe the safest change is to wrap each APIs that is called from another public API, so as to allow the serialization without major changes to the existing, tested, code flow.

What do you think?

@henrygab
Copy link
Collaborator Author

henrygab commented Dec 27, 2019

@hathach To maximize working time, I've gone ahead and put forward change that continues to work for the stress test.

It was not as simple as I had hoped. Three issues in particular arose:

  1. Because the Adafruit_LittleFS_File constructors were public, it was not possible to modify them without breaking the existing API. Although I preferred otherwise, I added some friendship between the classes.

  2. The implementation of various Adafruit_LittleFS_File functions called back into its own public API. This meant that either a recursive mutex would be required (not a great solution), or that some of those public functions would need convert to wrappers around private functions. I chose the latter solution of (limited) wrapper functions where needed.

  3. As before, the use of the VERIFY() macros had to be unwound. This raises the possibility of typos or logic inversion.

Accordingly, I strongly request your careful code review and testing for commit 2a2f771.

@henrygab
Copy link
Collaborator Author

Rebase changes to be based from 0.15.0 ...

@mattratto
Copy link

I haven’t tried the rebase, but use of the previous version still results in periodic lockups - only on sequential removal of files. This is happening after the saving of approximately 100 separate 50byte files. When I loop through and remove them, sometimes the code halts. I haven’t yet run this with debug on, but once I do, I will post more info.

I am wondering if I am not allowing enough of a delay between each remove, though from my reading of the new code this should not be an issue.

I should add that this is occurring in a production environment which includes Bluetooth transactions.

Matt

@mattratto
Copy link

Another note - this is occurring on brand new boards. I am not running a file system format before placing files on the flash, following initial install of software. Is it possible that reliability would be increased by adding an initial format prior to adding files?

@hathach
Copy link
Member

hathach commented Jan 1, 2020

Sorry for late response, is busy migrating from travis to github action, and bunch o f unnamed work at the end of year. Will look at this next.

@henrygab
Copy link
Collaborator Author

henrygab commented Jan 1, 2020

@mattratto --

Question: Are your files in the root directory, or are they all in a subdirectory that only you use?

Why:

As a test, if you they were in root, and you moved all your files out into a subdirectory, would you still see a problem?

  • If yes, then I'm intrigued... Could the softdevice also doing some access to the flash that fundamentally conflicts? Could it be that it cannot be "safe" to cache data at all when the softdevice is enabled?
  • If no, then while not definitive (cannot easily tell difference between move from 99% --> 99.999% vs. to 100%), using the subdirectory may be a quick way to reduce instances of the problem?

@mattratto
Copy link

They were all in root. I’m moving to subdirectory and continuing to test. Is there anything fancy I should be doing with this? I’ve just appended /data/ to the file names for the write, and open /data/ for the reads and deletes.

As always, my thanks Henry!

@henrygab
Copy link
Collaborator Author

henrygab commented Jan 1, 2020

First, I have very little understanding of what the SD does. This currently is limiting me to (hopefully educated) guesses on how this might occur. That said...

No, nothing fancy. A subdirectory is all that's needed to get the potential reduction in interactions.

@hathach -- do you know when, why, and how the SD is saving bluetooth pairing data to flash?

@mattratto
Copy link

So I’ve put my data files in a directory and now, instead of deleting them individually, I delete and recreate the directory. Either the use of the directory, or this new deletion process seems to have resolved remaining issues. I am of course continuing to test, but looking good so far!

@henrygab
Copy link
Collaborator Author

henrygab commented Jan 2, 2020

@mattratto That data supports my current suspicion that the softdevice is messing with (writing directly to) sectors that Adafruit_LittleFS is using.

(( I'm curious if there is any repro without sd enabled, but I can no longer repro the issue with this branch ))

Can you get a repro if you go back to the prior behavior (saving in root directory), and not enable the softdevice / bluetooth? Obviously, this presumes you have multiple tasks that each use the filesystem.

If you can repro even without the softdevice enabled, then something else is awry.

@henrygab
Copy link
Collaborator Author

henrygab commented Jan 3, 2020

@mattratto -- Matt, I want to be clear that, if the SD is the cause of corruption by writing to the LFS file system without being serialized, using a subdirectory will only make the problem much more rare, and not eliminate it entirely.

This is because there are still globally shared data structures on the media, so if/when certain conditions get hit, it would be more likely to cause conflict.

I think your prior method of deleting many files in sequence may have been a good test case (but also good to know the recursive directory deletion can also further reduce the likelihood of conflict with SD).

@hathach
Copy link
Member

hathach commented Jan 6, 2020

Sorry for another delay, was busy with other works lately.

@hathach -- do you know when, why, and how the SD is saving bluetooth pairing data to flash?

To be clear SD doesn't write to LFS, only the Bluefruit library save information to LFS when

  • Paired with an new device
  • CCCD of bonded device is changed (enabled or disabled)

CCCD changes occurs more often then pairing with new device, if there is conflict, it is probably it.

char const* name (void);

bool isDirectory (void);
File openNextFile (uint8_t mode = FILE_O_READ);
void rewindDirectory (void);

// Has to be public, in order to grant 'friend' permissions to Adafruit_LittleFS private member _mutex
Copy link
Member

@hathach hathach Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henrygab as long as it works, but I think we should wrap up the LFS lock/unlock() as _lock()/_unlock() with the comment to tell use not to call it explicitly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Because this PR fixes data corruption, I think it is important to accept even if you want to change some function names later.

Separately, if you want the design to better hide this, I would recommend following the PIMPL (Pointer-to-Implementation) solution. This would allow 100% hiding of all the non-public APIs, and the interactions between the FS / File abstractions, without exposing any of this internal shared locking. However, changing to PIMPL would be a bigger architectural change … my goal was to minimize changes while serializing access to LFS APIs.

Copy link
Member

@hathach hathach Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agrees, let's merge this first. We can make further modification and naming later on if needed.

No, I don't need any fancy way to absolute hide API from user. I would simply add underscore to _lock/_unlock public API with comment saying it is intended to be used internally and expect user not to call it :D
https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/master/libraries/Bluefruit52Lib/src/bluefruit.h#L239

Per @hathach's PR, this removes the wrapper functions.
This solution appears to work.  However, it does not
follow all best practices.  For example, the use of
friend declarations, and the need to make the lock/unlock
functions in File public (in order to get friend status).
However, it passes the stress test without errors.
@hathach
Copy link
Member

hathach commented Jan 6, 2020

wow, you are too quick, I made a modification and submit it as PR (as usual) to your branch, but look like github has difficulty to pick up the differences.

Copy link
Member

@hathach hathach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for super quick fix for comments, I am kind of get lost when differences presented by github now. I agreed that this fixes the stress_test and could be merged. I can then manually compare with my modification and submit as separated PR for further discussion.

char const* name (void);

bool isDirectory (void);
File openNextFile (uint8_t mode = FILE_O_READ);
void rewindDirectory (void);

// Has to be public, in order to grant 'friend' permissions to Adafruit_LittleFS private member _mutex
Copy link
Member

@hathach hathach Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agrees, let's merge this first. We can make further modification and naming later on if needed.

No, I don't need any fancy way to absolute hide API from user. I would simply add underscore to _lock/_unlock public API with comment saying it is intended to be used internally and expect user not to call it :D
https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/master/libraries/Bluefruit52Lib/src/bluefruit.h#L239

@hathach hathach merged commit d643515 into adafruit:master Jan 6, 2020
@hathach
Copy link
Member

hathach commented Jan 6, 2020

Thank you very much @henrygab for spending lots of time fixing this issue. This has been draining a lot more time than expected ( and I am got pulled to work on other things meanwhile). I am very much appreciated for your time and effort. Looking forward to see more of your PRs in the future :)

@henrygab
Copy link
Collaborator Author

henrygab commented Jan 6, 2020

@hathach -- I am humbled by the many volunteers, without whose help (reporting, testing the many incomplete fixes, and final validation) this would not have been possible, including:

A special thanks also to @geky for helping me understand LFS!

@mattratto
Copy link

Amazing! Great to watch this work being done!

@hathach hathach mentioned this pull request Jan 7, 2020
@hathach
Copy link
Member

hathach commented Jan 7, 2020

@hathach -- I am humbled by the many volunteers, without whose help (reporting, testing the many incomplete fixes, and final validation) this would not have been possible, including:

A special thanks also to @geky for helping me understand LFS!

Indeed, thank you everyone to help fixing this annoying bugs and be PATIENT with me :D

@henrygab henrygab deleted the SerializeLFS branch March 11, 2020 22:30
@mattratto
Copy link

HI all, and specifically Henry, after months of no corruption issues, I have now seen 5 issues, exhibited with both Adafruit BSP 0.16.0 and with Adafruit BSP 0.20.5 (most recent release.) In each case, the error is the same LFS 494 Corrupted Directory blocks. It is possible that these devices had an older version of the BSP installed previously and thus the corruption may have occurred while using an older version of the Adafruit littlefs libraries. It is resolved by a format, and has not reoccurred in any of the above examples. However, I am wondering how I might insulate my code against corruption issues.

My main problem is that when my code runs into a corruption issue, it locks up. Does anyone have any suggestions how I could prevent this? I would like to show an error message and potentially kick off a format when the code encounters file corruption, but I have no idea how to do this since the lock up occurs while executing a file open.

thanks,

Matt

@mattratto
Copy link

mattratto commented Aug 28, 2020

In the interests of full disclosure, below is the code that is locking up when it encounters an error. If this is really a code rather than a library issue, I do apologize:

InternalFS.begin(); //start file system
  File item(InternalFS);
  File dir("data/", FILE_O_READ, InternalFS);
    while( (item = dir.openNextFile(FILE_O_READ)) )
  {  
    String filename = item.name(); 
    if(filename.indexOf("data") == 0) {  
    uint32_t readlen;
    char buffer[999] = { 0 };
    readlen = item.read(buffer, sizeof(buffer));
    item.close();
    for (int e=0; e<readlen; e++) { //now parse and separate 
    Serial.print(buffer[e]); 
 }
 Serial.println("---"); 
 }
 }
dir.close();
}

@mattratto
Copy link

after further thought, am thinking this is a code issue rather than a library issue per se. For anyone who wanders by, I realized that the 'Internal_ListFiles' example does not lock up, but my code above does. The only difference is that the example code only grabs the file names, but I get the names but also try to read the file contents. So I am wondering if there is a way to check the file integrity prior to attempting to read - maybe in fact, the item.name read is returning an error code?

@hathach
Copy link
Member

hathach commented Aug 29, 2020

@mattratto the file system should done the checking for us, once the FS is corrupted, it is too hard to check that, since you need to understand how the files is stored on the flash and understand how these inode chain work etc...

Are you still seeing corruption issue with latest library ?

@mattratto
Copy link

@hathach I am getting file corruption error messages when running code with latest library. However, I do not know if they were created by the library, or occurred when running an older library, prior to update to 0.20.5 BSP. These show up when I run Internal_ListFiles example code and seem to be happening around the use of openNextFile. I went and looked at this function in Adafruit_LittleFS_file.cpp and found on line 393:

(void)ret._open(filepath, mode); // return value is ignored ... caller is expected to check isOpened()

I couldn't find isOpened, but am now trying isOpen() on each item prior to proceeding:

while( (item = dir.openNextFile(FILE_O_READ)) )
{
if (item.isOpen()){

I will test this and report back.

@hathach
Copy link
Member

hathach commented Aug 30, 2020

@mattratto I see your problem now, since this PR is merged and closed. You should really re-test the library with newly format drive, if you still have issue with latest code then. Feel free to open an new issue, that make following it easier ( your could point a ref link to this if needed).

@mattratto
Copy link

mattratto commented Aug 30, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment