New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFS] check.c: do lfn_remove on auto_rename() #83
Conversation
We see sometimes corrupted uboot.env files created by fsck.vfat. On the fat filesystem they are indistinguishable because one has a fat16 name UBOOT.ENV (and not lfn (long-file-name)) but the other has a "uboot.env" lfn name and a FSCK0000.000 FAT16 name. The only known workaround is to remove all dupes and put one file back in place. Ideally we also fix fsck.vfat (possible fix in dosfstools/dosfstools#83) With this patch the sequence of events will be: 1. snapd refreshes 2. snapd writes new uboot.env with snap_mode=try 3. snapd reboots 4. uboot starts loads the uboot.env and writes it with snap_mode=trying 5. system boots, /boot/uboot is set to fs_passno=2 and systemd checks it before it mounts it 6. the file is corrupted, most likely by the uboot write 7. the fsck in (6) renames the file to fsck0000.000 but leaves the l-f-n entries intact and now we have two files. 8. snapd.core-fixup.sh runs, detects the two files and removes them and just keeps one. In the (manual) tests I ran this was the file with "snap_mode=try" 9. snapd will assume that the boot failed because snap_core is still at the previous core and reverts 10. uboot will on revert write snap_mode=trying to the right file now and our failover logic works again.
We see sometimes corrupted uboot.env files created by fsck.vfat. On the fat filesystem they are indistinguishable because one has a fat16 name UBOOT.ENV (and not lfn (long-file-name)) but the other has a "uboot.env" lfn name and a FSCK0000.000 FAT16 name. The only known workaround is to remove all dupes and put one file back in place. Ideally we also fix fsck.vfat (possible fix in dosfstools/dosfstools#83) With this patch the sequence of events will be: 1. snapd refreshes 2. snapd writes new uboot.env with snap_mode=try 3. snapd reboots 4. uboot starts loads the uboot.env and writes it with snap_mode=trying 5. system boots, /boot/uboot is set to fs_passno=2 and systemd checks it before it mounts it 6. the file is corrupted, most likely by the uboot write 7. the fsck in (6) renames the file to fsck0000.000 but leaves the l-f-n entries intact and now we have two files. 8. snapd.core-fixup.sh runs, detects the two files and removes them and just keeps one. In the (manual) tests I ran this was the file with "snap_mode=try" 9. snapd will assume that the boot failed because snap_core is still at the previous core and reverts 10. uboot will on revert write snap_mode=trying to the right file now and our failover logic works again.
@@ -272,6 +272,8 @@ static void auto_rename(DOS_FILE * file) | |||
sprintf(num, "%07lu", (unsigned long)number); | |||
memcpy(file->dir_ent.name, "FSCK", 4); | |||
memcpy(file->dir_ent.name + 4, num, 7); | |||
if (file->lfn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is at the point where names are speculatively generated and then searched for to see if they are a suitable unique replacement. Removing the long file name might happen over and over.
Replacing the lfn_fix_checksum()
would probably be the right place and also make sense, since the LFN shouldn't be fixed but removed.
@@ -272,6 +272,8 @@ static void auto_rename(DOS_FILE * file) | |||
sprintf(num, "%07lu", (unsigned long)number); | |||
memcpy(file->dir_ent.name, "FSCK", 4); | |||
memcpy(file->dir_ent.name + 4, num, 7); | |||
if (file->lfn) | |||
lfn_remove(file->lfn_offset, file->offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't clear the file->lfn
field. This is not a problem in the other case where it is called since there the whole file is dropped. Here it should be updated.
I have now fixed this by removing the LFN for both automatic and manual renames. |
Thanks a lot for your fix! |
I ran into the issues that I had a filesystem with two "uboot.env" files.
On closer inspection it become clear that the image contained a file called "FSCK0000.000" with the lfn (long-file-name) "uboot.env". The image also contains a "UBOOT.ENV" (as a regular FAT16 name). When this is mounted via the vfat kernel driver two "uboot.env" files appear. Which is confusing :) Mounting it as "msdos" helpered to solve the mystery.
After looking at fsck.vfat I suspect that what happens is there was a corrupted entry for the uboot.env file which fsck.vfat corrected via auto_rename() in check.c. However this auto_rename() seems to only set the short filename. Any lfn-names remain untouched AFAICT. Which I suspect leads to the confusing result above.
Attached is a RFC patch that cleans the lfn name in auto_rename. I will try to find a way to properly test this, so far I was not successful with creating the right kind of corruption on the image.
Please check this carefully, I'm not familar with the internals of dosfstools but I would love to avoid that people run into the same confusion that I got into :)