Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFS] check.c: do lfn_remove on auto_rename() #83

Closed
wants to merge 1 commit into from

Conversation

mvo5
Copy link

@mvo5 mvo5 commented May 8, 2018

I ran into the issues that I had a filesystem with two "uboot.env" files.

On closer inspection it become clear that the image contained a file called "FSCK0000.000" with the lfn (long-file-name) "uboot.env". The image also contains a "UBOOT.ENV" (as a regular FAT16 name). When this is mounted via the vfat kernel driver two "uboot.env" files appear. Which is confusing :) Mounting it as "msdos" helpered to solve the mystery.

After looking at fsck.vfat I suspect that what happens is there was a corrupted entry for the uboot.env file which fsck.vfat corrected via auto_rename() in check.c. However this auto_rename() seems to only set the short filename. Any lfn-names remain untouched AFAICT. Which I suspect leads to the confusing result above.

Attached is a RFC patch that cleans the lfn name in auto_rename. I will try to find a way to properly test this, so far I was not successful with creating the right kind of corruption on the image.

Please check this carefully, I'm not familar with the internals of dosfstools but I would love to avoid that people run into the same confusion that I got into :)

mvo5 added a commit to mvo5/snappy that referenced this pull request May 9, 2018
We see sometimes corrupted uboot.env files created by fsck.vfat.

On the fat filesystem they are indistinguishable because one
has a fat16 name UBOOT.ENV (and not lfn (long-file-name)) but
the other has a "uboot.env" lfn name and a FSCK0000.000 FAT16
name. The only known workaround is to remove all dupes and put
one file back in place.

Ideally we also fix fsck.vfat (possible fix in
dosfstools/dosfstools#83)

With this patch the sequence of events will be:
1. snapd refreshes
2. snapd writes new uboot.env with snap_mode=try
3. snapd reboots
4. uboot starts loads the uboot.env and writes it with snap_mode=trying
5. system boots, /boot/uboot is set to fs_passno=2 and systemd checks
   it before it mounts it
6. the file is corrupted, most likely by the uboot write
7. the fsck in (6) renames the file to fsck0000.000 but leaves the
   l-f-n entries intact and now we have two files.
8. snapd.core-fixup.sh runs, detects the two files and removes them
   and just keeps one. In the (manual) tests I ran this was the
   file with "snap_mode=try"
9. snapd will assume that the boot failed because snap_core is still
   at the previous core and reverts
10. uboot will on revert write snap_mode=trying to the right file
    now and our failover logic works again.
mvo5 added a commit to mvo5/snappy that referenced this pull request May 10, 2018
We see sometimes corrupted uboot.env files created by fsck.vfat.

On the fat filesystem they are indistinguishable because one
has a fat16 name UBOOT.ENV (and not lfn (long-file-name)) but
the other has a "uboot.env" lfn name and a FSCK0000.000 FAT16
name. The only known workaround is to remove all dupes and put
one file back in place.

Ideally we also fix fsck.vfat (possible fix in
dosfstools/dosfstools#83)

With this patch the sequence of events will be:
1. snapd refreshes
2. snapd writes new uboot.env with snap_mode=try
3. snapd reboots
4. uboot starts loads the uboot.env and writes it with snap_mode=trying
5. system boots, /boot/uboot is set to fs_passno=2 and systemd checks
   it before it mounts it
6. the file is corrupted, most likely by the uboot write
7. the fsck in (6) renames the file to fsck0000.000 but leaves the
   l-f-n entries intact and now we have two files.
8. snapd.core-fixup.sh runs, detects the two files and removes them
   and just keeps one. In the (manual) tests I ran this was the
   file with "snap_mode=try"
9. snapd will assume that the boot failed because snap_core is still
   at the previous core and reverts
10. uboot will on revert write snap_mode=trying to the right file
    now and our failover logic works again.
@@ -272,6 +272,8 @@ static void auto_rename(DOS_FILE * file)
sprintf(num, "%07lu", (unsigned long)number);
memcpy(file->dir_ent.name, "FSCK", 4);
memcpy(file->dir_ent.name + 4, num, 7);
if (file->lfn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is at the point where names are speculatively generated and then searched for to see if they are a suitable unique replacement. Removing the long file name might happen over and over.

Replacing the lfn_fix_checksum() would probably be the right place and also make sense, since the LFN shouldn't be fixed but removed.

@@ -272,6 +272,8 @@ static void auto_rename(DOS_FILE * file)
sprintf(num, "%07lu", (unsigned long)number);
memcpy(file->dir_ent.name, "FSCK", 4);
memcpy(file->dir_ent.name + 4, num, 7);
if (file->lfn)
lfn_remove(file->lfn_offset, file->offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't clear the file->lfn field. This is not a problem in the other case where it is called since there the whole file is dropped. Here it should be updated.

@andreasbombe
Copy link
Contributor

I have now fixed this by removing the LFN for both automatic and manual renames.

@mvo5
Copy link
Author

mvo5 commented Jun 12, 2018

Thanks a lot for your fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants