Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically sized bare-metal image can clobber data partition on reprovision #586

Open
bgilbert opened this issue Jul 29, 2020 · 23 comments

Comments

@bgilbert
Copy link
Contributor

Bug Report

Expected Behavior

Successful reprovisioning of a machine from a newer OS image, without loss of data on user-created data partitions.

Actual Behavior

Data loss or provisioning failure.

Reproduction Steps

  1. Install FCOS or RHCOS. Use an Ignition config to create a data partition immediately following the root filesystem.
  2. Reprovision FCOS or RHCOS from a newer image with a larger root filesystem.

Other Information

If existing data partitions are preserved using the functionality of coreos/coreos-installer#321, reinstall will fail because coreos-installer will prevent the new install image from clobbering preserved partitions. If instead they're preserved by rerunning the Ignition config to recreate the partitions, data loss will occur when the installer writes the larger image over the beginning of the data partition.

If the Ignition config dynamically sets the starting offset of the partition (via a zero or missing startMiB), data loss can also appear to occur if the new image is smaller than the previous one.

Currently we dynamically set the size of the bare-metal root partition, but that size is a contract with the user. We can't avoid such a contract entirely (unless we disallow adding partitions to the boot disk) but we could separate it conceptually from the image size. That would involve 1) documenting a minimum starting offset for user data partitions, and 2) enforcing it with FCCT. That approach seems brittle, though, especially since not everyone uses FCCT.

Instead, I'd be in favor of switching back to a fixed-size bare metal image. We'd need to make that fixed size equal to what we're using now, without additional room for future expansion, unless we're prepared to clobber user data partitions in the process.

@dustymabe dustymabe added the meeting topics for meetings label Jul 29, 2020
@bgilbert bgilbert removed the meeting topics for meetings label Aug 6, 2020
@bgilbert
Copy link
Contributor Author

Discussed at a community meeting. Notes:

Ignition runs too late to detect the problem. coreos-installer could detect an existing *COS root partition, as a special case, and fail if the install image tries to overwrite any subsequent partition. But that only detects the problem; it doesn't allow successfully reprovisioning nodes.

If we commit to a long-term cap on the root FS size, it can still be changed later, but that's a breaking change and would need an announcement and deprecation period. People with nontrivial data on separate partitions probably have large disks and wouldn't mind allocating e.g. 20G to the OS. In practice we need ~2x the OS size to support Fedora major version bumps; we don't need to factor in package layering and user data because the user can make the root filesystem larger than the minimum. Should we set the cap at 5 GiB? 10 GiB?

Possible approaches:

  1. Make the partition larger than the filesystem and larger than the disk image. Seems dangerous.
  2. Ship a root partition with a well-defined (larger) size and accept the additional I/O during install.
  3. Support partition resizing in Ignition, ship a small root partition, and have the base Ignition config automatically resize it to a well-defined minimum size before creating any user-specified partitions.

@dustymabe
Copy link
Member

Discussed at a community meeting. Notes:

Ignition runs too late to detect the problem. coreos-installer could detect an existing *COS root partition, as a special case, and fail if the install image tries to overwrite any subsequent partition. But that only detects the problem; it doesn't allow successfully reprovisioning nodes.

If we commit to a long-term cap on the root FS size, it can still be changed later, but that's a breaking change and would need an announcement and deprecation period. People with nontrivial data on separate partitions probably have large disks and wouldn't mind allocating e.g. 20G to the OS. In practice we need ~2x the OS size to support Fedora major version bumps; we don't need to factor in package layering and user data because the user can make the root filesystem larger than the minimum. Should we set the cap at 5 GiB? 10 GiB?

Between those two I would choose 10 😄

Possible approaches:

  1. Make the partition larger than the filesystem and larger than the disk image. Seems dangerous.

Is there a reason we'd need to make the partition larger than the disk image? Why not just make the disk image larger too and just make it all sparse so we're not actually transfering more bits. IIUC this would remove the "danger" from this approach. Maybe there is something I'm missing.

  1. Ship a root partition with a well-defined (larger) size and accept the additional I/O during install.

Safest, but least ideal probably.

  1. Support partition resizing in Ignition, ship a small root partition, and have the base Ignition config automatically resize it to a well-defined minimum size before creating any user-specified partitions.

I worry about this one a bit. I'm not worried about coreos/ignition#924 as that seems reasonable, but I do worry about doing this by default for the root filesystem on every ignition boot (first provisioning). Seems like we'd need to be extra careful to consider all of the ways people muck with their disk image during install today and make sure this doesn't introduce issues.

@bgilbert bgilbert added the jira for syncing to jira label Sep 10, 2020
@bgilbert
Copy link
Contributor Author

Is there a reason we'd need to make the partition larger than the disk image? Why not just make the disk image larger too and just make it all sparse so we're not actually transfering more bits.

That doesn't actually help. The image is compressed and the extra space is mostly zeroes, so we're not transferring much more data either way. But we still have to write all the zeroes to disk, because we can't risk leaving some non-zeroed bits in zeroed blocks; those blocks might be significant to the filesystem (e.g. a free bitmap). And coreos-installer doesn't know which part of the partition is past the end of the filesystem.

  1. Support partition resizing in Ignition, ship a small root partition, and have the base Ignition config automatically resize it to a well-defined minimum size before creating any user-specified partitions.

Seems like we'd need to be extra careful to consider all of the ways people muck with their disk image during install today and make sure this doesn't introduce issues.

Hmm. What sorts of failure cases are you concerned about?

@jlebon
Copy link
Member

jlebon commented Sep 22, 2020

IMO, option 3 is the cleanest approach.

Related: coreos/fedora-coreos-config#604

That PR will make coreos-growpart not do any partition resizing if the rootfs partition size is any different before and after the disks stage (implying the user has resized it already via Ignition). Obviously we'll need to rework that if we go this route here.

@bnordgren
Copy link

Seems like we'd need to be extra careful to consider all of the ways people muck with their disk image during install today and make sure this doesn't introduce issues.

Hmm. What sorts of failure cases are you concerned about?

So, for one, I think the defaults should result in a working system. I missed the "start_mib: 5000" warning on one of the documentation pages. I have a 16Tb Raid5 mounted on /var; one of the three members of this raid happens to be the boot disk. The only partitions specified in the fcc file were for the RAID, and all had start_mib:0 size_mib:0. Something else took care of creating the root partition, boot partition, etc. behind the scenes.

That "something else" created a 2.7Gb root partition, which of course means that updates are failing now for lack of room.

If the tooling is going to start creating partitions not specified in the ignition file, it should at least create something that meets known minimum requirements...

I think I may have been reading "other" documentation instead of the simple examples...for instance:

start_mib (integer, optional): Start position of the partition from the beginning of the disk, in mebibytes. If 0, the partition will be placed at the start of the largest block available.

I guess I kind of assumed that "largest block available" would have been calculated after a minimum root partition was allocated...

Whoops. Rescue CD time. I've never tried to shrink a software raid before. Much less tried to move the start block of one of the members. Probably the easiest thing to do is to fail the member that rides on the root disk and re-add it after editing the partition table.

@bgilbert
Copy link
Contributor Author

We discussed this during the community meeting today. Notes:

  • We've now had a case (Upgrade fails due to minimum-free-space-percent exceeding 3% #731) where someone was bitten by this
  • Adding FCCT warnings
    • In the general case, warnings are difficult, since FCCT doesn't know the device node for the boot device. /dev/disk/by-* symlinks don't work because the well-defined ones point to partitions, and storage.disks.[].device points to the whole-disk device.
    • ACTION: We'll try to modify FCCT to at least check the case where the root filesystem is recreated in the config
  • Bad-config detection in the initrd
    • The initrd could check for a bad config after Ignition runs (too-small root partition followed by another partition)
    • Could log a warning to the journal, add a line to the MOTD, and/or fail boot
    • ACTION: We'll add a warning and/or MOTD line first, notify coreos-status@, provide a migration period, and then switch to a hard failure
  • Re option 3, we now have resize support, but there are problems:
    • The base config doesn't know the device node for the boot disk without external help
    • Resize would fail if the target disk isn't large enough for the minimum root partition size, even if there's no partition created after it, and we might not want that case to fail
    • A default partition section could merge poorly with a user config that specifies a different layout for the disk (including via FCCT sugar), or specifies the disk via a different device node symlink
  • It'd be good to improve docs around the minimum root partition size, if we can find a good place to do it

jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Feb 11, 2021
We've recently had our first case of a "trapped" rootfs running out of
space for upgrades:

coreos/fedora-coreos-tracker#731

Until we actually implement stronger behaviour for this, let's
explicitly check for this case and emit a warning if we detect it. In
the future, we'll look at making this a hard error by default (with an
escape hatch).

For more information, see:

coreos/fedora-coreos-tracker#586 (comment)
@jlebon
Copy link
Member

jlebon commented Feb 11, 2021

ACTION: We'll add a warning and/or MOTD line first, notify coreos-status@, provide a migration period, and then switch to a hard failure

First part of this done in coreos/fedora-coreos-config#850. We should probably discuss migration period before merging that so we can include it in the warning message. Let's say... 3 months?

Also, I think we need an escape hatch to be safe. One idea is to require the Ignition config to explicitly specify the rootfs size if users want it less than 5G. (The issue right now is that specifying any additional partition doesn't make it obvious that it's trapping the rootfs, but requiring users to type it out will make them realize that's probably not what they want.) In that case, we assume that's what the user really wants. This also keys into fcct emitting a warning for that case.

jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Feb 11, 2021
We've recently had our first case of a "trapped" rootfs running out of
space for upgrades:

coreos/fedora-coreos-tracker#731

Until we actually implement stronger behaviour for this, let's
explicitly check for this case and emit a warning if we detect it. In
the future, we'll look at making this a hard error by default (with an
escape hatch).

For more information, see:

coreos/fedora-coreos-tracker#586 (comment)
@bgilbert
Copy link
Contributor Author

3 months SGTM.

We're implicitly committing to a 5 GiB minimum and a 5 GiB recommendation. Is that what we want? Should we set e.g. a 5 GiB minimum and a 10 GiB recommendation?

I'm okay with the escape hatch you described. However, echoing what I said in coreos/fedora-coreos-config#850 (comment), I don't think it's reasonable to ship a small disk image and then fail the boot by default if the user doesn't expand it. One option is to only fail the boot if the user has explicitly hemmed in the rootfs partition, but not if the user has retained the default partition layout on a small disk. Another option is to just ship a larger image. 😛

@bgilbert bgilbert added the meeting topics for meetings label Feb 17, 2021
jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Feb 22, 2021
We've recently had our first case of a "trapped" rootfs running out of
space for upgrades:

coreos/fedora-coreos-tracker#731

Until we actually implement stronger behaviour for this, let's
explicitly check for this case and emit a warning if we detect it. In
the future, we'll look at making this a hard error by default (with an
escape hatch).

For more information, see:

coreos/fedora-coreos-tracker#586 (comment)
@jlebon
Copy link
Member

jlebon commented Feb 22, 2021

We're implicitly committing to a 5 GiB minimum and a 5 GiB recommendation. Is that what we want? Should we set e.g. a 5 GiB minimum and a 10 GiB recommendation?

Good point. It's tricky though because it really depends on what the workload and partitioning is going to be. E.g. 8 GiB should be really safe if all of /var is a separate partition. Maybe something like: "We recommend setting aside at least 8 GiB of the root partition for the OS itself. You should allocate more space above this based on the requirements of the intended workload." ?

@jlebon
Copy link
Member

jlebon commented Feb 24, 2021

This was discussed in the community meeting:

#agreed Set a single limit of 8 GiB: we error if the rootfs is trapped with less than that, but only warn if there are no following partition

The rationale for a single number is to keep it simple.

jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Feb 25, 2021
We've recently had our first case of a "trapped" rootfs running out of
space for upgrades:

coreos/fedora-coreos-tracker#731

Until we actually implement stronger behaviour for this, let's
explicitly check for this case and emit a warning if we detect it. In
the future, we'll look at making this a hard error by default (with an
escape hatch).

For more information, see:

coreos/fedora-coreos-tracker#586 (comment)
jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Mar 2, 2021
We've recently had our first case of a "trapped" rootfs running out of
space for upgrades:

coreos/fedora-coreos-tracker#731

Until we actually implement stronger behaviour for this, let's
explicitly check for this case and emit a warning if we detect it. In
the future, we'll look at making this a hard error by default (with an
escape hatch).

For more information, see:

coreos/fedora-coreos-tracker#586 (comment)
cgwalters pushed a commit to coreos/fedora-coreos-config that referenced this issue Mar 5, 2021
We've recently had our first case of a "trapped" rootfs running out of
space for upgrades:

coreos/fedora-coreos-tracker#731

Until we actually implement stronger behaviour for this, let's
explicitly check for this case and emit a warning if we detect it. In
the future, we'll look at making this a hard error by default (with an
escape hatch).

For more information, see:

coreos/fedora-coreos-tracker#586 (comment)
@jlebon
Copy link
Member

jlebon commented Mar 5, 2021

Wanted to circle back to option 3.

  • Re option 3, we now have resize support, but there are problems:
    • The base config doesn't know the device node for the boot disk without external help

The boot disk symlink idea I think resolves this now.

  • Resize would fail if the target disk isn't large enough for the minimum root partition size, even if there's no partition created after it, and we might not want that case to fail

We wouldn't hardcode a specific size in the drop-in though, just resize: true, right?

  • A default partition section could merge poorly with a user config that specifies a different layout for the disk (including via FCCT sugar), or specifies the disk via a different device node symlink

Yeah, that's a tricky one. Basically, it's missing the "policy" bits from coreos-growpart which ensure we don't growpart if root reprovisioning is detected.

So one half-baked here is simply to put that logic in transposefs: if we detect that the rootfs is being reprovisioned (not in-place), we delete the drop-in.

@bgilbert
Copy link
Contributor Author

bgilbert commented Mar 6, 2021

The boot disk symlink idea I think resolves this now.

That's only true if we only allow the user to partition the boot disk via the symlink, which would be a compat break.

We wouldn't hardcode a specific size in the drop-in though, just resize: true, right?

The goal is to allow the user to specify a config like this, right?

storage:
  disks:
    - device: /dev/vda
      partitions:
        - label: data

So after merging with the base config, they'd get:

storage:
  disks:
    - device: /dev/vda
      partitions:
        - number: 4
          label: root
          resize: true
        - label: data

...which doesn't actually force a minimum size. Under the current implementation, this config produces a root partition with the original size.

So one half-baked here is simply to put that logic in transposefs: if we detect that the rootfs is being reprovisioned (not in-place), we delete the drop-in.

Any child config really needs to understand what its parent is doing (or, at least, that its actions don't conflict with its parent) to avoid unintended results. I don't think it's great to have a base config that changes depending on what the child config is doing.

@bgilbert
Copy link
Contributor Author

bgilbert commented Mar 6, 2021

The boot symlink idea is #759.

@jlebon
Copy link
Member

jlebon commented Mar 8, 2021

...which doesn't actually force a minimum size

Ack OK, sorry for the confusion here. I was strictly focused on the "fold growpart into Ignition" bit.

Any child config really needs to understand what its parent is doing (or, at least, that its actions don't conflict with its parent) to avoid unintended results.

We could flip it around: by default there's no base config drop-in, and we dynamically generate one if we detect that we want to grow the root partition.

I don't think it's great to have a base config that changes depending on what the child config is doing.

Honestly, I don't think it's so bad. What we're essentially doing is codifying shell code into Ignition config drop-ins so that Ignition does the work we would've done anyway. There's lots of advantages to this, including (1) keeping all the knowledge about block devices and filesystems in Ignition, (2) sizing LUKS containers and filesystems at the right size from the start, and (3) less shell code.

@jlebon
Copy link
Member

jlebon commented Mar 8, 2021

So concretely, it'd be something like:

  • fetch stage runs
  • transposefs runs:
    • if reprovisioning the rootfs, saves it to RAM
    • if not reprovisioning, or in-place reprovisioning, generate a base Ignition config drop-in which grows the rootfs either to the end, or to at least MIN_SIZE if another partition is specified
  • disks stage runs
  • growfs purely handles filesystem growing in the no-reprovisioning case (and maybe in the future, we teach Ignition to grow filesystems too, so then this would also be handled by the disks stage)

@bgilbert
Copy link
Contributor Author

bgilbert commented Mar 9, 2021

Okay, so your proposal is:

  1. If not reprovisioning the rootfs, transposefs walks through the Ignition config, resolves every storage.disks.device path back to a device node, and compares it to the device node for the root disk. If it finds a match, it dynamically generates an Ignition dropin for partition 4 of that device that sets the resize flag, and also a sizeMiB if the disk specifies an additional partition. If it doesn't find a match, it generates a dropin that only sets resize on partition 4 of the boot disk. We document that on FCOS, resize for the root partition will default to true under certain conditions, and size will default to MIN_SIZE under a subset of those conditions.
  2. growfs still exists, solely for filesystem resizing, and runs only in the no-reprovision case. Alternatively, we teach Ignition to resize filesystems, which is functionality which in practice will only be needed for the root filesystem, but which requires it to know how to resize all supported filesystem types.

There's lots of advantages to this, including (1) keeping all the knowledge about block devices and filesystems in Ignition, (2) sizing LUKS containers and filesystems at the right size from the start, and (3) less shell code.

(1) isn't really true, since transposefs and maybe growfs would still be doing a bunch of partition and filesystem manipulation. We're already planning to address (2) via #759. And re (3), I'm not sure the above pile of JSON manipulation in shell would be a net win.

@jlebon
Copy link
Member

jlebon commented Mar 9, 2021

Thanks for the sanity-check. Wanted to think through this idea.

@jlebon
Copy link
Member

jlebon commented Mar 10, 2021

Was talking to @bgilbert about this. The situation now is: we've decided to put the burden of sizing the rootfs correctly at provisioning time. With the plan above to have a specific size as a contract and additional bits to warn (and eventually error in some cases) to catch violations, this issue will be considered resolved.

@jlebon
Copy link
Member

jlebon commented Apr 12, 2021

I suggest leaving this issue open to track the move to erroring out.

@bgilbert
Copy link
Contributor Author

bgilbert commented Jun 8, 2021

The console warning from coreos/fedora-coreos-config#850 says:

############################################################################
WARNING: The root filesystem is too small. It is strongly recommended to
allocate at least 8 GiB of space to allow for upgrades. From June 2021, this
condition will trigger a failure in some cases. For more information, see:
https://docs.fedoraproject.org/en-US/fedora-coreos/storage/

You may delete this warning using:
sudo rm /etc/motd.d/60-coreos-rootfs-size.motd
############################################################################

It's June 2021, so I think we can start failing.

@dustymabe
Copy link
Member

It's June 2021, so I think we can start failing.

where do we make that change?

@dustymabe
Copy link
Member

dustymabe commented Oct 7, 2021

bump

also, what happens if I'm on an SD card on a RPi4 that is 8G - hard fail from now on?

@jlebon
Copy link
Member

jlebon commented Oct 28, 2021

bump

also, what happens if I'm on an SD card on a RPi4 that is 8G - hard fail from now on?

Only a warning. We hard fail if the rootfs is smaller than 8G and partitions were added after it (trapping the rootfs). Check out the box in https://docs.fedoraproject.org/en-US/fedora-coreos/storage/.

@bgilbert bgilbert added the status/pending-action Needs action label Aug 1, 2023
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
We've recently had our first case of a "trapped" rootfs running out of
space for upgrades:

coreos/fedora-coreos-tracker#731

Until we actually implement stronger behaviour for this, let's
explicitly check for this case and emit a warning if we detect it. In
the future, we'll look at making this a hard error by default (with an
escape hatch).

For more information, see:

coreos/fedora-coreos-tracker#586 (comment)
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
We've recently had our first case of a "trapped" rootfs running out of
space for upgrades:

coreos/fedora-coreos-tracker#731

Until we actually implement stronger behaviour for this, let's
explicitly check for this case and emit a warning if we detect it. In
the future, we'll look at making this a hard error by default (with an
escape hatch).

For more information, see:

coreos/fedora-coreos-tracker#586 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants