-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determine how to handle automatic rollback #47
Comments
cc @LorbusChris |
For more context on this, automatic rollback was one of the main objectives of the GSOC project of which @LorbusChris took part. Coincidentally, there was work at the same time in both grub2 and systemd to support boot counting (see systemd/systemd#9437 and rhboot/grub2#24). One of the outcomes was that we could standardize on |
Huh, didn't know about grubenv. Looks very useful! I'm fine with I'm still a fan of:
|
a lot of discussion about boot counting and success determination happened over in the greenboot GSOC project for fedora-iot: https://pagure.io/fedora-iot/issue/12 Specifically I think we got it down to two variables for state that needed to be tracked.. here is a high level state diagram we ended up with. |
+1
+1, but we can ship some defaults I think that will help
+1 we'll need to add some logic so that the updater knows an update failed and it shouldn't retry it
I think in greenboot we decided to always try to boot the last known successful option
+1, we already have this today
we do keep two deployments around today, but they aren't guaranteed to be 'successful installs'. i.e. if you attempt upgrade and it fails and you rollback then you'll only have two deployments, but one of them will be "bad". I'd say that's the one exception. |
Regarding that state diagram: LG (although I don't know if there's much point in a boot counter starting at 2; if it fails once that's probably a good reason to not try again) but we need to figure out what to do when trying to pick between multiple successful entries (both to make sure the correct one is gc'd and to know which to boot). I also don't think we should ever be in a state where we have 1 successful install (other than first boot). I.e. if a install fails, don't gc the successful one) other than the first install. |
I'll just make a note that we discussed this during the community meeting, and we concluded that the OSTree model is less susceptible to issues like coreos/bugs#2457 since we don't actually GC until we successfully prepare the next root (before reboot). But of course, preparing the next root successfully != guaranteed successful boot in that root. There is still a risk that we have a deployment which successfully prepares the root (and cleans up the previous deployment), but borks it in a subtle enough way that it's not actually bootable. |
One other thing to keep in mind here is that today, rpm-ostree hardcodes running /bin/true in the new root before staging/deployment. We could easily support generalizing this to running arbitrary code in the new root as a container before even trying to boot it for real. Of course, if you're getting your ostree commits from any OS vendor that doesn't suck, they should have been tested server side. And if you're using package layering, you're going to end up running scripts in the new root which do more than But - the capbility is there for us to do something more sophisticated if we wanted to. |
My opinion on this is that until we have a design that has automated tests and has been carefully audited for correctness, we shouldn't ship anything here. I have a short term PR to disable the current grub2 behavior that I think we should also apply to FAH29. |
There was a thread on the XFS list about the grubenv approach: https://marc.info/?l=linux-xfs&m=153740791327073&w=2 Migrated to include linux-fsdevel: https://marc.info/?l=linux-fsdevel&m=153741350128439&w=2 TL;DR: The filesystem developers are against it. |
I'm inclined to agree.
That is... unfortunate. It looks like that's one of the only places that grub can write to and grub does need to write to something if we want to handle failures where we can't get to (working) userspace. What really sucks is worst case we just need 9 bits total (grub only needs to write tries which are 0, 1, or 2 plus a priority bit for each install which there can be at max 3). Ugh. I'm not sure what exactly to do about that. We're left with three options:
Option 1 is a non-starter for me; it defeats the point. Option 2 isn't great but might work as a stopgap. If we started with 2 and planned to move to 3 we'd also need a migration plan. 3 is also not great because writing good bootloader code is hard and error prone. |
Proposal:
Next steps/problems to solve:
Misc notes:
|
Instead of using |
Can you provide more motivation for these? Thinking more on it, I think I understand where it's coming from, but without the motivation explicitly written out, it's hard to provide useful feedback/improvements. So IIUC, it essentially comes down to (1) the only place we can write data to from GRUB is the env block, Is that more or less correct? |
Yeah, that's correct, plus a little extra. To summarize:
|
This makes sense, though I am a little worried about the complexity of teaching ostree to maintain a static number of deployments. Or alternatively, to dynamically expand the grub env block.
Note that this is somewhat orthogonal; since grub has learned to parse the BLS fragments. I am not sure if there are any blockers to just turning that on. |
interesting.. @ajeddeloh, @LorbusChris, could you include that in the investigation? |
Ultimately we need to ensure that the ordering logic for grub and ostree is the same (so ostree overwrites the right deployments). I think we still want the grub env to be the source of truth, right? This is a shift from ostree maintaining to the source of truth, but it really should be something that both ostree and grub can access. I think it's critical we get this sorted out first, since it impacts everything else. BLS fragments could be useful for pinning deployments. I need to dig into how they work under the hood (i.e. is it some fancy grub script or is it baked into grub itself) but I can imagine have the 2/3 deployments that are managed by the grub-env plus any number of pinned ones. You're boot menu could look like:
This assumes the BLS is implemented in a a way where entries can be merged with a static config. |
I'm generally +1 to a static (or mostly static) handwritten config. One caveat though: on CL, our kernel command line has changed over time, and we don't have any way to update old bootloader configs. This means that new OS releases have to work with old kernel command lines, forever. It'd be good to avoid that on FCOS. Maybe the command line could come from a GRUB fragment installed alongside each kernel? |
+1 to updatable snippets, but I think it's important to note that these should be carefully chosen and not generated. Generated grub snippets tend to contain a lot of cruft that doesn't always apply and makes determining what is needed/not needed hard in addition to making it harder to read. |
I like the idea of ostree commits containing the defaults: ostreedev/ostree#479 (A grub fragment would mean the BLS configs are not truth) |
BLS configs are grub specific. Does ostree expose any sort of bootloader agnostic source of truth with the same info that would be used to generate the BLS config / other bootloader's config? (i.e. deployment X has kernel |
I'm confused - nothing in ostree by default for upgrades uses the current deployment's BLS config as the kernel arguments for the new deployment. However, one can add/remove args when making new deployments. (You can also set the kargs for an existing deployment although I would tend to discourage this) The idea with that ostree issue is that it'd be kind of like |
Arg, I'm mistaken. BLS configs are not grub specific. Correct me if I'm wrong but if ostree does not detect a bootloader then it doesn't write out the BLS configs, right? I'm looking for a way of querying ostree to say "what are the bits that would go into a BLS config" without actually creating one.
That's not contained in the ostree commit then is it? |
ostree always writes out the BLS configs - the BLS configs are the source of truth for the list of deployments. If you had no configs, If ostree doesn't detect your bootloader it won't e.g. regenerate Try booting fcos and do:
It'll barf because it can't find your booted deployment anymore.
Right, not today; the kernel args live in the BLS fragments. |
I think this should come from the commits. I'm not sure how I feel about user supplied args and how they should be managed. In my ideal world they'd be completely separate from the BLS config and get pulled in by the static grub config. Whether they are part of a deployment or exist outside of it (like the static grub config) is another question. I'm not sure if grub's current BLS implementation allows adding on extra bits to the menuentries it generates though, which would make separating them impossible. |
yeah. I think colin referenced this RFE already: ostreedev/ostree#479 - i can maybe try to find someone to work on that.
we already manage user supplied args with |
But then the args in the BLS config wouldn't be from the commit, they'd be from the commit + user specified. I suppose we could combine them at deploy time, but it'd be nice to have a clear separation of what is part of the ostree commit and what is not. |
can we not have both? i.e. we can combine them at deploy time, but store them separately so that there is a clear separation (at least to someone investigating a problem). |
That's better than not storing them separately, but it'd be better if ostree didn't need to combine them. One less thing to go wrong or to confuse users. In general the less merging/mangling/etc of configs the better (says the vocal supporter of |
Since you're always creating an EFI system partition, and it's FAT, just use that for grubenv whether UEFI or BIOS. This way you're always using FAT for grubenv. And it's a non-journaled, non-checksummed file systems that grubenv was intended for. It is a slightly dirty hack, because why would a BIOS system use an ESP? Well, that's bullet 4 in the Bootloaderspec - it says to use it as $BOOT. If you still think it's dirty, bullet 3 gives you a way out, change the partition type GUID of that partition from "EFI System" to "Extended Boot" - which you can do during first boot on non-UEFI systems as detected by a lack of efivars. Also, I'm pretty convinced you can make changes on FAT atomic: |
In discussion with the IoT folks we had some agreement that it was time to drive this functionality into ostree ostreedev/ostree#2725 |
We want to bring forward Container Linux's automatic rollback model and probably extend it even further. Automatic rollbacks can't solve every problem (since in some cases it may mean downgrading something like docker which is an unsupported operation) but it works well to protect against kernel issues and other such problems.
CL's model currently uses the GPT attribute bits to record if a partition has been tried or not and if was successfully booted. On a successful boot update_engine waits 45 seconds then marks the boot as successful.
We're not using A/B partitions in FCOS, so we can't use the GPT priority bits (and I think we shouldn't regardless, but that's beside the point).
Ostree currently does not support automatic rollback (@cgwalters please correct me if I'm wrong), so we'll need to implement it.
note: I'm going to use the term "install" to mean an ostree/kernel combo (for FCOS, this would be a kernel/usr-partition for CL).
Essentially there are four states an install can be in (in order of what should be chosen to boot)
Goals I think we ought to strive for:
My proposal:
So I think we should use flag files in
/boot
like we do for Ignition. When creating a new install, it's kernel gets written to/boot
along with two flag files:untested
andfailed
. There should only ever be one install with both flags. Additionally there should be a flag file "recent" which indicates which install to boot in the case of two successful installs.Here is a table of what combinations of flags mean what:
The grub config should select installs in this order:
untested
andfailed
flagsrecent
flagfailed
flag.When grub selects one it immediately removes the
untested
flag. On a successful boot a systemd unit (tbd: integrate this with greenboot?) adds the recent flag, removes the recent flag from the old entry, then removes the failed flag.This proposal does hinge on grub being able to delete files, which I haven't confirmed yet. It also means ostree wouldn't need to write out any grub configs at all, just empty files.
Edit: hrmmm. Grub doesn't seem to be able to write to or delete files. That makes the whole "recover from a bad kernel" bit hard.
Thoughts?
cc @cgwalters and @jlebon for the ostree bits and @bgilbert to keep me honest about how CL works.
The text was updated successfully, but these errors were encountered: