Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Old bootloader versions don't boot new aarch64 6.2+ kernels #1441

Closed
dustymabe opened this issue Mar 15, 2023 · 38 comments
Closed

Old bootloader versions don't boot new aarch64 6.2+ kernels #1441

dustymabe opened this issue Mar 15, 2023 · 38 comments
Assignees
Labels
jira for syncing to jira

Comments

@dustymabe
Copy link
Member

dustymabe commented Mar 15, 2023

I just pro-actively updated my t4g.medium AWS instance to 38.20230310.1.0 and it didn't come back. Upon inspecting the serial console I see:

error: ../../grub-core/loader/arm64/linux.c:58:invalid magic number.
error: ../../grub-core/loader/arm64/linux.c:278:you need to load the kernel
first.  
        
Press any key to continue...

Pressing a key and selecting the older boot entry (thankfully I had console access) allowed me to re-connect with my system.

This system was provisioned a long time ago with 34.20210904.2.0 (testing stream; later moved over to the next stream to allow for earlier testing).

The problem here is that by default the bootloader on machines isn't updated so it keeps the one from when you first installed the machine. bootupd was created to solve this problem, but is still a work in progress so not widely used.

Here's what it shows on my system:

[core@dustymabe ~]$ sudo bootupctl status 
Component EFI
  Installed: grub2-efi-aa64-1:2.06-2.fc34.aarch64,shim-aa64-15.4-4.aarch64
  Update: Available: grub2-efi-aa64-1:2.06-88.fc37.aarch64,shim-aa64-15.6-2.aarch64
No components are adoptable.
CoreOS aleph image ID: fedora-coreos-34.20210904.2.0-qemu.aarch64.qcow2
Boot method: EFI

After updating the bootloader...

[core@dustymabe ~]$ sudo bootupctl update
Updated EFI: grub2-efi-aa64-1:2.06-88.fc37.aarch64,shim-aa64-15.6-2.aarch64

I am able to boot the system:

[core@dustymabe ~]$ rpm-ostree status 
State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: active; periodically polling for updates (last checked Wed 2023-03-15 19:38:43 UTC)
Deployments:
● fedora:fedora/aarch64/coreos/next
                  Version: 38.20230310.1.0 (2023-03-10T22:51:50Z)
                   Commit: b0fdf736cdbbd3971380d5549635e30155f07af6100925d987de623b4722637f
             GPGSignature: Valid signature by 6A51BBABBA3D5467B6171221809A8D7CEB10B464

  fedora:fedora/aarch64/coreos/next
                  Version: 37.20230303.1.0 (2023-03-06T18:55:26Z)
                   Commit: 0e785d34bddf7ff985fe49a4a9bdf2e88050c366f02b19f28df74b67fb3792ae
             GPGSignature: Valid signature by ACB5EE4E831C74BB7C168D27F55AD3FB5323552A

This is most likely due to recent changes for aarch64 kernels around EFI_ZBOOT, which we also think is the root cause for #1430.

@dustymabe dustymabe changed the title next: aarch64: old bootloaders can't boot new kernels next: aarch64: old bootloaders can't handle new kernels Mar 15, 2023
cgwalters added a commit to cgwalters/bootupd that referenced this issue Mar 15, 2023
@cgwalters
Copy link
Member

but is still a work in progress so not widely used.

➡️ coreos/bootupd#439

After updating the bootloader...I am able to boot the system:

Yeah. One option here I guess is to add cross-checks where at least rpm-ostree (or zincati) know how to query bootupd and block updates if it's too old.

jlebon added a commit to jlebon/fedora-coreos-streams that referenced this issue Mar 15, 2023
We found a last-minute issue on updating aarch64 nodes:
coreos/fedora-coreos-tracker#1441

Let's cancel the rollout while we figure out how to address this.
@bgilbert
Copy link
Contributor

One option here I guess is to add cross-checks where at least rpm-ostree (or zincati) know how to query bootupd and block updates if it's too old.

I mean, ideally bootupd would update the bootloader automatically. How far are we from being able to do that?

@cgwalters
Copy link
Member

@bgilbert
Copy link
Contributor

Well, it says "perhaps in the future bootupd will use some of those". If we prioritized doing the work, do we have the ability to implement it today, and how much effort would it be?

@cgwalters
Copy link
Member

Updating automatically is trivial, just a systemd unit that runs bootupctl update. Anyone who wants that has been able to do so for a long time. The intermediate model of "only run bootupctl update when it's truly necessary" is probably a week of work and testing. What's not trivial is updating transactionally.

dustymabe pushed a commit to coreos/fedora-coreos-streams that referenced this issue Mar 15, 2023
We found a last-minute issue on updating aarch64 nodes:
coreos/fedora-coreos-tracker#1441

Let's cancel the rollout while we figure out how to address this.
@jlebon
Copy link
Member

jlebon commented Mar 15, 2023

The intermediate model of "only run bootupctl update when it's truly necessary" is probably a week of work and testing.

To flesh this out, this would be something like: attach metadata to each release about the "minimum bootloader version" and then enhance rpm-ostree to update the bootloader using bootupd if it detects that it's too old before finalizing an update?

This is not something we want to do in a rush, but we may be in a situation where we can pin on an older kernel for now to give us time to do something like this.

The risk I see is that we haven't been updating the bootloader at all so far, so now years of potential regressions could manifest all at once. Though that's a concern anyway whether users do it or we do it. And we know we want to eventually have automatic bootloader updates.

@dustymabe
Copy link
Member Author

new investigation information:

This is confirmed to be specific to the 6.2 kernels and isn't tied to the Fedora 38 major release (i.e. when 6.2 hits F37 our testing and stable streams would also be affected if we don't pin).

@dustymabe
Copy link
Member Author

dustymabe commented Mar 15, 2023

I was trying to gauge how far back you'd have to go to get a bootloader/EFI binary that wasn't compatible. In my random sampling here's what I found:

  • 36.20220906.1.0 ❌ (won't boot 6.2 kernel)
    • Installed: grub2-efi-aa64-1:2.06-52.fc36.aarch64,shim-aa64-15.6-2.aarch64
  • 37.20221111.1.0 ✔️ (boots 6.2 kernel)
    • Installed: grub2-efi-aa64-1:2.06-60.fc37.aarch64,shim-aa64-15.6-2.aarch64

@travier
Copy link
Member

travier commented Mar 16, 2023

F36 is supposed to be able to update to F38 per Fedora policies so we should file a bug for that. Likely for the kernel.

This also likely affects Silverblue/Kinoite/Sericea & IoT (and we don't have bootupd there yet unfortunately): fedora-silverblue/issue-tracker#120

@dustymabe
Copy link
Member Author

dustymabe commented Mar 16, 2023

F36 is supposed to be able to update to F38 per Fedora policies so we should file a bug for that. Likely for the kernel.

I think this might not affect non-OSTree EFI systems because (IIUC) the bits get updated by the RPM on upgrade so we need to be careful about how we approach this. Justin Forbes is aware of the problem.

It's also possible a fully updated F36 system can update to F38, but for FCOS we stopped building F36 when F37 came out.

This also likely affects Silverblue/Kinoite/Sericea & IoT (and we don't have bootupd there yet unfortunately): fedora-silverblue/issue-tracker#120

Note that this is a limited failure scenario IIUC. The releases had to have been building and releasing for aarch64 since sometime in f36. That at least removes Sericea from the list. Not sure about Kinoite. Also, there are probably not that many people running aarch64 since there is limited laptop/desktop hardware available there.

For IoT I sent an FYI email to their mailing list.

dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Mar 16, 2023
The new 6.2 kernel can cause aarch64 systems originally installed
on F36- to not boot. For now while we figure out the best path forward
we'll ship the newest 6.1 kernel we can find, which just happens to
be built against F37.

See coreos/fedora-coreos-tracker#1441
@dustymabe
Copy link
Member Author

dustymabe commented Mar 16, 2023

Since this problem is introduced with 6.2 kernels my short term proposal for FCOS next is to pin on the latest F37 6.1 kernel and skip the 38.20230310.1.0 rollout completely while we come up with our next steps.

For our next steps, I discussed this briefly with @jlebon yesterday. Here are a few potential options for us:

  • run bootupctl update on every update
  • run bootupctl update on select updates (maybe once a fedora major or something)
  • use a systemd generator to detect when this specific problem (old old bootloader on aarch64/EFI) would occur and disable zincati and drop down CLHM warnings for the user telling them what to do
    • this puts the bootloader update (and the risk) in the user's hands

@bgilbert
Copy link
Contributor

Note that we only caught this in advance because Dusty had an aarch64 system that he manually updated before the rollout started. Thanks, Dusty, for doing that! ...but we shouldn't rely on it. How can we improve our upgrade testing to catch this case?

@dustymabe
Copy link
Member Author

How can we improve our upgrade testing to catch this case?

I added myself as assignee for coreos/coreos-assembler#2519 to make it happen.

@travier
Copy link
Member

travier commented Mar 17, 2023

Filed coreos/bootupd#440

@travier
Copy link
Member

travier commented Mar 17, 2023

Filed fedora-silverblue/issue-tracker#434 for Silverblue/Kinoite

@travier
Copy link
Member

travier commented Mar 17, 2023

Also filed: coreos/bootupd#441

@bgilbert
Copy link
Contributor

Some background info on the situation:

The 6.2 aarch64 kernel is not compatible with the GRUB bootloader versions shipped with older releases of Fedora CoreOS. Because Fedora CoreOS does not routinely update GRUB, we must do so explicitly before switching to a 6.2 kernel.

To do this, we'll ship a "barrier release" on each stream with a temporary systemd service that updates the bootloader. Our rollout system will force each node to install the barrier release before updating further, ensuring that all aarch64 machines update the bootloader. To reduce risk, non-aarch64 systems will not update the bootloader, but must still install the barrier release.

However, there's a further complication for next. Fedora 38 requires at least a 6.2 kernel, so the next barrier release must be based on Fedora 37. The barrier would therefore require 38.20230310.1.0 systems to downgrade to a Fedora 37 release, which is not generally supported in Fedora. As a result, we are marking 38.20230310.1.0 as a "dead-end release" which will not receive further updates.

The next barrier release, 37.20230303.1.1, has now fully rolled out. It will be followed shortly by a next release based on Fedora 38 with a 6.2 kernel. On the testing stream, this week's regular release will serve as the barrier release, and will promote normally to serve as the stable barrier release in two weeks. The testing and stable streams will then receive the 6.2 kernel in due course.

dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Mar 22, 2023
Since the barrier release for [1] shipped in 37.20230322.2.0
we can now unpin and ship the 6.2 kernel.

[1] coreos/fedora-coreos-tracker#1441
dustymabe added a commit to coreos/fedora-coreos-config that referenced this issue Mar 23, 2023
Since the barrier release for [1] shipped in 37.20230322.2.0
we can now unpin and ship the 6.2 kernel.

[1] coreos/fedora-coreos-tracker#1441
@bgilbert bgilbert added status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. and removed status/pending-testing-release Fixed upstream. Waiting on a testing release. labels Mar 24, 2023
@dustymabe
Copy link
Member Author

The fix for this went into testing stream release 37.20230322.2.0. Please try out the new release and report issues.

@goshansp

This comment was marked as off-topic.

@dustymabe

This comment was marked as off-topic.

@dustymabe

This comment was marked as off-topic.

@goshansp

This comment was marked as off-topic.

@goshansp

This comment was marked as off-topic.

@bgilbert
Copy link
Contributor

bgilbert commented Apr 4, 2023

Checklist in #1441 (comment) complete.

@dustymabe

This comment was marked as off-topic.

@dustymabe
Copy link
Member Author

The fix for this went into stable stream release 37.20230322.3.0.

@dustymabe dustymabe removed the status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. label Apr 6, 2023
@bgilbert
Copy link
Contributor

bgilbert commented May 2, 2023

It turns out that the fix doesn't work for systems with mirrored disks: #1485

HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
On aarch64, kernel 6.2 won't boot with older versions of GRUB.  In
preparation for switching to the new kernel, add a systemd service that
uses bootupd to update the bootloader on aarch64 systems.

Revert this after the next barrier release.

For coreos/fedora-coreos-tracker#1441.
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
This reverts commit 8ce6fd6.

The testing-devel promotion with this has already happened in
coreos#2315
and we have already shipped a `next` with the unit, so we can
drop it now (before we execute the `next-devel`->`next` promotion)
to prevent it from shipping in more than one release per stream.

We can do this because we have barriers.

Full context in coreos/fedora-coreos-tracker#1441
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
Since the barrier release for [1] shipped in 37.20230322.2.0
we can now unpin and ship the 6.2 kernel.

[1] coreos/fedora-coreos-tracker#1441
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
On aarch64, kernel 6.2 won't boot with older versions of GRUB.  In
preparation for switching to the new kernel, add a systemd service that
uses bootupd to update the bootloader on aarch64 systems.

Revert this after the next barrier release.

For coreos/fedora-coreos-tracker#1441.
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
This reverts commit 8ce6fd6.

The testing-devel promotion with this has already happened in
coreos#2315
and we have already shipped a `next` with the unit, so we can
drop it now (before we execute the `next-devel`->`next` promotion)
to prevent it from shipping in more than one release per stream.

We can do this because we have barriers.

Full context in coreos/fedora-coreos-tracker#1441
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
Since the barrier release for [1] shipped in 37.20230322.2.0
we can now unpin and ship the 6.2 kernel.

[1] coreos/fedora-coreos-tracker#1441
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira for syncing to jira
Projects
None yet
Development

No branches or pull requests

7 participants