Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider disabling emergency shell timeout and reboot if an error is hit in the initramfs on first boot #928

Closed
jlebon opened this issue Aug 16, 2021 · 7 comments
Assignees
Labels
jira for syncing to jira kind/enhancement

Comments

@jlebon
Copy link
Member

jlebon commented Aug 16, 2021

Right now if there's an error in the initramfs, we get:

Press Enter for emergency shell or wait 5 minutes for reboot.

But rebooting may hide important error messages and then the next boot may fail in a different way due to firstboot assumptions being violated.

We should just disable that timeout and maybe even automatically enter the emergency shell.

@bgilbert
Copy link
Contributor

bgilbert commented Aug 16, 2021

In general, Ignition is expected to be idempotent. The transposefs glue is very much not, though, and Ignition has gaps around file appending (coreos/ignition#642) and apparently also LUKS keyfiles.

Also, on the RHCOS side I have been told that at least one customer depends on the reboot semantics.

@bgilbert
Copy link
Contributor

The automatic reboot was originally implemented in CL to give GRUB a chance to fall back to the known good OS release after an update failure. We don't have such code in our GRUB, and retrying provisioning at the OS level does seem likely to exercise under-tested code paths. That still leaves the issue of users that have taken a dependency on the current behavior.

One intermediate option is to lock out the automatic reboot if transposefs has been engaged.

@jlebon
Copy link
Member Author

jlebon commented Aug 19, 2021

Right yeah, the initramfs is much more distro glue than it is Ignition at this point. And none of it really accounts for half-provisioned systems, and new code going forward likely won't either. So I don't think we should scope this to just transposefs.

If there are people who depend on the current behaviour, we should find out why and fix the underlying issue (e.g. by continuing the recent trend of just retrying operations forever on transient errors).

Re. automatic rollback, note I'm suggesting we do this only for the first boot, because it's special to us. So there'd be nothing to roll back to anyway.

@jlebon jlebon added the meeting topics for meetings label Aug 25, 2021
@cgwalters
Copy link
Member

This is related to coreos/ignition-dracut#137

@jlebon
Copy link
Member Author

jlebon commented Aug 25, 2021

This was discussed in today's community meeting:

13:32:46 < jlebon> #agreed we will disable the automatic reboot timeout upon hitting emergency.target
                   in the initramfs on first boot

@jlebon jlebon removed the meeting topics for meetings label Aug 25, 2021
@bgilbert
Copy link
Contributor

Also, on the RHCOS side I have been told that at least one customer depends on the reboot semantics.

I've talked to the person I originally heard this from, and was unable to track down the reference. So I don't have anything concrete to offer here.

At this point I think we should drop the automatic reboot on all boots, not just the first boot. The current behavior hides intermittent boot bugs, and the main reason to keep it is to avoid uncovering them. Let's just take the leap and fix the bugs.

@jlebon jlebon added the jira for syncing to jira label Mar 8, 2022
prestist added a commit to prestist/fedora-coreos-config that referenced this issue Apr 22, 2022
The reboot and consequently the timeout masked valuable debug
information. The reboot also caused some cascading errors due
to the fact that the system would try and run as if all required
dependencies were satisfied during the first boot. The issue can be
found at coreos/fedora-coreos-tracker#928
prestist added a commit to prestist/fedora-coreos-config that referenced this issue Apr 22, 2022
The reboot and consequently the timeout masked valuable debug
information. The reboot also caused some cascading errors due to the
fact that the system would try and run as if all required dependencies
were satisfied during the first boot.

Closes coreos/fedora-coreos-tracker#928.
@travier
Copy link
Member

travier commented May 12, 2022

@dustymabe We missed labeling this one for releases. Will be looking at which one it went into.

HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
The reboot and consequently the timeout masked valuable debug
information. The reboot also caused some cascading errors due to the
fact that the system would try and run as if all required dependencies
were satisfied during the first boot.

Closes coreos/fedora-coreos-tracker#928.
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
The reboot and consequently the timeout masked valuable debug
information. The reboot also caused some cascading errors due to the
fact that the system would try and run as if all required dependencies
were satisfied during the first boot.

Closes coreos/fedora-coreos-tracker#928.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira for syncing to jira kind/enhancement
Projects
None yet
Development

No branches or pull requests

6 participants