Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for container-rebase #428

Closed
jdoss opened this issue Dec 29, 2022 · 12 comments
Closed

Add support for container-rebase #428

jdoss opened this issue Dec 29, 2022 · 12 comments
Labels

Comments

@jdoss
Copy link

jdoss commented Dec 29, 2022

<walters> EDIT: Transferring this issue from rpm-ostree

Basically let's add something like:

variant: fcos
version: x
bootc:
  target: quay.io/example/customos:latest

One reason we should do this is that we need systemd unit ordering which correctly orders against ignition-firstboot-complete.target among others (see below).

Original issue follows:

Host system details

[root@appliance ~]# rpm-ostree status
State: idle
AutomaticUpdates: stage; rpm-ostreed-automatic.timer: inactive
Deployments:
● ostree-unverified-registry:registry.local:5000/appliance:devel
                   Digest: sha256:ce098ae1aeaff8663df6a8ae131f4ae7af70c810ae518f542fddc20ad20cbcad
                  Version: 37.20221211.3.0 (2022-12-29T18:37:01Z)

  fedora:fedora/x86_64/coreos/stable
                  Version: 37.20221211.3.0 (2022-12-26T13:53:28Z)
                   Commit: 93930f1bbe732751297fb7e5c4b7f3b79c563a803f3cf8c48115f84c541f86a7
             GPGSignature: Valid signature by ACB5EE4E831C74BB7C168D27F55AD3FB5323552A

Expected vs actual behavior

When using the latest quay.io/fedora/fedora-coreos:stable image based off of 37.20221211.3.0 the symlinks for systemd units that are enabled within the layer are no longer present so layered systemd units do not load on reboot.

Using an older verison of FCOS 36.20221001.3.0 works as expected.

Here is the steps in my container layer build process that show the symlinks being created

STEP 20/25: WORKDIR /usr/src/appliance
--> 07fdcd5afef
STEP 21/25: RUN tar xf app.tar && ./install.sh
Created symlink /etc/systemd/system/default.target.wants/pod-appliance.service → /etc/systemd/system/pod-appliance.service.
--> dba68fb0482
STEP 22/25: WORKDIR /
--> a1fdc9a8b0a
STEP 23/25: COPY units/appliance-config.service /etc/systemd/system/appliance-config.service
--> 428764c8876
STEP 24/25: RUN systemctl enable appliance-config.service && touch /etc/appliance/env/appliance-config.env   && sed -i 's/#AutomaticUpdatePolicy.*/AutomaticUpdatePolicy=stage/' /etc/rpm-ostreed.conf
Created symlink /etc/systemd/system/default.target.wants/appliance-config.service → /etc/systemd/system/appliance-config.service.
--> 3907077e435
STEP 25/25: RUN ostree container commit

But when I reboot into this container layer

Fedora CoreOS 37.20221211.3.0
Tracker: https://github.com/coreos/fedora-coreos-tracker
Discuss: https://discussion.fedoraproject.org/tag/coreos

[core@appliance ~]$ sudo su -
[root@appliance ~]# ls -lah /etc/systemd/system/default.target.wants/appliance-config.service
ls: cannot access '/etc/systemd/system/default.target.wants/appliance-config.service': No such file or directory

The symlink is not present. The unit file is however present on the file system from the layer:

[root@appliance ~]# ls -lah /etc/systemd/system/appliance-config.service 
-rw-r--r--. 1 root root 984 Dec 29 18:37 /etc/systemd/system/appliance-config.service

Expected:

Working systemd units after layering an image on FCOS and rebooting.

Steps to reproduce it

Use the latest quay.io/fedora/fedora-coreos:stable image based off of 37.20221211.3.0. Add and enable a systemd unit in your Containerfile, layer that on to FCOS and watch the systemd unit be enabled on boot.

@cgwalters
Copy link
Member

I'm not immediately reproducing this problem when booting latest stable and rebasing to the tailscale example. This may relate to in-place upgrades? Or it might relate somehow to https://fedoraproject.org/wiki/Changes/Preset_All_Systemd_Units_on_First_Boot

Can you try to narrow down the reproduction scenario a bit more?

One thing that jumps out to me as a little odd is you're getting default.target.wants instead of multi-user.target.wants. Are you changing the default target?

@jdoss
Copy link
Author

jdoss commented Jan 4, 2023

I pushed up an example here https://github.com/quickvm/fcos-layer-paperless-ngx that reproduces the problem. It layers quay.io/quickvm/paperless-ngx:broken which has the latest FCOS stable and it doesn't start the systemd units. quay.io/quickvm/paperless-ngx:stable works.

The default.target is from podman generated systemd units. I just used my paperless ngx script to setup the podman pod and service containers and then I dumped the systemd units with podman generate systemd.

I did try switching to multi-user.target on the units but the result was the same.

@cgwalters
Copy link
Member

There's a full 4GB layer in there which makes testing things here a bit annoying 😄 Is it really necessary to reproduce? (I'm looking at it, just hoping it's not...)

@jdoss
Copy link
Author

jdoss commented Jan 6, 2023

Yeah that can be annoying. You could edit the container file to not include the paperless tarballs and push that up to a registry and use that. You can still reproduce without that stuff because it won't have the units enabled via simlink.

I use a local registry to speed up my development workflows. Start a local registry on your workstation:

podman run --replace -d --rm --name local-registry -p 5000:5000 docker.io/library/registry:2

And push the container to the local registry:

podman push localhost/paperless-ngx:busted <your workstation ip>:5000/paperless-ngx:busted

Add this to the butane:

  - path: /etc/containers/registries.conf.d/local.conf 
     mode: 0644 
     overwrite: true 
     contents: 
       inline: | 
         [[registry]] 
         location = "<your workstation ip>:5000" 
         insecure = true

Change the rebase exec start to use the local registry:

        ExecStart=rpm-ostree rebase --bypass-driver --experimental ostree-unverified-registry: <your workstation ip>:5000/paperless-ngx:busted

That will make local development pretty darn quick.

@cgwalters
Copy link
Member

I rebased to to the :broken image and I do see the systemd units started:

[root@cosa-devsh ~]# systemctl list-units|grep pngx
  pngx-gotenberg.service                                                                                                               loaded active running   Paperless Gotenberg Service
  pngx-pod.service                                                                                                                     loaded active running   Paperless pod service
  pngx-postgres.service                                                                                                                loaded active running   Paperless-ngx Postgresql Service
  pngx-redis.service                                                                                                                   loaded active running   Paperless-ngx Redis Service
  pngx-sftpgo.service                                                                                                                  loaded active running   Paperless-ngx SFTPgo Service
  pngx-tika.service                                                                                                                    loaded active running   Paperless Gotenberg Service
● pngx-webserver.service                                                                                                               loaded failed failed    Paperless-ngx Webserver Service
  machine-pngx-pod.slice                                                                                                               loaded active active    Slice /machine/pngx/pod
  machine-pngx.slice                                                                                                                   loaded active active    Slice /machine/pngx

I can think of three potential things that might be happening.

  • First, there's https://fedoraproject.org/wiki/Changes/Preset_All_Systemd_Units_on_First_Boot - this will wipe out units that don't have corresponding presets - but only on first boot. If you're rebasing from a "golden" FCOS image that shouldn't apply
  • Except, if you're managing to do the rebase and reboot before coreos-ignition-firstboot-complete.service happens to complete, then you can have Ignition run on the next boot again too, which would likely have this effect. (Hmm...when we make sugar for rebasing via a systemd unit we should definitely ensure it's ordered after that)
  • If you're rebasing to an image which has the units, and then trying to rebase back to an image which doesn't, those unit links will disappear. But, this is expected behavior.

@jdoss
Copy link
Author

jdoss commented Jan 6, 2023

If you launch FCOS from that butane it will reproduce after applying the layer from the systems unit.

That unit has Before=first-boot-complete.target do we need to add After=coreos-ignition-firstboot-complete.service?

@jdoss
Copy link
Author

jdoss commented Mar 1, 2023

I just ran into this issue after not seeing it happen for a while.

Server without this issue:

[root@node1 ~]# journalctl -u coreos-ignition-firstboot-complete.service
Feb 07 04:41:03 node1 systemd[1]: Starting coreos-ignition-firstboot-complete.service - CoreOS Mark Ignition Boot Complete...
Feb 07 04:41:03 node1 systemd[1]: Finished coreos-ignition-firstboot-complete.service - CoreOS Mark Ignition Boot Complete.
-- Boot 13b6de94359247e79f0b99f0716fd282 --
Feb 07 05:25:13 node1 systemd[1]: coreos-ignition-firstboot-complete.service - CoreOS Mark Ignition Boot Complete was skipped because of a failed condition check>
-- Boot 907edf8815f94cd9af2f14390d10430b --

[root@node0 ~]# journalctl -u coreos-ignition-firstboot-complete.service
Feb 20 23:39:49 localhost systemd[1]: Starting coreos-ignition-firstboot-complete.service - CoreOS Mark Ignition Boot Complete...
Feb 20 23:39:49 localhost systemd[1]: Finished coreos-ignition-firstboot-complete.service - CoreOS Mark Ignition Boot Complete.
Feb 20 23:51:01 linux systemd[1]: coreos-ignition-firstboot-complete.service: Deactivated successfully.
Feb 20 23:51:01 linux systemd[1]: Stopped coreos-ignition-firstboot-complete.service - CoreOS Mark Ignition Boot Complete.
-- Boot 6cf393e50562452cab0aecba2a83129a --
Feb 20 23:52:21 node000.quickvm.com systemd[1]: coreos-ignition-firstboot-complete.service - CoreOS Mark Ignition Boot Complete was skipped because of a failed condition >
-- Boot 1e8054fc91344df2b2ffc50181310c92 --
Feb 22 07:06:58 node000.quickvm.com systemd[1]: coreos-ignition-firstboot-complete.service - CoreOS Mark Ignition Boot Complete was skipped because of a failed condition >
lines 1-8/8 (END)

Server with this issue:

# journalctl -u coreos-ignition-firstboot-complete.service
Feb 25 23:00:44 node000.quickvm.com systemd[1]: coreos-ignition-firstboot-complete.service - CoreOS Mark Ignition Boot Complete was skipped because of a failed condition >
lines 1-1/1 (END)

Per your suggestion, I noted that coreos-ignition-firstboot-complete.service had not been run before the system layered the update and rebooted. I adjusted my systemd unit that does the rebase so it runs after coreos-ignition-firstboot-complete.service and I am not seeing the issue anymore.

[Unit]
Description=Rebase FCOS to Container Image
ConditionPathExists=!/var/lib/fcos-rebase.stamp
ConditionFirstBoot=true
After=network-online.target coreos-ignition-firstboot-complete.service coreos-update-ca-trust.service
Wants=network-online.target

[Service]
Type=oneshot
RemainAfterExit=yes
Restart=on-failure
RestartSec=10s
ExecStart=rpm-ostree rebase --bypass-driver --experimental ostree-unverified-registry:{{ container_registry }}/{{ container_registry_org }}/{{ container_name }}:{{ container_tag }}
ExecStartPost=/bin/touch /var/lib/fcos-rebase.stamp
ExecStartPost=systemctl reboot
[Install]
WantedBy=basic.target

@cgwalters
Copy link
Member

Thanks, I think we'll get some butane sugar for this at some point soon which should avoid that particular footgun.

@jdoss
Copy link
Author

jdoss commented Mar 2, 2023

No problem Colin and thank you for your help on this issue. Do you want to close this or leave it open?

@cgwalters cgwalters transferred this issue from coreos/rpm-ostree Mar 2, 2023
@cgwalters cgwalters changed the title Enabled systemd units not working in a layered 37.20221211.3.0 FCOS container Add support for container-rebase Mar 2, 2023
@cgwalters
Copy link
Member

I've transferred the issue to butane.

@bgilbert
Copy link
Contributor

bgilbert commented Aug 1, 2023

Generally we try not to hardcode complex systemd units in Butane, but instead ship them in the OS and have Butane sugar configure them.

It looks like the workflow here is to write an Ignition config that arranges a pivot during the first boot. Is this a workflow we want to support and encourage? A major design goal of Ignition is that it makes its changes before the system boots, so we're not trying to rearrange the boot while in the middle of booting.

@cgwalters
Copy link
Member

Right, but this is all part of moving stuff out of ignition in the end, it relates to coreos/fedora-coreos-docs#540

So we can just move forward with documenting there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants