Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zincati refuses to update after manually staging a deployment #1691

Open
arnegroskurth opened this issue Mar 5, 2024 · 13 comments · Fixed by coreos/rpm-ostree#4939
Open
Labels
kind/bug status/pending-upstream-release Fixed upstream. Waiting on an upstream component source code release. triaged

Comments

@arnegroskurth
Copy link

arnegroskurth commented Mar 5, 2024

Describe the bug

We've just CoreOS for a while now but always had to disable zincati to ensure uninterrupted availability of the container hosts. We are now trying to set up zincati with the lock-based update strategy but while the update does seem to get downloaded and "staged" (?), the system does never reboot to apply the update.

I'm not very familiar with the inner workings of rpm-ostree and zincati. So is there some documentation about how exactly "deployments", "stagings" and "locks" work together within ostree? Maybe I can then find more information related to the shown error Staged deployment already unlocked.

Reproduction steps

  • Set up host with CoreOS 39.20240128.3.0
  • Configure zincati with lock-based update finalization coordination:
# cat /etc/zincati/config.d/55-updates-strategy.toml
[updates]
strategy = "fleet_lock"

[updates.fleet_lock]
base_url = "https://.../"

[identity]
group = "65e5e79e2fa73"
  • Boot system and notice it never updating to the now most recent CoreOS version 39.20240210.3.0
  • Notice logs about an already unlocked deployment (I cannot interpret that information)

Expected behavior

System does reboot and apply the outstanding update.

Actual behavior

System does never reboot and continuously logs Txn FinalizeDeployment on /org/projectatomic/rpmostree1/fedora_coreos failed: Staged deployment already unlocked.

System details

# rpm-ostree --version
rpm-ostree:
 Version: '2023.11'
 Git: d15b73c58879a43456d5eeeee6605f7524e9ae65
 Features:
  - rust
  - compose
  - container
  - fedora-integration

Additional information

# journalctl -u zincati
Mar 04 16:31:41 update-test systemd[1]: Starting zincati.service - Zincati Update Agent...
Mar 04 16:31:41 update-test zincati[4076]: [INFO  zincati::cli::agent] starting update agent (zincati 0.0.27)
Mar 04 16:31:42 update-test zincati[4076]: [INFO  zincati::strategy::fleet_lock] remote fleet_lock reboot manager: https://.../
Mar 04 16:31:42 update-test zincati[4076]: [INFO  zincati::cincinnati] Cincinnati service: https://updates.coreos.fedoraproject.org
Mar 04 16:31:42 update-test zincati[4076]: [INFO  zincati::cli::agent] agent running on node '8d2af8f53c764f4f91667da87793a7f5', in update group '65e5e79e2fa73'
Mar 04 16:31:42 update-test zincati[4076]: [INFO  zincati::update_agent::actor] registering as the update driver for rpm-ostree
Mar 04 16:31:42 update-test zincati[4076]: [INFO  zincati::update_agent::actor] initialization complete, auto-updates logic enabled
Mar 04 16:31:42 update-test zincati[4076]: [INFO  zincati::strategy] update strategy: fleet_lock
Mar 04 16:31:42 update-test systemd[1]: Started zincati.service - Zincati Update Agent.
Mar 04 16:31:42 update-test zincati[4076]: [INFO  zincati::update_agent::actor] reached steady state, periodically polling for updates
Mar 04 16:31:44 update-test zincati[4076]: [INFO  zincati::cincinnati] current release detected as not a dead-end
Mar 04 16:31:44 update-test zincati[4076]: [INFO  zincati::update_agent::actor] target release '39.20240210.3.0' selected, proceeding to stage it
Mar 04 16:33:17 update-test zincati[4076]: [INFO  zincati::update_agent::actor] update staged: 39.20240210.3.0
Mar 04 16:33:17 update-test systemd[1]: zincati.service: Got notification message from PID 5211, but reception only permitted for main PID 4076
Mar 04 16:33:17 update-test zincati[4076]: [INFO  zincati::update_agent::actor] staged deployment '39.20240210.3.0' available, proceeding to finalize it
Mar 04 16:33:17 update-test zincati[4076]: [ERROR zincati::update_agent::actor] failed to finalize deployment: rpm-ostree finalize-deployment failed:
Mar 04 16:33:17 update-test zincati[4076]:     error: Staged deployment already unlocked
Mar 04 16:33:17 update-test zincati[4076]:     
Mar 04 16:38:23 update-test systemd[1]: zincati.service: Got notification message from PID 5446, but reception only permitted for main PID 4076
Mar 04 16:38:23 update-test zincati[4076]: [INFO  zincati::update_agent::actor] staged deployment '39.20240210.3.0' available, proceeding to finalize it
Mar 04 16:38:23 update-test zincati[4076]: [ERROR zincati::update_agent::actor] failed to finalize deployment: rpm-ostree finalize-deployment failed:
Mar 04 16:38:23 update-test zincati[4076]:     error: Staged deployment already unlocked
Mar 04 16:38:23 update-test zincati[4076]:     
...
# journalctl -u rpm-ostreed
Mar 04 16:26:45 update-test systemd[1]: Starting rpm-ostreed.service - rpm-ostree System Management Daemon...
Mar 04 16:26:46 update-test rpm-ostree[2844]: Reading config file '/etc/rpm-ostreed.conf'
Mar 04 16:26:49 update-test rpm-ostree[2844]: In idle state; will auto-exit in 62 seconds
Mar 04 16:26:49 update-test systemd[1]: Started rpm-ostreed.service - rpm-ostree System Management Daemon.
Mar 04 16:26:49 update-test rpm-ostree[2844]: client(id:cli dbus:1.10 unit:rpm-ostree-install-nrpe.service uid:0) added; new total=1
Mar 04 16:26:52 update-test rpm-ostree[2844]: Loaded sysroot
Mar 04 16:26:52 update-test rpm-ostree[2844]: Locked sysroot
Mar 04 16:26:52 update-test rpm-ostree[2844]: Initiated txn UpdateDeployment for client(id:cli dbus:1.10 unit:rpm-ostree-install-nrpe.service uid:0): /org/projectatomic/rpmostree1/fedora_coreos
Mar 04 16:26:52 update-test rpm-ostree[2844]: Process [pid: 2835 uid: 0 unit: rpm-ostree-install-nrpe.service] connected to transaction progress
Mar 04 16:26:53 update-test rpm-ostree[2844]: Librepo version: 1.17.0 with CURL_GLOBAL_ACK_EINTR support (libcurl/8.2.1 OpenSSL/3.1.1 zlib/1.2.13 libidn2/2.3.4 nghttp2/1.55.1)
Mar 04 16:27:31 update-test rpm-ostree[2844]: Downloading: file:///etc/pki/rpm-gpg/RPM-GPG-KEY-fedora-39-x86_64
Mar 04 16:27:31 update-test rpm-ostree[2844]: Downloading: https://mirrors.fedoraproject.org/metalink?repo=fedora-cisco-openh264-39&arch=x86_64
Mar 04 16:27:32 update-test rpm-ostree[2844]: Downloading: https://codecs.fedoraproject.org/openh264/39/x86_64/os/repodata/repomd.xml
Mar 04 16:27:32 update-test rpm-ostree[2844]: Downloading: https://codecs.fedoraproject.org/openh264/39/x86_64/os/repodata/cd71f063bfdc46daabc89dfe212825bf4a3ee0593556a25057e9b38dc14f0475-primary.xml.gz
Mar 04 16:27:32 update-test rpm-ostree[2844]: Downloading: https://codecs.fedoraproject.org/openh264/39/x86_64/os/repodata/7d7e77ccb27cb810e3cc2a57001d1b15072928f68ef55fca86819b1467f7e1bd-filelists.xml.gz
Mar 04 16:27:32 update-test rpm-ostree[2844]: Downloading: https://codecs.fedoraproject.org/openh264/39/x86_64/os/repodata/2b13cd3f9d81647fd31aa16de1b16b582efd9566f8c4334e4561a030f3777c37-comps-Temporary.x86_64.xml
Mar 04 16:27:32 update-test rpm-ostree[2844]: Downloading: file:///etc/pki/rpm-gpg/RPM-GPG-KEY-fedora-39-x86_64
Mar 04 16:27:32 update-test rpm-ostree[2844]: Downloading: https://mirrors.fedoraproject.org/metalink?repo=updates-released-f39&arch=x86_64
...
Mar 04 16:29:19 update-test rpm-ostree[2844]: Imported 179 pkgs
Mar 04 16:29:20 update-test rpm-ostree[2844]: Executed %prein for nagios-common in 504 ms
Mar 04 16:29:20 update-test rpm-ostree[2844]: Executed %prein for ntpsec in 223 ms
Mar 04 16:29:20 update-test rpm-ostree[2844]: Executed %prein for groff-base in 114 ms
Mar 04 16:29:21 update-test rpm-ostree[2844]: Executed %prein for nrpe in 344 ms
Mar 04 16:29:21 update-test rpm-ostree[2844]: Executed %post for samba-client in 130 ms
Mar 04 16:29:21 update-test rpm-ostree[2844]: Executed %post for esmtp in 118 ms
Mar 04 16:29:22 update-test rpm-ostree[2844]: Executed %post for ntpsec in 567 ms
Mar 04 16:29:22 update-test rpm-ostree[2844]: Executed %post for groff-base in 109 ms
Mar 04 16:29:22 update-test rpm-ostree(lm_sensors.post)[3713]: Created symlink /etc/systemd/system/multi-user.target.wants/lm_sensors.service → /usr/lib/systemd/system/lm_sensors.service.
Mar 04 16:29:22 update-test rpm-ostree[2844]: Executed %post for lm_sensors in 198 ms
Mar 04 16:29:22 update-test rpm-ostree[2844]: Executed %post for nrpe in 212 ms
Mar 04 16:29:22 update-test rpm-ostree[2844]: No files matched %transfiletriggerin(lib) for glibc-common
Mar 04 16:29:22 update-test rpm-ostree[2844]: No files matched %transfiletriggerin(lib64) for glibc-common
Mar 04 16:29:23 update-test rpm-ostree[2844]: Executed %transfiletriggerin(glibc-common) for lib, lib64, usr/lib, usr/lib64 in 508 ms; 16932 matched files
Mar 04 16:29:23 update-test rpm-ostree[2844]: No files matched %transfiletriggerin(usr/lib64/gio/modules) for glib2
Mar 04 16:29:23 update-test rpm-ostree[2844]: No files matched %transfiletriggerin(usr/share/glib-2.0/schemas) for glib2
Mar 04 16:29:23 update-test rpm-ostree[2844]: Executed %transfiletriggerin(shared-mime-info) for usr/share/mime in 181 ms; 864 matched files
Mar 04 16:29:24 update-test rpm-ostree[2844]: Executed %transfiletriggerin(systemd-udev) for usr/lib/udev/hwdb.d in 1095 ms; 31 matched files
Mar 04 16:29:24 update-test rpm-ostree[2844]: Executed %transfiletriggerin(systemd-udev) for usr/lib/udev/rules.d in 96 ms; 73 matched files
Mar 04 16:29:24 update-test rpm-ostree[2844]: sanitycheck(/usr/bin/true) successful
Mar 04 16:29:27 update-test rpm-ostree[2844]: Regenerating rpmdb for target
Mar 04 16:31:13 update-test rpm-ostree[2844]: Wrote commit: 6a2c48dd0be850527d298a4f83b46f95b84a4aaf7693f666cfd309c2d2b64879; New objects: meta:251 content:29 totaling 29.3 MB)
Mar 04 16:31:16 update-test rpm-ostree[2844]: note: Deploying commit 6a2c48dd0be850527d298a4f83b46f95b84a4aaf7693f666cfd309c2d2b64879 which contains content in /var/lib that should be in /usr/share/factory/var
Mar 04 16:31:16 update-test rpm-ostree[2844]: Created new deployment /ostree/deploy/fedora-coreos/deploy/6a2c48dd0be850527d298a4f83b46f95b84a4aaf7693f666cfd309c2d2b64879.0
Mar 04 16:31:21 update-test systemd[1]: rpm-ostreed.service: Got notification message from PID 3981, but reception only permitted for main PID 2844
Mar 04 16:31:21 update-test rpm-ostree[2844]: Computed /usr diff: files(added:457 removed:0 changed:6) dirs(added:66 removed:0 changed:0)
Mar 04 16:31:21 update-test rpm-ostree[2844]: Computed pkg diff: 179 added, 0 changed, 0 removed
Mar 04 16:31:25 update-test rpm-ostree[2844]: Computed /etc diff: files(added:56 removed:0 changed:13) dirs(added:6 removed:0 changed:0)
Mar 04 16:31:29 update-test systemd[1]: rpm-ostreed.service: Got notification message from PID 3995, but reception only permitted for main PID 2844
Mar 04 16:31:29 update-test rpm-ostree[2844]: Txn UpdateDeployment on /org/projectatomic/rpmostree1/fedora_coreos successful
Mar 04 16:31:29 update-test rpm-ostree[2844]: Unlocked sysroot
Mar 04 16:31:29 update-test rpm-ostree[2844]: Process [pid: 2835 uid: 0 unit: rpm-ostree-install-nrpe.service] disconnected from transaction progress
Mar 04 16:31:29 update-test rpm-ostree[2844]: client(id:cli dbus:1.10 unit:rpm-ostree-install-nrpe.service uid:0) vanished; remaining=0
Mar 04 16:31:29 update-test rpm-ostree[2844]: In idle state; will auto-exit in 63 seconds
Mar 04 16:31:42 update-test rpm-ostree[2844]: Loaded sysroot
Mar 04 16:31:42 update-test rpm-ostree[2844]: Locked sysroot
Mar 04 16:31:42 update-test rpm-ostree[2844]: Initiated txn Deploy for client(dbus:1.33 unit:zincati.service uid:981): /org/projectatomic/rpmostree1/fedora_coreos
Mar 04 16:31:42 update-test rpm-ostree[2844]: Process [pid: 4089 uid: 981 unit: zincati.service] connected to transaction progress
Mar 04 16:31:42 update-test rpm-ostree[2844]: Txn Deploy on /org/projectatomic/rpmostree1/fedora_coreos successful
Mar 04 16:31:42 update-test rpm-ostree[2844]: Unlocked sysroot
Mar 04 16:31:42 update-test rpm-ostree[2844]: Process [pid: 4089 uid: 981 unit: zincati.service] disconnected from transaction progress
Mar 04 16:31:42 update-test rpm-ostree[2844]: In idle state; will auto-exit in 63 seconds
Mar 04 16:31:44 update-test rpm-ostree[2844]: Loaded sysroot
Mar 04 16:31:44 update-test rpm-ostree[2844]: Locked sysroot
Mar 04 16:31:44 update-test rpm-ostree[2844]: Initiated txn Deploy for client(dbus:1.41 unit:zincati.service uid:981): /org/projectatomic/rpmostree1/fedora_coreos
Mar 04 16:31:44 update-test rpm-ostree[2844]: Process [pid: 4144 uid: 981 unit: zincati.service] connected to transaction progress
Mar 04 16:31:47 update-test rpm-ostree[2844]: Writing objects: 1
Mar 04 16:31:57 update-test rpm-ostree[2844]: libostree pull from 'fedora' for fedora/x86_64/coreos/stable complete
                                              security: GPG: commit 
                                              security: SIGN: disabled http: TLS
                                              non-delta: meta: 1326 content: 5072
                                              transfer: secs: 12 size: 263.7 MB
Mar 04 16:32:03 update-test rpm-ostree[2844]: Preparing pkg txn; enabled repos: ['fedora-cisco-openh264', 'updates', 'fedora', 'updates-archive'] solvables: 129020
Mar 04 16:32:10 update-test rpm-ostree[2844]: Relabeled 0/179 pkgs
Mar 04 16:32:12 update-test rpm-ostree[2844]: Executed %prein for nagios-common in 714 ms
Mar 04 16:32:12 update-test rpm-ostree[2844]: Executed %prein for ntpsec in 502 ms
Mar 04 16:32:12 update-test rpm-ostree[2844]: Executed %prein for groff-base in 131 ms
Mar 04 16:32:13 update-test rpm-ostree[2844]: Executed %prein for nrpe in 494 ms
Mar 04 16:32:13 update-test rpm-ostree[2844]: Executed %post for samba-client in 121 ms
Mar 04 16:32:13 update-test rpm-ostree[2844]: Executed %post for esmtp in 124 ms
Mar 04 16:32:13 update-test rpm-ostree[2844]: Executed %post for ntpsec in 248 ms
Mar 04 16:32:13 update-test rpm-ostree[2844]: Executed %post for groff-base in 123 ms
Mar 04 16:32:14 update-test rpm-ostree(lm_sensors.post)[4623]: Created symlink /etc/systemd/system/multi-user.target.wants/lm_sensors.service → /usr/lib/systemd/system/lm_sensors.service.
Mar 04 16:32:14 update-test rpm-ostree[2844]: Executed %post for lm_sensors in 209 ms
Mar 04 16:32:14 update-test rpm-ostree[2844]: Executed %post for nrpe in 227 ms
Mar 04 16:32:14 update-test rpm-ostree[2844]: No files matched %transfiletriggerin(lib) for glibc-common
Mar 04 16:32:14 update-test rpm-ostree[2844]: No files matched %transfiletriggerin(lib64) for glibc-common
Mar 04 16:32:15 update-test rpm-ostree[2844]: Executed %transfiletriggerin(glibc-common) for lib, lib64, usr/lib, usr/lib64 in 1162 ms; 16959 matched files
Mar 04 16:32:15 update-test rpm-ostree[2844]: No files matched %transfiletriggerin(usr/lib64/gio/modules) for glib2
Mar 04 16:32:15 update-test rpm-ostree[2844]: No files matched %transfiletriggerin(usr/share/glib-2.0/schemas) for glib2
Mar 04 16:32:15 update-test rpm-ostree[2844]: Executed %transfiletriggerin(shared-mime-info) for usr/share/mime in 124 ms; 864 matched files
Mar 04 16:32:17 update-test rpm-ostree[2844]: Executed %transfiletriggerin(systemd-udev) for usr/lib/udev/hwdb.d in 1386 ms; 31 matched files
Mar 04 16:32:17 update-test rpm-ostree[2844]: Executed %transfiletriggerin(systemd-udev) for usr/lib/udev/rules.d in 111 ms; 73 matched files
Mar 04 16:32:17 update-test rpm-ostree[2844]: sanitycheck(/usr/bin/true) successful
Mar 04 16:32:38 update-test rpm-ostree[2844]: Regenerating rpmdb for target
Mar 04 16:33:06 update-test rpm-ostree[2844]: Wrote commit: 2aee87e22fb539a13c6e74469f6435c9a91a22649a3e97b532d4314f6a96d37e; New objects: meta:167 content:2 totaling 29.2 MB)
Mar 04 16:33:10 update-test rpm-ostree[2844]: note: Deploying commit 2aee87e22fb539a13c6e74469f6435c9a91a22649a3e97b532d4314f6a96d37e which contains content in /var/lib that should be in /usr/share/factory/var
Mar 04 16:33:14 update-test rpm-ostree[2844]: Created new deployment /ostree/deploy/fedora-coreos/deploy/2aee87e22fb539a13c6e74469f6435c9a91a22649a3e97b532d4314f6a96d37e.0
Mar 04 16:33:17 update-test rpm-ostree[2844]: Txn Deploy on /org/projectatomic/rpmostree1/fedora_coreos successful
Mar 04 16:33:17 update-test rpm-ostree[2844]: Unlocked sysroot
Mar 04 16:33:17 update-test rpm-ostree[2844]: Process [pid: 4144 uid: 981 unit: zincati.service] disconnected from transaction progress
Mar 04 16:33:17 update-test rpm-ostree[2844]: In idle state; will auto-exit in 63 seconds
Mar 04 16:33:17 update-test rpm-ostree[2844]: Loaded sysroot
Mar 04 16:33:17 update-test rpm-ostree[2844]: Locked sysroot
Mar 04 16:33:17 update-test rpm-ostree[2844]: Initiated txn FinalizeDeployment for client(dbus:1.58 unit:zincati.service uid:981): /org/projectatomic/rpmostree1/fedora_coreos
Mar 04 16:33:17 update-test rpm-ostree[2844]: Process [pid: 5212 uid: 981 unit: zincati.service] connected to transaction progress
Mar 04 16:33:17 update-test rpm-ostree[2844]: Txn FinalizeDeployment on /org/projectatomic/rpmostree1/fedora_coreos failed: Staged deployment already unlocked
Mar 04 16:33:17 update-test rpm-ostree[2844]: Unlocked sysroot
Mar 04 16:33:17 update-test rpm-ostree[2844]: Process [pid: 5212 uid: 981 unit: zincati.service] disconnected from transaction progress
Mar 04 16:33:17 update-test rpm-ostree[2844]: In idle state; will auto-exit in 61 seconds
Mar 04 16:34:18 update-test systemd[1]: rpm-ostreed.service: Deactivated successfully.
Mar 04 16:34:18 update-test systemd[1]: rpm-ostreed.service: Consumed 3min 32.213s CPU time.
Mar 04 16:38:23 update-test systemd[1]: Starting rpm-ostreed.service - rpm-ostree System Management Daemon...
Mar 04 16:38:23 update-test rpm-ostree[5452]: Reading config file '/etc/rpm-ostreed.conf'
Mar 04 16:38:23 update-test rpm-ostree[5452]: In idle state; will auto-exit in 61 seconds
Mar 04 16:38:23 update-test systemd[1]: Started rpm-ostreed.service - rpm-ostree System Management Daemon.
Mar 04 16:38:23 update-test rpm-ostree[5452]: Loaded sysroot
Mar 04 16:38:23 update-test rpm-ostree[5452]: Locked sysroot
Mar 04 16:38:23 update-test rpm-ostree[5452]: Initiated txn FinalizeDeployment for client(dbus:1.71 unit:zincati.service uid:981): /org/projectatomic/rpmostree1/fedora_coreos
Mar 04 16:38:23 update-test rpm-ostree[5452]: Process [pid: 5447 uid: 981 unit: zincati.service] connected to transaction progress
Mar 04 16:38:23 update-test rpm-ostree[5452]: Txn FinalizeDeployment on /org/projectatomic/rpmostree1/fedora_coreos failed: Staged deployment already unlocked
Mar 04 16:38:23 update-test rpm-ostree[5452]: Unlocked sysroot
Mar 04 16:38:23 update-test rpm-ostree[5452]: Process [pid: 5447 uid: 981 unit: zincati.service] disconnected from transaction progress
Mar 04 16:38:23 update-test rpm-ostree[5452]: In idle state; will auto-exit in 62 seconds
Mar 04 16:39:26 update-test systemd[1]: rpm-ostreed.service: Deactivated successfully.
Mar 04 16:46:29 update-test systemd[1]: Starting rpm-ostreed.service - rpm-ostree System Management Daemon...
Mar 04 16:46:29 update-test rpm-ostree[5804]: Reading config file '/etc/rpm-ostreed.conf'
Mar 04 16:46:29 update-test rpm-ostree[5804]: In idle state; will auto-exit in 64 seconds
Mar 04 16:46:29 update-test systemd[1]: Started rpm-ostreed.service - rpm-ostree System Management Daemon.
Mar 04 16:46:29 update-test rpm-ostree[5804]: Loaded sysroot
Mar 04 16:46:30 update-test rpm-ostree[5804]: Locked sysroot
Mar 04 16:46:30 update-test rpm-ostree[5804]: Initiated txn FinalizeDeployment for client(dbus:1.99 unit:zincati.service uid:981): /org/projectatomic/rpmostree1/fedora_coreos
Mar 04 16:46:30 update-test rpm-ostree[5804]: Process [pid: 5799 uid: 981 unit: zincati.service] connected to transaction progress
Mar 04 16:46:30 update-test rpm-ostree[5804]: Txn FinalizeDeployment on /org/projectatomic/rpmostree1/fedora_coreos failed: Staged deployment already unlocked
Mar 04 16:46:30 update-test rpm-ostree[5804]: Unlocked sysroot
Mar 04 16:46:30 update-test rpm-ostree[5804]: Process [pid: 5799 uid: 981 unit: zincati.service] disconnected from transaction progress
Mar 04 16:46:30 update-test rpm-ostree[5804]: In idle state; will auto-exit in 62 seconds
Mar 04 16:47:30 update-test rpm-ostree[5804]: Loaded sysroot
Mar 04 16:47:30 update-test rpm-ostree[5804]: Locked sysroot
Mar 04 16:47:30 update-test rpm-ostree[5804]: Initiated txn FinalizeDeployment for client(dbus:1.102 unit:zincati.service uid:981): /org/projectatomic/rpmostree1/fedora_coreos
Mar 04 16:47:30 update-test rpm-ostree[5804]: Process [pid: 5819 uid: 981 unit: zincati.service] connected to transaction progress
Mar 04 16:47:30 update-test rpm-ostree[5804]: Txn FinalizeDeployment on /org/projectatomic/rpmostree1/fedora_coreos failed: Staged deployment already unlocked
Mar 04 16:47:30 update-test rpm-ostree[5804]: Unlocked sysroot
Mar 04 16:47:30 update-test rpm-ostree[5804]: Process [pid: 5819 uid: 981 unit: zincati.service] disconnected from transaction progress
Mar 04 16:47:30 update-test rpm-ostree[5804]: In idle state; will auto-exit in 61 seconds
...
@HaveFun83
Copy link

We run k8s on fedora Coreos with zincati and fleetlock and got a similar issue.
The node stuck in SchedulingDisabled without reboot zincati logs shows:

Mar 11 00:51:35 k8s-worker4 systemd[1]: zincati.service: Got notification message from PID 584554, but reception only permitted for main PID 1764679
Mar 11 00:51:35 k8s-worker4 zincati[1764679]: [INFO  zincati::update_agent::actor] staged deployment '39.20240210.3.0' available, proceeding to finalize it
Mar 11 00:51:36 k8s-worker4 zincati[1764679]: [ERROR zincati::update_agent::actor] failed to finalize deployment: rpm-ostree finalize-deployment failed:
Mar 11 00:51:36 k8s-worker4 zincati[1764679]:     error: Staged deployment already unlocked
Mar 11 00:51:36 k8s-worker4 zincati[1764679]:

@cgwalters
Copy link
Member

Can you also add the output of rpm-ostree status in this state?

@HaveFun83
Copy link

root@k8s-worker2:~# rpm-ostree status
State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: active; update staged: 39.20240210.3.0; reboot pending due to update strategy
Deployments:
  fedora:fedora/x86_64/coreos/stable
                  Version: 39.20240210.3.0 (2024-02-26T11:14:39Z)
               BaseCommit: 9ee1cb0963e253a2864c69a9450c8e3dd91b50b47f97389be0393c1db567dd5e
             GPGSignature: Valid signature by E8F23996F23218640CB44CBE75CF5AC418B8E74C
                     Diff: 120 upgraded, 4 added
          LayeredPackages: conntrack-tools python3

● fedora:fedora/x86_64/coreos/stable
                  Version: 39.20240112.3.0 (2024-01-29T16:43:10Z)
               BaseCommit: 2102f5d8530fc440ba0ddccb58c8321dd541df3acc0656f8a8b7df95e68fa89e
             GPGSignature: Valid signature by E8F23996F23218640CB44CBE75CF5AC418B8E74C
          LayeredPackages: conntrack-tools python3

  fedora:fedora/x86_64/coreos/stable
                  Version: 39.20240104.3.0 (2024-01-15T18:55:25Z)
               BaseCommit: df21c44638e117843272f80da8c82b0ac4b527e774cd01dce566836d6b3b2367
             GPGSignature: Valid signature by E8F23996F23218640CB44CBE75CF5AC418B8E74C
          LayeredPackages: conntrack-tools python3

@HaveFun83
Copy link

JFI Couple of our nodes update from 39.20240210.3.0 to 39.20240225.3.0 without any manual intervention.

Looks like the problem exist only in 39.20240112.3.0

@jlebon jlebon transferred this issue from coreos/rpm-ostree Mar 15, 2024
@jlebon
Copy link
Member

jlebon commented Mar 15, 2024

Mar 11 00:51:36 k8s-worker4 zincati[1764679]:     error: Staged deployment already unlocked

One way this could happen is if you manually do e.g. rpm-ostree upgrade or rpm-ostree install but don't immediately reboot. Could that have happened here?

@HaveFun83
Copy link

Mar 11 00:51:36 k8s-worker4 zincati[1764679]:     error: Staged deployment already unlocked

One way this could happen is if you manually do e.g. rpm-ostree upgrade or rpm-ostree install but don't immediately reboot. Could that have happened here?

Maybe but normally the reboot is triggered by zincati immediately after rpm-ostree upgrade and fleetlock take care about the locking (one node at a time)

As i mentioned this only happens on 39.20240112.3.0 all nodes are now 39.20240225.3.0

@travier travier changed the title CoreOS+Zincati+OSTree: Txn FinalizeDeployment on /org/projectatomic/rpmostree1/fedora_coreos failed: Staged deployment already unlocked Txn FinalizeDeployment on /org/projectatomic/rpmostree1/fedora_coreos failed: Staged deployment already unlocked Mar 18, 2024
@travier travier changed the title Txn FinalizeDeployment on /org/projectatomic/rpmostree1/fedora_coreos failed: Staged deployment already unlocked Txn FinalizeDeployment on /org/projectatomic/rpmostree1/fedora_coreos failed: Staged deployment already unlocked Mar 18, 2024
@arnegroskurth
Copy link
Author

Mar 11 00:51:36 k8s-worker4 zincati[1764679]:     error: Staged deployment already unlocked

One way this could happen is if you manually do e.g. rpm-ostree upgrade or rpm-ostree install but don't immediately reboot. Could that have happened here?

We can confirm this - we installed nrpe on provisioned hosts with this systemd unit:

[Unit]
Description=nrpe install
Wants=network-online.target
After=network-online.target
# We run before `zincati.service` to avoid conflicting rpm-ostree transactions.
Before=zincati.service
ConditionPathExists=!/var/lib/%N.stamp

[Service]
Type=oneshot
RemainAfterExit=yes
# `--allow-inactive` ensures that rpm-ostree does not return an error
# if the package is already installed. This is useful if the package is
# added to the root image in a future Fedora CoreOS release as it will
# prevent the service from failing.
ExecStart=/usr/bin/bash -c "/usr/bin/rpm-ostree install --apply-live --allow-inactive nrpe nagios-plugins-all && /usr/bin/systemctl enable --now nrpe.service"
ExecStart=/usr/sbin/semodule -i /var/lib/nagios/check_mem_selinux_rule.pp
ExecStart=/bin/touch /var/lib/%N.stamp

[Install]
WantedBy=multi-user.target

Removing this unit results in a newly provisioned host to reboot as expected after applying the update to the latest CoreOS version.

Is this in some way a not (yet) supported usecase? Or is this indeed expected to work and might get fixed soon?

@dustymabe
Copy link
Member

dustymabe commented Mar 18, 2024

I'm guessing the --apply-live is what is making everything get confused. If you were to reboot after the package layer then you'd probably not end up with a system in this state.

Though, maybe not. Why would it only happen to fleetlock based zincati updates systems and not all systems with zincati enabled?

@arnegroskurth
Copy link
Author

Can you also add the output of rpm-ostree status in this state?

This is shown on our test node for that command:

$ rpm-ostree status
State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: active; update staged: 39.20240210.3.0; reboot delayed due to active user sessions
Deployments:
  fedora:fedora/x86_64/coreos/stable
                  Version: 39.20240210.3.0 (2024-02-26T11:14:39Z)
               BaseCommit: 9ee1cb0963e253a2864c69a9450c8e3dd91b50b47f97389be0393c1db567dd5e
                   Commit: 2aee87e22fb539a13c6e74469f6435c9a91a22649a3e97b532d4314f6a96d37e
             GPGSignature: Valid signature by E8F23996F23218640CB44CBE75CF5AC418B8E74C
                     Diff: 52 upgraded, 179 added
          LayeredPackages: nagios-plugins-all nrpe

● fedora:fedora/x86_64/coreos/stable
                  Version: 39.20240128.3.0 (2024-02-12T18:15:45Z)
             BootedCommit: e69bac2c4fc5891572fe7cd612404f6b9c628f172427c6d8c6f692442a9bf2b7
               LiveCommit: 6a2c48dd0be850527d298a4f83b46f95b84a4aaf7693f666cfd309c2d2b64879
                 LiveDiff: 179 added
             GPGSignature: Valid signature by E8F23996F23218640CB44CBE75CF5AC418B8E74C
                 Unlocked: transient

I'm guessing the --apply-live is what is making everything get confused. If you were to reboot after the package layer then you'd probably not end up with a system in this state.

Though, maybe not. Why would it only happen to fleetlock based zincati updates systems and not all systems with zincati enabled?

We will, for now, just throw out the nrpe installation via rpm-ostree and run it in a container instead.

@jlebon
Copy link
Member

jlebon commented Mar 19, 2024

The issue is likely that the staged deployment created by rpm-ostree is unlocked and Zincati doesn't like that. Possibly the docs in https://docs.fedoraproject.org/en-US/fedora-coreos/os-extensions/ were written before this was tightened in Zincati.

One workaround is to add --lock-finalization to the rpm-ostree install invocation, though ideally that wouldn't be required (it's an undocumented switch). But yes, definitely if you can run it in a container instead, that's preferable.

@dustymabe
Copy link
Member

Possibly the docs in https://docs.fedoraproject.org/en-US/fedora-coreos/os-extensions/ were written before this was tightened in Zincati.

If the docs as written today put the system in a state that can't be upgraded that really isn't great. Can we get someone to confirm?

@samcday
Copy link

samcday commented Apr 26, 2024

I've just run into this myself. I'm following the advice from https://docs.fedoraproject.org/en-US/fedora-coreos/os-extensions/ to layer in some extra packages on the first Ignition boot: https://github.com/samcday/home-cluster/blob/main/control-plane/config.bu#L93

I was specifically testing Zincati with fleetlock so I deliberately re-provisioned a node with the previous stable version (39.20240322.3.1 instead of 39.20240407.3.0).

So my rpm-ostree status looks like this:

rpm-ostree status
State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: active; update staged: 39.20240407.3.0; reboot delayed due to active user sessions
Deployments:
  fedora:fedora/x86_64/coreos/stable
                  Version: 39.20240407.3.0 (2024-04-19T18:34:05Z)
               BaseCommit: 4f5997a887d92f19f6ce564069511115138916a37d7b30e2bfd027c5e5158e63
                   Commit: 199a2c5222cc9919741c3496da9c12290c34835d6fa31f10107377805ad20f84
             GPGSignature: Valid signature by E8F23996F23218640CB44CBE75CF5AC418B8E74C
                     Diff: 28 upgraded, 31 added
          LayeredPackages: dmidecode haproxy kubeadm kubectl kubelet man-db man-pages tailscale tcpdump usbutils vim

● fedora:fedora/x86_64/coreos/stable
                  Version: 39.20240322.3.1 (2024-04-09T16:43:21Z)
             BootedCommit: 70c68b3d14b27e64bab4cb05a407e35472d172eda97be0f96b1e33cd65f6fd3b
               LiveCommit: 0a85fcf6fba11989041702b45e6e79fff1f3cf0167987d4bcf0ce3107d775eab
                 LiveDiff: 31 added
             GPGSignature: Valid signature by E8F23996F23218640CB44CBE75CF5AC418B8E74C
                 Unlocked: transient

And Zincati is complaining with:

$ journalctl -u zincati -n 10
Apr 26 07:23:59 m710q-1 zincati[4165]: [INFO  zincati::update_agent::actor] staged deployment '39.20240407.3.0' available, proceeding to finalize it
Apr 26 07:23:59 m710q-1 zincati[4165]: [ERROR zincati::update_agent::actor] failed to finalize deployment: rpm-ostree finalize-deployment failed:
Apr 26 07:23:59 m710q-1 zincati[4165]:     error: Staged deployment already unlocked
Apr 26 07:23:59 m710q-1 zincati[4165]:     
Apr 26 07:24:59 m710q-1 systemd[1]: zincati.service: Got notification message from PID 11289, but reception only permitted for main PID 4165
Apr 26 07:24:59 m710q-1 zincati[4165]: [WARN  zincati::update_agent] reached end of grace period while waiting for interactive sessions
Apr 26 07:24:59 m710q-1 zincati[4165]: [INFO  zincati::update_agent::actor] staged deployment '39.20240407.3.0' available, proceeding to finalize it
Apr 26 07:24:59 m710q-1 zincati[4165]: [ERROR zincati::update_agent::actor] failed to finalize deployment: rpm-ostree finalize-deployment failed:
Apr 26 07:24:59 m710q-1 zincati[4165]:     error: Staged deployment already unlocked
Apr 26 07:24:59 m710q-1 zincati[4165]:     

jlebon added a commit to jlebon/rpm-ostree that referenced this issue Apr 26, 2024
Now that we've made stabilized and made public the finalization APIs,
let's use the new bindings.

This also fixes an issue where when creating a locked deployment using
the legacy API (i.e. touching the `/run/ostree/staged/deployment-locked`
file before calling the staging API), if a staged deployment already
exists, libostree would just nuke the lockfile (this behaviour was
introduced in ostreedev/ostree#3077).

In theory the legacy API (via the lockfile) should keep working, but
the core issue is that there's no way for libostree to know if the
lockfile is carried-over state, or was freshly created for the current
invocation.

So let's not try to salvage the legacy API and just move over to the
new one.

We already have finalization tests; they will now test that the new API
functions correctly. But stop looking for the legacy lockfile. We could
instead inspect the staged deployment GVariant, but these checks were
redundant anyway since the tests verify the finalization by actually
rebooting and/or not use `finalize-deployment --allow-unlocked`.

Fixes: coreos/fedora-coreos-tracker#1691
jlebon added a commit to jlebon/rpm-ostree that referenced this issue Apr 26, 2024
Now that we've stabilized and made public deployment finalization APIs,
let's use them.

This also fixes an issue where when creating a locked deployment using
the legacy API (i.e. touching the `/run/ostree/staged/deployment-locked`
file before calling the staging API), if a staged deployment already
exists, libostree would just nuke the lockfile (this behaviour was
introduced in ostreedev/ostree#3077).

In theory the legacy API (via the lockfile) should keep working, but
the core issue is that there's no way for libostree to know if the
lockfile is carried-over state, or was freshly created for the current
invocation.

So let's not try to salvage the legacy API and just move over to the
new one.

We already have finalization tests; they will now test that the new API
functions correctly. But stop looking for the legacy lockfile. We could
instead inspect the staged deployment GVariant, but these checks were
redundant anyway since the tests verify the finalization by actually
rebooting and/or not use `finalize-deployment --allow-unlocked`.

Fixes: coreos/fedora-coreos-tracker#1691
jlebon added a commit to jlebon/rpm-ostree that referenced this issue Apr 26, 2024
Now that we've stabilized and made public deployment finalization APIs,
let's use them.

This also fixes an issue where when creating a locked deployment using
the legacy API (i.e. touching the `/run/ostree/staged-deployment-locked`
file before calling the staging API), if a staged deployment already
exists, libostree would just nuke the lockfile (this behaviour was
introduced in ostreedev/ostree#3077).

In theory the legacy API (via the lockfile) should keep working, but
the core issue is that there's no way for libostree to know if the
lockfile is carried-over state, or was freshly created for the current
invocation.

So let's not try to salvage the legacy API and just move over to the
new one.

We already have finalization tests; they will now test that the new API
functions correctly. But stop looking for the legacy lockfile. We could
instead inspect the staged deployment GVariant, but these checks were
redundant anyway since the tests verify the finalization by actually
rebooting and/or not use `finalize-deployment --allow-unlocked`.

Fixes: coreos/fedora-coreos-tracker#1691
@jlebon
Copy link
Member

jlebon commented Apr 26, 2024

This should be fixed by coreos/rpm-ostree#4939.

Short-term hacks around this:

  • Reboot after installing the package (i.e. remove --apply-live, add --reboot)
  • Manually touch /run/ostree/staged-deployment-locked when Zincati is continually retrying. You can't touch it from your unit. It has to be after Zincati has deployed an update.

dustymabe added a commit to dustymabe/fedora-coreos-docs that referenced this issue Apr 26, 2024
There is a bug that causes nodes started with --apply-live
to not be able to update via zincati [1]. We can add it back
once the bug is taken care of.

[1] coreos/fedora-coreos-tracker#1691
jlebon pushed a commit to coreos/fedora-coreos-docs that referenced this issue Apr 26, 2024
There is a bug that causes nodes started with --apply-live
to not be able to update via zincati [1]. We can add it back
once the bug is taken care of.

[1] coreos/fedora-coreos-tracker#1691
@jlebon jlebon changed the title Txn FinalizeDeployment on /org/projectatomic/rpmostree1/fedora_coreos failed: Staged deployment already unlocked Zincati refuses to update after manually staging a deployment May 3, 2024
@dustymabe dustymabe reopened this May 7, 2024
@dustymabe dustymabe added the status/pending-upstream-release Fixed upstream. Waiting on an upstream component source code release. label May 7, 2024
@dustymabe dustymabe reopened this May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug status/pending-upstream-release Fixed upstream. Waiting on an upstream component source code release. triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants