New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kola testiso iso-live-login
scenario failing
#1233
Comments
We're trying to debug coreos/fedora-coreos-tracker#1233 but logs aren't helping much. We think adding `rd.debug` might help in case it's e.g. a dracut initqueue issue. Add a separate test for this on a separate ISO so that we don't affect the official artifacts.
We're trying to debug coreos/fedora-coreos-tracker#1233 but logs aren't helping much. We think adding `rd.debug` might help in case it's e.g. a dracut initqueue issue. Add a separate test for this on a separate ISO so that we don't affect the official artifacts.
We're trying to debug coreos/fedora-coreos-tracker#1233 but logs aren't helping much. We think adding `rd.debug` might help in case it's e.g. a dracut initqueue issue. Add a separate test for this on a separate ISO so that we don't affect the official artifacts.
Since we implemented coreos/fedora-coreos-pipeline#557 we haven't seen this in the main pipeline AFAICT, but we have seen it twice in the bump-lockfile job: |
Saw this issue in: https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/blue/organizations/jenkins/build/detail/build/900/pipeline/346. |
That one was also in the We saw this |
In running releases for fedora, I ran into this error here https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/build/21/ |
Seems to have showed up in |
I tried to track this down. It appears to me that I'm really at a loss for what's going on. |
Background context: This issue is hard to reproduce but pops up enough for it to bother us. I can't reproduce the issue locally but can if I run the test in a tight loop in our build infra. OK. After all my testing I think this is some kind of systemd bug. I instrumented For example: What I found is that for the relevant units from live-generator on a bad run the One thing to note is that in a bad run the In order to gain some more clarity about what was blocking things from continuing I decided to write a unit that would print out some information after a period of sleep in the initrd: diff --git a/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh
index 6a91048d..988177ad 100644
--- a/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh
+++ b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh
@@ -52,4 +52,7 @@ install() {
install_and_enable_unit "coreos-livepxe-persist-osmet.service" \
"default.target"
+
+ install_and_enable_unit "sleep-emergency.service" \
+ "default.target"
}
diff --git a/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/sleep-emergency.service b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/sleep-emergency.service
new file mode 100644
index 00000000..06b50f11
--- /dev/null
+++ b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/sleep-emergency.service
@@ -0,0 +1,10 @@
+[Unit]
+OnFailure=emergency.target
+OnFailureJobMode=isolate
+[Service]
+Type=simple
+StandardOutput=journal
+StandardError=journal
+ExecStart=bash -c "sleep 30; systemctl list-jobs --after --before; false"
+[Install]
+RequiredBy=basic.target So this unit will wait for 30 seconds and then print some information and then exit with failure, triggering emergency.target. On a successful boot this unit will get killed during the sleep. On a stuck boot this unit will kick in. I managed to get a failure to occur with this in place. The journal The interesting piece is here:
You can see that the boot stopped progressing at 2 seconds into the boot, but then at 32s (when the sleep ended) You can also see from the cleaned up
NOTE: I find the Ultimately it looks like there is some internal state somewhere that knows I can't reproduce this with systemd debug logging enabled, which makes sense because there is probably something in there that performs the necessary kick to get the system out of this hung state. |
We've found the system can stall waiting for run-media-iso.mount and apparently any operation seems to be effective at reviving the system. Let's add a workaround that will kick in after 10 seconds and try to revive the boot. A lot more context over in coreos/fedora-coreos-tracker#1233 (comment)
We've found the system can stall waiting for run-media-iso.mount and apparently any operation seems to be effective at reviving the system. Let's add a workaround that will kick in after 10 seconds and try to revive the boot. A lot more context over in coreos/fedora-coreos-tracker#1233 (comment) Here's an example of this working: ``` [ 2.749303] systemd[1]: ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions was skipped because of a failed condition check (ConditionPathIsDire ctory=/run/ignition-ostree-transposefs). [ 2.750547] ignition[591]: disks: disks passed [ 2.750882] ignition[591]: Ignition finished successfully [ 12.645801] kauditd_printk_skb: 19 callbacks suppressed [ 12.645804] audit: type=1131 audit(1662520856.021:30): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=workaround-stalled-media-iso-mount comm="systemd" exe= "/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 12.647774] bash[564]: warn: tracker issue workaround engaged for coreos/fedora-coreos-tracker#1233 [ 12.648646] systemd[1]: workaround-stalled-media-iso-mount.service: Deactivated successfully. [ 12.649281] systemd[1]: Mounting run-media-iso.mount - /run/media/iso... ```
I'm honestly not really sure how to go about peeling back the complexity here to open a systemd issue upstream. Ultimately I think I'm going to drop in a workaround and call it a day. It appears in order to get the system out of this state you simply need to do something (anything). So we'll just add another unit that waits 10 seconds and then prints a message. We'll also add some instrumentation in our pipeline to let us know when we hit this. |
For now we want to notify ourselves when this workaround is observed. It won't fail the build, just give us information. See coreos/fedora-coreos-tracker#1233
and here's the code to notify us when this workaround pops up: coreos/fedora-coreos-pipeline#624 If it happens often enough and we get annoyed with it we can just delete the warning in the future. |
We've found the system can stall waiting for run-media-iso.mount and apparently any operation seems to be effective at reviving the system. Let's add a workaround that will kick in after 10 seconds and try to revive the boot. A lot more context over in coreos/fedora-coreos-tracker#1233 (comment) Here's an example of this working: ``` [ 2.749303] systemd[1]: ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions was skipped because of a failed condition check (ConditionPathIsDire ctory=/run/ignition-ostree-transposefs). [ 2.750547] ignition[591]: disks: disks passed [ 2.750882] ignition[591]: Ignition finished successfully [ 12.645801] kauditd_printk_skb: 19 callbacks suppressed [ 12.645804] audit: type=1131 audit(1662520856.021:30): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=workaround-stalled-media-iso-mount comm="systemd" exe= "/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 12.647774] bash[564]: warn: tracker issue workaround engaged for coreos/fedora-coreos-tracker#1233 [ 12.648646] systemd[1]: workaround-stalled-media-iso-mount.service: Deactivated successfully. [ 12.649281] systemd[1]: Mounting run-media-iso.mount - /run/media/iso... ```
We've found the system can stall waiting for run-media-iso.mount and apparently any operation seems to be effective at reviving the system. Let's add a workaround that will kick in after 10 seconds and try to revive the boot. A lot more context over in coreos/fedora-coreos-tracker#1233 (comment) Here's an example of this working: ``` [ 2.749303] systemd[1]: ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions was skipped because of a failed condition check (ConditionPathIsDire ctory=/run/ignition-ostree-transposefs). [ 2.750547] ignition[591]: disks: disks passed [ 2.750882] ignition[591]: Ignition finished successfully [ 12.645801] kauditd_printk_skb: 19 callbacks suppressed [ 12.645804] audit: type=1131 audit(1662520856.021:30): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=workaround-stalled-media-iso-mount comm="systemd" exe= "/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 12.647774] bash[564]: warn: tracker issue workaround engaged for coreos/fedora-coreos-tracker#1233 [ 12.648646] systemd[1]: workaround-stalled-media-iso-mount.service: Deactivated successfully. [ 12.649281] systemd[1]: Mounting run-media-iso.mount - /run/media/iso... ```
For now we want to notify ourselves when this workaround is observed. It won't fail the build, just give us information. See coreos/fedora-coreos-tracker#1233
And we see the warning is working. https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/build/235/ notified us that the workaround was detected. The test passed as it should have. |
We haven't seen this warning in a good while so we don't think the workaround is needed any longer. Let's drop it and see if anything breaks: coreos/fedora-coreos-config#2230 |
Closing this since coreos/fedora-coreos-config#2230 removed the code for this. Will re-open if we re-encounter any issues. |
We've found the system can stall waiting for run-media-iso.mount and apparently any operation seems to be effective at reviving the system. Let's add a workaround that will kick in after 10 seconds and try to revive the boot. A lot more context over in coreos/fedora-coreos-tracker#1233 (comment) Here's an example of this working: ``` [ 2.749303] systemd[1]: ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions was skipped because of a failed condition check (ConditionPathIsDire ctory=/run/ignition-ostree-transposefs). [ 2.750547] ignition[591]: disks: disks passed [ 2.750882] ignition[591]: Ignition finished successfully [ 12.645801] kauditd_printk_skb: 19 callbacks suppressed [ 12.645804] audit: type=1131 audit(1662520856.021:30): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=workaround-stalled-media-iso-mount comm="systemd" exe= "/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 12.647774] bash[564]: warn: tracker issue workaround engaged for coreos/fedora-coreos-tracker#1233 [ 12.648646] systemd[1]: workaround-stalled-media-iso-mount.service: Deactivated successfully. [ 12.649281] systemd[1]: Mounting run-media-iso.mount - /run/media/iso... ```
We've found the system can stall waiting for run-media-iso.mount and apparently any operation seems to be effective at reviving the system. Let's add a workaround that will kick in after 10 seconds and try to revive the boot. A lot more context over in coreos/fedora-coreos-tracker#1233 (comment) Here's an example of this working: ``` [ 2.749303] systemd[1]: ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions was skipped because of a failed condition check (ConditionPathIsDire ctory=/run/ignition-ostree-transposefs). [ 2.750547] ignition[591]: disks: disks passed [ 2.750882] ignition[591]: Ignition finished successfully [ 12.645801] kauditd_printk_skb: 19 callbacks suppressed [ 12.645804] audit: type=1131 audit(1662520856.021:30): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=workaround-stalled-media-iso-mount comm="systemd" exe= "/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 12.647774] bash[564]: warn: tracker issue workaround engaged for coreos/fedora-coreos-tracker#1233 [ 12.648646] systemd[1]: workaround-stalled-media-iso-mount.service: Deactivated successfully. [ 12.649281] systemd[1]: Mounting run-media-iso.mount - /run/media/iso... ```
iso-live-login times out in UEFI branch. The test has failed inconsistently recently:
From the logs, it seems like the console isn’t complete. Final message is:
[ 2.966523] ignition[594]: Ignition finished successfully
Console log:console.txt
The text was updated successfully, but these errors were encountered: