kola testiso `iso-live-login` scenario failing #1233

gursewak1997 · 2022-06-16T18:01:45Z

iso-live-login times out in UEFI branch. The test has failed inconsistently recently:

From the logs, it seems like the console isn’t complete. Final message is:
[ 2.966523] ignition[594]: Ignition finished successfully
Console log:console.txt

The text was updated successfully, but these errors were encountered:

We're trying to debug coreos/fedora-coreos-tracker#1233 but logs aren't helping much. We think adding `rd.debug` might help in case it's e.g. a dracut initqueue issue. Add a separate test for this on a separate ISO so that we don't affect the official artifacts.

dustymabe · 2022-07-12T02:18:32Z

Since we implemented coreos/fedora-coreos-pipeline#557 we haven't seen this in the main pipeline AFAICT, but we have seen it twice in the bump-lockfile job:

saqibali-2k · 2022-07-18T17:52:28Z

Saw this issue in: https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/blue/organizations/jenkins/build/detail/build/900/pipeline/346.
In this error, the issue was not in the UEFI branch

dustymabe · 2022-07-20T17:31:06Z

Saw this issue in: https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/blue/organizations/jenkins/build/detail/build/900/pipeline/346. In this error, the issue was not in the UEFI branch

That one was also in the iso-offline-install step.

We saw this iso-offline-install failure again here today.

prestist · 2022-08-08T18:47:41Z

In running releases for fedora, I ran into this error here https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/build/21/

dustymabe · 2022-09-02T02:09:59Z

Seems to have showed up in rawhide: build#127

dustymabe · 2022-09-05T22:37:52Z

I tried to track this down. It appears to me that run-media-iso.mount just never gets started and thus everything else gets stalled and the boot just never continues. The tests then timeout after 10 minutes.

I'm really at a loss for what's going on.

dustymabe · 2022-09-07T01:42:38Z

Background context: This issue is hard to reproduce but pops up enough for it to bother us. I can't reproduce the issue locally but can if I run the test in a tight loop in our build infra.

OK. After all my testing I think this is some kind of systemd bug. I instrumented kola AND the build FCOS images so that I could get actual journal logs from the initrd (see dustymabe/coreos-assembler@fb38118 which is a random commit that explains how to do it). I then started comparing the journal from the good and bad runs.

For example:
- journal-good.txt
- journal-bad.txt

What I found is that for the relevant units from live-generator on a bad run the sysroot-xfs-ephemeral-mkfs.service and run-ephemeral.mount run, but none of the other ones do.

One thing to note is that in a bad run the run-ephemeral.mount seems to run a lot earlier than it does in the good run. I tried to force this behavior to see if I could reproduce the issue artificially and I couldn't, but maybe it is relevant in some way.

In order to gain some more clarity about what was blocking things from continuing I decided to write a unit that would print out some information after a period of sleep in the initrd:

diff --git a/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh
index 6a91048d..988177ad 100644
--- a/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh
+++ b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh
@@ -52,4 +52,7 @@ install() {
 
     install_and_enable_unit "coreos-livepxe-persist-osmet.service" \
         "default.target"
+
+    install_and_enable_unit "sleep-emergency.service" \
+        "default.target"
 }
diff --git a/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/sleep-emergency.service b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/sleep-emergency.service
new file mode 100644
index 00000000..06b50f11
--- /dev/null
+++ b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/sleep-emergency.service
@@ -0,0 +1,10 @@
+[Unit]
+OnFailure=emergency.target
+OnFailureJobMode=isolate
+[Service]
+Type=simple
+StandardOutput=journal
+StandardError=journal
+ExecStart=bash -c "sleep 30; systemctl list-jobs --after --before; false"
+[Install]
+RequiredBy=basic.target

So this unit will wait for 30 seconds and then print some information and then exit with failure, triggering emergency.target. On a successful boot this unit will get killed during the sleep. On a stuck boot this unit will kick in.

I managed to get a failure to occur with this in place. The journal
didn't have all of the printed output so I'm attaching the console-emergency.txt.

The interesting piece is here:

[    2.750638] ignition[587]: disks: disks passed
[    2.750946] systemd[1]: Reached target initrd-root-device.target - Initrd Root Device.
[    2.751498] ignition[587]: Ignition finished successfully
[    2.751884] systemd[1]: ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions was skipped because of a failed condition check (ConditionPathIsDirectory=/run/ignition-ostree-transposefs).
[   32.654964] systemd[1]: Mounting run-media-iso.mount - /run/media/iso...
         Mounting run-media-iso.mount - /run/media/iso...
[   32.659865] bash[591]: JOB UNIT                                                                  TYPE  STATE
[   32.660557] bash[591]: 68  ignition-ostree-growfs.service                                        start waiting
[   32.661221] bash[591]: └─  	waiting for job 52 (initrd-root-fs.target/start)                     -     -

You can see that the boot stopped progressing at 2 seconds into the boot, but then at 32s (when the sleep ended) run-media-iso.mount suddenly started to progress (at this point it was too late, though because we were heading to emergency.target.

You can also see from the cleaned up systemctl list-jobs output (systemctl-list-jobs.txt) that run-media-iso.mount appears to be the only unit without something listed that it's blocked by:

[   32.823915] bash[591]: 41  run-media-iso.mount                                                   start running
[   32.824586] bash[591]: └─  	waiting for job 40 (coreos-liveiso-persist-osmet.service/start)      -     -
[   32.825234] bash[591]: └─  	waiting for job 52 (initrd-root-fs.target/start)                     -     -
[   32.825894] bash[591]: └─  	waiting for job 56 (sysroot.mount/start)                             -     -

NOTE: I find the systemctl list-jobs --after --before output confusing. The items listed as waiting for are the units that are blocked by run-media-iso.mount.

Ultimately it looks like there is some internal state somewhere that knows run-media-iso.mount needs to run but nothing telling it to run. Without the unit I added (or some other trigger) the boot hangs and the tests timeout after 10 minutes.

I can't reproduce this with systemd debug logging enabled, which makes sense because there is probably something in there that performs the necessary kick to get the system out of this hung state.

We've found the system can stall waiting for run-media-iso.mount and apparently any operation seems to be effective at reviving the system. Let's add a workaround that will kick in after 10 seconds and try to revive the boot. A lot more context over in coreos/fedora-coreos-tracker#1233 (comment)

We've found the system can stall waiting for run-media-iso.mount and apparently any operation seems to be effective at reviving the system. Let's add a workaround that will kick in after 10 seconds and try to revive the boot. A lot more context over in coreos/fedora-coreos-tracker#1233 (comment) Here's an example of this working: ``` [ 2.749303] systemd[1]: ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions was skipped because of a failed condition check (ConditionPathIsDire ctory=/run/ignition-ostree-transposefs). [ 2.750547] ignition[591]: disks: disks passed [ 2.750882] ignition[591]: Ignition finished successfully [ 12.645801] kauditd_printk_skb: 19 callbacks suppressed [ 12.645804] audit: type=1131 audit(1662520856.021:30): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=workaround-stalled-media-iso-mount comm="systemd" exe= "/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 12.647774] bash[564]: warn: tracker issue workaround engaged for coreos/fedora-coreos-tracker#1233 [ 12.648646] systemd[1]: workaround-stalled-media-iso-mount.service: Deactivated successfully. [ 12.649281] systemd[1]: Mounting run-media-iso.mount - /run/media/iso... ```

dustymabe · 2022-09-07T03:34:51Z

I'm honestly not really sure how to go about peeling back the complexity here to open a systemd issue upstream. Ultimately I think I'm going to drop in a workaround and call it a day.

It appears in order to get the system out of this state you simply need to do something (anything). So we'll just add another unit that waits 10 seconds and then prints a message. We'll also add some instrumentation in our pipeline to let us know when we hit this.

dustymabe · 2022-09-07T03:35:01Z

coreos/fedora-coreos-config#1961

For now we want to notify ourselves when this workaround is observed. It won't fail the build, just give us information. See coreos/fedora-coreos-tracker#1233

dustymabe · 2022-09-07T03:44:05Z

and here's the code to notify us when this workaround pops up: coreos/fedora-coreos-pipeline#624

If it happens often enough and we get annoyed with it we can just delete the warning in the future.

We've found the system can stall waiting for run-media-iso.mount and apparently any operation seems to be effective at reviving the system. Let's add a workaround that will kick in after 10 seconds and try to revive the boot. A lot more context over in coreos/fedora-coreos-tracker#1233 (comment) Here's an example of this working: ``` [ 2.749303] systemd[1]: ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions was skipped because of a failed condition check (ConditionPathIsDire ctory=/run/ignition-ostree-transposefs). [ 2.750547] ignition[591]: disks: disks passed [ 2.750882] ignition[591]: Ignition finished successfully [ 12.645801] kauditd_printk_skb: 19 callbacks suppressed [ 12.645804] audit: type=1131 audit(1662520856.021:30): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=workaround-stalled-media-iso-mount comm="systemd" exe= "/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 12.647774] bash[564]: warn: tracker issue workaround engaged for coreos/fedora-coreos-tracker#1233 [ 12.648646] systemd[1]: workaround-stalled-media-iso-mount.service: Deactivated successfully. [ 12.649281] systemd[1]: Mounting run-media-iso.mount - /run/media/iso... ```

For now we want to notify ourselves when this workaround is observed. It won't fail the build, just give us information. See coreos/fedora-coreos-tracker#1233

dustymabe · 2022-09-16T14:38:22Z

And we see the warning is working. https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/build/235/ notified us that the workaround was detected. The test passed as it should have.

dustymabe · 2023-02-13T18:01:39Z

We haven't seen this warning in a good while so we don't think the workaround is needed any longer. Let's drop it and see if anything breaks: coreos/fedora-coreos-config#2230

dustymabe · 2023-02-13T19:12:00Z

Closing this since coreos/fedora-coreos-config#2230 removed the code for this. Will re-open if we re-encounter any issues.

We've found the system can stall waiting for run-media-iso.mount and apparently any operation seems to be effective at reviving the system. Let's add a workaround that will kick in after 10 seconds and try to revive the boot. A lot more context over in coreos/fedora-coreos-tracker#1233 (comment) Here's an example of this working: ``` [ 2.749303] systemd[1]: ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions was skipped because of a failed condition check (ConditionPathIsDire ctory=/run/ignition-ostree-transposefs). [ 2.750547] ignition[591]: disks: disks passed [ 2.750882] ignition[591]: Ignition finished successfully [ 12.645801] kauditd_printk_skb: 19 callbacks suppressed [ 12.645804] audit: type=1131 audit(1662520856.021:30): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=workaround-stalled-media-iso-mount comm="systemd" exe= "/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 12.647774] bash[564]: warn: tracker issue workaround engaged for coreos/fedora-coreos-tracker#1233 [ 12.648646] systemd[1]: workaround-stalled-media-iso-mount.service: Deactivated successfully. [ 12.649281] systemd[1]: Mounting run-media-iso.mount - /run/media/iso... ```

gursewak1997 added the kind/bug label Jun 16, 2022

jlebon mentioned this issue Jun 22, 2022

jobs/build: add temporary iso-live-login UEFI test with rd.debug coreos/fedora-coreos-pipeline#557

Merged

dustymabe mentioned this issue Sep 7, 2022

overlay.d/35coreos-live: add workaround for stalled iso-live-login boot coreos/fedora-coreos-config#1961

Merged

dustymabe mentioned this issue Sep 26, 2022

live-generator: Use a loop to work around potential races with udev coreos/fedora-coreos-config#1986

Merged

dustymabe closed this as completed Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kola testiso `iso-live-login` scenario failing #1233

kola testiso `iso-live-login` scenario failing #1233

gursewak1997 commented Jun 16, 2022 •

edited

dustymabe commented Jul 12, 2022

saqibali-2k commented Jul 18, 2022 •

edited

dustymabe commented Jul 20, 2022

prestist commented Aug 8, 2022

dustymabe commented Sep 2, 2022

dustymabe commented Sep 5, 2022

dustymabe commented Sep 7, 2022

dustymabe commented Sep 7, 2022

dustymabe commented Sep 7, 2022

dustymabe commented Sep 7, 2022

dustymabe commented Sep 16, 2022

dustymabe commented Feb 13, 2023

dustymabe commented Feb 13, 2023

kola testiso iso-live-login scenario failing #1233

kola testiso iso-live-login scenario failing #1233

Comments

gursewak1997 commented Jun 16, 2022 • edited

dustymabe commented Jul 12, 2022

saqibali-2k commented Jul 18, 2022 • edited

dustymabe commented Jul 20, 2022

prestist commented Aug 8, 2022

dustymabe commented Sep 2, 2022

dustymabe commented Sep 5, 2022

dustymabe commented Sep 7, 2022

dustymabe commented Sep 7, 2022

dustymabe commented Sep 7, 2022

dustymabe commented Sep 7, 2022

dustymabe commented Sep 16, 2022

dustymabe commented Feb 13, 2023

dustymabe commented Feb 13, 2023

kola testiso `iso-live-login` scenario failing #1233

kola testiso `iso-live-login` scenario failing #1233

gursewak1997 commented Jun 16, 2022 •

edited

saqibali-2k commented Jul 18, 2022 •

edited