Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kola testiso iso-live-login scenario failing #1233

Closed
gursewak1997 opened this issue Jun 16, 2022 · 13 comments
Closed

kola testiso iso-live-login scenario failing #1233

gursewak1997 opened this issue Jun 16, 2022 · 13 comments
Labels

Comments

@gursewak1997
Copy link
Member

gursewak1997 commented Jun 16, 2022

iso-live-login times out in UEFI branch. The test has failed inconsistently recently:

From the logs, it seems like the console isn’t complete. Final message is:
[ 2.966523] ignition[594]: Ignition finished successfully
Console log:console.txt

jlebon added a commit to jlebon/fedora-coreos-pipeline that referenced this issue Jun 22, 2022
We're trying to debug
coreos/fedora-coreos-tracker#1233 but logs
aren't helping much. We think adding `rd.debug` might help in case it's
e.g. a dracut initqueue issue.

Add a separate test for this on a separate ISO so that we don't affect
the official artifacts.
jlebon added a commit to jlebon/fedora-coreos-pipeline that referenced this issue Jun 22, 2022
We're trying to debug
coreos/fedora-coreos-tracker#1233 but logs
aren't helping much. We think adding `rd.debug` might help in case it's
e.g. a dracut initqueue issue.

Add a separate test for this on a separate ISO so that we don't affect
the official artifacts.
jlebon added a commit to coreos/fedora-coreos-pipeline that referenced this issue Jun 22, 2022
We're trying to debug
coreos/fedora-coreos-tracker#1233 but logs
aren't helping much. We think adding `rd.debug` might help in case it's
e.g. a dracut initqueue issue.

Add a separate test for this on a separate ISO so that we don't affect
the official artifacts.
@dustymabe
Copy link
Member

Since we implemented coreos/fedora-coreos-pipeline#557 we haven't seen this in the main pipeline AFAICT, but we have seen it twice in the bump-lockfile job:

@saqibali-2k
Copy link
Member

saqibali-2k commented Jul 18, 2022

Saw this issue in: https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/blue/organizations/jenkins/build/detail/build/900/pipeline/346.
In this error, the issue was not in the UEFI branch

@dustymabe
Copy link
Member

Saw this issue in: https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/blue/organizations/jenkins/build/detail/build/900/pipeline/346. In this error, the issue was not in the UEFI branch

That one was also in the iso-offline-install step.

We saw this iso-offline-install failure again here today.

@prestist
Copy link
Contributor

prestist commented Aug 8, 2022

In running releases for fedora, I ran into this error here https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/build/21/

@dustymabe
Copy link
Member

Seems to have showed up in rawhide: build#127

@dustymabe
Copy link
Member

I tried to track this down. It appears to me that run-media-iso.mount just never gets started and thus everything else gets stalled and the boot just never continues. The tests then timeout after 10 minutes.

I'm really at a loss for what's going on.

@dustymabe
Copy link
Member

Background context: This issue is hard to reproduce but pops up enough for it to bother us. I can't reproduce the issue locally but can if I run the test in a tight loop in our build infra.

OK. After all my testing I think this is some kind of systemd bug. I instrumented kola AND the build FCOS images so that I could get actual journal logs from the initrd (see dustymabe/coreos-assembler@fb38118 which is a random commit that explains how to do it). I then started comparing the journal from the good and bad runs.

For example:
- journal-good.txt
- journal-bad.txt

What I found is that for the relevant units from live-generator on a bad run the sysroot-xfs-ephemeral-mkfs.service and run-ephemeral.mount run, but none of the other ones do.

One thing to note is that in a bad run the run-ephemeral.mount seems to run a lot earlier than it does in the good run. I tried to force this behavior to see if I could reproduce the issue artificially and I couldn't, but maybe it is relevant in some way.

In order to gain some more clarity about what was blocking things from continuing I decided to write a unit that would print out some information after a period of sleep in the initrd:

diff --git a/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh
index 6a91048d..988177ad 100644
--- a/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh
+++ b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh
@@ -52,4 +52,7 @@ install() {
 
     install_and_enable_unit "coreos-livepxe-persist-osmet.service" \
         "default.target"
+
+    install_and_enable_unit "sleep-emergency.service" \
+        "default.target"
 }
diff --git a/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/sleep-emergency.service b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/sleep-emergency.service
new file mode 100644
index 00000000..06b50f11
--- /dev/null
+++ b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/sleep-emergency.service
@@ -0,0 +1,10 @@
+[Unit]
+OnFailure=emergency.target
+OnFailureJobMode=isolate
+[Service]
+Type=simple
+StandardOutput=journal
+StandardError=journal
+ExecStart=bash -c "sleep 30; systemctl list-jobs --after --before; false"
+[Install]
+RequiredBy=basic.target

So this unit will wait for 30 seconds and then print some information and then exit with failure, triggering emergency.target. On a successful boot this unit will get killed during the sleep. On a stuck boot this unit will kick in.

I managed to get a failure to occur with this in place. The journal
didn't have all of the printed output so I'm attaching the console-emergency.txt.

The interesting piece is here:

[    2.750638] ignition[587]: disks: disks passed
[    2.750946] systemd[1]: Reached target initrd-root-device.target - Initrd Root Device.
[    2.751498] ignition[587]: Ignition finished successfully
[    2.751884] systemd[1]: ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions was skipped because of a failed condition check (ConditionPathIsDirectory=/run/ignition-ostree-transposefs).
[   32.654964] systemd[1]: Mounting run-media-iso.mount - /run/media/iso...
         Mounting run-media-iso.mount - /run/media/iso...
[   32.659865] bash[591]: JOB UNIT                                                                  TYPE  STATE
[   32.660557] bash[591]: 68  ignition-ostree-growfs.service                                        start waiting
[   32.661221] bash[591]: └─  	waiting for job 52 (initrd-root-fs.target/start)                     -     -

You can see that the boot stopped progressing at 2 seconds into the boot, but then at 32s (when the sleep ended) run-media-iso.mount suddenly started to progress (at this point it was too late, though because we were heading to emergency.target.

You can also see from the cleaned up systemctl list-jobs output (systemctl-list-jobs.txt) that run-media-iso.mount appears to be the only unit without something listed that it's blocked by:

[   32.823915] bash[591]: 41  run-media-iso.mount                                                   start running
[   32.824586] bash[591]: └─  	waiting for job 40 (coreos-liveiso-persist-osmet.service/start)      -     -
[   32.825234] bash[591]: └─  	waiting for job 52 (initrd-root-fs.target/start)                     -     -
[   32.825894] bash[591]: └─  	waiting for job 56 (sysroot.mount/start)                             -     -

NOTE: I find the systemctl list-jobs --after --before output confusing. The items listed as waiting for are the units that are blocked by run-media-iso.mount.

Ultimately it looks like there is some internal state somewhere that knows run-media-iso.mount needs to run but nothing telling it to run. Without the unit I added (or some other trigger) the boot hangs and the tests timeout after 10 minutes.

I can't reproduce this with systemd debug logging enabled, which makes sense because there is probably something in there that performs the necessary kick to get the system out of this hung state.

dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Sep 7, 2022
We've found the system can stall waiting for run-media-iso.mount
and apparently any operation seems to be effective at reviving
the system. Let's add a workaround that will kick in after 10 seconds
and try to revive the boot.

A lot more context over in
coreos/fedora-coreos-tracker#1233 (comment)
dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Sep 7, 2022
We've found the system can stall waiting for run-media-iso.mount
and apparently any operation seems to be effective at reviving
the system. Let's add a workaround that will kick in after 10 seconds
and try to revive the boot.

A lot more context over in
coreos/fedora-coreos-tracker#1233 (comment)

Here's an example of this working:

```
[    2.749303] systemd[1]: ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions was skipped because of a failed condition check (ConditionPathIsDire
ctory=/run/ignition-ostree-transposefs).
[    2.750547] ignition[591]: disks: disks passed
[    2.750882] ignition[591]: Ignition finished successfully
[   12.645801] kauditd_printk_skb: 19 callbacks suppressed
[   12.645804] audit: type=1131 audit(1662520856.021:30): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=workaround-stalled-media-iso-mount comm="systemd" exe=
"/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   12.647774] bash[564]: warn: tracker issue workaround engaged for coreos/fedora-coreos-tracker#1233
[   12.648646] systemd[1]: workaround-stalled-media-iso-mount.service: Deactivated successfully.
[   12.649281] systemd[1]: Mounting run-media-iso.mount - /run/media/iso...
```
@dustymabe
Copy link
Member

I'm honestly not really sure how to go about peeling back the complexity here to open a systemd issue upstream. Ultimately I think I'm going to drop in a workaround and call it a day.

It appears in order to get the system out of this state you simply need to do something (anything). So we'll just add another unit that waits 10 seconds and then prints a message. We'll also add some instrumentation in our pipeline to let us know when we hit this.

@dustymabe
Copy link
Member

dustymabe added a commit to dustymabe/fedora-coreos-pipeline that referenced this issue Sep 7, 2022
For now we want to notify ourselves when this workaround is observed. It won't
fail the build, just give us information. See
coreos/fedora-coreos-tracker#1233
@dustymabe
Copy link
Member

and here's the code to notify us when this workaround pops up: coreos/fedora-coreos-pipeline#624

If it happens often enough and we get annoyed with it we can just delete the warning in the future.

dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Sep 7, 2022
We've found the system can stall waiting for run-media-iso.mount
and apparently any operation seems to be effective at reviving
the system. Let's add a workaround that will kick in after 10 seconds
and try to revive the boot.

A lot more context over in
coreos/fedora-coreos-tracker#1233 (comment)

Here's an example of this working:

```
[    2.749303] systemd[1]: ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions was skipped because of a failed condition check (ConditionPathIsDire
ctory=/run/ignition-ostree-transposefs).
[    2.750547] ignition[591]: disks: disks passed
[    2.750882] ignition[591]: Ignition finished successfully
[   12.645801] kauditd_printk_skb: 19 callbacks suppressed
[   12.645804] audit: type=1131 audit(1662520856.021:30): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=workaround-stalled-media-iso-mount comm="systemd" exe=
"/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   12.647774] bash[564]: warn: tracker issue workaround engaged for coreos/fedora-coreos-tracker#1233
[   12.648646] systemd[1]: workaround-stalled-media-iso-mount.service: Deactivated successfully.
[   12.649281] systemd[1]: Mounting run-media-iso.mount - /run/media/iso...
```
dustymabe added a commit to coreos/fedora-coreos-config that referenced this issue Sep 7, 2022
We've found the system can stall waiting for run-media-iso.mount
and apparently any operation seems to be effective at reviving
the system. Let's add a workaround that will kick in after 10 seconds
and try to revive the boot.

A lot more context over in
coreos/fedora-coreos-tracker#1233 (comment)

Here's an example of this working:

```
[    2.749303] systemd[1]: ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions was skipped because of a failed condition check (ConditionPathIsDire
ctory=/run/ignition-ostree-transposefs).
[    2.750547] ignition[591]: disks: disks passed
[    2.750882] ignition[591]: Ignition finished successfully
[   12.645801] kauditd_printk_skb: 19 callbacks suppressed
[   12.645804] audit: type=1131 audit(1662520856.021:30): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=workaround-stalled-media-iso-mount comm="systemd" exe=
"/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   12.647774] bash[564]: warn: tracker issue workaround engaged for coreos/fedora-coreos-tracker#1233
[   12.648646] systemd[1]: workaround-stalled-media-iso-mount.service: Deactivated successfully.
[   12.649281] systemd[1]: Mounting run-media-iso.mount - /run/media/iso...
```
dustymabe added a commit to coreos/fedora-coreos-pipeline that referenced this issue Sep 7, 2022
For now we want to notify ourselves when this workaround is observed. It won't
fail the build, just give us information. See
coreos/fedora-coreos-tracker#1233
@dustymabe
Copy link
Member

And we see the warning is working. https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/build/235/ notified us that the workaround was detected. The test passed as it should have.

@dustymabe
Copy link
Member

We haven't seen this warning in a good while so we don't think the workaround is needed any longer. Let's drop it and see if anything breaks: coreos/fedora-coreos-config#2230

@dustymabe
Copy link
Member

Closing this since coreos/fedora-coreos-config#2230 removed the code for this. Will re-open if we re-encounter any issues.

HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
We've found the system can stall waiting for run-media-iso.mount
and apparently any operation seems to be effective at reviving
the system. Let's add a workaround that will kick in after 10 seconds
and try to revive the boot.

A lot more context over in
coreos/fedora-coreos-tracker#1233 (comment)

Here's an example of this working:

```
[    2.749303] systemd[1]: ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions was skipped because of a failed condition check (ConditionPathIsDire
ctory=/run/ignition-ostree-transposefs).
[    2.750547] ignition[591]: disks: disks passed
[    2.750882] ignition[591]: Ignition finished successfully
[   12.645801] kauditd_printk_skb: 19 callbacks suppressed
[   12.645804] audit: type=1131 audit(1662520856.021:30): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=workaround-stalled-media-iso-mount comm="systemd" exe=
"/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   12.647774] bash[564]: warn: tracker issue workaround engaged for coreos/fedora-coreos-tracker#1233
[   12.648646] systemd[1]: workaround-stalled-media-iso-mount.service: Deactivated successfully.
[   12.649281] systemd[1]: Mounting run-media-iso.mount - /run/media/iso...
```
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
We've found the system can stall waiting for run-media-iso.mount
and apparently any operation seems to be effective at reviving
the system. Let's add a workaround that will kick in after 10 seconds
and try to revive the boot.

A lot more context over in
coreos/fedora-coreos-tracker#1233 (comment)

Here's an example of this working:

```
[    2.749303] systemd[1]: ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions was skipped because of a failed condition check (ConditionPathIsDire
ctory=/run/ignition-ostree-transposefs).
[    2.750547] ignition[591]: disks: disks passed
[    2.750882] ignition[591]: Ignition finished successfully
[   12.645801] kauditd_printk_skb: 19 callbacks suppressed
[   12.645804] audit: type=1131 audit(1662520856.021:30): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=workaround-stalled-media-iso-mount comm="systemd" exe=
"/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   12.647774] bash[564]: warn: tracker issue workaround engaged for coreos/fedora-coreos-tracker#1233
[   12.648646] systemd[1]: workaround-stalled-media-iso-mount.service: Deactivated successfully.
[   12.649281] systemd[1]: Mounting run-media-iso.mount - /run/media/iso...
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants