Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kola: rawhide: 20220823: hung reboot after simulated disk failure in boot mirror tests #1282

Closed
dustymabe opened this issue Aug 23, 2022 · 11 comments
Labels

Comments

@dustymabe
Copy link
Member

dustymabe commented Aug 23, 2022

The coreos.boot-mirror.luks and coreos.boot-mirror tests are intermittently failing in rawhide(see build#72). They appear to be getting hung up during reboot after detaching the primary disk (simulating a disk failure):

Kernel 6.0.0-0.rc2.19.fc38.x86_64 on an x86_64 (ttyS0)
SSH host key: SHA256:cLWgLYKmVuFgnQsGTQu1SqTYzm85d3mdtzFN9hISmzk (ECDSA)
SSH host key: SHA256:xhQ8cmpyrUg9NqHNHss1aCaISL7NjbikRyTn/BPXbfM (ED25519)
SSH host key: SHA256:5WrL3R0ThWe+dbXyhYNG9bNOWNjo0jQha9ADU/AnJJI (RSA)
ens5: 10.0.2.15 fec0::118e:7c2c:92c5:a1ce
Ignition: ran on 2022/08/23 12:42:42 UTC (at least 2 boots ago)
Ignition: user-provided config was applied
Ignition: wrote ssh authorized keys file for user: core
qemu0 login: [   17.165546] md/raid1:md126: Disk failure on vdb3, disabling device.
[   17.165546] md/raid1:md126: Operation continuing on 1 devices.
[   17.168987] md: super_written gets error=-5
[   17.169323] md/raid1:md127: Disk failure on vdb4, disabling device.
[   17.169323] md/raid1:md127: Operation continuing on 1 devices.
[[0;32m  OK  [0m] Stopped [0;1;39msession-3.scope[0m - Session 3 of User core.
[[0;32m  OK  [0m] Stopped [0;1;39msession-12.scope[0m - Session 12 of User core.
[[0;32m  OK  [0m] Removed slice [0;1;39msystem-after…- Slice /system/afterburn-sshkeys.
[[0;32m  OK  [0m] Removed slice [0;1;39msystem-modpr…lice[0m - Slice /system/modprobe.
[[0;32m  OK  [0m] Removed slice [0;1;39msystem-sshd\…e[0m - Slice /system/sshd-keygen.
[[0;32m  OK  [0m] Stopped target [0;1;39mmulti-user.target[0m - Multi-User System.
[[0;32m  OK  [0m] Stopped target [0;1;39mafterburn-s…hkeys@.service template instances.
[[0;32m  OK  [0m] Stopped target [0;1;39mgetty.target[0m - Login Prompts.
[[0;32m  OK  [0m] Stopped target [0;1;39mmachines.target[0m - Containers.
[[0;32m  OK  [0m] Stopped target [0;1;39mnss-lookup.…m - Host and Network Name Lookups.
[[0;32m  OK  [0m] Stopped target [0;1;39mremote-cryp…et[0m - Remote Encrypted Volumes.
[[0;32m  OK  [0m] Stopped target [0;1;39msound.target[0m - Sound Card.
[[0;32m  OK  [0m] Stopped target [0;1;39mtimers.target[0m - Timer Units.
[[0;32m  OK  [0m] Stopped [0;1;39mfstrim.timer[0m - Discard unused blocks once a week.
[[0;32m  OK  [0m] Stopped [0;1;39mlogrotate.timer[0m - Daily rotation of log files.
[[0;32m  OK  [0m] Stopped [0;1;39mraid-check.timer[… - Weekly RAID setup health check.
[[0;32m  OK  [0m] Stopped [0;1;39mrpm-ostree-countme… Weekly rpm-ostree Count Me timer.
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-tmpfiles-c… Cleanup of Temporary Directories.
[[0;32m  OK  [0m] Closed [0;1;39mlvm2-lvmpolld.socket[0m - LVM2 poll daemon socket.
[[0;32m  OK  [0m] Closed [0;1;39msystemd-coredump.so…et[0m - Process Core Dump Socket.
[[0;32m  OK  [0m] Closed [0;1;39msystemd-rfkill.sock…l Switch Status /dev/rfkill Watch.
         Unmounting [0;1;39metc.mount[0m - /etc...
         Unmounting [0;1;39musr.mount[0m - /usr...
         Stopping [0;1;39mNetworkManager-di…nager Script Dispatcher Service...
         Stopping [0;1;39mchronyd.service[0m - NTP client/server...
[[0;32m  OK  [0m] Stopped [0;1;39mconsole-login-help…via console-login-helper-messages.
         Stopping [0;1;39mdracut-shutdown.s…tore /run/initramfs on shutdown...
         Stopping [0;1;39mgetty@tty1.service[0m - Getty on tty1...
         Stopping [0;1;39mpolkit.service[0m - Authorization Manager...
         Stopping [0;1;39mrpm-ostreed.servi…ostree System Management Daemon...
         Stopping [0;1;39mserial-getty@ttyS…ice[0m - Serial Getty on ttyS0...
         Stopping [0;1;39msshd.service[0m - OpenSSH server daemon...
         Stopping [0;1;39msystemd-hostnamed.service[0m - Hostname Service...
         Stopping [0;1;39msystemd-logind.se…ice[0m - User Login Management...
         Stopping [0;1;39msystemd-random-se…ice[0m - Load/Save Random Seed...
         Stopping [0;1;39muser@1000.service[0m - User Manager for UID 1000...
         Stopping [0;1;39mzincati.service[0m - Zincati Update Agent...
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-logind.service[0m - User Login Management.
[[0;32m  OK  [0m] Stopped [0;1;39msshd.service[0m - OpenSSH server daemon.
[[0;32m  OK  [0m] Stopped [0;1;39mzincati.service[0m - Zincati Update Agent.
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-hostnamed.service[0m - Hostname Service.
[[0;32m  OK  [0m] Stopped [0;1;39mchronyd.service[0m - NTP client/server.
[[0;32m  OK  [0m] Stopped [0;1;39mrpm-ostreed.servic…m-ostree System Management Daemon.
[[0;32m  OK  [0m] Stopped [0;1;39mpolkit.service[0m - Authorization Manager.
[[0;32m  OK  [0m] Stopped [0;1;39mNetworkManager-dis…Manager Script Dispatcher Service.
[[0;32m  OK  [0m] Stopped [0;1;39mgetty@tty1.service[0m - Getty on tty1.
[[0;32m  OK  [0m] Stopped [0;1;39mserial-getty@ttyS0…rvice[0m - Serial Getty on ttyS0.
[[0;32m  OK  [0m] Stopped [0;1;39muser@1000.service[0m - User Manager for UID 1000.
[[0;1;31mFAILED[0m] Failed unmounting [0;1;39metc.mount[0m - /etc.
[[0;1;31mFAILED[0m] Failed unmounting [0;1;39musr.mount[0m - /usr.
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-random-see…rvice[0m - Load/Save Random Seed.
[[0;32m  OK  [0m] Stopped [0;1;39mdracut-shutdown.se…estore /run/initramfs on shutdown.
[[0;32m  OK  [0m] Removed slice [0;1;39msystem-getty.slice[0m - Slice /system/getty.
[[0;32m  OK  [0m] Removed slice [0;1;39msystem-seria…[0m - Slice /system/serial-getty.
[[0;32m  OK  [0m] Stopped target [0;1;39mboot-comple…arget[0m - Boot Completion Check.
         Starting [0;1;39mdracut-shutdown-o…down failure to perform cleanup...
         Stopping [0;1;39muser-runtime-dir@…untime Directory /run/user/1000...
[[0;32m  OK  [0m] Finished [0;1;39mdracut-shutdown-o…utdown failure to perform cleanup.
[[0;32m  OK  [0m] Unmounted [0;1;39mrun-user-1000.mount[0m - /run/user/1000.
[[0;32m  OK  [0m] Stopped [0;1;39muser-runtime-dir@1… Runtime Directory /run/user/1000.
[[0;32m  OK  [0m] Removed slice [0;1;39muser-1000.slice[0m - User Slice of UID 1000.
         Stopping [0;1;39msystemd-user-sess…vice[0m - Permit User Sessions...
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-user-sessi…ervice[0m - Permit User Sessions.
[[0;32m  OK  [0m] Stopped target [0;1;39mnetwork.target[0m - Network.
[[0;32m  OK  [0m] Stopped target [0;1;39mnss-user-lo…[0m - User and Group Name Lookups.
[[0;32m  OK  [0m] Stopped target [0;1;39mremote-fs.target[0m - Remote File Systems.
[[0;32m  OK  [0m] Stopped target [0;1;39mremote-fs-p…eparation for Remote File Systems.
         Stopping [0;1;39mNetworkManager.service[0m - Network Manager...
[[0;32m  OK  [0m] Stopped [0;1;39mconsole-login-help…via console-login-helper-messages.
[[0;32m  OK  [0m] Stopped target [0;1;39msshd-keygen.target[0m.
[[0;32m  OK  [0m] Stopped [0;1;39mcoreos-ignition-wr…reate Ignition Status Issue Files.
         Stopping [0;1;39msystemd-homed-act…vice[0m - Home Area Activation...
[[0;32m  OK  [0m] Stopped [0;1;39mNetworkManager.service[0m - Network Manager.
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-homed-acti…ervice[0m - Home Area Activation.
[[0;32m  OK  [0m] Stopped target [0;1;39mnetwork-pre…get[0m - Preparation for Network.
         Stopping [0;1;39msystemd-homed.service[0m - Home Area Manager...
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-homed.service[0m - Home Area Manager.
[[0;32m  OK  [0m] Stopped target [0;1;39mbasic.target[0m - Basic System.
[[0;32m  OK  [0m] Stopped target [0;1;39mpaths.target[0m - Path Units.
[[0;32m  OK  [0m] Stopped [0;1;39mostree-finalize-st… OSTree Monitor Staged Deployment.
[[0;32m  OK  [0m] Stopped target [0;1;39mslices.target[0m - Slice Units.
[[0;32m  OK  [0m] Removed slice [0;1;39muser.slice[0m - User and Session Slice.
[[0;32m  OK  [0m] Stopped target [0;1;39msockets.target[0m - Socket Units.
[[0;32m  OK  [0m] Closed [0;1;39mbootupd.socket[0m.
[[0;32m  OK  [0m] Closed [0;1;39mdocker.socket[0m - Docker Socket for the API.
[[0;32m  OK  [0m] Closed [0;1;39miscsid.socket[0m - Open-iSCSI iscsid Socket.
[[0;32m  OK  [0m] Closed [0;1;39miscsiuio.socket[0m - Open-iSCSI iscsiuio Socket.
         Stopping [0;1;39mdbus-broker.servi…[0m - D-Bus System Message Bus...
[[0;32m  OK  [0m] Stopped [0;1;39mdbus-broker.service[0m - D-Bus System Message Bus.
[[0;32m  OK  [0m] Closed [0;1;39mdbus.socket[0m - D-Bus System Message Bus Socket.
[[0;32m  OK  [0m] Stopped target [0;1;39msysinit.target[0m - System Initialization.
[[0;32m  OK  [0m] Unset automount [0;1;39mproc-sys-f…rmats File System Automount Point.
[[0;32m  OK  [0m] Stopped target [0;1;39mcryptsetup.…get[0m - Local Encrypted Volumes.
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-ask-passwo…quests to Console Directory Watch.
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-ask-passwo… Requests to Wall Directory Watch.
[[0;32m  OK  [0m] Stopped target [0;1;39mintegrityse…Local Integrity Protected Volumes.
[[0;32m  OK  [0m] Stopped target [0;1;39mveritysetup… - Local Verity Protected Volumes.
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-boot-updat…0m - Automatic Boot Loader Update.
         Stopping [0;1;39msystemd-resolved.…e[0m - Network Name Resolution...
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-sysctl.service[0m - Apply Kernel Variables.
[[0;32m  OK  [0m] Stopped [0;1;39mcoreos-printk-quie…eOS: Set printk To Level 4 (warn).
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-modules-lo…service[0m - Load Kernel Modules.
         Stopping [0;1;39msystemd-update-ut…rd System Boot/Shutdown in UTMP...
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-resolved.s…ice[0m - Network Name Resolution.
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-update-utm…cord System Boot/Shutdown in UTMP.
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-tmpfiles-s…te Volatile Files and Directories.
[[0;32m  OK  [0m] Stopped target [0;1;39mlocal-fs.target[0m - Local File Systems.
         Unmounting [0;1;39mboot.mount[0m - CoreOS Dynamic Mount for /boot...
         Unmounting [0;1;39mtmp.mount[0m - Temporary Directory /tmp...
         Stopping [0;1;39msystemd-journal-f…h Journal to Persistent Storage...
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-journal-fl…ush Journal to Persistent Storage.
[[0;32m  OK  [0m] Stopped [0;1;39mostree-remount.ser… - OSTree Remount OS/ Bind Mounts.
         Unmounting [0;1;39mvar.mount[0m - /var...
[[0;32m  OK  [0m] Unmounted [0;1;39mboot.mount[0m - CoreOS Dynamic Mount for /boot.
[[0;32m  OK  [0m] Unmounted [0;1;39mtmp.mount[0m - Temporary Directory /tmp.
[[0;32m  OK  [0m] Stopped target [0;1;39mswap.target[0m - Swaps.
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-fsck@dev-d…32080-d1b0-4d76-8ea2-c0fdd5f2737a.
[[0;32m  OK  [0m] Removed slice [0;1;39msystem-syste…[0m - Slice /system/systemd-fsck.
[[0;32m  OK  [0m] Unmounted [0;1;39mvar.mount[0m - /var.
         Unmounting [0;1;39msysroot-ostree-…ostree/deploy/fedora-coreos/var...
[[0;32m  OK  [0m] Unmounted [0;1;39msysroot-ostree-d…t/ostree/deploy/fedora-coreos/var.
         Unmounting [0;1;39msysroot.mount[0m - /sysroot...
[[0;32m  OK  [0m] Unmounted [0;1;39msysroot.mount[0m - /sysroot.
[[0;32m  OK  [0m] Stopped target [0;1;39mblockdev@de… Preparation for /dev/mapper/root.
[[0;32m  OK  [0m] Stopped target [0;1;39mlocal-fs-pr…reparation for Local File Systems.
         Stopping [0;1;39mlvm2-monitor.serv…ng dmeventd or progress polling...
         Stopping [0;1;39msystemd-cryptsetu…m - Cryptography Setup for root...
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-remount-fs…ount Root and Kernel File Systems.
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-tmpfiles-s…reate Static Device Nodes in /dev.
[[0;32m  OK  [0m] Stopped [0;1;39msystemd-cryptsetup…[0m - Cryptography Setup for root.
[[0;32m  OK  [0m] Stopped target [0;1;39mcryptsetup-…m - Local Encrypted Volumes (Pre).
[[0;32m  OK  [0m] Reached target [0;1;39mumount.target[0m - Unmount All Filesystems.
[[0;32m  OK  [0m] Stopped [0;1;39mlvm2-monitor.servi…sing dmeventd or progress polling.
[[0;32m  OK  [0m] Reached target [0;1;39mshutdown.target[0m - System Shutdown.
[[0;32m  OK  [0m] Reached target [0;1;39mfinal.target[0m - Late Shutdown Services.
[[0;32m  OK  [0m] Finished [0;1;39msystemd-reboot.service[0m - System Reboot.
[[0;32m  OK  [0m] Reached target [0;1;39mreboot.target[0m - System Reboot.
[   17.978854] block device autoloading is deprecated and will be removed.
[   17.982555] block device autoloading is deprecated and will be removed.
[   17.985537] block device autoloading is deprecated and will be removed.
[   17.987546] block device autoloading is deprecated and will be removed.
[   17.989540] block device autoloading is deprecated and will be removed.
[   17.991547] block device autoloading is deprecated and will be removed.
[   17.993555] block device autoloading is deprecated and will be removed.
[   17.995539] block device autoloading is deprecated and will be removed.
[   17.997577] block device autoloading is deprecated and will be removed.
[   17.999544] block device autoloading is deprecated and will be removed.
[   22.979465] blkdev_get_no_open: 1666 callbacks suppressed
[   22.979467] block device autoloading is deprecated and will be removed.
[   22.984459] block device autoloading is deprecated and will be removed.
[   22.986473] block device autoloading is deprecated and will be removed.
[   22.988470] block device autoloading is deprecated and will be removed.
[   22.990469] block device autoloading is deprecated and will be removed.
[   22.992471] block device autoloading is deprecated and will be removed.
[   22.994471] block device autoloading is deprecated and will be removed.
[   22.996469] block device autoloading is deprecated and will be removed.
[   22.998468] block device autoloading is deprecated and will be removed.
[   23.000470] block device autoloading is deprecated and will be removed.
...
...
...
[  618.221270] blkdev_get_no_open: 1664 callbacks suppressed
[  618.221273] block device autoloading is deprecated and will be removed.
[  618.224274] block device autoloading is deprecated and will be removed.
[  618.227267] block device autoloading is deprecated and will be removed.
[  618.229274] block device autoloading is deprecated and will be removed.
[  618.231277] block device autoloading is deprecated and will be removed.
[  618.233277] block device autoloading is deprecated and will be removed.
[  618.235282] block device autoloading is deprecated and will be removed.
[  618.237370] block device autoloading is deprecated and will be removed.
[  618.239356] block device autoloading is deprecated and will be removed.
[  618.241290] block device autoloading is deprecated and will be removed.

I think I have tracked this down to some issue in the 6.0.0-0.rc2.19.fc38 kernel. Advancing the kernel the the newer version that was just built today seems to resolve the issue.

Here are some logs for the failure:

console.txt
journal.txt

@jlebon
Copy link
Member

jlebon commented Aug 23, 2022

Nice work digging into this!

@dustymabe
Copy link
Member Author

dustymabe commented Aug 23, 2022

Added a kernel fast-track over in coreos/fedora-coreos-config#1926

@dustymabe
Copy link
Member Author

This one might be more complicated. I wasn't able to reproduce locally with kernel-5.19.3-300.fc37, but now there's a branched build (build#81) that is showing it with that kernel.

@dustymabe
Copy link
Member Author

dustymabe commented Aug 26, 2022

OK. More information.

I think this is a kernel regression that still isn't fixed in latest upstream kernel git tree. The reason I thought kernel-6.0.0-0.rc2.20220823git072e51356cd5.20.fc38 was good is because it passes the test and works fine. This is most likely because it is a debug kernel and has different perf characteristics than a nodebug kernel.

I did a kernel bisect and believe the offending kernel commit is a09b314. I've reported a Fedora bug and started a mailing list discussion upstream to see if we can get the experts to evaluate it.

On a related note I think we should stop testing debug kernels in rawhide. I've spent a decent amount of time in recent months either chasing down issues that only occur on debug kernels OR getting false positive or negative test results because of some behavior that happens on debug kernels and doesn't happen on nodebug kernels.

@cgwalters
Copy link
Member

Awesome work bisecting!

dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Aug 26, 2022
They are failing because of a kernel regression and we are actively
working on getting a fix upstream and backported. Let's snooze these
tests for now.

See coreos/fedora-coreos-tracker#1282
dustymabe added a commit to coreos/fedora-coreos-config that referenced this issue Aug 26, 2022
They are failing because of a kernel regression and we are actively
working on getting a fix upstream and backported. Let's snooze these
tests for now.

See coreos/fedora-coreos-tracker#1282
@dustymabe
Copy link
Member Author

dustymabe commented Aug 29, 2022

With the introduction of kernel-5.19.4-200.fc36 in Fedora 36 this is now affecting testing-devel. (see bump-lockfile#40).

Opened a pin request here: coreos/fedora-coreos-config#1936

@dustymabe
Copy link
Member Author

OK. Now that rawhide is on a nodebug kernel (the latest build is kernel-6.0.0-0.rc3.25.fc38 we are seeing this issue again (fun timing differences in debug versus nodebug kernels).

We'll snooze these tests for rawhide too while we wait for a fix for https://bugzilla.redhat.com/show_bug.cgi?id=2121791

@dustymabe
Copy link
Member Author

@dustymabe
Copy link
Member Author

The revert for the offending kernel commit landed upstream in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4c66a326b5ab784cddd72de07ac5b6210e9e1b06

@dustymabe
Copy link
Member Author

That commit made it into v6.0-rc7 and was backported to v5.19.12 in https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=48a12961e800aa404fefa845aa95098d18a30f19

dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Sep 29, 2022
The revert of the problematic commit that caused the hung reboots
in our `coreos.boot-mirror` tests landed upstream in the 6.0 development
cycle and was backported to 5.19.12. This means the Fedora kernel
maintainer has dropped his manual revert of the commit and also that
rawhide should no longer see the issue (since it's already on 6.0 rc
kernels.

Closes coreos/fedora-coreos-tracker#1282
@dustymabe
Copy link
Member Author

This problem never made it to our production streams.

HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
They are failing because of a kernel regression and we are actively
working on getting a fix upstream and backported. Let's snooze these
tests for now.

See coreos/fedora-coreos-tracker#1282
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
The revert of the problematic commit that caused the hung reboots
in our `coreos.boot-mirror` tests landed upstream in the 6.0 development
cycle and was backported to 5.19.12. This means the Fedora kernel
maintainer has dropped his manual revert of the commit and also that
rawhide should no longer see the issue (since it's already on 6.0 rc
kernels.

Closes coreos/fedora-coreos-tracker#1282
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
They are failing because of a kernel regression and we are actively
working on getting a fix upstream and backported. Let's snooze these
tests for now.

See coreos/fedora-coreos-tracker#1282
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
The revert of the problematic commit that caused the hung reboots
in our `coreos.boot-mirror` tests landed upstream in the 6.0 development
cycle and was backported to 5.19.12. This means the Fedora kernel
maintainer has dropped his manual revert of the commit and also that
rawhide should no longer see the issue (since it's already on 6.0 rc
kernels.

Closes coreos/fedora-coreos-tracker#1282
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants