New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ubuntu 22.04]cloud-init failed to complete after 10 minutes of waiting was shown during Installation via iDRAC Virtual Console #4009
Comments
Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-08-17T10:13:39.377607+00:00 Launchpad attachments: Attaching cloud-init Installation logs under /var/log |
Launchpad user James Falcon(falcojr) wrote on 2022-08-17T14:50:02.501990+00:00 I took a look at the logs, and this doesn't appear to be a problem with cloud-init specifically. Cloud-init runs multiple times during boot, and each run has its own set of systemd services that it depends on having run first. If we look at cloud-init.log, we can see that there's a time where one cloud-init instance exits, then a greater than 10 minute gap before cloud-init's final service runs: 2022-08-17 06:38:53,429 - util.py[DEBUG]: cloud-init mode 'modules' took 0.139 seconds (0.14) This can happen if one of cloud-init's dependent services has stalled or failed. To see if another service has stalled or failed, you could try the following commands: Does the entire installation fail, or just cloud-init? |
Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-08-22T07:21:17.471994+00:00 James Falcon, Does the entire installation fail, or just cloud-init? Only cloud-init will fail. Once issue occurs when hit 'cancel' we can proceed with the further Installation successfully. |
Launchpad user Chad Smith(chad.smith) wrote on 2022-08-24T16:26:46.172306+00:00 Thanks for the info, shubhakara. To debug further what we need from you on the failed installed system once it boots is the output of
This will help us determine what other systemd service ordering seems to be delaying cloud-init boot stages from running. Most of he data above gets collected by running Given that you mentioned clicking "cancel" and there is a /var/log/intaller subdirectory, this looks like you are using the ubuntu server live-install (subiquity) to install this system. I'm adding subiquity project too just to keep devs aware. Looking through /var/log/installer/subiquity-server-debug*log I can see timeout logs on cloud-init status --wait2022-08-17 06:39:32,689 DEBUG subiquitycore.utils:77 arun_command called: ['cloud-init', 'status', '--wait'] |
Launchpad user Chad Smith(chad.smith) wrote on 2022-08-24T16:28:00.754191+00:00 Please set the bug status back to "New" when further comments have been added to make sure our team gets alerted about the updated comments so we can respond more quickly. |
Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-08-25T11:49:18.959580+00:00 Chad Smith, Here is the attachment using 1G end to end connection without VPN. In the old attachment shared previously, virtual-media used over a VPN connection to install the OS. |
Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-08-25T13:58:26.496810+00:00 |
Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-08-25T14:03:04.191000+00:00 Chad Smith, File "cloud-init-install-fail.tar" in comment #6 contains the cloud init logs collected under /var/log. |
Launchpad user Chad Smith(chad.smith) wrote on 2022-08-26T21:30:59.368303+00:00 Thanks Shubhakar Gowda P S (shubhakara) for getting back to us.
I can see a snapd log in your journalctl -b 0 output that says snapd was timed out after an inordinate amount of time waiting for NTP: Aug 25 08:55:42.976933 ubuntu-server snapd[1642]: devicemgr.go:1927: no NTP sync after 10m0s, I also see a whole bunch of ntpd and network.timed The systemd sevice which performs cloud configuration cloud-config.service has a clause which makes it wait until snapd.seeded.service completes in the booting environment so cloud-init use of any snap tools/services installed: After=snapd.seeded.service So this 10 minute wait is due to something in either networking setup in the ephemeral/early install environment not being configured appropriately for snapd to come up. Once snapd unblocks, cloud-init can complete. You can confirm that the problem is snap seeding on your system with the commands we listed above in comments #2 and #4:
That should point to snap seeding being the problem in this early/ephemeral install environment. Also in your journalctl -b 0 output (in cloud-init collect-logs /var/log/journal.txt) A red flag I'm also seeing is repeated logs in your journal that look like thrashing of time-related services: Something is amiss with ntp or network configuration on this installation and it looks like it is causing issues for snapd which cause issue for cloud-init to make progress on the install of this system. In checking /var/log/cloud-init-output.log I see that the two NICs are listed as link down: But journalctl -b 0 showed them as "UP": Something seems to be turning off the NICs on this device maybe? Can you please provide:
|
Launchpad user Michael Hudson-Doyle(mwhudson) wrote on 2022-08-29T09:09:15.022307+00:00 What's going on here is I think that the md5 check casper is doing is using all the (minimal) amount of IO bandwidth to the virtual ISO and starving other processes from making progress. We should probably make the md5 check run under ionice or just abort itself if it's slow or something. |
Launchpad user Michael Hudson-Doyle(mwhudson) wrote on 2022-08-29T09:14:34.857104+00:00 Would someone be able to test a custom ISO that I can make with https://code.launchpad.net/~mwhudson/casper/+git/casper/+merge/429055 incorporated? |
Launchpad user Narendra K(knarendra) wrote on 2022-08-30T16:39:29.744108+00:00 Hi Michael, Thank you for taking a look. Please provide the custom ISO. We will test and share findings. |
Launchpad user Narendra K(knarendra) wrote on 2022-08-30T19:46:09.376088+00:00 Please find the details: systemd-analyze-blame-output 12min 33.112s casper-md5check.service "systemd-analyze-critical-chain-output" The time when unit became active or started is printed after the "@" character. graphical.target @13min 17.133s "cloud-init-analyze-blame-output" -- Boot Record 01 -- "cloud-init-analyze-show-output" -- Boot Record 01 -- Starting stage: init-local Starting stage: init-network Starting stage: modules-config Starting stage: modules-final Total Time: 6.25300 seconds 1 boot records analyzed |
Launchpad user Narendra K(knarendra) wrote on 2022-08-31T10:33:27.032168+00:00 Launchpad attachments: Installation showing failure |
Launchpad user Narendra K(knarendra) wrote on 2022-08-31T10:36:46.179183+00:00 Hi Chad/Michael, When issue occurs, the installation stops as shown in the screen attached in comment #14. If "Close" option is selected, installation completes successfully and OS completes successfully. Does choosing "Close" option and proceeding with installation have any side effects or could it be recommended as a work around till a solution is included in the installation ISO ? |
Launchpad user Michael Hudson-Doyle(mwhudson) wrote on 2022-09-01T02:32:20.085064+00:00 If you could try booting the ISO from https://cdimage.ubuntu.com/ubuntu-server/daily-live/pending/ I would be very interested to hear if it is better (and if it is we can backport to 22.04).
No, if the install completes successfully then there should be no deficiencies. It mostly impacts things about the installer environment, like autoinstall and ssh access to the installer. |
Launchpad user Narendra K(knarendra) wrote on 2022-09-01T08:08:44.374134+00:00
Thank you for clarifying.
Any thoughts on how users using autoinstall scenarios could work around this issue ?
Any thoughts on how users requiring ssh could work around this issue ? Or should it confirmed first if this scenario an issue by trying to ssh when issue occurs ? |
Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-09-01T12:02:16.567295+00:00 Hi Michael, As per your assist downloaded 64-bit PC (AMD64) server install image from "https://cdimage.ubuntu.com/ubuntu-server/daily-live/pending/" and tested it, Issue was still observed. Please find the details: "systemd-analyze-blame-output"
"systemd-analyze-critical-chain-output" The time when unit became active or started is printed after the "@" character. graphical.target @14min 7.154s "cloud-init-analyze-blame-output" -- Boot Record 01 -- 1 boot records analyzed "cloud-init-analyze-show-output" -- Boot Record 01 -- Starting stage: init-local Starting stage: init-network Starting stage: modules-config Starting stage: modules-final Total Time: 2.80900 seconds 1 boot records analyzed |
Launchpad user Narendra K(knarendra) wrote on 2022-09-05T17:34:50.823637+00:00 Hi Chad, 2022-08-17 06:49:32,690 DEBUG subiquity.server.server:519 waited 600.0012300014496s for cloud-init Do the logs provide any details on why cloud-init takes more than 10 minutes to complete ? Is cloud-init waiting for casper-md5check.service to complete ? casper-md5check.service seems to be taking > 10 minutes when issue occurs. Hi Michael, Could subiquity be made to continue without interrupting the installation and reschedule a check to verify if cloud-init completed ? Or does it need to wait for cloud-init complete before proceeding with installation ? Also, any thoughts on query from comment #17 ? |
Launchpad user Michael Hudson-Doyle(mwhudson) wrote on 2022-09-05T21:51:30.129036+00:00 Hmm it's a shame those casper changes didn't help but not completely surprising. I haven't verified this but my theory on what is going on is that the casper-md5 check is using all of the small amount of IO the virtual bmc feature provides and thus starving other services like cloud-init and snap seeding from making progress. Nothing is explicitly waiting for the casper-md5 check to complete -- if my theory is correct it's not something that can be fixed by tweaking unit dependencies. I should have mentioned this before, but one workaround for this is to pass 'tomem' on the kernel command line. This makes the initrd copy the entire ISO into RAM before pivoting to the real rootfs which makes the first part of the boot pretty slow but should make all other parts including the md5 check much faster. It obviously increases RAM usage too but most "real" servers probably have much more RAM than is needed. Other workarounds we are looking at is making a much smaller ISO that can be provided as a virtual ISO that will then download the main ISO from the network (as the network is probably much much faster than the virtual ISO). But it would be nice to fix this properly, probably by having the casper md5 check abort if it detects that it is only making very slow progress. |
Launchpad user Chad Smith(chad.smith) wrote on 2022-09-07T20:56:08.213790+00:00 Narendra K (knarendra), As Michael mentioned, this really looks more like a system that is under load and not responding well to all the systemd services that are being run during the install process. The time values I'm seeing in comment #18 from systemd-analyze are 10 times greater than what I typically see on laptops, containers, VMs and bare metal. From the systemd-analyze blame output from comment in #18, I can confirm the reason for cloud-config.service "waiting" is the sum of the systemd snap*services setup time:
... Those values account for the minute and a half of waiting... which aligns with the attached cloud-init analyze blame output as well: starting stage: modules-config Which tells us that cloud-init's modules-config stage started at 76 seconds after boot began (around the same amount as the combined snap services). When looking as cloud-init analyze output we see the cloud-final stage doesn't start until 8817 seconds (13 min) after boot which is greater than subiquity's 10 minute timeout. Starting stage: modules-final /lib/systemd/system/cloud-final.service config is only started once the following condition is met: After=multi-user.target And multi-user.target depends on that casper-md5check.service which is taking 13 minutes for one reason of another. |
Launchpad user Chad Smith(chad.smith) wrote on 2022-09-07T21:19:31.593901+00:00 Ultimately, each md5 file md5 comparison seems to take around 1/2 second on this system. If we are unable to determine what is slowing down the casper-md5check or an early exit in the event of painful md5checks, can we either provide a config option to disable the casper md5 check, or make that cloud-init timeout configurable to > 600 seconds[1]? References for cloud-init timeout: |
Launchpad user Michael Hudson-Doyle(mwhudson) wrote on 2022-09-08T00:41:41.695107+00:00 You can also put fsck.mode=skip on the kernel command line to get the md5check to skip itself, I should have mentioned that too! |
Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-09-08T10:29:32.574970+00:00 Hi Michael, Tried passing kernel work around parameter "tomem" as suggested in the comment #20, Issue was still observed in this case. And also Tried passing kernel work around parameter "fsck.mode=skip" as suggested in the comment #23, Issue was not seen in this scenario. Also, we tried reproducing the Issue in 'Ubuntu 20.04.4' in this case Issue was not seen. |
Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-09-08T10:54:26.492325+00:00 Chad Smith, Please find the details as requested in comment #21. "systemd-analyze critical-chain cloud-final.service output" The time when unit became active or started is printed after the "@" character. cloud-final.service +266ms |
Launchpad user Chad Smith(chad.smith) wrote on 2022-09-08T21:09:20.324471+00:00 Shubhakar Gowda P S (shubhakara)
I did notice in running the daily server live image for 20.04 today that in the console output it shows me the Thanks for the kernel cmdline tip Michael, I didn't know about Maybe no need to add a config setting for the cloud-init --wait timeout in subiquity as the other 'workaround' seems to work here. cloud-init squad should be touching the |
Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-09-09T11:32:23.310690+00:00 Chad Smith, Once After Ubuntu 20.04.4 CD Image Installation got completed, by switching over to a shell using 'Ctrl-Z + F2' keyboard shortcut key we have collected the output of following details. output "systemd-analyze critical-chain cloud-final.service" The time when unit became active or started is printed after the "@" character. cloud-final.service +475ms output "systemctl status casper-md5check.service" Unit casper-md5check.service could not be found output "/var/log/casper.log" stdin: Invalid argument |
Launchpad user Narendra K(knarendra) wrote on 2022-09-16T16:54:35.729545+00:00 Hi Chad/Michael, Any updates or feedback on findings in comment #27 ? It seems like in Ubuntu 20.04, a) cloud-init does not depend on multi-user.target Could Ubuntu 22.04 installation be modified to match with Ubuntu 20.04 ? Any thoughts ? Hi Michael, Could 'fsck.mode=skip' be documented in Ubuntu 22.04 release notes ? |
Launchpad user Brett Holman(holmanb) wrote on 2022-09-16T19:09:00.896674+00:00 Since the issue has been narrowed down to casper-md5check, which is out of scope of cloud-init, I'm marking the cloud-init side of this bug as "Invalid", since there is nothing actionable to do to in cloud-init to resolve this bug, and the plans to increase debugging transparency that Chad mentioned were already planned and scheduled. |
Launchpad user Narendra K(knarendra) wrote on 2022-09-21T09:51:56.971840+00:00 Hi Michael, Any updates on the queries in comment #28 ? It seems like in Ubuntu 20.04, a) cloud-init does not depend on multi-user.target Could Ubuntu 22.04 installation be modified to match with Ubuntu 20.04 ? Any thoughts ? |
Launchpad user Dan Bungert(dbungert) wrote on 2022-11-01T08:56:32.150866+00:00 We propose no longer having the md5check block multi-user. I'm opening a matching MP. |
Launchpad user Narendra K(knarendra) wrote on 2022-11-16T15:20:30.614960+00:00 Hi Dan, Would the patch get included in Ubuntu 22.04.2 ? |
Launchpad user Dan Bungert(dbungert) wrote on 2022-11-16T15:38:12.218127+00:00 Hi Narendra - yes, the plan is to have this done in time for 22.04.2. Thanks for specifically pointing this out, we need to make sure it's included in the Jammy Stable Release Updates. |
Launchpad user Launchpad Janitor(janitor) wrote on 2022-11-27T04:37:22.218023+00:00 This bug was fixed in the package casper - 1.476 casper (1.476) lunar; urgency=medium
-- Dan Bungert daniel.bungert@canonical.com Wed, 23 Nov 2022 16:10:57 +1300 |
Launchpad user Dan Bungert(dbungert) wrote on 2022-11-29T15:15:40.263866+00:00 Thanks for moving the SRU forward Mauricio. I see you nominated this fix for Kinetic, I think it won't help much as we probably won't be creating new Kinetic install media. Or do you have something else in mind? |
Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2022-11-29T16:22:55.481845+00:00 Hi Dan o/ |
Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2023-01-23T20:59:45.645722+00:00 Reproducer/Verification for SRU to Jammy, Uploading to jammy. ... Launch a VM with the Jammy daily live server:
Press e to edit, append "break=init console=ttyS0", press ctrl-x, close window. Open the serial console and chroot:
Test strace delay injection:
Modify the casper-md5check service:
Exit and wait; the issue happens:
... Similarly, repeat, this time with test packages from ppa:mfo/lp1986781: ... Open the serial console and chroot:
Install the test package:
Modify the casper-md5check service:
Exit and wait; the issue does not happen anymore:
...
Checking the casper-md5sum service is still running:
And it should not be a problem, as its start timeout is not limited.
Quit, cleanup the VM.
|
Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2023-01-23T21:01:19.241010+00:00 |
Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2023-01-23T21:03:22.617136+00:00 cosmetic glitch (see description's regression potential section). |
Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2023-01-23T21:07:35.250446+00:00 For documentation purposes, The critical chain of systemd units/time Before: root@ubuntu-server:/# systemd-analyze critical-chain cloud-final.service cloud-final.service +492ms After: root@ubuntu-server:/# systemd-analyze critical-chain cloud-final.service cloud-final.service +901ms |
Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2023-01-23T21:08:21.083037+00:00 Launchpad attachments: lp1986781-casper-jammy.debdiff |
Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2023-01-27T18:38:10.361167+00:00 Re-uploading to Jammy with changelog updates suggested by Timo. |
Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2023-01-27T18:38:33.809893+00:00 Launchpad attachments: lp1986781-casper-jammy-v2.debdiff |
Launchpad user Timo Aaltonen(tjaalton) wrote on 2023-01-28T10:54:36.959899+00:00 Hello Shubhakar, or anyone else affected, Accepted casper into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/casper/1.470.2 in a few hours, and then in the -proposed repository. Please help us by testing this new package. To properly test it you will need to obtain and boot a daily build of a Live CD for jammy. Your feedback will aid us getting this update out to other Ubuntu users. If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed. Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping! N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days. |
Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2023-01-30T14:00:25.633007+00:00 Verification done for jammy-proposed. ... Booting jammy-live-server-amd64.iso with (initramfs) chroot /root /bin/bash root@ubuntu-server:/# add-apt-repository -y -p proposed root@ubuntu-server:/# apt install -y casper/jammy-proposed root@ubuntu-server:/# apt policy casper ... root@ubuntu-server:/# sed -i '/^ExecStart=/ s,=,=/usr/bin/strace --inject read:delay_enter=60s ,' /usr/lib/systemd/system/casper-md5check.service root@ubuntu-server:/# cat /usr/lib/systemd/system/casper-md5check.service [Service] root@ubuntu-server:/# exit ... Ubuntu 22.04.1 LTS ubuntu-server ttyS0 connecting... ================================================================================
|
Launchpad user Mike Rushton(leftyfb) wrote on 2023-01-31T16:29:25.780022+00:00 "To properly test it you will need to obtain and boot a daily build of a Live CD for jammy" If that is all that is needed then it is failing for me. If, however, we need to stop the installer, enable the proposed repository, update casper, disable the proposed repository and then continue the installer, then it looks like that has been validated above by Mauricio. |
Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2023-01-31T17:13:15.565522+00:00 Hi Mike, Right; the package version in -proposed isn't in the daily build yet, @ https://cdimage.ubuntu.com/ubuntu-server/jammy/daily-live/20230131/jammy-live-server-amd64.manifest
cheers, |
Launchpad user Dan Bungert(dbungert) wrote on 2023-02-02T15:23:38.048915+00:00 Verification done for jammy-proposed. I have purposefully done a different test procedure, based on injecting the proposed casper version and using systemd-analyze. Above, in comments such as https://bugs.launchpad.net/cloud-init/+bug/1986781/comments/25, we can see that casper-md5check needs to complete before hitting multi-user.target. With casper 1.470.2 in place, it looks like: The time when unit became active or started is printed after the "@" character. casper-md5check.service +13.837s This is the expected behavior, so the SRU LGTM. |
Launchpad user Launchpad Janitor(janitor) wrote on 2023-02-07T11:05:44.814349+00:00 This bug was fixed in the package casper - 1.470.2 casper (1.470.2) jammy; urgency=medium [ Dan Bungert ]
-- Mauricio Faria de Oliveira mfo@canonical.com Fri, 27 Jan 2023 15:31:09 -0300 |
Launchpad user Łukasz Zemczak(sil2100) wrote on 2023-02-07T11:05:49.907084+00:00 The verification of the Stable Release Update for casper has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions. |
This bug was originally filed in Launchpad as LP: #1986781
Launchpad details
Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-08-17T09:55:27.968812+00:00
[Impact]
* Users that install Ubuntu Server through slow
media (eg, virtual optical drive over network,
which may be common on enterprise deployments)
might hit the following subiquity startup error:
'cloud-init failed to complete after 10 minutes of waiting'
* (That in addition to 10 minutes of waiting themselves.)
* This happens because casper-md5check.service is
(slowly) verifying the integrity of install media,
which blocks
multi-user.target
,which blocks
cloud-final.service
,which blocks
cloud-init status --wait
which is used in subiquity /
waiting on cloud-init
).[Fix]
* The adopted solution (merged on lunar) is simply
not to block
multi-user.target
, but rather runafter it.
[Test Steps]
For a synthetic reproducer of slowness of casper-md5check:
* boot with
break=init
to break into initramfs-toolsbefore exec() systemd.
* chroot /root /bin/bash
* edit /usr/lib/systemd/system/casper-md5check.service
* prepend
strace --inject read:delay_enter=5s
to thecommand in
ExecStart
, to introduce a 5 secs delayto every read() syscall performed by casper-md5check.
* exit twice (chroot, initramfs shell) to resume boot.
See comment 37 for examples.
[Regression Potential]
Functionality related to install media integrity check.
Users with corrupted install media might not realize
this until later on; but this is rarely the case and
even w/out the fix, there's a lot that runs before
we even get to casper-md5check, so they may (still)
see errors early anyway.
* There's a cosmetic glitch in the proposed solution:
the systemd line when casper-md5check finishes
shows up on top of subiquity's menu (screenshot):
"[ OK ] Finished casper-md5check Verify Live ISO checksums."
Dan Bungert mentioned this is known and should be
addressed in a future change to subiquity, and is
not supposed to block the SRU for Jammy / 22.04.2.
[Original Description]
Description:
On Dell EMC PowerEdge system when Install Ubuntu 22.04 via iDRAC Virtual Console, cloud-init failed to complete after 10 minutes of waiting.
Steps to Reproduce:
Expected Results :-
Installation should be successful.
The text was updated successfully, but these errors were encountered: