Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ubuntu 22.04]cloud-init failed to complete after 10 minutes of waiting was shown during Installation via iDRAC Virtual Console #4009

Closed
ubuntu-server-builder opened this issue May 12, 2023 · 50 comments
Labels
launchpad Migrated from Launchpad

Comments

@ubuntu-server-builder
Copy link
Collaborator

This bug was originally filed in Launchpad as LP: #1986781

Launchpad details
affected_projects = ['subiquity', 'casper (Ubuntu)', 'casper (Ubuntu Focal)', 'casper (Ubuntu Jammy)', 'casper (Ubuntu Kinetic)']
assignee = None
assignee_name = None
date_closed = 2022-09-16T19:09:09.209707+00:00
date_created = 2022-08-17T09:55:27.968812+00:00
date_fix_committed = None
date_fix_released = None
id = 1986781
importance = undecided
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1986781
milestone = None
owner = shubhakara
owner_name = Shubhakar Gowda P S
private = False
status = invalid
submitter = shubhakara
submitter_name = Shubhakar Gowda P S
tags = ['se-sponsor-mfo', 'sts', 'verification-done', 'verification-done-jammy']
duplicates = [1969919, 1971181, 2003152]

Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-08-17T09:55:27.968812+00:00

[Impact]

 * Users that install Ubuntu Server through slow
   media (eg, virtual optical drive over network,
   which may be common on enterprise deployments)
   might hit the following subiquity startup error:

  'cloud-init failed to complete after 10 minutes of waiting'

 * (That in addition to 10 minutes of waiting themselves.)

 * This happens because casper-md5check.service is
   (slowly) verifying the integrity of install media,
   which blocks multi-user.target,
   which blocks cloud-final.service,
   which blocks cloud-init status --wait
   which is used in subiquity / waiting on cloud-init).

[Fix]

 * The adopted solution (merged on lunar) is simply
   not to block multi-user.target, but rather run
   after it.

[Test Steps]

 For a synthetic reproducer of slowness of casper-md5check:

 * boot with break=init to break into initramfs-tools
   before exec() systemd.
 * chroot /root /bin/bash
 * edit /usr/lib/systemd/system/casper-md5check.service
 * prepend strace --inject read:delay_enter=5s to the
   command in ExecStart, to introduce a 5 secs delay
   to every read() syscall performed by casper-md5check.
 * exit twice (chroot, initramfs shell) to resume boot.

  See comment 37 for examples.

[Regression Potential]

  • Functionality related to install media integrity check.

  • Users with corrupted install media might not realize
    this until later on; but this is rarely the case and
    even w/out the fix, there's a lot that runs before
    we even get to casper-md5check, so they may (still)
    see errors early anyway.

 * There's a cosmetic glitch in the proposed solution:
   the systemd line when casper-md5check finishes
   shows up on top of subiquity's menu (screenshot):

   "[ OK ] Finished casper-md5check Verify Live ISO checksums."

   Dan Bungert mentioned this is known and should be
   addressed in a future change to subiquity, and is
   not supposed to block the SRU for Jammy / 22.04.2.

[Original Description]

Description:

On Dell EMC PowerEdge system when Install Ubuntu 22.04 via iDRAC Virtual Console, cloud-init failed to complete after 10 minutes of waiting.

Steps to Reproduce:

  1. Login to iDRAC and Launch Virtual Console.
  2. Connect to Virtual Media and Map ubuntu 22.04 iso file using Map CD/DVD option.
  3. Try Installing Ubuntu server.
  4. "cloud-init" failed to complete after 10 minutes of waiting was shown during Installation.

Expected Results :-

Installation should be successful.

@ubuntu-server-builder ubuntu-server-builder added the launchpad Migrated from Launchpad label May 12, 2023
@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-08-17T10:13:39.377607+00:00

Launchpad attachments: Attaching cloud-init Installation logs under /var/log

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user James Falcon(falcojr) wrote on 2022-08-17T14:50:02.501990+00:00

I took a look at the logs, and this doesn't appear to be a problem with cloud-init specifically. Cloud-init runs multiple times during boot, and each run has its own set of systemd services that it depends on having run first. If we look at cloud-init.log, we can see that there's a time where one cloud-init instance exits, then a greater than 10 minute gap before cloud-init's final service runs:

2022-08-17 06:38:53,429 - util.py[DEBUG]: cloud-init mode 'modules' took 0.139 seconds (0.14)
2022-08-17 06:38:53,429 - handlers.py[DEBUG]: finish: modules-config: SUCCESS: running modules for config
2022-08-17 06:52:29,424 - util.py[DEBUG]: Cloud-init v. 22.1-14-g2e17a0d6-0ubuntu1~22.04.5 running 'modules:final' at Wed, 17 Aug 2022 06:52:29 +0000. Up 999.23 seconds.

This can happen if one of cloud-init's dependent services has stalled or failed. To see if another service has stalled or failed, you could try the following commands:
systemd-analyze blame
systemctl --failed

Does the entire installation fail, or just cloud-init?

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-08-22T07:21:17.471994+00:00

James Falcon,

Does the entire installation fail, or just cloud-init?

Only cloud-init will fail. Once issue occurs when hit 'cancel' we can proceed with the further Installation successfully.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2022-08-24T16:26:46.172306+00:00

Thanks for the info, shubhakara.

To debug further what we need from you on the failed installed system once it boots is the output of

  • systemd-analyze blame
  • systemd-analyze show
  • systemd-analyze critical-chain
  • systemctl --failed
  • journalctl -b 0

This will help us determine what other systemd service ordering seems to be delaying cloud-init boot stages from running.

Most of he data above gets collected by running cloud-init collect-logs which emits a tar.gz file that can be attached to this bug if you don't want to run the commands individually.

Given that you mentioned clicking "cancel" and there is a /var/log/intaller subdirectory, this looks like you are using the ubuntu server live-install (subiquity) to install this system. I'm adding subiquity project too just to keep devs aware.

Looking through /var/log/installer/subiquity-server-debug*log I can see Starting Subiquity server revision 3359 is blocking on cloud-init status --wait which alludes to the 10 minute wait you referenced in this bug.

timeout logs on cloud-init status --wait

2022-08-17 06:39:32,689 DEBUG subiquitycore.utils:77 arun_command called: ['cloud-init', 'status', '--wait']
2022-08-17 06:39:32,861 DEBUG curtin.reporting.start.subiquity/Meta/status_GET:45 start: subiquity/Meta/status_GET:
2022-08-17 06:39:32,863 DEBUG curtin.reporting.finish.subiquity/Meta/status_GET:45 finish: subiquity/Meta/status_GET: SUCCESS: 200 {"state": "CLOUD_INIT_WAIT", "confirming_tty": "", "error": null, "cloud_init...
2022-08-17 06:39:32,863 INFO aiohttp.access:233 [17/Aug/2022:06:39:32 +0000] "GET /meta/status?cur=null HTTP/1.1" 200 419 "-" "Python/3.8 aiohttp/3.6.2"
2022-08-17 06:39:33,347 DEBUG curtin.reporting.start.subiquity/Meta/status_GET:45 start: subiquity/Meta/status_GET:
2022-08-17 06:49:32,690 DEBUG subiquity.server.server:519 waited 600.0012300014496s for cloud-init
2022-08-17 06:49:32,691 DEBUG subiquity.server.server:535 cloud-init status: '', assumed disabled

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2022-08-24T16:28:00.754191+00:00

Please set the bug status back to "New" when further comments have been added to make sure our team gets alerted about the updated comments so we can respond more quickly.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-08-25T11:49:18.959580+00:00

Chad Smith,

Here is the attachment using 1G end to end connection without VPN. In the old attachment shared previously, virtual-media used over a VPN connection to install the OS.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-08-25T13:58:26.496810+00:00

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-08-25T14:03:04.191000+00:00

Chad Smith,

File "cloud-init-install-fail.tar" in comment #6 contains the cloud init logs collected under /var/log.
File "cloud-init.tar.gz" in comment #7 contains the cloud init logs collected using the command "cloud-init collect-logs".

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2022-08-26T21:30:59.368303+00:00

Thanks Shubhakar Gowda P S (shubhakara) for getting back to us.

From journal.txt in your attached logs in comment #7 is looks like the 10 minute timeout is not strictly related to cloud-init, but cloud-init waiting on snaps being seeded.

I can see a snapd log in your journalctl -b 0 output that says snapd was timed out after an inordinate amount of time waiting for NTP:

Aug 25 08:55:42.976933 ubuntu-server snapd[1642]: devicemgr.go:1927: no NTP sync after 10m0s,

I also see a whole bunch of ntpd and network.timed

The systemd sevice which performs cloud configuration cloud-config.service has a clause which makes it wait until snapd.seeded.service completes in the booting environment so cloud-init use of any snap tools/services installed:

After=snapd.seeded.service

So this 10 minute wait is due to something in either networking setup in the ephemeral/early install environment not being configured appropriately for snapd to come up. Once snapd unblocks, cloud-init can complete.

You can confirm that the problem is snap seeding on your system with the commands we listed above in comments #2 and #4:

  • systemd-analyze blame
  • systemd-analyze show
  • systemd-analyze critical-chain
  • systemctl --failed

That should point to snap seeding being the problem in this early/ephemeral install environment.

Also in your journalctl -b 0 output (in cloud-init collect-logs /var/log/journal.txt)

A red flag I'm also seeing is repeated logs in your journal that look like thrashing of time-related services:
Aug 25 08:52:55.229686 ubuntu-server dbus-daemon[1630]: [system] Successfully activated service 'org.freedesktop.timedate1'
Aug 25 08:52:55.229931 ubuntu-server systemd[1]: Started Time & Date Service.
Aug 25 08:53:55.291021 ubuntu-server systemd[1]: systemd-timedated.service: Deactivated successfully.

Something is amiss with ntp or network configuration on this installation and it looks like it is causing issues for snapd which cause issue for cloud-init to make progress on the install of this system.

In checking /var/log/cloud-init-output.log I see that the two NICs are listed as link down:
ci-info: | eno1 | False | . | . | . | f4:ee:08:19:37:49 |
ci-info: | eno2 | False | . | . | . | f4:ee:08:19:37:4a |

But journalctl -b 0 showed them as "UP":
Aug 25 08:45:04.600250 ubuntu-server systemd-networkd[1594]: eno2: Link UP
Aug 25 08:45:04.635710 ubuntu-server systemd-networkd[1594]: eno1: Link UP

Something seems to be turning off the NICs on this device maybe?

Can you please provide:

  1. Was the a customized install Ubuntu 22.04 installer ISO?
  2. if public ISO can you provide the URL to the ISO used for install?
  3. Run the separate commands and attach that output
  • systemd-analyze blame
  • systemd-analyze show
  • systemd-analyze critical-chain
  • systemctl --failed
  • ls /etc/netplan
  • cat /etc/netplan/*
  • ip addr
  • ip route

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Michael Hudson-Doyle(mwhudson) wrote on 2022-08-29T09:09:15.022307+00:00

What's going on here is I think that the md5 check casper is doing is using all the (minimal) amount of IO bandwidth to the virtual ISO and starving other processes from making progress. We should probably make the md5 check run under ionice or just abort itself if it's slow or something.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Michael Hudson-Doyle(mwhudson) wrote on 2022-08-29T09:14:34.857104+00:00

Would someone be able to test a custom ISO that I can make with https://code.launchpad.net/~mwhudson/casper/+git/casper/+merge/429055 incorporated?

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Narendra K(knarendra) wrote on 2022-08-30T16:39:29.744108+00:00

Hi Michael,

Thank you for taking a look. Please provide the custom ISO. We will test and share findings.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Narendra K(knarendra) wrote on 2022-08-30T19:46:09.376088+00:00

Please find the details:

systemd-analyze-blame-output

12min 33.112s casper-md5check.service
48.873s snapd.service
48.004s snapd.seeded.service
35.847s pollinate.service
31.663s udisks2.service
28.538s cloud-init-local.service
23.592s e2scrub_reap.service
23.281s thermald.service
22.361s dev-sr0.device
17.649s polkit.service
16.971s dev-loop9.device
16.768s networkd-dispatcher.service
16.547s snapd.hold.service
15.221s dev-loop6.device
15.003s dev-loop8.device
14.037s snapd.apparmor.service
12.682s dev-loop4.device
12.680s dev-loop5.device
12.677s dev-loop0.device
12.093s ModemManager.service
10.873s snap.lxd.activate.service
10.661s dev-loop7.device
5.310s rsyslog.service
5.231s apport.service
5.168s systemd-logind.service
4.633s secureboot-db.service
4.523s cloud-init.service
4.297s ua-timer.service
2.867s console-setup.service
2.712s systemd-resolved.service
2.689s systemd-udev-trigger.service
2.523s systemd-udevd.service
2.365s multipathd.service
2.263s systemd-user-sessions.service
1.742s keyboard-setup.service
1.590s systemd-modules-load.service
1.483s systemd-sysusers.service
1.433s snap-core20-1405.mount
1.304s snap-lxd-22923.mount
1.211s snap-snapd-15534.mount
1.198s update-notifier-download.service
950ms systemd-journal-flush.service
840ms systemd-sysctl.service
837ms setvtrgb.service
664ms systemd-random-seed.service
634ms snap-subiquity-3359.mount
606ms systemd-timesyncd.service
558ms systemd-tmpfiles-setup.service
495ms systemd-networkd.service
461ms cloud-config.service
418ms systemd-tmpfiles-setup-dev.service
357ms systemd-update-utmp.service
340ms systemd-remount-fs.service
327ms cloud-final.service
317ms sys-kernel-config.mount
315ms sys-fs-fuse-connections.mount
306ms plymouth-read-write.service
224ms plymouth-quit.service
199ms lvm2-monitor.service
181ms systemd-journald.service
117ms finalrd.service
111ms ufw.service
89ms dev-hugepages.mount
88ms dev-mqueue.mount
87ms sys-kernel-debug.mount
87ms sys-kernel-tracing.mount
83ms lvm2-pvscan@8:3.service
78ms kmod-static-nodes.service
76ms modprobe@fuse.service
76ms modprobe@drm.service
75ms modprobe@configfs.service
26ms ssh.service
20ms systemd-tmpfiles-clean.service
17ms plymouth-quit-wait.service
11ms systemd-update-utmp-runlevel.service
8ms snapd.socket
8ms systemd-networkd-wait-online.service
3ms tmp.mount
51us blk-availability.service

"systemd-analyze-critical-chain-output"

The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

graphical.target @13min 17.133s
-multi-user.target @13min 17.132s -casper-md5check.service @44.019s +12min 33.112s
-basic.target @44.015s -sockets.target @44.014s
-snap.lxd.daemon.unix.socket @2min 8.203s -sysinit.target @43.976s
-cloud-init.service @39.445s +4.523s -systemd-networkd-wait-online.service @39.432s +8ms
-systemd-networkd.service @38.932s +495ms -network-pre.target @38.925s
-cloud-init-local.service @10.374s +28.538s -systemd-remount-fs.service @5.507s +340ms
-systemd-journald.socket @5.413s --.mount @5.377s
`--.slice @5.377s

"cloud-init-analyze-blame-output"

-- Boot Record 01 --
01.47200s (init-network/config-ssh)
00.34700s (init-local/search-NoCloud)
00.09400s (modules-config/config-set-passwords)
00.04300s (init-network/config-users-groups)
00.01600s (init-network/check-cache)
00.00400s (modules-final/config-final-message)
00.00400s (init-network/consume-user-data)
00.00200s (modules-config/config-timezone)
00.00200s (init-network/config-write-files)
00.00200s (init-network/activate-datasource)
00.00100s (modules-final/config-ssh-authkey-fingerprints)
00.00100s (modules-final/config-scripts-per-once)
00.00100s (modules-final/config-keys-to-console)
00.00100s (modules-config/config-disable-ec2-metadata)
00.00100s (init-network/config-rsyslog)
00.00100s (init-network/config-ca-certs)
00.00100s (init-network/config-bootcmd)
00.00100s (init-local/check-cache)
00.00000s (modules-final/config-scripts-user)
00.00000s (modules-final/config-phone-home)
00.00000s (modules-config/config-ssh-import-id)
00.00000s (modules-config/config-runcmd)
00.00000s (init-network/setup-datasource)
00.00000s (init-network/consume-vendor-data2)
00.00000s (init-network/consume-vendor-data)

"cloud-init-analyze-show-output"

-- Boot Record 01 --
The total time elapsed since completing an event is printed after the "@" character.
The time the event takes is printed after the "+" character.

Starting stage: init-local
|->no cache found @00.00300s +00.00100s |->found local data from DataSourceNoCloud @00.01200s +00.34700s
Finished stage: (init-local) 01.91800 seconds

Starting stage: init-network
|->restored from cache with run check: DataSourceNoCloud [seed=/var/lib/cloud/seed/nocloud][dsmode=net] @04.06200s +00.01600s |->setting up datasource @04.16500s +00.00000s
|->reading and applying user-data @04.35100s +00.00400s |->reading and applying vendor-data @04.35500s +00.00000s
|->reading and applying vendor-data2 @04.35500s +00.00000s |->activating datasource @05.43600s +00.00200s
|->config-bootcmd ran successfully @05.44200s +00.00100s |->config-write-files ran successfully @05.44300s +00.00200s
|->config-ca-certs ran successfully @05.44600s +00.00100s |->config-rsyslog ran successfully @05.44700s +00.00100s
|->config-users-groups ran successfully @05.44900s +00.04300s |->config-ssh ran successfully @05.49200s +01.47200s
Finished stage: (init-network) 04.19000 seconds

Starting stage: modules-config
|->config-ssh-import-id ran successfully @105.80900s +00.00000s |->config-set-passwords ran successfully @105.81000s +00.09400s
|->config-timezone ran successfully @105.90400s +00.00200s |->config-disable-ec2-metadata ran successfully @105.90600s +00.00100s
|`->config-runcmd ran successfully @105.90700s +00.00000s
Finished stage: (modules-config) 00.13400 seconds

Starting stage: modules-final
|->config-scripts-per-once ran successfully @761.00000s +00.00100s |->config-scripts-user ran successfully @761.00100s +00.00000s
|->config-ssh-authkey-fingerprints ran successfully @761.00100s +00.00100s |->config-keys-to-console ran successfully @761.00200s +00.00100s
|->config-phone-home ran successfully @761.00300s +00.00000s |->config-final-message ran successfully @761.00300s +00.00400s
Finished stage: (modules-final) 00.01100 seconds

Total Time: 6.25300 seconds

1 boot records analyzed

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Narendra K(knarendra) wrote on 2022-08-31T10:33:27.032168+00:00

Launchpad attachments: Installation showing failure

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Narendra K(knarendra) wrote on 2022-08-31T10:36:46.179183+00:00

Hi Chad/Michael,

When issue occurs, the installation stops as shown in the screen attached in comment #14. If "Close" option is selected, installation completes successfully and OS completes successfully. Does choosing "Close" option and proceeding with installation have any side effects or could it be recommended as a work around till a solution is included in the installation ISO ?

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Michael Hudson-Doyle(mwhudson) wrote on 2022-09-01T02:32:20.085064+00:00

If you could try booting the ISO from https://cdimage.ubuntu.com/ubuntu-server/daily-live/pending/ I would be very interested to hear if it is better (and if it is we can backport to 22.04).

When issue occurs, the installation stops as shown in the screen attached in comment #14. If
"Close" option is selected, installation completes successfully and OS completes successfully.
Does choosing "Close" option and proceeding with installation have any side effects or could it
be recommended as a work around till a solution is included in the installation ISO ?

No, if the install completes successfully then there should be no deficiencies. It mostly impacts things about the installer environment, like autoinstall and ssh access to the installer.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Narendra K(knarendra) wrote on 2022-09-01T08:08:44.374134+00:00

No, if the install completes successfully then there should be no deficiencies.

Thank you for clarifying.

It mostly impacts things about the installer environment, like autoinstall and ssh access to the installer.

Any thoughts on how users using autoinstall scenarios could work around this issue ?

ssh access to the installer

Any thoughts on how users requiring ssh could work around this issue ? Or should it confirmed first if this scenario an issue by trying to ssh when issue occurs ?

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-09-01T12:02:16.567295+00:00

Hi Michael,

As per your assist downloaded 64-bit PC (AMD64) server install image from "https://cdimage.ubuntu.com/ubuntu-server/daily-live/pending/" and tested it, Issue was still observed.

Please find the details:

"systemd-analyze-blame-output"

  13min 33.597s casper-md5check.service
  41.640s snapd.seeded.service
  29.360s snapd.service
  24.189s pollinate.service
  18.638s udisks2.service
  16.918s dev-sr0.device
  10.718s snapd.hold.service
  10.585s ModemManager.service
  10.085s snap.lxd.activate.service
   9.802s cloud-init-local.service
   9.375s snapd.apparmor.service
   9.059s thermald.service
   9.044s console-setup.service
   8.591s polkit.service
   7.693s e2scrub_reap.service
   7.348s systemd-logind.service
   7.220s rsyslog.service
   7.097s dev-loop8.device
   6.720s lvm2-monitor.service
   6.331s dev-loop7.device
   6.050s dev-loop6.device
   6.000s secureboot-db.service
   5.983s apport.service
   5.363s systemd-tmpfiles-setup.service
   4.766s dev-loop9.device
   4.074s dev-loop5.device
   4.073s dev-loop0.device
   4.072s dev-loop4.device
   3.766s snap-lxd-23537.mount
   3.432s multipathd.service
   3.372s snap-snapd-16292.mount
   2.970s systemd-resolved.service
   2.927s snap-core20-1611.mount
   2.376s plymouth-read-write.service
   2.234s keyboard-setup.service
   2.212s systemd-user-sessions.service
   2.040s systemd-udev-trigger.service
   1.672s finalrd.service
   1.483s systemd-sysusers.service
   1.460s cloud-init.service
   1.303s systemd-sysctl.service
   1.296s ufw.service
   1.261s systemd-journal-flush.service
   1.219s setvtrgb.service
   1.147s systemd-random-seed.service
   1.131s plymouth-quit.service
   1.103s systemd-modules-load.service
   1.032s dev-hugepages.mount
   1.030s dev-mqueue.mount
   1.029s sys-kernel-debug.mount
   1.028s sys-kernel-tracing.mount
   1.020s kmod-static-nodes.service
   1.019s systemd-journald.service
   1.014s modprobe@chromeos_pstore.service
   1.013s modprobe@configfs.service
   1.012s modprobe@drm.service
   1.010s modprobe@efi_pstore.service
   1.009s modprobe@fuse.service
   1.008s modprobe@pstore_blk.service
   1.006s modprobe@pstore_zone.service
   1.005s modprobe@ramoops.service
    801ms systemd-remount-fs.service
    712ms systemd-udevd.service
    700ms snap-subiquity-3798.mount
    558ms systemd-networkd.service
    471ms update-notifier-download.service
    378ms systemd-tmpfiles-setup-dev.service
    368ms cloud-config.service
    335ms cloud-final.service
    279ms sys-kernel-config.mount
    269ms sys-fs-fuse-connections.mount
    215ms systemd-timesyncd.service
    107ms systemd-update-utmp.service
     23ms systemd-tmpfiles-clean.service
     11ms systemd-update-utmp-runlevel.service
      9ms snapd.socket
      4ms tmp.mount
      4ms systemd-networkd-wait-online.service
      3ms plymouth-quit-wait.service
     17us blk-availability.service

"systemd-analyze-critical-chain-output"

The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

graphical.target @14min 7.154s
-multi-user.target @14min 7.153s -casper-md5check.service @33.554s +13min 33.597s
-basic.target @33.551s -sockets.target @33.550s
-snap.lxd.user-daemon.unix.socket @1min 32.661s -sysinit.target @33.513s
-cloud-init.service @32.046s +1.460s -systemd-networkd-wait-online.service @32.038s +4ms
-systemd-networkd.service @31.473s +558ms -network-pre.target @31.470s
-cloud-init-local.service @21.655s +9.802s -systemd-remount-fs.service @4.284s +801ms
-systemd-journald.socket @4.031s --.mount @3.958s
`--.slice @3.958s

"cloud-init-analyze-blame-output"

-- Boot Record 01 --
00.54400s (init-network/config-ssh)
00.43600s (init-local/search-NoCloud)
00.04200s (init-network/config-users-groups)
00.02700s (modules-config/config-set-passwords)
00.00800s (modules-final/config-final-message)
00.00500s (init-network/check-cache)
00.00100s (modules-final/config-scripts-user)
00.00100s (modules-final/config-keys-to-console)
00.00100s (init-network/consume-user-data)
00.00100s (init-network/activate-datasource)
00.00000s (modules-final/config-ssh-authkey-fingerprints)
00.00000s (modules-final/config-scripts-per-once)
00.00000s (modules-config/config-ssh-import-id)
00.00000s (init-network/setup-datasource)
00.00000s (init-network/consume-vendor-data2)
00.00000s (init-network/consume-vendor-data)
00.00000s (init-local/check-cache)

1 boot records analyzed

"cloud-init-analyze-show-output"

-- Boot Record 01 --
The total time elapsed since completing an event is printed after the "@" character.
The time the event takes is printed after the "+" character.

Starting stage: init-local
|->no cache found @00.00300s +00.00000s |->found local data from DataSourceNoCloud @00.01500s +00.43600s
Finished stage: (init-local) 01.57200 seconds

Starting stage: init-network
|->restored from cache with run check: DataSourceNoCloud [seed=/var/lib/cloud/seed/nocloud][dsmode=net] @02.94400s +00.00500s |->setting up datasource @02.98900s +00.00000s
|->reading and applying user-data @02.99800s +00.00100s |->reading and applying vendor-data @02.99900s +00.00000s
|->reading and applying vendor-data2 @02.99900s +00.00000s |->activating datasource @03.02400s +00.00100s
|->config-users-groups ran successfully @03.05100s +00.04200s |->config-ssh ran successfully @03.09300s +00.54400s
Finished stage: (init-network) 01.16100 seconds

Starting stage: modules-config
|->config-ssh-import-id ran successfully @76.25500s +00.00000s |->config-set-passwords ran successfully @76.25500s +00.02700s
Finished stage: (modules-config) 00.04800 seconds

Starting stage: modules-final
|->config-scripts-per-once ran successfully @817.63600s +00.00000s |->config-scripts-user ran successfully @817.63600s +00.00100s
|->config-ssh-authkey-fingerprints ran successfully @817.63700s +00.00000s |->config-keys-to-console ran successfully @817.63700s +00.00100s
|`->config-final-message ran successfully @817.63800s +00.00800s
Finished stage: (modules-final) 00.02800 seconds

Total Time: 2.80900 seconds

1 boot records analyzed

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Narendra K(knarendra) wrote on 2022-09-05T17:34:50.823637+00:00

Hi Chad,

2022-08-17 06:49:32,690 DEBUG subiquity.server.server:519 waited 600.0012300014496s for cloud-init
2022-08-17 06:49:32,691 DEBUG subiquity.server.server:535 cloud-init status: '', assumed disabled

Do the logs provide any details on why cloud-init takes more than 10 minutes to complete ? Is cloud-init waiting for casper-md5check.service to complete ? casper-md5check.service seems to be taking > 10 minutes when issue occurs.

Hi Michael,

Could subiquity be made to continue without interrupting the installation and reschedule a check to verify if cloud-init completed ? Or does it need to wait for cloud-init complete before proceeding with installation ?

Also, any thoughts on query from comment #17 ?

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Michael Hudson-Doyle(mwhudson) wrote on 2022-09-05T21:51:30.129036+00:00

Hmm it's a shame those casper changes didn't help but not completely surprising.

I haven't verified this but my theory on what is going on is that the casper-md5 check is using all of the small amount of IO the virtual bmc feature provides and thus starving other services like cloud-init and snap seeding from making progress. Nothing is explicitly waiting for the casper-md5 check to complete -- if my theory is correct it's not something that can be fixed by tweaking unit dependencies.

I should have mentioned this before, but one workaround for this is to pass 'tomem' on the kernel command line. This makes the initrd copy the entire ISO into RAM before pivoting to the real rootfs which makes the first part of the boot pretty slow but should make all other parts including the md5 check much faster. It obviously increases RAM usage too but most "real" servers probably have much more RAM than is needed.

Other workarounds we are looking at is making a much smaller ISO that can be provided as a virtual ISO that will then download the main ISO from the network (as the network is probably much much faster than the virtual ISO). But it would be nice to fix this properly, probably by having the casper md5 check abort if it detects that it is only making very slow progress.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2022-09-07T20:56:08.213790+00:00

Narendra K (knarendra),

As Michael mentioned, this really looks more like a system that is under load and not responding well to all the systemd services that are being run during the install process.

The time values I'm seeing in comment #18 from systemd-analyze are 10 times greater than what I typically see on laptops, containers, VMs and bare metal.

From the systemd-analyze blame output from comment in #18, I can confirm the reason for cloud-config.service "waiting" is the sum of the systemd snap*services setup time:

  41.640s snapd.seeded.service
  29.360s snapd.service

...
16.547s snapd.hold.service

Those values account for the minute and a half of waiting... which aligns with the attached cloud-init analyze blame output as well:

starting stage: modules-config
|`->config-ssh-import-id ran successfully @76.25500s +00.00000s

Which tells us that cloud-init's modules-config stage started at 76 seconds after boot began (around the same amount as the combined snap services).

When looking as cloud-init analyze output we see the cloud-final stage doesn't start until 8817 seconds (13 min) after boot which is greater than subiquity's 10 minute timeout.

Starting stage: modules-final
|`->config-scripts-per-once ran successfully @817.63600s +00.00000s

/lib/systemd/system/cloud-final.service config is only started once the following condition is met:

After=multi-user.target

And multi-user.target depends on that casper-md5check.service which is taking 13 minutes for one reason of another.
We can confirm what's slowing up that final stage of cloud-init with the following command on your system:
$ systemd-analyze critical-chain cloud-final.service

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2022-09-07T21:19:31.593901+00:00

Ultimately, each md5 file md5 comparison seems to take around 1/2 second on this system. If we are unable to determine what is slowing down the casper-md5check or an early exit in the event of painful md5checks, can we either provide a config option to disable the casper md5 check, or make that cloud-init timeout configurable to > 600 seconds[1]?

References for cloud-init timeout:
https://github.com/canonical/subiquity/blob/main/subiquity/server/server.py#L523

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Michael Hudson-Doyle(mwhudson) wrote on 2022-09-08T00:41:41.695107+00:00

You can also put fsck.mode=skip on the kernel command line to get the md5check to skip itself, I should have mentioned that too!

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-09-08T10:29:32.574970+00:00

Hi Michael,

Tried passing kernel work around parameter "tomem" as suggested in the comment #20, Issue was still observed in this case. And also Tried passing kernel work around parameter "fsck.mode=skip" as suggested in the comment #23, Issue was not seen in this scenario.

Also, we tried reproducing the Issue in 'Ubuntu 20.04.4' in this case Issue was not seen.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-09-08T10:54:26.492325+00:00

Chad Smith,

Please find the details as requested in comment #21.

"systemd-analyze critical-chain cloud-final.service output"

The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

cloud-final.service +266ms
-multi-user.target @12min 38.156s -casper-md5check.service @39.106s +11min 59.049s
-basic.target @39.104s -sockets.target @39.103s
-snap.lxd.user-daemon.unix.socket @1min 38.160s -sysinit.target @39.068s
-cloud-init.service @35.645s +3.417s -systemd-networkd-wait-online.service @35.634s +8ms
-systemd-networkd.service @35.265s +365ms -network-pre.target @35.262s
-cloud-init-local.service @12.178s +23.078s -systemd-remount-fs.service @6.689s +483ms
-systemd-journald.socket @6.401s --.mount @6.360s
`--.slice @6.360s

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2022-09-08T21:09:20.324471+00:00

Shubhakar Gowda P S (shubhakara)

  1. thank you for the specific critical-chain of cloud-final.service. It does confirm the main thing blocking cloud-init from completing is the casper-md5check.service on this system. Without that service delaying the multi-user.target, cloud-final,service can complete in around 50
    seconds which beats that 10 minute timeout in the installer.

  2. Also, we tried reproducing the Issue in 'Ubuntu 20.04.4' in this case Issue was not seen.
    Do you mean that stock 20.04.4 images without any kernel command-line options did not reproduce this issue on the same system type?

I did notice in running the daily server live image for 20.04 today that in the console output it shows me the fsck.mode=skip So that is probably why this didn't hit you on the 20.04 install.

Thanks for the kernel cmdline tip Michael, I didn't know about fsck.mode=skip and seems like that would avoid this issue on platforms which are being hit with general IO/performance concerns leading to multi-user.target not being reached for an extended period of time.

Maybe no need to add a config setting for the cloud-init --wait timeout in subiquity as the other 'workaround' seems to work here.

cloud-init squad should be touching the cloud-init status --wait call https://github.com/canonical/subiquity/blob/main/subiquity/server/server.py#L521 to introspect cloud-init status a bit better once cloud-init releases support for cloud-init status --format json. In the event of a timeout, we can also make sure we touch this section to ensure a timeout results in a report of what stage cloud-init is 'stuck' and give suggested debug steps for bug filing.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Shubhakar Gowda P S(shubhakara) wrote on 2022-09-09T11:32:23.310690+00:00

Chad Smith,

Once After Ubuntu 20.04.4 CD Image Installation got completed, by switching over to a shell using 'Ctrl-Z + F2' keyboard shortcut key we have collected the output of following details.

output "systemd-analyze critical-chain cloud-final.service"

The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

cloud-final.service +475ms
-cloud-config.service @22.760s +473ms -snapd.seeded.service @12.689s +10.067s
-basic.target @9.526s -sockets.target @9.526s
-snap.lxd.daemon.unix.socket @14.228s -sysinit.target @9.512s
-cloud-init.service @8.350s +1.161s -systemd-networkd-wait-online.service @8.341s +6ms
-systemd-journald.socket @1.304s -system.slice @1.300s
`--.slice @1.300s

output "systemctl status casper-md5check.service"

Unit casper-md5check.service could not be found

output "/var/log/casper.log"

stdin: Invalid argument
stdin: Invalid argument
stdin: Invalid argument
/init: line 49: can't open /dev/sda: No medium found
/init: line 49: can't open /dev/sda: No medium found
passwd: password expiry information changed.
Using CD-ROM mount point /cdrom/
Identifying... [bdb176fb941b5200237520bff1ab61c3-2]
Scanning disc for index files...
Found 2 package indexes, 0 source indexes, 0 translation indexes and 1 signatures
Found label 'Ubuntu-Server 20.04.4 LTS Focal Fossa - Release amd64 (20220223.1)'
This disc is called:
'Ubuntu-Server 20.04.4 LTS Focal Fossa - Release amd64 (20220223.1)'
Copying package lists...gpgv: Signature made Wed Feb 23 09:26:53 2022 UTC
gpgv: using RSA key 843938DF228D22F7B3742BC0D94AA3F0EFE21092
gpgv: Good signature from "Ubuntu CD Image Automatic Signing Key (2012) cdimage@ubuntu.com"
Reading Package Indexes... 0%
Reading Package Indexes... Done
Writing new source list
Source list entries for this disc are:
deb cdrom:[Ubuntu-Server 20.04.4 LTS Focal Fossa - Release amd64 (20220223.1)]/ focal main restricted
Repeat this process for the rest of the CDs in your set.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Narendra K(knarendra) wrote on 2022-09-16T16:54:35.729545+00:00

Hi Chad/Michael,

Any updates or feedback on findings in comment #27 ?

It seems like in Ubuntu 20.04,

a) cloud-init does not depend on multi-user.target
b) casper-md5check.service is not present

Could Ubuntu 22.04 installation be modified to match with Ubuntu 20.04 ? Any thoughts ?

Hi Michael,

Could 'fsck.mode=skip' be documented in Ubuntu 22.04 release notes ?

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Brett Holman(holmanb) wrote on 2022-09-16T19:09:00.896674+00:00

Since the issue has been narrowed down to casper-md5check, which is out of scope of cloud-init, I'm marking the cloud-init side of this bug as "Invalid", since there is nothing actionable to do to in cloud-init to resolve this bug, and the plans to increase debugging transparency that Chad mentioned were already planned and scheduled.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Narendra K(knarendra) wrote on 2022-09-21T09:51:56.971840+00:00

Hi Michael,

Any updates on the queries in comment #28 ?

It seems like in Ubuntu 20.04,

a) cloud-init does not depend on multi-user.target
b) casper-md5check.service is not present

Could Ubuntu 22.04 installation be modified to match with Ubuntu 20.04 ? Any thoughts ?

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Bungert(dbungert) wrote on 2022-11-01T08:56:32.150866+00:00

We propose no longer having the md5check block multi-user. I'm opening a matching MP.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Narendra K(knarendra) wrote on 2022-11-16T15:20:30.614960+00:00

Hi Dan,

Would the patch get included in Ubuntu 22.04.2 ?

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Bungert(dbungert) wrote on 2022-11-16T15:38:12.218127+00:00

Hi Narendra - yes, the plan is to have this done in time for 22.04.2. Thanks for specifically pointing this out, we need to make sure it's included in the Jammy Stable Release Updates.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Launchpad Janitor(janitor) wrote on 2022-11-27T04:37:22.218023+00:00

This bug was fixed in the package casper - 1.476


casper (1.476) lunar; urgency=medium

  • Change casper-md5check to Type=Simple to take advantage of the
    alternate interpretation of what started means relative to Type=oneshot.
    This should allow us to reach multi-user.target sooner, especially for
    users where the md5 check is slow. (LP: #1986781)

-- Dan Bungert daniel.bungert@canonical.com Wed, 23 Nov 2022 16:10:57 +1300

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Bungert(dbungert) wrote on 2022-11-29T15:15:40.263866+00:00

Thanks for moving the SRU forward Mauricio.

I see you nominated this fix for Kinetic, I think it won't help much as we probably won't be creating new Kinetic install media. Or do you have something else in mind?

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2022-11-29T16:22:55.481845+00:00

Hi Dan o/
Yes, it'd be just for documentation purposes for the SRU team's review stage.
I'll mark it as Won't Fix with the rationale described, in the SRU template.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2023-01-23T20:59:45.645722+00:00

Reproducer/Verification for SRU to Jammy,
based on strace delay injection on read().

Uploading to jammy.
Attaching debdiff for reference.

...

Launch a VM with the Jammy daily live server:

$ wget https://cdimage.ubuntu.com/ubuntu-server/jammy/daily-live/current/jammy-live-server-amd64.iso

$ ISO=jammy-live-server-amd64.iso
$ VM=casper-jammy
$ virt-install  --name $VM --cdrom $ISO --vcpus 2 --memory 2048 --disk none --osinfo ubuntu-stable-latest 

Press e to edit, append "break=init console=ttyS0", press ctrl-x, close window.

Open the serial console and chroot:

$ virsh console $VM
...
(initramfs) chroot /root /bin/bash

Test strace delay injection:

# time strace --trace read cat /dev/null
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\3206\2\0\0\0\0\0"..., 832) = 832
read(3, "", 131072)                     = 0
+++ exited with 0 +++

real	0m0.041s
user	0m0.009s
sys	0m0.030s

# time strace --trace read --inject read:delay_enter=5s cat /dev/null
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\3206\2\0\0\0\0\0"..., 832) = 832 (DELAYED)
read(3, "", 131072)                     = 0 (DELAYED)
+++ exited with 0 +++

real	0m10.041s
user	0m0.010s
sys	0m0.033s

Modify the casper-md5check service:

# sed -i '/^ExecStart=/ s,=,=/usr/bin/strace --inject read:delay_enter=60s ,' /usr/lib/systemd/system/casper-md5check.service

# cat /usr/lib/systemd/system/casper-md5check.service
[Unit]
Description=casper-md5check Verify Live ISO checksums

[Service]
Type=oneshot
ExecStart=/usr/bin/strace --inject read:delay_enter=60s /usr/lib/casper/casper-md5check /cdrom /cdrom/md5sum.txt
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Exit and wait; the issue happens:

# exit
(initramfs) exit

...

Ubuntu 22.04.1 ubuntu-server ttyS0


connecting...
waiting for cloud-init... /

<10 minutes later>

================================================================================
  Serial                                                              [ Help ]
================================================================================

  As the installer is running on a serial console, it has started in basic
  mode, using only the ASCII character set and black and white colours.

   ┌────────────────────────────────────────────────────────────────────────┐
   │                                                                        │
   │  cloud-init failed to complete after 10 minutes of waiting. This       │
   │  suggests a bug, which we would appreciate help understanding.  If     │
   │  you could file a bug at                                               │
   │  https://bugs.launchpad.net/subiquity/+filebug and attach the          │
   │  contents of /var/log, it would be most appreciated.                   │
   │                                                                        │
   │                         [ Switch to a shell ]                          │
   │                         [ Close             ]                          │
   │                                                                        │
   └────────────────────────────────────────────────────────────────────────┘


	                  [ Continue in rich mode  > ]
	                  [ Continue in basic mode > ]

...

Similarly, repeat, this time with test packages from ppa:mfo/lp1986781:

...

Open the serial console and chroot:

$ virsh console $VM
...
(initramfs) chroot /root /bin/bash

Install the test package:

# dhclient enp1s0
# wget https://launchpad.net/~mfo/+archive/ubuntu/lp1986781/+build/25512466/+files/casper_1.470.2_amd64.deb
# dpkg -i casper_*.deb

Modify the casper-md5check service:

# sed -i '/^ExecStart=/ s,=,=/usr/bin/strace --inject read:delay_enter=60s ,' /usr/lib/systemd/system/casper-md5check.service

# cat /usr/lib/systemd/system/casper-md5check.service
Description=casper-md5check Verify Live ISO checksums
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/usr/bin/strace --inject read:delay_enter=60s /usr/lib/casper/casper-md5check /cdrom /cdrom/md5sum.txt
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Exit and wait; the issue does not happen anymore:

# exit
(initramfs) exit

...

Ubuntu 22.04.1 ubuntu-server ttyS0


connecting...
waiting for cloud-init... /

<some seconds later>



================================================================================
  Serial                                                              [ Help ]
================================================================================

  As the installer is running on a serial console, it has started in basic
  mode, using only the ASCII character set and black and white colours.

  If you are connecting from a terminal emulator such as gnome-terminal that
  supports unicode and rich colours you can switch to "rich mode" which uses
  unicode, colours and supports many languages.

  You can also connect to the installer over the network via SSH, which will
  allow use of rich mode.



	                  [ Continue in rich mode  > ]
	                  [ Continue in basic mode > ]
	                  [ View SSH instructions    ]

Checking the casper-md5sum service is still running:

Help > Enter shell.

# systemctl status casper-md5check.service --no-pager | grep 'Active:'
     Active: activating (start) since Mon 2023-01-23 18:56:50 UTC; 3min 26s ago

And it should not be a problem, as its start timeout is not limited.

# systemctl show casper-md5check.service | grep -i timeout
TimeoutStartUSec=infinity
TimeoutStopUSec=1min 30s
TimeoutAbortUSec=1min 30s
TimeoutStartFailureMode=terminate
TimeoutStopFailureMode=terminate
TimeoutCleanUSec=infinity
JobTimeoutUSec=infinity
JobRunningTimeoutUSec=infinity
JobTimeoutAction=none

Quit, cleanup the VM.

Press ctrl-]

$ virsh destroy $VM
$ virsh undefine $VM

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2023-01-23T21:01:19.241010+00:00

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2023-01-23T21:03:22.617136+00:00

cosmetic glitch (see description's regression potential section).
Launchpad attachments: Screenshot from 2023-01-23 13-08-33.png

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2023-01-23T21:07:35.250446+00:00

For documentation purposes,

The critical chain of systemd units/time
before/after the fix, with a fake 'sleep 30'
as the command in casper-md5check.service.

Before:

root@ubuntu-server:/# systemd-analyze critical-chain cloud-final.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

cloud-final.service +492ms
-multi-user.target @37.968s -casper-md5check.service @7.954s +30.010s
-basic.target @7.947s -sockets.target @7.946s
-snap.lxd.user-daemon.unix.socket @15.740s -sysinit.target @7.892s
-cloud-init.service @5.854s +2.032s -systemd-networkd-wait-online.service @5.835s +6ms
-systemd-networkd.service @5.713s +119ms -network-pre.target @5.709s
-cloud-init-local.service @1.943s +3.765s -systemd-remount-fs.service @1.172s +130ms
-systemd-journald.socket @988ms --.mount @942ms
`--.slice @941ms

After:

root@ubuntu-server:/# systemd-analyze critical-chain cloud-final.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

cloud-final.service +901ms
-multi-user.target @21.831s -ssh.service @21.712s +38ms
-basic.target @8.255s -sockets.target @8.254s
-snap.lxd.user-daemon.unix.socket @17.452s -sysinit.target @8.202s
-cloud-init.service @6.970s +1.225s -systemd-networkd-wait-online.service @6.958s +9ms
-systemd-networkd.service @6.842s +113ms -network-pre.target @6.838s
-cloud-init-local.service @1.938s +4.898s -systemd-remount-fs.service @1.160s +174ms
-systemd-journald.socket @941ms -system.slice @894ms
`--.slice @894ms

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2023-01-23T21:08:21.083037+00:00

Launchpad attachments: lp1986781-casper-jammy.debdiff

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2023-01-27T18:38:10.361167+00:00

Re-uploading to Jammy with changelog updates suggested by Timo.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2023-01-27T18:38:33.809893+00:00

Launchpad attachments: lp1986781-casper-jammy-v2.debdiff

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Timo Aaltonen(tjaalton) wrote on 2023-01-28T10:54:36.959899+00:00

Hello Shubhakar, or anyone else affected,

Accepted casper into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/casper/1.470.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To properly test it you will need to obtain and boot a daily build of a Live CD for jammy. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2023-01-30T14:00:25.633007+00:00

Verification done for jammy-proposed.

...

Booting jammy-live-server-amd64.iso with break=init
(more details in comment #37).

(initramfs) chroot /root /bin/bash
root@ubuntu-server:/# dhclient enp1s0

root@ubuntu-server:/# add-apt-repository -y -p proposed

root@ubuntu-server:/# apt install -y casper/jammy-proposed

root@ubuntu-server:/# apt policy casper
casper:
Installed: 1.470.2
Candidate: 1.470.2
Version table:
*** 1.470.2 500
500 http://security.ubuntu.com/ubuntu jammy-proposed/main amd64 Packages
500 http://archive.ubuntu.com/ubuntu jammy-proposed/main amd64 Packages
100 /var/lib/dpkg/status
...
root@ubuntu-server:/# add-apt-repository -y -p proposed -r

...

root@ubuntu-server:/# sed -i '/^ExecStart=/ s,=,=/usr/bin/strace --inject read:delay_enter=60s ,' /usr/lib/systemd/system/casper-md5check.service

root@ubuntu-server:/# cat /usr/lib/systemd/system/casper-md5check.service
[Unit]
Description=casper-md5check Verify Live ISO checksums
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/usr/bin/strace --inject read:delay_enter=60s /usr/lib/casper/casper-md5check /cdrom /cdrom/md5sum.txt
...

root@ubuntu-server:/# exit
exit
(initramfs) exit
...

...
[ OK ] Reached target Multi-User System.
Starting casper-md5check Verify Live ISO checksums...
Starting Execute cloud user/final scripts...
...
[ OK ] Finished Execute cloud user/final scripts.
[ OK ] Reached target Cloud-init target.

Ubuntu 22.04.1 LTS ubuntu-server ttyS0

connecting...

================================================================================
Serial [ Help ]

As the installer is running on a serial console, it has started in basic
mode, using only the ASCII character set and black and white colours.

If you are connecting from a terminal emulator such as gnome-terminal that
supports unicode and rich colours you can switch to "rich mode" which uses
unicode, colours and supports many languages.

You can also connect to the installer over the network via SSH, which will
allow use of rich mode.

                      [ Continue in rich mode  > ]
                      [ Continue in basic mode > ]
                      [ View SSH instructions    ]

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Mike Rushton(leftyfb) wrote on 2023-01-31T16:29:25.780022+00:00

"To properly test it you will need to obtain and boot a daily build of a Live CD for jammy"

If that is all that is needed then it is failing for me. If, however, we need to stop the installer, enable the proposed repository, update casper, disable the proposed repository and then continue the installer, then it looks like that has been validated above by Mauricio.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Mauricio Faria de Oliveira(mfo) wrote on 2023-01-31T17:13:15.565522+00:00

Hi Mike,

Right; the package version in -proposed isn't in the daily build yet,
so it needed more manual steps, but once this makes it to -updates
(or if a daily build is built w/ -proposed enabled), then it'll be in.

@ https://cdimage.ubuntu.com/ubuntu-server/jammy/daily-live/20230131/jammy-live-server-amd64.manifest

casper 1.470.1

cheers,
Mauricio

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Bungert(dbungert) wrote on 2023-02-02T15:23:38.048915+00:00

Verification done for jammy-proposed.

I have purposefully done a different test procedure, based on injecting the proposed casper version and using systemd-analyze.

Above, in comments such as https://bugs.launchpad.net/cloud-init/+bug/1986781/comments/25, we can see that casper-md5check needs to complete before hitting multi-user.target.

With casper 1.470.2 in place, it looks like:

The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

casper-md5check.service +13.837s
└─multi-user.target @18.459s
└─ssh.service @18.336s +30ms
└─basic.target @7.833s
└─sockets.target @7.831s
└─snap.lxd.user-daemon.unix.socket @15.746s
└─sysinit.target @7.778s
└─cloud-init.service @6.080s +1.695s
└─systemd-networkd-wait-online.service @6.061s +10ms
└─systemd-networkd.service @5.950s +109ms
└─network-pre.target @5.948s
└─cloud-init-local.service @1.615s +4.332s
└─systemd-remount-fs.service @884ms +135ms
└─systemd-journald.socket @705ms
└─system.slice @624ms
└─-.slice @624ms

This is the expected behavior, so the SRU LGTM.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Launchpad Janitor(janitor) wrote on 2023-02-07T11:05:44.814349+00:00

This bug was fixed in the package casper - 1.470.2


casper (1.470.2) jammy; urgency=medium

[ Dan Bungert ]

  • d/casper.casper-md5check.service: order it After=multi-user.target.
    This should allow us to reach multi-user.target sooner, especially for
    users where the md5 check is slow. (LP: #1986781)

-- Mauricio Faria de Oliveira mfo@canonical.com Fri, 27 Jan 2023 15:31:09 -0300

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Łukasz Zemczak(sil2100) wrote on 2023-02-07T11:05:49.907084+00:00

The verification of the Stable Release Update for casper has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

@ubuntu-server-builder ubuntu-server-builder closed this as not planned Won't fix, can't repro, duplicate, stale May 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
launchpad Migrated from Launchpad
Projects
None yet
Development

No branches or pull requests

1 participant