Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud-init may hang OS boot process due to grep for the entire ISO file when it is attached #3293

Closed
ubuntu-server-builder opened this issue May 11, 2023 · 14 comments
Labels
launchpad Migrated from Launchpad

Comments

@ubuntu-server-builder
Copy link
Collaborator

This bug was originally filed in Launchpad as LP: #1806701

Launchpad details
affected_projects = []
assignee = pengpengs
assignee_name = Pengpeng Sun
date_closed = 2019-07-17T17:12:07.462698+00:00
date_created = 2018-12-04T14:20:45.721645+00:00
date_fix_committed = 2019-07-16T14:31:22.239947+00:00
date_fix_released = 2019-07-17T17:12:07.462698+00:00
id = 1806701
importance = medium
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1806701
milestone = None
owner = pengzhencao
owner_name = Pengzhen(Peter) Cao
private = False
status = fix_released
submitter = pengzhencao
submitter_name = Pengzhen(Peter) Cao
tags = []
duplicates = []

Launchpad user Pengzhen(Peter) Cao(pengzhencao) wrote on 2018-12-04T14:20:45.721645+00:00

We have found in our test for SLES15 with cloud-init installed, if we attach a ISO file with the VM before VM is boot, it often takes more than 10 minutes to start the SLES OS. Sometimes it failed to start the SLES OS at all.

We've root caused it is due to the "is_cdrom_ovf()" func of "tools/ds-identify".

In this function, there is the following logic to detect if an ISO contains certain string:

local idstr="http://schemas.dmtf.org/ovf/environment/1"
grep --quiet --ignore-case "$idstr" "${PATH_ROOT}$dev"
ref: https://git.launchpad.net/cloud-init/tree/tools/ds-identify

It is trying to grep the who ISO file for a certain string, which causes intense IO pressure for the system.
What is worse is that sometimes the ISO file is large (e.g. >5GB for installer DVD) and it is mounted over NFS. The "grep" process often consume 99% CPU and seems hang. Then the systemd starts more and more "grep" process which smoke the CPU and consumes all the IO bandwidth for the ISO file. Then the system may hang for a long time and sometimes failed to start.

To fix this issue, I suggest that we should not grep for the entire ISO file. Rather then we should just check if the file/dir exists with os.path.exists().

-------------------------debug log snip------------------------
pek2-gosv-16-dhcp180:~ # ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 13:32 ? 00:00:04 /usr/lib/systemd/systemd --switched-root --system --deserialize 24

root 474 1 0 13:34 ? 00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 482 474 2 13:34 ? 00:00:15 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1
root 1020 1 0 13:35 ? 00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 1039 1020 1 13:35 ? 00:00:07 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1
polkitd 1049 1 0 13:37 ? 00:00:00 /usr/lib/polkit-1/polkitd --no-debug
root 1051 1 0 13:37 ? 00:00:00 /usr/sbin/wickedd --systemd --foreground
root 1052 1 0 13:37 ? 00:00:00 /usr/lib/systemd/systemd-logind
root 1054 1 0 13:37 ? 00:00:00 /usr/sbin/wickedd-nanny --systemd --foreground
root 1073 1 0 13:37 ? 00:00:00 /usr/bin/vmtoolsd
root 1097 1 0 13:37 ? 00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 1110 1097 1 13:37 ? 00:00:04 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1
root 1304 1 0 13:38 ? 00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 1312 1304 1 13:38 ? 00:00:03 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1
root 1537 1 0 13:40 ? 00:00:00 /usr/bin/plymouth --wait
root 1613 1 0 13:40 ? 00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 1645 1613 0 13:40 ? 00:00:02 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1

Grep use nearly 100% cpu, system very slow.

top - 13:46:37 up 26 min, 2 users, load average: 14.14, 15.03, 10.57
Tasks: 225 total, 6 running, 219 sleeping, 0 stopped, 0 zombie
%Cpu(s): 40.1 us, 49.3 sy, 0.0 ni, 0.0 id, 1.4 wa, 0.0 hi, 9.1 si, 0.0 st
KiB Mem : 1000916 total, 64600 free, 355880 used, 580436 buff/cache
KiB Swap: 1288168 total, 1285600 free, 2568 used. 492688 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4427 root 20 0 40100 3940 3084 R 99.90 0.394 0:27.41 top
1016 root 20 0 197796 4852 3400 R 99.90 0.485 1:26.44 vmtoolsd
1723 root 20 0 7256 1860 1556 D 99.90 0.186 0:28.44 grep
484 root 20 0 7256 1684 1396 D 99.90 0.168 1:51.22 grep
1278 root 20 0 7256 1856 1556 D 99.90 0.185 0:38.44 grep
1398 root 20 0 7256 1860 1556 R 99.90 0.186 0:28.53 grep
1061 root 20 0 7256 1856 1556 D 99.90 0.185 0:56.62 grep
-------------------------debug log snip------------------------

@ubuntu-server-builder ubuntu-server-builder added the launchpad Migrated from Launchpad label May 11, 2023
@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Scott Moser(smoser) wrote on 2018-12-04T15:47:39.142506+00:00

@peter,

You're correct that ds-identify both that ds-identify does grep through
and entire cdrom if attached with an ISO9660 filesystem. The reason for this hack
is that
a.) the OVF specification does not indicate the filesystem label for ISO transport.
If it did, we could require that. As it is, there is no easy way to positively
identify the OVF transport cdrom.
b.) when this code was developed I asked a vmware contact if I could rely on
any such label in practice and was told no.
c.) we did not want to attempt a mount in ds-identify and cannot rely on the
filesystem being mounted at that point in time.

Note that commit 530850f improves the situation on vmware to
not do the grep if the filesystem has a label 'OVF ENV'. In some recent
experience I've seen this label provided by vmware.

If we can rely on that label on VMWare OVF systems, then I think I would be
fine with dropping the grep.

Note that the grep is protected from executing in many ways. It will only
execute if:
a.) the ISO9660 filesystem is on a device named /dev/sr[0-9] or /dev/hd[a-z].
b.) the label is not one of case insensitive: OVF-Transport OVFENV, OVF ENV, config-2, rd_rdfe_stable*, cidata.

If we cannot drop the grep all together because VMware/OVF spec does not provide
us a filesystem label to match on, then we could potentially further limit the
grep to only occur if the device attached was very small (perhaps < 10M ?).
That way we would not grep through DVD or 700M install ISOs.

Thoughts?

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Pengzhen(Peter) Cao(pengzhencao) wrote on 2018-12-04T23:44:12.522023+00:00

In the case of OVF customization the protection could work. However, if a user choose to install an OS from the installation ISO image then it could not. In our example we were trying to install a SLES15 OS from its installer DVD attached by virtual CDROM(/dev/sr0).

It is the OS vender's choice for what the ISO's lable is, usually "XXX installer DVD". "XXX package DVD"...

If we can not just check if certain config file exists on the ISO image, then I think your suggestion for limiting the ISO file size before grep seems a solution. As the config iso file will defenitely be a small one.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Scott Moser(smoser) wrote on 2018-12-05T14:53:24.399928+00:00

@peter,

I think I'm happy to drop the 'grep' if you tell me that vmware installations will always label the OVF transport 'OVF ENV' which is my current experience with a vSphere Client 6.0.0 and vCenter Server 6.0.0.

if we cannot rely on that, then other heuristic checks is all we can do.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Scott Moser(smoser) wrote on 2018-12-05T14:54:28.417211+00:00

well, not all we can do, but without adding either a 'mount' or a dependency on something like 'isoinfo'.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Pengzhen(Peter) Cao(pengzhencao) wrote on 2018-12-06T08:47:30.784306+00:00

Let me check with OVF team internally and update you if we use fixed name for the iso lable.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Pengpeng Sun(pengpengs) wrote on 2019-02-22T08:18:32.885755+00:00

Hi Scott,

If only consider to speed up VMware guest customization process, could we move if ovf_vmware_guest_customization check to be in front of "is_cdrom_ovf" in dsCheckOVF()?
https://github.com/cloud-init/cloud-init/blob/12bc76cebf69a1c8cf9eba78431333842ed170cf/tools/ds-identify#L772

Not sure if there is side-effect to do this change.

Thanks,
Pengpeng

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Scott Moser(smoser) wrote on 2019-03-05T14:38:42+00:00

it does seem like that would improve things, pengpeng. Much better
would be for vmware to promise (at least with new versions) that the
attached OVF cdrom would have a specific label.

On Fri, Feb 22, 2019 at 3:30 AM Pengpeng Sun 1806701@bugs.launchpad.net wrote:

Hi Scott,

If only consider to speed up VMware guest customization process, could we move if ovf_vmware_guest_customization check to be in front of "is_cdrom_ovf" in dsCheckOVF()?
https://github.com/cloud-init/cloud-init/blob/12bc76cebf69a1c8cf9eba78431333842ed170cf/tools/ds-identify#L772

Not sure if there is side-effect to do this change.

Thanks,
Pengpeng

--
You received this bug notification because you are subscribed to cloud-
init.
https://bugs.launchpad.net/bugs/1806701

Title:
cloud-init may hang OS boot process due to grep for the entire ISO
file when it is attached

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1806701/+subscriptions

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Pengpeng Sun(pengpengs) wrote on 2019-03-06T06:47:48.971508+00:00

@scott, agree with you. While I checked the current versions, the Label contains no fixed string. If you attached a VMTools ISO, then LABEL=VMware Tools, TYPE=iso9660.
If you attached a RHEL7.6 installer ISO, then LABEL=RHEL-7.6 Server.x86_64, TYPE=iso9660.
so LABEL depends on what's ISO attached.
We will try to get an official answer from OVF team internally, but the answer could be 'no specific label'.

Thanks,
Pengpeng

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Pengpeng Sun(pengpengs) wrote on 2019-04-17T02:16:21.165227+00:00

Note that commit 530850f improves the situation on vmware to
not do the grep if the filesystem has a label 'OVF ENV'. In some recent
experience I've seen this label provided by vmware.

As Scott mentioned, got the label 'OVF ENV' when attached OVF cdrom which need enable vApp options on a VM.
While the vApp options are not enabled by default on a VMware VM, so can NOT reply on it to check if current is a VMware installation.
To solve this issue, checking size might be the best option, that grep entire cdrom size only if it's size is small (ex: <=10MB).
And to speed up VMware guest customization, could move if ovf_vmware_guest_customization check to be in front of "is_cdrom_ovf" in dsCheckOVF()? , see https://bugs.launchpad.net/cloud-init/+bug/1806701/comments/6

Copy Ryan and Dan on this.

*** An article on how to enable vApp options on a VM: https://www.virtuallyghetto.com/2012/06/ovf-runtime-environment.html

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Pengpeng Sun(pengpengs) wrote on 2019-06-18T09:56:48.367078+00:00

Post a merge request:
https://code.launchpad.net/~pengpengs/cloud-init/+git/cloud-init/+merge/368959

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Server Team CI bot(server-team-bot) wrote on 2019-07-16T14:31:20.836121+00:00

This bug is fixed with commit d9769c4 to cloud-init on branch master.
To view that commit see the following URL:
https://git.launchpad.net/cloud-init/commit/?id=d9769c47

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Ryan Harper(raharper) wrote on 2019-07-17T17:12:08.643739+00:00

This bug is believed to be fixed in cloud-init in version 19.2. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Nicklas(nicklaswallgren) wrote on 2019-12-16T11:27:33.283669+00:00

We got recurring problems with cloud-init/ds-identify process. It spawns sub-processes "blkid -c /dev/null -o export" which gets stuck in the "D" uninterruptible sleep state.

The processes cannot be killed, so the only solution is to reboot the affected server.

root 3839 0.0 0.0 4760 1840 ? S Dec05 0:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 3844 0.0 0.0 11212 2836 ? D Dec05 0:00 _ blkid -c /dev/null -o export
root 6943 0.0 0.0 4760 1880 ? S Dec05 0:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 6948 0.0 0.0 11212 2844 ? D Dec05 0:00 _ blkid -c /dev/null -o export
root 6111 0.0 0.0 4760 1916 ? S Dec12 0:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 6149 0.0 0.0 11212 2940 ? D Dec12 0:00 _ blkid -c /dev/null -o export
root 8765 0.0 0.3 926528 24968 ? Ssl Dec12 0:12 /usr/lib/snapd/snapd
root 9179 0.0 0.0 4760 1892 ? S Dec12 0:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 9185 0.0 0.0 11980 3552 ? D Dec12 0:00 _ blkid -c /dev/null -o export

Distributor ID: Ubuntu
Description: Ubuntu 18.04.3 LTS
Release: 18.04
Codename: bionic

5.0.0-36-generic #39~18.04.1-Ubuntu SMP Tue Nov 12 11:09:50 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Pengpeng Sun(pengpengs) wrote on 2019-12-17T03:21:51.729502+00:00

@Nicklas, I think the /var/run/cloud-init/ds-identify.log is helpful to investigate the problem.
And could you please file a new bug with above log file, it's a different issue with the one in this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
launchpad Migrated from Launchpad
Projects
None yet
Development

No branches or pull requests

1 participant