Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

18.04 minimal images on GCE intermittently fail to set up networking #3158

Closed
ubuntu-server-builder opened this issue May 11, 2023 · 18 comments
Labels
launchpad Migrated from Launchpad priority Fix soon

Comments

@ubuntu-server-builder
Copy link
Collaborator

This bug was originally filed in Launchpad as LP: #1766287

Launchpad details
affected_projects = ['cloud-init (Ubuntu)']
assignee = raharper
assignee_name = Ryan Harper
date_closed = 2018-06-20T18:06:32.783909+00:00
date_created = 2018-04-23T15:51:41.007862+00:00
date_fix_committed = 2018-04-26T20:39:27.377342+00:00
date_fix_released = 2018-06-20T18:06:32.783909+00:00
id = 1766287
importance = high
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1766287
milestone = None
owner = philroche
owner_name = Philip Roche
private = False
status = fix_released
submitter = philroche
submitter_name = Philip Roche
tags = ['id-5d0a33dc7c02f24574ae04aa']
duplicates = []

Launchpad user Philip Roche(philroche) wrote on 2018-04-23T15:51:41.007862+00:00

When running tests on 18.04 Minimal daily images on GCE we are noticing intermittent failure to set up networking.

One test run launched 4 instances of image daily-ubuntu-minimal-1804-bionic-v20180420 in GCE project ubuntu-os-cloud-devel. Only one of the four successfully set up networking.

There appears to be only loopback devices set up.

cloud-init 18.2-14-g6d48d26 is installed on these images.

I have attached the console log from the failed launch.

I can provide access to the successfully launched instance if required.

@ubuntu-server-builder ubuntu-server-builder added launchpad Migrated from Launchpad priority Fix soon labels May 11, 2023
@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Philip Roche(philroche) wrote on 2018-04-23T15:51:41.007862+00:00

Launchpad attachments: GCE console log of failed launch

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Philip Roche(philroche) wrote on 2018-04-23T15:52:25.700511+00:00

Also see attached GCE console of successfully launched instance for comparison
Launchpad attachments: GCE Console of successfully launched instance

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Philip Roche(philroche) wrote on 2018-04-23T17:26:28.783672+00:00

To help with debugging I will try and collect logs from the failed instance launch.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Philip Roche(philroche) wrote on 2018-04-23T20:01:44.996513+00:00

I have gathered as much as I can from attaching a snapshot of the boot disk from the instance that failed to boot.

I was unable to run cloud-init collect-logs from inside the chroot (See filed bug https://bugs.launchpad.net/cloud-init/+bug/1766335).

The attached archive contains:

  • /var/log/cloud-init*log
  • cloud-init analyze show output
  • cloud-init analyze dump output
  • cloud-init package version
  • journalctl output
  • /var/lib/cloud/instance/user-data.txt

I hope this helps.
Launchpad attachments: failedbootdebug.tar.gz

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Ryan Harper(raharper) wrote on 2018-04-23T22:00:29.053641+00:00

Here's what I think is happening.

In the success case, the virtio nic is renamed by the kernel to a "stable" name prior to cloud-init local enumerating the system nics and picking a fallback device.

$ journalctl -o short-precise | egrep "(Cloud-init|rename)"
Apr 23 16:19:45.517627 ubuntu kernel: virtio_net virtio1 ens4: renamed from eth0
Apr 23 16:19:47.427137 ubuntu cloud-init[163]: Cloud-init v. 18.2 running 'init-local' at Mon, 23 Apr 2018 16:19:47 +0000. Up 6.12 seconds.

On the failing case, we see that the rename happens after cloud-init-local has started

Apr 23 10:33:24 ubuntu kernel: [ 3.334493] virtio_net virtio1 ens4: renamed from eth0
Apr 23 10:33:24 ubuntu cloud-init[165]: Cloud-init v. 18.2 running 'init-local' at Mon, 23 Apr 2018 10:33:21 +0000. Up 3.19 seconds.

Note here cloud-init's uptime value 3.19 seconds, is before the rename kernel time 3.33, about 14 milliseconds before.

When this race happens, cloud-init local reads /sys/class/net for interfaces and picks eth0; as it has not yet been renamed, then generates a config for eth0, and when rendered to netplan; it contains a Name=eth0 as part of the match section, so networkd does not apply the config as the interface is actually ens4 at this time.

There is a possibility that systemd-networkd isn't doing the rename properly; that is, in the failure path, the files will look like:

% cat /run/systemd/network$ cat 10-netplan-ens4.link
[Match]
MACAddress=42:01:0a:80:00:03

[Link]
Name=eth0
WakeOnLan=off

% cat 10-netplan-ens4.network
[Match]
MACAddress=42:01:0a:80:00:03
Name=eth0

[Network]
DHCP=ipv4

[DHCP]
UseMTU=true

The .link file should have forced ens4 back to eth0; and looks like this was happening with this log message:

Apr 23 10:33:24 ubuntu systemd-networkd[359]: ens4: Interface name change detected, ens4 has been renamed to eth0.
Apr 23 10:33:24 ubuntu systemd-networkd[359]: eth0: Interface name change detected, eth0 has been renamed to ens4.

But somehow it's moved back; when then means the .network config won't appy.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Philip Roche(philroche) wrote on 2018-04-24T07:45:04.316599+00:00

As requested in IRC, please find attached the collect-logs archive from a successful boot.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Philip Roche(philroche) wrote on 2018-04-24T07:51:49.873980+00:00

Launchpad attachments: successful-boot-cloud-init.tar.gz

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Ryan Harper(raharper) wrote on 2018-04-24T23:02:21.319321+00:00

I'm able to recreate by launching the specified image in europe-west1 region;

The race is between udev coldplug which triggers systemd persistent naming events, and the start of cloud-init-local.service; if the nic has not yet been renamed by the time that cloud-init-local runs, then it will render a config for eth0 which won't match up with the nic which get's renamed to ens4.

On key to for cloud-init is that at local time, we are expecting that udev renaming will have already completed. Systemd provides a 'systemd-udev-settle.service' which can be invoked after the 'systemd-udev-trigger.service' aka, the Coldplug.

Currently nothing in the cloud-image (minimal, or regular) provides a Wants=systemd-udev-settle.service; which means that nothing is waiting for udev events to have completed. There are a number of reasons why not to wait for things; in some cases usb or other devices take quite a while to come up and this blocks boot. Currently in ubuntu at least LVM and zfs will ensure that systemd-udev-settle.service is wanted and run before sysinit.target is reached.

We would like to have cloud-init-local.service both Want and run After systemd-udev-settle.service has complete; this ensure that any persistent name rules will have fired (systemd-udev-trigger.service starts the events) and the settle blocks until the kernel uevent queue is empty. At this point no other entity is issuing network device renames and cloud-init-local can rely on whatever names of the devices that are present.

To verify this; I've setup a reboot loop on an instance wherew we've just added:

% git diff
diff --git a/systemd/cloud-init-local.service.tmpl b/systemd/cloud-init-local.service.tmpl
index ff9c644..2babf05 100644
--- a/systemd/cloud-init-local.service.tmpl
+++ b/systemd/cloud-init-local.service.tmpl
@@ -3,6 +3,8 @@
Description=Initial cloud-init job (pre-networking)
{% if variant in ["ubuntu", "unknown", "debian"] %}
DefaultDependencies=no
+Wants=systemd-udev-settle.service
+After=systemd-udev-settle.service
{% endif %}
Wants=network-pre.target
After=systemd-remount-fs.service

I will run this overnight to see how successful this approach is. It requires further discussion w.r.t whether we can generally enable this service without impacting other use-cases.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Ryan Harper(raharper) wrote on 2018-04-24T23:17:27.846990+00:00

To recreate:

  • install the google-cloud-sdk and configure your gcloud command line
  • cat >user-data.yaml <<EOF
    #cloud-config
    ssh_import_id: lp:
    EOF

launch instance that can recreate

  • gcloud compute instances create recreate-lp1766287 --zone=europe-west1 --image daily-ubuntu-minimal-1804-bionic-v20180420 --metadata-from-file user-data=user-data.yaml --metadata=serial-port-enable=1

get the ip of the instance into variable

  • IP=gcloud compute instances list | awk '/recreate-lp1766287/ {print $5}'

connect to instance

  • ssh ubuntu@$IP;

on the instance, set root password for serial console login

  • sudo bash; passwd

update cloud-init-local.service config to Want/After systemd-udev-settle.service

  • sudo sed -i -e '/DefaultDependencies/i Wants=systemd-udev-settle.service\nAfter=systemd-udev-settle.service' /lib/systemd/system/cloud-init-local.service

reset cloud-init

  • sudo cloud-init clean --logs --reboot

in separate terminal, fire up serial console

  • cloud compute connect-to-serial-port recreate-lp1766287

In the shell with the IP variable defined,

Run this loop to watch and trigger reboots if network comes up:

  • COUNT=0; while true; do echo "---"; echo "COUNT=$COUNT"; ssh -o ConnectTimeout=5s ubuntu@$IP -- "sudo cloud-init status --wait; sudo cloud-init clean --logs; sudo shutdown --reboot +1; exit" 2>/dev/null; if [ "$?" = "0" ]; then echo "Boot Success!"; COUNT=$(($COUNT + 1)); fi; echo "waiting 4s"; sleep 4; done

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Philip Roche(philroche) wrote on 2018-04-25T11:06:44.730952+00:00

@raharper I hadn't realised you had been working on a reproducer.

Attached is a test script which I have written which I have used to successfully reproduce the issue.

A summary of test results is provided at the end of the test run.

Usage

reproduce-failed-boot.sh

OR

reproduce-failed-boot.sh --image-serial 20180420

Update TOTAL_LAUNCHES for fewer tests (currently set to 20).
Update DELETE_FAILED_LAUNCHES=true to delete all instances including failed launches

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Philip Roche(philroche) wrote on 2018-04-25T11:19:13.821018+00:00

Launchpad attachments: reproduce-failed-boot.sh

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Ryan Harper(raharper) wrote on 2018-04-25T13:58:40+00:00

After applying the systemd-udev-settle.service changes to cloud-init
on my instance, I've got 484 successful reboots with no interruption
of networking.

On Wed, Apr 25, 2018 at 6:19 AM, Philip Roche phil.roche@canonical.com wrote:

** Attachment removed: "reproduce-failed-boot.sh"
https://bugs.launchpad.net/cloud-init/+bug/1766287/+attachment/5126998/+files/reproduce-failed-boot.sh

** Attachment added: "reproduce-failed-boot.sh"
https://bugs.launchpad.net/cloud-init/+bug/1766287/+attachment/5127008/+files/reproduce-failed-boot.sh

--
You received this bug notification because you are subscribed to the bug
report.
https://bugs.launchpad.net/bugs/1766287

Title:
18.04 minimal images on GCE intermittently fail to set up networking

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1766287/+subscriptions

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Scott Moser(smoser) wrote on 2018-04-26T20:39:24.461466+00:00

An upstream commit landed for this bug.

To view that commit see the following URL:
https://git.launchpad.net/cloud-init/commit/?id=4731c8da

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Philip Roche(philroche) wrote on 2018-05-03T23:47:14.278629+00:00

I too have verified the GCE bionic images with cloud-init 18.2-27 (currently in bionic-proposed) in europe-west region. 20 of 20 launched successfully.

Nice work.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2018-05-25T20:07:02.207623+00:00

This bug is believed to be fixed in cloud-init in version18.2-27-g6ef92c98-0ubuntu1~18.04.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Scott Moser(smoser) wrote on 2018-06-20T18:06:33.687680+00:00

This bug is believed to be fixed in cloud-init in version 18.3. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Philip Roche(philroche) wrote on 2019-03-29T17:20:08.404430+00:00

I'd like to reopen this following Disco minimal images failing to set up networking due to similar reasons to this bug with the only difference being that no nic was found.

A workaround was found to set up cloud-init service config:

/etc/systemd/system/cloud-init-local.service.d/gcp.conf

[Unit]
After=systemd-udev-trigger.service

[Service]
ExecStartPre=/bin/udevadm settle

The goal of this workaround is to:

  1. ensure that cloud-init-local.service runs after
    systemd-udev-trigger.service starts (this is what triggers
    udev coldplug events, like plugging in the nic)
  2. Run udevadm settle before we start cloud-init local so that any
    nic processing is completed before cloud-init starts looking for
    a nic.

Currently this is only required on minimal images but there is a
chance it could occur in base images too should they boot quick
enough. Minimal disco does not have snap preseeding as base images do and
due to this running before cloud-init it makes it extremely unlikely to
happen in base images.

I understand that cloud-init might not be the place to fix the issue for all images but I'd like to re-open this bug to start that discussion.

I have attached cloud-init logs, netplan yaml, image manifest and sosreports from an instance that failed to set up networking.
Launchpad attachments: Disco GCE Minimal Failed Networking Setup Logs

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Philip Roche(philroche) wrote on 2019-03-29T17:27:47.126924+00:00

On guidance from raharper I have opened new bug for this @ https://bugs.launchpad.net/cloud-init/+bug/1822353

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
launchpad Migrated from Launchpad priority Fix soon
Projects
None yet
Development

No branches or pull requests

1 participant