-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
18.04 minimal images on GCE intermittently fail to set up networking #3158
Comments
Launchpad user Philip Roche(philroche) wrote on 2018-04-23T15:51:41.007862+00:00 Launchpad attachments: GCE console log of failed launch |
Launchpad user Philip Roche(philroche) wrote on 2018-04-23T15:52:25.700511+00:00 Also see attached GCE console of successfully launched instance for comparison |
Launchpad user Philip Roche(philroche) wrote on 2018-04-23T17:26:28.783672+00:00 To help with debugging I will try and collect logs from the failed instance launch. |
Launchpad user Philip Roche(philroche) wrote on 2018-04-23T20:01:44.996513+00:00 I have gathered as much as I can from attaching a snapshot of the boot disk from the instance that failed to boot. I was unable to run The attached archive contains:
I hope this helps. |
Launchpad user Ryan Harper(raharper) wrote on 2018-04-23T22:00:29.053641+00:00 Here's what I think is happening. In the success case, the virtio nic is renamed by the kernel to a "stable" name prior to cloud-init local enumerating the system nics and picking a fallback device. $ journalctl -o short-precise | egrep "(Cloud-init|rename)" On the failing case, we see that the rename happens after cloud-init-local has started Apr 23 10:33:24 ubuntu kernel: [ 3.334493] virtio_net virtio1 ens4: renamed from eth0 Note here cloud-init's uptime value 3.19 seconds, is before the rename kernel time 3.33, about 14 milliseconds before. When this race happens, cloud-init local reads /sys/class/net for interfaces and picks eth0; as it has not yet been renamed, then generates a config for eth0, and when rendered to netplan; it contains a Name=eth0 as part of the match section, so networkd does not apply the config as the interface is actually ens4 at this time. There is a possibility that systemd-networkd isn't doing the rename properly; that is, in the failure path, the files will look like: % cat /run/systemd/network$ cat 10-netplan-ens4.link [Link] % cat 10-netplan-ens4.network [Network] [DHCP] The .link file should have forced ens4 back to eth0; and looks like this was happening with this log message: Apr 23 10:33:24 ubuntu systemd-networkd[359]: ens4: Interface name change detected, ens4 has been renamed to eth0. But somehow it's moved back; when then means the .network config won't appy. |
Launchpad user Philip Roche(philroche) wrote on 2018-04-24T07:45:04.316599+00:00 As requested in IRC, please find attached the collect-logs archive from a successful boot. |
Launchpad user Philip Roche(philroche) wrote on 2018-04-24T07:51:49.873980+00:00 Launchpad attachments: successful-boot-cloud-init.tar.gz |
Launchpad user Ryan Harper(raharper) wrote on 2018-04-24T23:02:21.319321+00:00 I'm able to recreate by launching the specified image in europe-west1 region; The race is between udev coldplug which triggers systemd persistent naming events, and the start of cloud-init-local.service; if the nic has not yet been renamed by the time that cloud-init-local runs, then it will render a config for eth0 which won't match up with the nic which get's renamed to ens4. On key to for cloud-init is that at local time, we are expecting that udev renaming will have already completed. Systemd provides a 'systemd-udev-settle.service' which can be invoked after the 'systemd-udev-trigger.service' aka, the Coldplug. Currently nothing in the cloud-image (minimal, or regular) provides a Wants=systemd-udev-settle.service; which means that nothing is waiting for udev events to have completed. There are a number of reasons why not to wait for things; in some cases usb or other devices take quite a while to come up and this blocks boot. Currently in ubuntu at least LVM and zfs will ensure that systemd-udev-settle.service is wanted and run before sysinit.target is reached. We would like to have cloud-init-local.service both Want and run After systemd-udev-settle.service has complete; this ensure that any persistent name rules will have fired (systemd-udev-trigger.service starts the events) and the settle blocks until the kernel uevent queue is empty. At this point no other entity is issuing network device renames and cloud-init-local can rely on whatever names of the devices that are present. To verify this; I've setup a reboot loop on an instance wherew we've just added: % git diff I will run this overnight to see how successful this approach is. It requires further discussion w.r.t whether we can generally enable this service without impacting other use-cases. |
Launchpad user Ryan Harper(raharper) wrote on 2018-04-24T23:17:27.846990+00:00 To recreate:
launch instance that can recreate
get the ip of the instance into variable
connect to instance
on the instance, set root password for serial console login
update cloud-init-local.service config to Want/After systemd-udev-settle.service
reset cloud-init
in separate terminal, fire up serial console
In the shell with the IP variable defined,Run this loop to watch and trigger reboots if network comes up:
|
Launchpad user Philip Roche(philroche) wrote on 2018-04-25T11:06:44.730952+00:00 @raharper I hadn't realised you had been working on a reproducer. Attached is a test script which I have written which I have used to successfully reproduce the issue. A summary of test results is provided at the end of the test run. Usage
OR
Update TOTAL_LAUNCHES for fewer tests (currently set to 20). |
Launchpad user Philip Roche(philroche) wrote on 2018-04-25T11:19:13.821018+00:00 Launchpad attachments: reproduce-failed-boot.sh |
Launchpad user Ryan Harper(raharper) wrote on 2018-04-25T13:58:40+00:00 After applying the systemd-udev-settle.service changes to cloud-init On Wed, Apr 25, 2018 at 6:19 AM, Philip Roche phil.roche@canonical.com wrote:
|
Launchpad user Scott Moser(smoser) wrote on 2018-04-26T20:39:24.461466+00:00 An upstream commit landed for this bug. To view that commit see the following URL: |
Launchpad user Philip Roche(philroche) wrote on 2018-05-03T23:47:14.278629+00:00 I too have verified the GCE bionic images with cloud-init 18.2-27 (currently in bionic-proposed) in europe-west region. 20 of 20 launched successfully. Nice work. |
Launchpad user Chad Smith(chad.smith) wrote on 2018-05-25T20:07:02.207623+00:00 This bug is believed to be fixed in cloud-init in version18.2-27-g6ef92c98-0ubuntu1~18.04.1. If this is still a problem for you, please make a comment and set the state back to New Thank you. |
Launchpad user Scott Moser(smoser) wrote on 2018-06-20T18:06:33.687680+00:00 This bug is believed to be fixed in cloud-init in version 18.3. If this is still a problem for you, please make a comment and set the state back to New Thank you. |
Launchpad user Philip Roche(philroche) wrote on 2019-03-29T17:20:08.404430+00:00 I'd like to reopen this following Disco minimal images failing to set up networking due to similar reasons to this bug with the only difference being that no nic was found. A workaround was found to set up cloud-init service config: /etc/systemd/system/cloud-init-local.service.d/gcp.conf
The goal of this workaround is to:
Currently this is only required on minimal images but there is a I understand that cloud-init might not be the place to fix the issue for all images but I'd like to re-open this bug to start that discussion. I have attached cloud-init logs, netplan yaml, image manifest and sosreports from an instance that failed to set up networking. |
Launchpad user Philip Roche(philroche) wrote on 2019-03-29T17:27:47.126924+00:00 On guidance from raharper I have opened new bug for this @ https://bugs.launchpad.net/cloud-init/+bug/1822353 |
This bug was originally filed in Launchpad as LP: #1766287
Launchpad details
Launchpad user Philip Roche(philroche) wrote on 2018-04-23T15:51:41.007862+00:00
When running tests on 18.04 Minimal daily images on GCE we are noticing intermittent failure to set up networking.
One test run launched 4 instances of image daily-ubuntu-minimal-1804-bionic-v20180420 in GCE project ubuntu-os-cloud-devel. Only one of the four successfully set up networking.
There appears to be only loopback devices set up.
cloud-init 18.2-14-g6d48d26 is installed on these images.
I have attached the console log from the failed launch.
I can provide access to the successfully launched instance if required.
The text was updated successfully, but these errors were encountered: