Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud-init 19.2.36 fails with python exception "Not all expected physical devices present ..." during bionic image deployment from MAAS #3459

Closed
ubuntu-server-builder opened this issue May 12, 2023 · 46 comments
Labels
launchpad Migrated from Launchpad priority Fix soon

Comments

@ubuntu-server-builder
Copy link
Collaborator

This bug was originally filed in Launchpad as LP: #1846535

Launchpad details
affected_projects = ['cloud-init (Ubuntu)', 'cloud-init (Ubuntu Xenial)', 'cloud-init (Ubuntu Bionic)', 'cloud-init (Ubuntu Disco)', 'cloud-init (Ubuntu Eoan)']
assignee = oddbloke
assignee_name = Dan Watkins
date_closed = 2019-12-19T23:00:05.207528+00:00
date_created = 2019-10-03T17:19:42.795903+00:00
date_fix_committed = 2019-10-04T15:34:46.009163+00:00
date_fix_released = 2019-12-19T23:00:05.207528+00:00
id = 1846535
importance = critical
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1846535
milestone = None
owner = nikolay.vinogradov
owner_name = Nikolay Vinogradov
private = False
status = fix_released
submitter = nikolay.vinogradov
submitter_name = Nikolay Vinogradov
tags = ['cpe-onsite', 'regression-update', 'sts', 'verification-done', 'verification-done-bionic', 'verification-done-disco', 'verification-done-xenial']
duplicates = [1846661]

Launchpad user Nikolay Vinogradov(nikolay.vinogradov) wrote on 2019-10-03T17:19:42.795903+00:00

[Impact]

Any instances launched with bridges or bonds in their network configuration will fail to bring up networking.

[Test Case]

Juju bootstrap on maas of a machine sets up a network bridge that triggers a failure in cloud-init init stage.

This results in a maas machine deployment failure and the machine gets released

Procedure:

Alternate steps on a maas machine with a bridge already created

A1. confirm a bridge interface is configured for the target machine on interface eno1, name it broam, attach it to a subnet and select auto-assign for IP
A2. click deploy -> bionic
A3. Once manual deployment fails go to step 2 below

Alternative 2 juju bootstrap failure on maas

B1: juju bootrap mymaas --no-gui
B2: Once bootstrap fails go to step 2

  1. After deployment failure and machine is powered off click on the failed/released node in the MAAS UI
  2. Click Rescue Mode from the 'Take Action' drop down in the MAAS UI
  3. Grab the IP from the interfaces tab
  4. ssh ubuntu@ -- cloud-init status --long

Expect failure message

  1. Click Exit Rescue Mode on the node in MAAS UI.

  2. ssh to the maas server add the following lines to /etc/maas/preseeds/curtin_userdata to test official *-proposed packages:

system_upgrade: {enabled: True}
apt:
  sources:
    proposed.list:
       source: deb $MIRROR $RELEASE-proposed main universe # upstream -proposed

  1. Repeat step 1 and expect bootstrap success

expect to see MAASDatasource from bootstrapped machine and no errors

  1. juju ssh 0 -- cloud-init status --long

Additional verification checks to avoid regression
 - DONE oracle
 - DONE ec2
 - DONE openstack
 - DONE gce
 - DONE azure

  • DONE nocloud kvm
  • DONE nocloud lxd

[Regression Potential]

The change being SRU'd adds more conditions to an existing conditional. There is potential to regress the cases that the existing conditional was introduced to cover, so we will be testing those specifically. Other than that, there was some minor refactoring of the existing conditional statement (which did not change the logic it checks), which could cause issues for Oracle netfailover interfaces. We will also specifically test on Oracle.

[Original Report]

Symptoms

After deployment of Ubuntu Bionic image on MAAS provider (deploying to a bare metal server) juju cannot access any deployed machine due to missing SSH keys and machines are stuck in pending state:

$ juju ssh 0
ERROR retrieving SSH host keys for "0": keys not found

$ juju machines
Machine State DNS Inst id Series AZ Message
0 pending 172.20.10.125 block-3 bionic AZ3 Deployed
1 pending 172.20.10.124 block-2 bionic AZ2 Deployed
2 pending 172.20.10.126 block-1 bionic AZ1 Deployed
3 pending 172.20.10.127 object-2 bionic AZ1 Deployed
4 pending 172.20.10.128 object-1 bionic AZ2 Deployed
5 pending 172.20.10.129 object-3 bionic AZ3 Deployed

It worth mentioning that pods can be successfully deployed with MAAS, only bare metal deployment fails.

We checked different bionic images: cloud-init 19.2.24 works, and cloud-init 19.2.36 doesn't.

@ubuntu-server-builder ubuntu-server-builder added launchpad Migrated from Launchpad priority Fix soon labels May 12, 2023
@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Michał Ajduk(majduk) wrote on 2019-10-03T17:55:46.025567+00:00

Issue was introduced in Cloud-init v 19.2-36-g059d049c-0ubuntu118.04.1.
It was not present in Cloud-init v. 19.2-24-ge7881d5c-0ubuntu1
18.04.1

Symptoms:
2019-10-03 13:10:59,100 - init.py[WARNING]: Not all expected physical devices present: {'3c:fd:fe:d5:7a:42', '3c:fd:fe:d5:70:d9', '3c:fd:fe:d5:7a:41', '3c:fd:fe:d5:70:d8', '3c:fd:fe:d5:7a:40', '3c:fd:fe:d5:70:da'}

It seems that following change causes that:

  • New upstream snapshot. (LP: #1844334)
    • net: add is_master check for filtering device list

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Nikolay Vinogradov(nikolay.vinogradov) wrote on 2019-10-03T18:00:56.483600+00:00

The deployments with bonds seems to be affected. I think since slaves MAC address change after a bond is created cloud-init fails to find the interfaces with original MAC address (they're present in netplan though).

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Nikolay Vinogradov(nikolay.vinogradov) wrote on 2019-10-03T18:05:17.152526+00:00

The cloud-init commit that introduced the change: b3a87fc

It affects branches : 19.2-36 and 19.2-25

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2019-10-03T18:19:59.511335+00:00

Confirmed I can successfully install bionic on baremetal with 19.2-36 without seeing this issue directly in MAAS. I'll try now using Juju to deploy the bare metal machine now to see if I can reproduce the failure.

Nikolay do you have access to cloud-init collect-logs (or minimally /var/log/cloud-init.log) on the failed system?

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user David Coronel(davecore) wrote on 2019-10-03T18:31:10.178956+00:00

subscribed ~field-high

Issue seems to be affecting deploys with network bonds. Workaround is to use an older cloud image with an older cloud-init version.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Nikolay Vinogradov(nikolay.vinogradov) wrote on 2019-10-03T18:40:15.387335+00:00

cloud-init-output.log attached
Launchpad attachments: cloud-init-output.log

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Nikolay Vinogradov(nikolay.vinogradov) wrote on 2019-10-03T18:41:29.250470+00:00

Note mac addresses of bond slaves
Launchpad attachments: configuration with bonds

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Nikolay Vinogradov(nikolay.vinogradov) wrote on 2019-10-03T18:42:14.384169+00:00

Note mac addresses again. See also the error in attached cloud-init-output.log.
Launchpad attachments: after removal of the bonds

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Jason Hobbs(jason-hobbs) wrote on 2019-10-03T18:44:49.587069+00:00

sub'd to field critical. this is blocking all of our tests.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Watkins(oddbloke) wrote on 2019-10-03T22:14:17.068098+00:00

EOD update from the cloud-init team:

We've identified what the problem is: in the problematic code path, we have started filtering out network devices that have a "master", which accidentally also includes physical interfaces that are members of a bridge or bond. As suggested in comment #1, it's the "net: add is_master check for filtering device list" commit that introduced the issue. Unfortunately, that commit was a fix for a critical issue (bug 1844191) from another source, so we cannot simply revert it and take our time to land a different fix.

We are currently pursuing three potential options for fixes: (a) completely circumvent the now-incorrect function in the code path that is raising the exception, (b) specifically avoid excluding bridge and bond member interfaces from the list, and (c) match more closely on the type of interface the 'master' check was intended to exclude. In order of preference, we would prefer (c) over (b) over (a). (Because (a) does not obviate the need for something along the lines of (b) or (c) for other code paths, and because we may have to play whack-a-mole with other cases that (b) needs to expand to include.)

We have initial implementations of (a) and (b) which we are testing (though they are not yet ready to land in trunk in their current form). We are investigating (c), but it's likely that we won't reach a conclusion on that investigation fast enough to warrant waiting for its completion before addressing this issue.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Nivedita Singhvi(niveditasinghvi) wrote on 2019-10-04T09:16:18.960810+00:00

Bumped up the importance due to more reports in production env
hitting this problem (apparently, not fully confirmed).

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Launchpad Janitor(janitor) wrote on 2019-10-04T13:03:00.541354+00:00

Status changed to 'Confirmed' because the bug affects multiple users.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Server Team CI bot(server-team-bot) wrote on 2019-10-04T15:34:44.351035+00:00

This bug is fixed with commit a7d8d03 to cloud-init on branch master.
To view that commit see the following URL:
https://git.launchpad.net/cloud-init/commit/?id=a7d8d032

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Steve Langasek(vorlon) wrote on 2019-10-04T17:36:19.478055+00:00

Hello Nikolay, or anyone else affected,

Accepted cloud-init into disco-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/19.2-36-g059d049c-0ubuntu2~19.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-disco to verification-done-disco. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-disco. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Steve Langasek(vorlon) wrote on 2019-10-04T17:37:10.059953+00:00

Hello Nikolay, or anyone else affected,

Accepted cloud-init into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/19.2-36-g059d049c-0ubuntu2~18.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user David Coronel(davecore) wrote on 2019-10-04T17:37:38.687955+00:00

As an FYI, my workaround for this is to grab the squasfh file from https://images.maas.io/ephemeral-v3/daily/bionic/amd64/20190930/ and copy it over the file /var/lib/maas/boot-resources/current/ubuntu/amd64/ga-18.04/bionic/daily/squashfs on my MAAS nodes (assuming you deploy Ubuntu 18.04 with the GA kernel).

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Steve Langasek(vorlon) wrote on 2019-10-04T17:46:44.834194+00:00

Hello Nikolay, or anyone else affected,

Accepted cloud-init into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/19.2-36-g059d049c-0ubuntu2~16.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Launchpad Janitor(janitor) wrote on 2019-10-04T18:42:07.865309+00:00

This bug was fixed in the package cloud-init - 19.2-36-g059d049c-0ubuntu2


cloud-init (19.2-36-g059d049c-0ubuntu2) eoan; urgency=medium

  • cherry-pick a7d8d03: get_interfaces: don't exclude bridge and bond
    members (LP: #1846535)

-- Daniel Watkins oddbloke@ubuntu.com Fri, 04 Oct 2019 11:42:12 -0400

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Jason Hobbs(jason-hobbs) wrote on 2019-10-04T21:34:45.489950+00:00

I successfully verified the bionic fix on MAAS. Here's what I did:

  1. deployed a machine with a bridge via maas
  2. machine went to deployed mode, couldn't ssh in
  3. switched to rescue mode, ssh'd in
  4. mounted /, captured cloud-init-output.log with error: http://paste.ubuntu.com/p/Gg53xf9wtZ/
  5. I think enabled proposed via curtin_userdata and repeated, and ssh worked and the error was gone: http://paste.ubuntu.com/p/gXVGmKvWHY/

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Mark Darmadi(darmadoo) wrote on 2019-10-04T22:07:34.763182+00:00

How would I make sure that MAAS pulls the latest package of cloud-init?

running 'apt policy cloud-init' shows that I do not have cloud-init installed

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user David Coronel(davecore) wrote on 2019-10-04T22:25:38.811530+00:00

@darmadoo: The cloud-init package is baked into the cloud images that MAAS uses to deploy instances. You have to wait for the next cloud image that will contain this fixed cloud-init, or you can use the workaround in my comment #16 to use a previous image in the meantime.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Mark Darmadi(darmadoo) wrote on 2019-10-04T22:44:48.523702+00:00

I have also successfully verified the bionic fix on MAAS.

Enabling the *-proposed repo via curtin_userdata pulled the latest cloud-init package fixed the issue.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2019-10-04T23:03:56.832673+00:00

Oracle SRU verification logs
Launchpad attachments: oracle-sru-19.2.36.ubuntu2.txt

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2019-10-04T23:04:25.775081+00:00

gce sru verification logs

Launchpad attachments: gce-sru-19.2.36.ubuntu2.txt

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2019-10-04T23:06:46.741104+00:00

Attach file openstack-sru-19.2.36.ubuntu2.txt.
Launchpad attachments: openstack-sru-19.2.36.ubuntu2.txt

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2019-10-04T23:07:10.642614+00:00

Attach file ec2-sru-19.2.36.ubuntu2.txt.
Launchpad attachments: ec2-sru-19.2.36.ubuntu2.txt

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2019-10-04T23:13:13.477614+00:00

Attach file nocloud-lxd-sru-19.2.36.ubuntu2.txt.
Launchpad attachments: nocloud-lxd-sru-19.2.36.ubuntu2.txt

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2019-10-04T23:13:53.407205+00:00

Attach file nocloud-kvm-sru-19.2.36.ubuntu2.txt.
Launchpad attachments: nocloud-kvm-sru-19.2.36.ubuntu2.txt

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2019-10-04T23:18:28.061838+00:00

Attach file azure-sru-19.2.36.ubuntu2.txt.
Launchpad attachments: azure-sru-19.2.36.ubuntu2.txt

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Lee Trager(ltrager) wrote on 2019-10-05T02:02:15.636785+00:00

I manually tested the new cloud-init using the MAAS CI as we don't have automated tests for bonds or bridges. Xenial, Bionic, and Disco can all commissioning and deploy fine using static IPs, bonds, and bridges.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Nobuto Murata(nobuto) wrote on 2019-10-06T15:37:39.278619+00:00

It looks like images.maas.io has an older image with cloud-init 19.2-24-ge7881d5c-0ubuntu1~18.04.1 (the version before the regression was introduced) as 20191004 (the latest as of right now).

$ curl -s https://images.maas.io/ephemeral-v3/daily/bionic/amd64/20191003/squashfs.manifest | grep -w cloud-init
cloud-init 19.2-36-g059d049c-0ubuntu1~18.04.1

$ curl -s https://images.maas.io/ephemeral-v3/daily/bionic/amd64/20191004/squashfs.manifest | grep -w cloud-init
cloud-init 19.2-24-ge7881d5c-0ubuntu1~18.04.1

diff --git a/squashfs.manifest.20191003 b/squashfs.manifest.20191004
index 524873c..9a64019 100644
--- a/squashfs.manifest.20191003
+++ b/squashfs.manifest.20191004
@@ -25,7 +25,7 @@ byobu 5.125-0ubuntu1
bzip2 1.0.6-8.1ubuntu0.2
ca-certificates 20180409
cloud-guest-utils 0.30-0ubuntu5
-cloud-init 19.2-36-g059d049c-0ubuntu118.04.1
+cloud-init 19.2-24-ge7881d5c-0ubuntu1
18.04.1
cloud-initramfs-copymods 0.40ubuntu1.1
cloud-initramfs-dyn-netconf 0.40ubuntu1.1
command-not-found 18.04.5
@@ -336,9 +336,9 @@ ncurses-term 6.1-1ubuntu1.18.04
net-tools 1.60+git20161116.90da8a0-1ubuntu1
netbase 5.4
netcat-openbsd 1.187-1ubuntu0.1
-netplan.io 0.98-0ubuntu118.04.1
+netplan.io 0.97-0ubuntu1
18.04.1
networkd-dispatcher 1.7-0ubuntu3.3
-nplan 0.98-0ubuntu118.04.1
+nplan 0.97-0ubuntu1
18.04.1
ntfs-3g 1:2017.3.23-2ubuntu0.18.04.2
open-iscsi 2.0.874-5ubuntu2.7
open-vm-tools 2:10.3.10-1ubuntu0.18.04.1
@@ -447,7 +447,7 @@ tcpdump 4.9.2-3
telnet 0.17-41
time 1.7-25.1build1
tmux 2.6-3ubuntu0.2
-tzdata 2019c-0ubuntu0.18.04
+tzdata 2019b-0ubuntu0.18.04
ubuntu-advantage-tools 17
ubuntu-keyring 2018.09.18.1
18.04.0
ubuntu-minimal 1.417.3

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Watkins(oddbloke) wrote on 2019-10-07T13:32:15.991978+00:00

That's correct, that was done to work around the cloud-init issue while we land the fix in the archive. (That specific change was tracked in bug 1846845, but that's in a private project so odds are people won't be able to view it.)

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Launchpad Janitor(janitor) wrote on 2019-10-07T20:20:51.792837+00:00

This bug was fixed in the package cloud-init - 19.2-36-g059d049c-0ubuntu2~19.04.1


cloud-init (19.2-36-g059d049c-0ubuntu2~19.04.1) disco; urgency=medium

  • cherry-pick a7d8d03: get_interfaces: don't exclude bridge and bond
    members (LP: #1846535)

-- Daniel Watkins oddbloke@ubuntu.com Fri, 04 Oct 2019 11:46:15 -0400

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Brian Murray(brian-murray) wrote on 2019-10-07T20:20:57.819914+00:00

The verification of the Stable Release Update for cloud-init has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Launchpad Janitor(janitor) wrote on 2019-10-07T20:22:19.560873+00:00

This bug was fixed in the package cloud-init - 19.2-36-g059d049c-0ubuntu2~18.04.1


cloud-init (19.2-36-g059d049c-0ubuntu2~18.04.1) bionic; urgency=medium

  • cherry-pick a7d8d03: get_interfaces: don't exclude bridge and bond
    members (LP: #1846535)

-- Daniel Watkins oddbloke@ubuntu.com Fri, 04 Oct 2019 11:35:54 -0400

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Launchpad Janitor(janitor) wrote on 2019-10-07T20:22:35.519514+00:00

This bug was fixed in the package cloud-init - 19.2-36-g059d049c-0ubuntu2~16.04.1


cloud-init (19.2-36-g059d049c-0ubuntu2~16.04.1) xenial; urgency=medium

  • cherry-pick a7d8d03: get_interfaces: don't exclude bridge and bond
    members (LP: #1846535)

-- Daniel Watkins oddbloke@ubuntu.com Fri, 04 Oct 2019 12:01:19 -0400

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Richard Maynard(richard-maynard) wrote on 2019-10-09T05:53:09.649636+00:00

Using 19.2-36-g059d049c-0ubuntu1~18.04.1 any of our new AWS instances can not get networking. A standard package update AMI build resulted in moving from 19.2.24 to 19.2.36 and now our instances come up inaccessible.

Our cloudconfig does not make any adjustments to the networking.

The specific environment where we have the issue is AWS, US-West-2, m5.Xlarge instances, on a private subnet, within a VPC.

Please let me know what information I can provide that may help to troubleshoot. I'm unable to access any instances running the new cloud init so information I can retrieve from them is limited.


[ 30.990869] cloud-init[729]: Cloud-init v. 19.2-36-g059d049c-0ubuntu1~18.04.1 running 'init' at Wed, 09 Oct 2019 05:09:30 +0000. Up 30.80 seconds.
[ 13.017666] cloud-init[736]: ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
[ 13.019271] cloud-init[736]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[ 13.020911] cloud-init[736]: ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
[ 13.022470] cloud-init[736]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[ 13.024021] cloud-init[736]: ci-info: | ens5 | False | . | . | . | 06:57:5b:c1:24:52 |
[ 13.025576] cloud-init[736]: ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
[ 13.027182] cloud-init[736]: ci-info: | lo | True | ::1/128 | . | host | . |
[ 13.028763] cloud-init[736]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Alexander Weber(deshke) wrote on 2019-10-09T12:15:11.722809+00:00

the issue Richard describes also applies to GCP and Azure instances.

The netplan configuration is not re-generated if the mac address changes and is hardcoded with a match: { mac: address}. So if a image is created the newly spawned instances do not have network enabled on their main interface.

the only work around i've found is

# disabling default cloudInit network and enable dhcp based on known interfaces
echo "network: {config: disabled}" > /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg 
echo 'network:
    version: 2
    renderer: networkd
    ethernets:
        ens5:
            dhcp4: true
            dhcp6: true
            optional: true
        ens4:
            dhcp4: true
            dhcp6: true
            optional: true
' > /etc/netplan/50-cloud-init.yaml

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Christian Ehrhardt (paelzer) wrote on 2019-10-09T12:36:31.904833+00:00

Hi Richard and Alexander,
Dan Watkins (driving the recent upload) will be around any minute and take a look.

Until then if you both could try to get to a failing instance (or its disk) somehow and get the log 1 which usually is at /var/log/cloud-init-output.log that would be great.
I know logging in is hard due to the issue, but you might see in 1 also options to e.g. send logs to a rsyslog service if you have one. Another alternative is to add a late_command that pushes the log somewhere else.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Watkins(oddbloke) wrote on 2019-10-09T12:58:27.189811+00:00

Hi Richard, Alexander,

Richard: We've published new daily images to AWS containing the newer version of cloud-init. Can you try to reproduce using the most recent daily image (which is ami-0802dbc378772aca8 in us-west-2), please? If you can still reproduce, then please file a new bug (because we'll need to triage and track the fix separately).

Alexander: That sounds like a distinct issue, as that has been the behaviour of cloud-init for quite some time now. Could you file a separate bug so we can follow up on it?

Thanks!

Dan

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Richard Maynard(richard-maynard) wrote on 2019-10-09T15:12:15.597350+00:00

Thanks! I'll work to capture logs and reproduce. We worked around for now by using apt to hold the package back (we use an older base image from july that then does an apt get update all while building our image).

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Richard Maynard(richard-maynard) wrote on 2019-10-09T17:18:27.212778+00:00

I was able to capture the logs by making a snapshot of the volume and mounting it to a different host, however with the previous version of cloud-init it had the same problem, sorry for the false note but it looks like this issue is not related to the upgrade.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Watkins(oddbloke) wrote on 2019-10-09T17:52:29+00:00

On Wed, Oct 09, 2019 at 05:18:27PM -0000, Richard Maynard wrote:

I was able to capture the logs by making a snapshot of the volume and
mounting it to a different host, however with the previous version of
cloud-init it had the same problem, sorry for the false note but it
looks like this issue is not related to the upgrade.

That's a relief! Thanks for digging into this to get confirmation. If
there's another cloud-init issue causing the problem, please do file a
separate bug!

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Alexander Weber(deshke) wrote on 2019-10-10T09:41:12.492476+00:00

Hey Dan, see https://bugs.launchpad.net/cloud-init/+bug/1847583

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Joshua Powers(powersj) wrote on 2019-10-29T19:27:27.400444+00:00

This is fixed in Ubuntu and as such I am unsubscribing field-critical

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2019-12-19T23:00:10.153476+00:00

This bug is believed to be fixed in cloud-init in version 19.2-55. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
launchpad Migrated from Launchpad priority Fix soon
Projects
None yet
Development

No branches or pull requests

1 participant