Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAAS Provider: LXC did not get DHCP address, stuck in "pending" #2472

Closed
ubuntu-server-builder opened this issue May 10, 2023 · 24 comments
Closed
Labels
launchpad Migrated from Launchpad priority Fix soon

Comments

@ubuntu-server-builder
Copy link
Collaborator

This bug was originally filed in Launchpad as LP: #1353008

Launchpad details
affected_projects = ['cloud-init (Ubuntu)', 'cloud-init (Ubuntu Trusty)']
assignee = None
assignee_name = None
date_closed = 2016-08-10T14:44:36.938478+00:00
date_created = 2014-08-05T16:58:43.927232+00:00
date_fix_committed = 2014-09-12T16:34:38.434036+00:00
date_fix_released = 2016-08-10T14:44:36.938478+00:00
id = 1353008
importance = high
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1353008
milestone = None
owner = dpb
owner_name = David Britton
private = False
status = fix_released
submitter = dpb
submitter_name = David Britton
tags = ['cloud-installer', 'landscape', 'lxc', 'maas-provider', 'network', 'verification-done']
duplicates = [1327078, 1366893]

Launchpad user David Britton(dpb) wrote on 2014-08-05T16:58:43.927232+00:00

=== Begin SRU Information ===
This bug causes lxc containers created by the ubuntu-cloud template (lxc-create -t ubuntu-cloud) to sometimes not obtain an IP address, and thus not correctly boot to completion.

The bug is in an assumption by cloud-init that /run is mounted before the cloud-init-local job is run. The fix is very simply to guarantee that it is via modification to its upstart 'start on'.

When booting with an initramfs /run will be mounted before /, so the race condition is not possible. Thus, the failure case is only either in non-initramfs boot (which is very unlikely) or in lxc boot. The lxc case seems only to occur very rarely, somewhere well under one percent of the time.

[Test Case]
A test case is written at [1] that launches many instances in an attempt brute force find the error. However, I've not been able to make it fail.

The original bug reporter has been running with the 'start on' change and has seen no errors since.

We will request the original bug reporter to apply the uploaded changes and run through their battery.

[Regression Potential]
The possibility for regression here is in the second boot of an instance. The following scenario is a change of behavior:
 * the user boots an instance with NoCloud or ConfigDrive in ds=local mode
 * user changes /etc/network/interfaces in a way that would cause
   static-networking to not be emitted on subsequent boot
 * user reboots
Now, instead of a quick boot, the user may see cloud-init-nonet blocking on network coming up.

This would be a uncommon scenario, and the broken-etc-network-interfaces scenario is already one that causes timeouts on boot.

--
[1] http://bazaar.launchpad.net/~cloud-init-dev/cloud-init/cloud-init-test/view/head:/tests/lxc-test-new-instance

=== End SRU Information ===

Note, that after I went onto the system, it did have an IP address.

      0/lxc/3:
        agent-state: pending
        instance-id: juju-machine-0-lxc-3
        series: trusty
        hardware: arch=amd64

cloud-init-output.log snip:

Cloud-init v. 0.7.5 running 'init' at Mon, 04 Aug 2014 23:57:12 +0000. Up 572.29 seconds.
ci-info: +++++++++++++++++++++++Net device info+++++++++++++++++++++++
ci-info: +--------+------+-----------+-----------+-------------------+
ci-info: | Device | Up | Address | Mask | Hw-Address |
ci-info: +--------+------+-----------+-----------+-------------------+
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | . |
ci-info: | eth0 | True | . | . | 00:16:3e:34:aa:57 |
ci-info: +--------+------+-----------+-----------+-------------------+
ci-info: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!Route info failed!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Cloud-init v. 0.7.5 running 'modules:config' at Mon, 04 Aug 2014 23:57:12 +0000. Up 572.99 seconds.
Cloud-init v. 0.7.5 running 'modules:final' at Mon, 04 Aug 2014 23:57:14 +0000. Up 574.42 seconds.
Cloud-init v. 0.7.5 finished at Mon, 04 Aug 2014 23:57:14 +0000. Datasource DataSourceNoCloudNet [seed=/var/lib/cloud/seed/nocloud-net][dsmode=net]. Up 574.54 seconds

syslog on system, showing DHCPACK 1 second later:

root@juju-machine-0-lxc-3:/home/ubuntu# grep DHCP /var/log/syslog
Aug 4 23:57:13 juju-machine-0-lxc-3 dhclient: DHCPREQUEST of 10.96.3.173 on eth0 to 255.255.255.255 port 67 (xid=0x1687c544)
Aug 4 23:57:13 juju-machine-0-lxc-3 dhclient: DHCPOFFER of 10.96.3.173 from 10.96.0.10
Aug 4 23:57:13 juju-machine-0-lxc-3 dhclient: DHCPACK of 10.96.3.173 from 10.96.0.10
Aug 5 05:28:15 juju-machine-0-lxc-3 dhclient: DHCPREQUEST of 10.96.3.173 on eth0 to 10.96.0.10 port 67 (xid=0x1687c544)
Aug 5 05:28:15 juju-machine-0-lxc-3 dhclient: DHCPACK of 10.96.3.173 from 10.96.0.10
Aug 5 11:15:00 juju-machine-0-lxc-3 dhclient: DHCPREQUEST of 10.96.3.173 on eth0 to 10.96.0.10 port 67 (xid=0x1687c544)
Aug 5 11:15:00 juju-machine-0-lxc-3 dhclient: DHCPACK of 10.96.3.173 from 10.96.0.10

It appears in every case, cloud-init init-local has failed very early visible in juju logs /var/lib/juju/containers//console.log:

Traceback (most recent call last):
  File "/usr/bin/cloud-init", line 618, in
    sys.exit(main())
  File "/usr/bin/cloud-init", line 614, in main
    get_uptime=True, func=functor, args=(name, args))
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 1875, in log_time
    ret = func(*args, **kwargs)
  File "/usr/bin/cloud-init", line 491, in status_wrapper
    force=True)
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 1402, in sym_link
    os.symlink(source, link)
OSError: [Errno 2] No such file or directory

@ubuntu-server-builder ubuntu-server-builder added launchpad Migrated from Launchpad priority Fix soon labels May 10, 2023
@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user David Britton(dpb) wrote on 2014-08-05T16:58:43.927232+00:00

Launchpad attachments: syslog

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user David Britton(dpb) wrote on 2014-08-05T16:58:57.706462+00:00

Launchpad attachments: dmesg

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user David Britton(dpb) wrote on 2014-08-05T16:59:09.576979+00:00

Launchpad attachments: cloud-init-output.log

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user David Britton(dpb) wrote on 2014-08-05T16:59:22.342586+00:00

Launchpad attachments: cloud-init.log

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Curtis Hovey(sinzui) wrote on 2014-08-06T07:19:36.923123+00:00

Looks like the machine was shutdown by something other than juju. juju wasn't installed in the machine before it was terminated. Since there wasn't a first call home, juju will wait indefinitely. Could maas have terminated the machine that juju requested?

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user David Britton(dpb) wrote on 2014-08-16T22:49:49.457537+00:00

I've reproduced this again.

cloud-init appears to fail fast if the ip address is not yet allocated by the upstart job "network-interface-eth0". As you can see in my paste, cloud-init doesn't appear to wait at all, like I'm used to, it just fails.

This might take someone with deeper expertise (smoser?) to look into.

I'll attach full logs from the lxc

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user David Britton(dpb) wrote on 2014-08-16T22:50:36.518579+00:00

Launchpad attachments: lxclogs.tgz

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user David Britton(dpb) wrote on 2014-08-19T08:56:55.133789+00:00

Deployer file I use to demonstrate the issue: http://paste.ubuntu.com/8087092/

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Scott Moser(smoser) wrote on 2014-08-19T14:45:39.815515+00:00

cloud-init doesn't "fail fast". Rather, the 'cloud-init' upstart job will not execute until either:
a.) all network interfaces defined in /etc/network/interfaces as 'auto' are up
or
b.) cloud-init-nonet has ended (which basically blocks boot for up to 120 seconds or until 'a' is accomplished).

one of the two scenarios must be reached before the 'mounted MOUNTPOINT=/' event occurs. That event then basically blocks the rest of boot. so cloud init forces a pinch point in boot on networking coming up.

something else must be blocking networking from coming up until afer mounted MOUNTPOINT=/ has occurred.

Possibly a dupe of https://bugs.launchpad.net/cloud-init/+bug/1345433

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Andreas Hasenack(ahasenack) wrote on 2014-08-20T12:24:28.975359+00:00

Let's keep in mind the way juju creates containers: first, it creates a template container. Then, it clones it.

From the cloud init logs, I can't see cloud-init from the cloned container waiting 120s for an IP. In fact, doesn't look like it waited even 60s.

  • template container ends: Cloud-init v. 0.7.5 finished at Mon, 04 Aug 2014 23:56:18
  • cloned container starts: Cloud-init v. 0.7.5 running 'init' at Mon, 04 Aug 2014 23:57:12
  • cloned container fails at, or before: Cloud-init v. 0.7.5 running 'modules:config' at Mon, 04 Aug 2014 23:57:12
  • dhcp server gets a lease request, offers it, and the container ACKs it: all within this second: 23:57:13

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user David Britton(dpb) wrote on 2014-08-25T16:49:11.221923+00:00

Removing duplicate status. After understanding more of the failure mode, it's certainly distinct from 1345433. The problem here is the initial log message printed by "cloud-init init-local" is failing with a symlink error.

Traceback (most recent call last):
File "/usr/bin/cloud-init", line 618, in
sys.exit(main())
File "/usr/bin/cloud-init", line 614, in main
get_uptime=True, func=functor, args=(name, args))
File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 1875, in log_time
ret = func(*args, **kwargs)
File "/usr/bin/cloud-init", line 491, in status_wrapper
force=True)
File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 1402, in sym_link
os.symlink(source, link)
OSError: [Errno 2] No such file or directory

Aug 22 13:55:52 juju-trusty-lxc-template [CLOUDINIT] util.py[DEBUG]: Creating symbolic link from '/run/cloud-init/result.json' => '../../var/lib/cloud/data/result.json'

Thus it's hard to see in the log what is going wrong.

  • "cloud-init init-local" does the waiting for network bit.
  • "cloud-init init" expects the network devices to be up

Two issues:

  1. Why is the symlink failing?

From the error message (No such file or directory), the only ways I've gotten it to repeat is if the directory where the link is requested to be placed does not exist, or if using relative paths, the directory cannot be calculated -- i.e., dirname(source) doesn't exist.

  1. Why doesn't cloud-init-init block on success of cloud-init, and not just completion of it?

Perhaps this is a limitation of the upstart job chain. I'm not familiar enough with that to speak to it. Or, perhaps it's by design.

I'll add more to the bug debugging wise as I find it.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user David Britton(dpb) wrote on 2014-08-25T16:58:37.843980+00:00

Launchpad attachments: zipped logs from repro

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user David Britton(dpb) wrote on 2014-08-25T19:24:00.267773+00:00

FYI -- I can hit this pretty often in my test environment. If there is some way to add instrumentation to cloud-init I would be happy to debug further.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Ian Booth(wallyworld) wrote on 2014-09-02T03:20:41.838863+00:00

I'm removing this from 1.20 series as any Juju related work (if any is required) will be done for the next release.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Scott Moser(smoser) wrote on 2014-09-10T15:15:43.977277+00:00

I pushed a test that creates and destroys lxc containers in an effort to reproduce my suspicion of the problem to http://bazaar.launchpad.net/~cloud-init-dev/cloud-init/cloud-init-test/view/head:/tests/lxc-test-new-instance .

I've not been able to make it fail anywhere, though.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user David Britton(dpb) wrote on 2014-09-10T15:38:00+00:00

On Wed, Sep 10, 2014 at 03:15:43PM -0000, Scott Moser wrote:

I pushed a test that creates and destroys lxc containers in an effort to
reproduce my suspicion of the problem to http://bazaar.launchpad.net
/~cloud-init-dev/cloud-init/cloud-init-test/view/head:/tests/lxc-test-
new-instance .

I've not been able to make it fail anywhere, though.

FYI, with the workaround that you provided (of checking that /run is
mounted), I haven't seen the error since, but only have about 1500
containers created so far.

--
David Britton david.britton@canonical.com

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Scott Moser(smoser) wrote on 2014-09-12T16:34:45.883006+00:00

fixed in trunk at revno 1012

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Scott Moser(smoser) wrote on 2014-09-16T18:13:32.452036+00:00

This is fixed in
utopic: version 0.7.6~bzr1016-0ubuntu1 uploaded 2014-09-12
trusty: version 0.7.5-0ubuntu1.2 uploaded to -proposed queue 2014-09-12

currently in unapproved at https://launchpad.net/ubuntu/trusty/+queue?queue_state=1&queue_text=cloud-init

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Brian Murray(brian-murray) wrote on 2014-09-18T20:35:59.172752+00:00

Hello David, or anyone else affected,

Accepted cloud-init into trusty-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/cloud-init/0.7.5-0ubuntu1.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user David Britton(dpb) wrote on 2014-09-22T21:47:41.031406+00:00

With scott's help, I spun this about 2000x over the weekend and did not hit the issue. I would call it verified from my perspective.

Changing tag.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Launchpad Janitor(janitor) wrote on 2014-09-25T16:23:46.026041+00:00

This bug was fixed in the package cloud-init - 0.7.5-0ubuntu1.2


cloud-init (0.7.5-0ubuntu1.2) trusty-proposed; urgency=medium

  • d/patches/lp-1353008-cloud-init-local-needs-run.conf:
    backport change to cloud-init-local.conf to depend on /run being
    mounted (LP: #1353008)
    -- Scott Moser smoser@ubuntu.com Wed, 17 Sep 2014 09:15:54 -0400

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Brian Murray(brian-murray) wrote on 2014-09-25T16:23:53.486772+00:00

The verification of the Stable Release Update for cloud-init has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Ian Booth(wallyworld) wrote on 2014-10-21T13:30:55.717230+00:00

Removing this bug from juju-core as the issue required a cloud init fix.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Scott Moser(smoser) wrote on 2016-08-10T14:44:35.044165+00:00

This is fixed in cloud-init 0.7.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
launchpad Migrated from Launchpad priority Fix soon
Projects
None yet
Development

No branches or pull requests

1 participant