-
Notifications
You must be signed in to change notification settings - Fork 828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAAS Provider: LXC did not get DHCP address, stuck in "pending" #2472
Comments
Launchpad user David Britton(dpb) wrote on 2014-08-05T16:58:43.927232+00:00 Launchpad attachments: syslog |
Launchpad user David Britton(dpb) wrote on 2014-08-05T16:58:57.706462+00:00 Launchpad attachments: dmesg |
Launchpad user David Britton(dpb) wrote on 2014-08-05T16:59:09.576979+00:00 Launchpad attachments: cloud-init-output.log |
Launchpad user David Britton(dpb) wrote on 2014-08-05T16:59:22.342586+00:00 Launchpad attachments: cloud-init.log |
Launchpad user Curtis Hovey(sinzui) wrote on 2014-08-06T07:19:36.923123+00:00 Looks like the machine was shutdown by something other than juju. juju wasn't installed in the machine before it was terminated. Since there wasn't a first call home, juju will wait indefinitely. Could maas have terminated the machine that juju requested? |
Launchpad user David Britton(dpb) wrote on 2014-08-16T22:49:49.457537+00:00 I've reproduced this again. cloud-init appears to fail fast if the ip address is not yet allocated by the upstart job "network-interface-eth0". As you can see in my paste, cloud-init doesn't appear to wait at all, like I'm used to, it just fails. This might take someone with deeper expertise (smoser?) to look into. I'll attach full logs from the lxc |
Launchpad user David Britton(dpb) wrote on 2014-08-16T22:50:36.518579+00:00 Launchpad attachments: lxclogs.tgz |
Launchpad user David Britton(dpb) wrote on 2014-08-19T08:56:55.133789+00:00 Deployer file I use to demonstrate the issue: http://paste.ubuntu.com/8087092/ |
Launchpad user Scott Moser(smoser) wrote on 2014-08-19T14:45:39.815515+00:00 cloud-init doesn't "fail fast". Rather, the 'cloud-init' upstart job will not execute until either: one of the two scenarios must be reached before the 'mounted MOUNTPOINT=/' event occurs. That event then basically blocks the rest of boot. so cloud init forces a pinch point in boot on networking coming up. something else must be blocking networking from coming up until afer mounted MOUNTPOINT=/ has occurred. Possibly a dupe of https://bugs.launchpad.net/cloud-init/+bug/1345433 |
Launchpad user Andreas Hasenack(ahasenack) wrote on 2014-08-20T12:24:28.975359+00:00 Let's keep in mind the way juju creates containers: first, it creates a template container. Then, it clones it. From the cloud init logs, I can't see cloud-init from the cloned container waiting 120s for an IP. In fact, doesn't look like it waited even 60s.
|
Launchpad user David Britton(dpb) wrote on 2014-08-25T16:49:11.221923+00:00 Removing duplicate status. After understanding more of the failure mode, it's certainly distinct from 1345433. The problem here is the initial log message printed by "cloud-init init-local" is failing with a symlink error. Traceback (most recent call last): Aug 22 13:55:52 juju-trusty-lxc-template [CLOUDINIT] util.py[DEBUG]: Creating symbolic link from '/run/cloud-init/result.json' => '../../var/lib/cloud/data/result.json' Thus it's hard to see in the log what is going wrong.
Two issues:
From the error message (No such file or directory), the only ways I've gotten it to repeat is if the directory where the link is requested to be placed does not exist, or if using relative paths, the directory cannot be calculated -- i.e., dirname(source) doesn't exist.
Perhaps this is a limitation of the upstart job chain. I'm not familiar enough with that to speak to it. Or, perhaps it's by design. I'll add more to the bug debugging wise as I find it. |
Launchpad user David Britton(dpb) wrote on 2014-08-25T16:58:37.843980+00:00 Launchpad attachments: zipped logs from repro |
Launchpad user David Britton(dpb) wrote on 2014-08-25T19:24:00.267773+00:00 FYI -- I can hit this pretty often in my test environment. If there is some way to add instrumentation to cloud-init I would be happy to debug further. |
Launchpad user Ian Booth(wallyworld) wrote on 2014-09-02T03:20:41.838863+00:00 I'm removing this from 1.20 series as any Juju related work (if any is required) will be done for the next release. |
Launchpad user Scott Moser(smoser) wrote on 2014-09-10T15:15:43.977277+00:00 I pushed a test that creates and destroys lxc containers in an effort to reproduce my suspicion of the problem to http://bazaar.launchpad.net/~cloud-init-dev/cloud-init/cloud-init-test/view/head:/tests/lxc-test-new-instance . I've not been able to make it fail anywhere, though. |
Launchpad user David Britton(dpb) wrote on 2014-09-10T15:38:00+00:00 On Wed, Sep 10, 2014 at 03:15:43PM -0000, Scott Moser wrote:
FYI, with the workaround that you provided (of checking that /run is -- |
Launchpad user Scott Moser(smoser) wrote on 2014-09-12T16:34:45.883006+00:00 fixed in trunk at revno 1012 |
Launchpad user Scott Moser(smoser) wrote on 2014-09-16T18:13:32.452036+00:00 This is fixed in currently in unapproved at https://launchpad.net/ubuntu/trusty/+queue?queue_state=1&queue_text=cloud-init |
Launchpad user Brian Murray(brian-murray) wrote on 2014-09-18T20:35:59.172752+00:00 Hello David, or anyone else affected, Accepted cloud-init into trusty-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/cloud-init/0.7.5-0ubuntu1.2 in a few hours, and then in the -proposed repository. Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users. If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision. Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance! |
Launchpad user David Britton(dpb) wrote on 2014-09-22T21:47:41.031406+00:00 With scott's help, I spun this about 2000x over the weekend and did not hit the issue. I would call it verified from my perspective. Changing tag. |
Launchpad user Launchpad Janitor(janitor) wrote on 2014-09-25T16:23:46.026041+00:00 This bug was fixed in the package cloud-init - 0.7.5-0ubuntu1.2 cloud-init (0.7.5-0ubuntu1.2) trusty-proposed; urgency=medium
|
Launchpad user Brian Murray(brian-murray) wrote on 2014-09-25T16:23:53.486772+00:00 The verification of the Stable Release Update for cloud-init has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions. |
Launchpad user Ian Booth(wallyworld) wrote on 2014-10-21T13:30:55.717230+00:00 Removing this bug from juju-core as the issue required a cloud init fix. |
Launchpad user Scott Moser(smoser) wrote on 2016-08-10T14:44:35.044165+00:00 This is fixed in cloud-init 0.7.7 |
This bug was originally filed in Launchpad as LP: #1353008
Launchpad details
Launchpad user David Britton(dpb) wrote on 2014-08-05T16:58:43.927232+00:00
=== Begin SRU Information ===
This bug causes lxc containers created by the ubuntu-cloud template (lxc-create -t ubuntu-cloud) to sometimes not obtain an IP address, and thus not correctly boot to completion.
The bug is in an assumption by cloud-init that /run is mounted before the cloud-init-local job is run. The fix is very simply to guarantee that it is via modification to its upstart 'start on'.
When booting with an initramfs /run will be mounted before /, so the race condition is not possible. Thus, the failure case is only either in non-initramfs boot (which is very unlikely) or in lxc boot. The lxc case seems only to occur very rarely, somewhere well under one percent of the time.
[Test Case]
A test case is written at [1] that launches many instances in an attempt brute force find the error. However, I've not been able to make it fail.
The original bug reporter has been running with the 'start on' change and has seen no errors since.
We will request the original bug reporter to apply the uploaded changes and run through their battery.
[Regression Potential]
The possibility for regression here is in the second boot of an instance. The following scenario is a change of behavior:
* the user boots an instance with NoCloud or ConfigDrive in ds=local mode
* user changes /etc/network/interfaces in a way that would cause
static-networking to not be emitted on subsequent boot
* user reboots
Now, instead of a quick boot, the user may see cloud-init-nonet blocking on network coming up.
This would be a uncommon scenario, and the broken-etc-network-interfaces scenario is already one that causes timeouts on boot.
--
[1] http://bazaar.launchpad.net/~cloud-init-dev/cloud-init/cloud-init-test/view/head:/tests/lxc-test-new-instance
=== End SRU Information ===
Note, that after I went onto the system, it did have an IP address.
0/lxc/3:
agent-state: pending
instance-id: juju-machine-0-lxc-3
series: trusty
hardware: arch=amd64
cloud-init-output.log snip:
Cloud-init v. 0.7.5 running 'init' at Mon, 04 Aug 2014 23:57:12 +0000. Up 572.29 seconds.
ci-info: +++++++++++++++++++++++Net device info+++++++++++++++++++++++
ci-info: +--------+------+-----------+-----------+-------------------+
ci-info: | Device | Up | Address | Mask | Hw-Address |
ci-info: +--------+------+-----------+-----------+-------------------+
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | . |
ci-info: | eth0 | True | . | . | 00:16:3e:34:aa:57 |
ci-info: +--------+------+-----------+-----------+-------------------+
ci-info: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!Route info failed!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Cloud-init v. 0.7.5 running 'modules:config' at Mon, 04 Aug 2014 23:57:12 +0000. Up 572.99 seconds.
Cloud-init v. 0.7.5 running 'modules:final' at Mon, 04 Aug 2014 23:57:14 +0000. Up 574.42 seconds.
Cloud-init v. 0.7.5 finished at Mon, 04 Aug 2014 23:57:14 +0000. Datasource DataSourceNoCloudNet [seed=/var/lib/cloud/seed/nocloud-net][dsmode=net]. Up 574.54 seconds
syslog on system, showing DHCPACK 1 second later:
root@juju-machine-0-lxc-3:/home/ubuntu# grep DHCP /var/log/syslog
Aug 4 23:57:13 juju-machine-0-lxc-3 dhclient: DHCPREQUEST of 10.96.3.173 on eth0 to 255.255.255.255 port 67 (xid=0x1687c544)
Aug 4 23:57:13 juju-machine-0-lxc-3 dhclient: DHCPOFFER of 10.96.3.173 from 10.96.0.10
Aug 4 23:57:13 juju-machine-0-lxc-3 dhclient: DHCPACK of 10.96.3.173 from 10.96.0.10
Aug 5 05:28:15 juju-machine-0-lxc-3 dhclient: DHCPREQUEST of 10.96.3.173 on eth0 to 10.96.0.10 port 67 (xid=0x1687c544)
Aug 5 05:28:15 juju-machine-0-lxc-3 dhclient: DHCPACK of 10.96.3.173 from 10.96.0.10
Aug 5 11:15:00 juju-machine-0-lxc-3 dhclient: DHCPREQUEST of 10.96.3.173 on eth0 to 10.96.0.10 port 67 (xid=0x1687c544)
Aug 5 11:15:00 juju-machine-0-lxc-3 dhclient: DHCPACK of 10.96.3.173 from 10.96.0.10
It appears in every case, cloud-init init-local has failed very early visible in juju logs /var/lib/juju/containers//console.log:
Traceback (most recent call last):
File "/usr/bin/cloud-init", line 618, in
sys.exit(main())
File "/usr/bin/cloud-init", line 614, in main
get_uptime=True, func=functor, args=(name, args))
File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 1875, in log_time
ret = func(*args, **kwargs)
File "/usr/bin/cloud-init", line 491, in status_wrapper
force=True)
File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 1402, in sym_link
os.symlink(source, link)
OSError: [Errno 2] No such file or directory
The text was updated successfully, but these errors were encountered: