Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud-init doesn't retry metadata lookups and hangs forever if metadata is down #2812

Closed
ubuntu-server-builder opened this issue May 10, 2023 · 6 comments
Labels
launchpad Migrated from Launchpad

Comments

@ubuntu-server-builder
Copy link
Collaborator

This bug was originally filed in Launchpad as LP: #1667735

Launchpad details
affected_projects = ['cloud-init (Ubuntu)', 'cloud-init (Ubuntu Precise)', 'cloud-init (Ubuntu Trusty)']
assignee = mgerdts
assignee_name = Mike Gerdts
date_closed = 2018-06-20T18:06:52.230612+00:00
date_created = 2017-02-24T16:15:11.633675+00:00
date_fix_committed = 2018-04-18T17:58:30.365556+00:00
date_fix_released = 2018-06-20T18:06:52.230612+00:00
id = 1667735
importance = medium
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1667735
milestone = None
owner = smoser
owner_name = Scott Moser
private = False
status = fix_released
submitter = chorrell
submitter_name = Christopher Horrell
tags = []
duplicates = []

Launchpad user Christopher Horrell(chorrell) wrote on 2017-02-24T16:15:11.633675+00:00

If a host SmartOS server is rebooted and the metadata service is not available, a KVM VM instance that use cloud-init (via the SmartOS datasource) will fail to start.

If the metadata agent on the host server is not available the python code for cloud-init gets blocked forever waiting for data it will never receive. This causes the boot process for an instance to hang on cloud-init.

This is problematic if there happens to be some reason the metadata agent is not available for any reason while a SmartOS KVM VM that relies on cloud-init is booting.

From the engineer that worked on this (not the svadm command is run on the host SmartOS server):

You can reproduce this by disabling the metadata service SmartOS host:

svcadm disable metadata

and then boot a KVM VM running an Ubuntu Certified Cloud image such as:

c864f104-624c-43d2-835e-b49a39709b6b (ubuntu-certified-14.04 20150225.2)

when you do this, the VM's boot process will hang at cloud-init. If you then start the metadata service, cloud-init will not recover.

On of our engineers who looked at this was able to cause forward progress by applying this patch:

--- /usr/lib/python2.7/dist-packages/cloudinit/sources/DataSourceSmartOS.py.ori 2017-02-23 01:28:28.405885775 +0000
+++ /usr/lib/python2.7/dist-packages/cloudinit/sources/DataSourceSmartOS.py 2017-02-23 01:35:51.281885775 +0000
@@ -286,7 +286,7 @@
     if not seed_device:
         raise AttributeError("seed_device value is not set")

  • ser = serial.Serial(seed_device, timeout=seed_timeout)
  • ser = serial.Serial(seed_device, timeout=10)
         if not ser.isOpen():
             raise SystemError("Unable to open %s" % seed_device)

which causes the following strace output:

[pid 2119] open("/dev/ttyS1", O_RDWR|O_NOCTTY|O_NONBLOCK) = 5
[pid 2119] ioctl(5, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B9600 -opost -isig -icanon -echo ...}) = 0
[pid 2119] write(5, "GET user-script\n", 16) = 16
[pid 2119] select(6, [5], [], [], {10, 0}) = 0 (Timeout)
[pid 2119] close(5) = 0
[pid 2119] open("/dev/ttyS1", O_RDWR|O_NOCTTY|O_NONBLOCK) = 5
[pid 2119] ioctl(5, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B9600 -opost -isig -icanon -echo ...}) = 0
[pid 2119] write(5, "GET iptables_disable\n", 21) = 21
[pid 2119] select(6, [5], [], [], {10, 0}) = 0 (Timeout)
[pid 2119] close(5) = 0

instead of:

[pid 1977] open("/dev/ttyS1", O_RDWR|O_NOCTTY|O_NONBLOCK) = 5
[pid 1977] ioctl(5, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B9600 -opost -isig -icanon -echo ...}) = 0
[pid 1977] write(5, "GET base64_keys\n", 16) = 16
[pid 1977] select(6, [5], [], [], NULL

which you get without the patch (notice the NULL for the timeout parameter). The code that gets blocked in this version of cloud-init is:

    ser.write("GET %s\n" % noun.rstrip())
    status = str(ser.readline()).rstrip()

in cloudinit/sources/DataSourceSmartOS.py. The ser.readline() documentation says

(https://pyserial.readthedocs.io/en/latest/shortintro.html#readline):

Be careful when using readline(). Do specify a timeout when opening the serial port otherwise it could block forever if no newline character is received. Also note that readlines() only works with a timeout. readlines() depends on having a timeout and interprets that as EOF (end of file). It raises an exception if the port is not opened correctly.

which is exactly the situation we've hit here.

It might be better to have a timeout but when the timeout is hit, the GET should be retried if there's no answer rather than moving on to the next key. A negative answer (NOTFOUND for example) should not be retried, only when there's no answer (because metadata is unavailable).

Once this is resolved, it should be possible to start a VM with cloud-init and metadata disabled, and then enable metadata some time later and have the boot process complete at that time.

@ubuntu-server-builder ubuntu-server-builder added the launchpad Migrated from Launchpad label May 10, 2023
@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Scott Moser(smoser) wrote on 2017-02-24T18:24:45.108309+00:00

"Once this is resolved, it should be possible to start a VM with cloud-init and metadata disabled, and then enable metadata some time later and have the boot process complete at that time."

That doesn't really make sense. And re-trying timing out in 10 seconds and re-trying wouldn't really change anything.

Either you block the boot waiting for the other end of the read() or you don't. "have the boot process complete at that time" is I think what we have in precise and trusty right now.

boot will hang until the read() came back. I'm not sure if upstart boot will actually wait forever on cloud-init-local to return , it may well do that. I suspect that systemd will kill cloud-init after like 90 seconds.

Its very arguable that the right thing to do is wait forever on the metadata service. Cloud-init only knows what it should do next based on information from the metadata service. If this is the first boot of an instance, then some things will happen, if it is a reboot, then things such as ssh key generation or user creation will not occur.

So this is kind of working by design. Sure we can change the setting to give up, but that will just result in cloud-init not finding a datasource. If that was a instance's first boot, then the user wouldn't be able to access the system. If it was any other boot, then things would be fine, but cloud-init has no way of knowing that this is "second boot" other than by the metadata service.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Josh Wilsdon(jwilsdon) wrote on 2017-03-01T23:26:25.661809+00:00

"That doesn't really make sense. And re-trying timing out in 10 seconds and re-trying wouldn't really change anything."

Unfortunately that's not the case for cloud-init in 14.04 (retrying there would fix the problem but miss keys) and only correct for cloud-init in 16.10 because that version has other problems.

With the Ubuntu 14.04 image's version of cloud-init (0.7.5-0ubuntu1.3) the read() is blocking with no timeout. If there was a timeout and if the request was correctly retried every 10 seconds, eventually things would proceed once the metadata service became available. As it stands now, these instances need to be externally rebooted when they become hung as they will never recover on their own.

I've just tested 16.10 as well with cloud-init 0.7.8-68-gca3ae67-0ubuntu1~16.10.1. With that version the code has changed significantly, but the problem is the same. Even though the timeout has been added in the new version, its usefulness has been negated since the new code gets itself stuck in an infinite loop when we hit this case. You can also see some more details in:

https://smartos.org/bugview/IMAGE-1014

In my testing on 16.10, systemd did not kill cloud-init in the 30 minutes I waited.

"Its very arguable that the right thing to do is wait forever on the metadata service."

I agree with this. However, that's not what cloud-init is doing. In part because cloud-init's implementation of the metadata specification (https://eng.joyent.com/mdata/protocol.html) is incomplete. In particular:

  • It uses V2 without doing NEGOTIATE
  • It uses the KVM serial port without reading all the data from the buffer before writing
  • It does not write '\n' and wait for 'invalid command\n'
  • When a read() times out, it tries a read() again instead of starting over

What happens with cloud-init in 16.10 is that if metadata is unavailable when the instance boots, cloud-init will:

  1. write data into the socket (nobody's listening)
  2. do a select() on the socket looking for readable data (and timeout after 10 seconds)
  3. goto 2

the loop between 2 and 3 becomes infinite because even if metadata is enabled at this point, cloud-init never attempts to send any commands to it.

If instead it were to:

  1. open the socket
  2. read on the socket (with a timeout) and discard any data
  3. write '\n'
  4. read on the socket for 'invalid command\n' (with a timeout, on timeout close socket and go to 1 )
  5. NEGOTIATE V2

before making any queries, it would be able to recover when the metadata service became available if it is unavailable initially. If you disable the metadata service and run mdata-get under strace, and then enable metadata, you'll see that that's how it is able to recover in this case.

So in summary: I think we're in agreement that cloud-init should wait forever for the metadata service, but both 14.04 and 16.10 have different but related problems in their implementation which prevent cloud-init from ever actually knowing when metadata has become available. The consequence of this is that in both versions if metadata is unavailable when cloud-init is first run, the VMs will hang until rebooted externally.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Scott Moser(smoser) wrote on 2017-03-02T13:07:34.573541+00:00

Thanks for clarifying.
Note there is also a socket client for your container service there. we'd ideally fix that too.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Scott Moser(smoser) wrote on 2018-04-18T17:58:26.981256+00:00

An upstream commit landed for this bug.

To view that commit see the following URL:
https://git.launchpad.net/cloud-init/commit/?id=4c573d0e

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Launchpad Janitor(janitor) wrote on 2018-04-19T10:34:27.849376+00:00

This bug was fixed in the package cloud-init - 18.2-14-g6d48d265-0ubuntu1


cloud-init (18.2-14-g6d48d265-0ubuntu1) bionic; urgency=medium

  • New upstream snapshot.
    • net: Depend on iproute2's ip instead of net-tools ifconfig or route
    • DataSourceSmartOS: fix hang when metadata service is down
      [Mike Gerdts] (LP: #1667735)
    • DataSourceSmartOS: change default fs on ephemeral disk from ext3 to
      ext4. [Mike Gerdts] (LP: #1763511)
    • pycodestyle: Fix invalid escape sequences in string literals.
    • Implement bash completion script for cloud-init command line

-- Chad Smith chad.smith@canonical.com Wed, 18 Apr 2018 15:25:53 -0600

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Scott Moser(smoser) wrote on 2018-06-20T18:06:54.043663+00:00

This bug is believed to be fixed in cloud-init in version 18.3. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
launchpad Migrated from Launchpad
Projects
None yet
Development

No branches or pull requests

1 participant