-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud-init fails when rebooting EC2 i3.metal instances #3435
Comments
Launchpad user Paride Legovini(paride) wrote on 2019-08-23T13:25:05.373507+00:00 Launchpad attachments: cloud-init.tar.gz |
Launchpad user Ryan Harper(raharper) wrote on 2019-08-28T19:03:08.116678+00:00 Thanks for submitting this. Is it possible to collect-logs from the initial boot of the instance before the reboot? In particular I'd like to see a cloud-init collect-logs and a tar of /var/lib/cloud before the reboot. |
Launchpad user Paride Legovini(paride) wrote on 2019-08-30T09:36:45+00:00 Ryan Harper wrote on 28/08/2019:
Here are the first-boot logs from an instance that when rebooted hit the Paride Launchpad attachments: _var_lib_cloud.tar.gz,cloud-init.tar.gz,post-reboot-console-log-fail.log |
Launchpad user Dan Watkins(oddbloke) wrote on 2019-09-03T21:42:54.779264+00:00 OK, it looks like /var/lib/cloud/data/set-hostname ends up empty after the reboot, which is what causes the issue. We should handle that case gracefully, assuming that it isn't cloud-init that's causing it to be removed. To help confirm what's going on, are you able to capture /var/lib/cloud and logs post-failure (perhaps by attaching the root disk elsewhere)? |
Launchpad user Paride Legovini(paride) wrote on 2019-09-05T13:58:07+00:00 Dan Watkins wrote on 03/09/2019:
Here are the first-boot and post-reboot logs collected from an instance Launchpad attachments: FIRSTBOOT_cloud-init.tar.gz,FIRSTBOOT_var_lib_cloud.tar.gz,REBOOT_cloud-init.tar.gz,REBOOT_var_lib_cloud.tar.gz,console-output-i-0946e77e872111556 |
Launchpad user Dan Watkins(oddbloke) wrote on 2019-09-09T14:18:30.045006+00:00 OK, to clarify my previous comment, the issue is not that /var/lib/cloud/data/set-hostname is missing, the issue is specifically that it is present but empty. Unfortunately, we don't have any logging from the code path used to write set-hostname (it uses cloudinit.atomic_helper.write_json which uses atomic_helper.write_file), so we can't be 100% sure that we aren't incorrectly writing out this empty file. So, I think this breaks down into three things:
(2) and (3) should probably be separate bugs, but without (1) it's not 100% clear that that's the case so I haven't filed anything separate yet. |
Launchpad user Dan Watkins(oddbloke) wrote on 2019-09-10T17:24:04.559069+00:00 Paride, are you able to test this again using the cloud-init from the daily PPA[0]? That has some additional logging added, which should make working out (2) easier. [0] https://launchpad.net/~cloud-init-dev/+archive/ubuntu/daily |
Launchpad user Launchpad Janitor(janitor) wrote on 2019-11-17T04:17:29.091340+00:00 [Expired for cloud-init because there has been no activity for 60 days.] |
Launchpad user Launchpad Janitor(janitor) wrote on 2020-01-18T04:17:25.766116+00:00 [Expired for cloud-init because there has been no activity for 60 days.] |
Launchpad user Paride Legovini(paride) wrote on 2020-01-21T11:17:14.377029+00:00 This didn't happen again in quite some time, not sure what fixed it. I'll reopen this bug report if I encounter the issue again. |
Launchpad user xiaoyi chen(xiachen-rh) wrote on 2022-11-08T08:14:04.440769+00:00 Hi Dan Watkins, Our problem details are here, |
Launchpad user James Falcon(falcojr) wrote on 2022-11-08T14:08:02.225639+00:00 Xiaoyi, that bugzilla bug isn't public, so we can't see the details. We'll need the resulting tarball from 'cloud-init collect-logs' (with --include-userdata if there's no sensitive userdata) in order to try to debug this issue. Also note that this bug was filed 3 years ago against a version of cloud-init that is 3 years old. Can the problem be reproduced on the current version of cloud-init? |
Launchpad user Emanuele Esposito(esposem) wrote on 2022-11-22T16:11:40.266309+00:00 The BZ is now public. However, there is only one suggestion that is private that might help solving the problem, so I am pasting it here:
|
This bug was originally filed in Launchpad as LP: #1841182
Launchpad details
Launchpad user Paride Legovini(paride) wrote on 2019-08-23T13:25:05.373507+00:00
In order to collect boot-speed metrics I deploy/reboot/terminate several EC2 instances per day. At a high level this is that the jobs do:
to finish using
cloud-init status --wait
This works in a fairly reliable way, but on i3.metal instances the instance often fails to survive the reboot step. After a failed reboot the instance state appears as "running", but it's unreachable via SSH.
By detaching the root volume and attaching it to another instance in the same availability zone I've been able to inspect the logs, and problem is a cloud-init failure. At a first glance of the logs it looks like cloud-init doesn't like /var/lib/cloud/data/set-hostname being empty:
2019-08-23 11:31:27,585 - util.py[DEBUG]: Reading from /var/lib/cloud/data/set-hostname (quiet=False)
2019-08-23 11:31:27,585 - util.py[DEBUG]: Read 0 bytes from /var/lib/cloud/data/set-hostname
2019-08-23 11:31:27,585 - util.py[WARNING]: failed stage init-local
2019-08-23 11:31:27,586 - util.py[DEBUG]: failed stage init-local
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 653, in status_wrapper
ret = functor(name, args)
File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 361, in main_init
_maybe_set_hostname(init, stage='local', retry_stage='network')
File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 709, in _maybe_set_hostname
cc_set_hostname.handle('set-hostname', init.cfg, cloud, LOG, None)
File "/usr/lib/python3/dist-packages/cloudinit/config/cc_set_hostname.py", line 67, in handle
prev_hostname = util.load_json(util.load_file(prev_fn))
File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1586, in load_json
decoded = json.loads(decode_binary(text))
File "/usr/lib/python3.6/json/init.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I'm not sure on where the actual problem is here. Is set-hostname supposed to always contain something? Should cloud-init be able to handle an empty set-hostname? Could the fact that the instance is rebooted shortly after being deployed affect this?
The full logs are attached.
The text was updated successfully, but these errors were encountered: