Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud-init fails when rebooting EC2 i3.metal instances #3435

Closed
ubuntu-server-builder opened this issue May 11, 2023 · 13 comments
Closed

cloud-init fails when rebooting EC2 i3.metal instances #3435

ubuntu-server-builder opened this issue May 11, 2023 · 13 comments
Labels
launchpad Migrated from Launchpad priority Fix soon

Comments

@ubuntu-server-builder
Copy link
Collaborator

This bug was originally filed in Launchpad as LP: #1841182

Launchpad details
affected_projects = []
assignee = None
assignee_name = None
date_closed = 2020-01-18T04:17:25.904019+00:00
date_created = 2019-08-23T13:25:05.373507+00:00
date_fix_committed = None
date_fix_released = None
id = 1841182
importance = high
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1841182
milestone = None
owner = paride
owner_name = Paride Legovini
private = False
status = expired
submitter = paride
submitter_name = Paride Legovini
tags = []
duplicates = []

Launchpad user Paride Legovini(paride) wrote on 2019-08-23T13:25:05.373507+00:00

In order to collect boot-speed metrics I deploy/reboot/terminate several EC2 instances per day. At a high level this is that the jobs do:

  1. Deploy an instance and wait for cloud-init
    to finish using cloud-init status --wait
  2. Collect and retrieve some logs via SSH/SFTP
  3. Reboot the instance using boto3's reboot()
  4. Collect some more logs
  5. Terminate the instance

This works in a fairly reliable way, but on i3.metal instances the instance often fails to survive the reboot step. After a failed reboot the instance state appears as "running", but it's unreachable via SSH.

By detaching the root volume and attaching it to another instance in the same availability zone I've been able to inspect the logs, and problem is a cloud-init failure. At a first glance of the logs it looks like cloud-init doesn't like /var/lib/cloud/data/set-hostname being empty:

2019-08-23 11:31:27,585 - util.py[DEBUG]: Reading from /var/lib/cloud/data/set-hostname (quiet=False)
2019-08-23 11:31:27,585 - util.py[DEBUG]: Read 0 bytes from /var/lib/cloud/data/set-hostname
2019-08-23 11:31:27,585 - util.py[WARNING]: failed stage init-local
2019-08-23 11:31:27,586 - util.py[DEBUG]: failed stage init-local
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 653, in status_wrapper
ret = functor(name, args)
File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 361, in main_init
_maybe_set_hostname(init, stage='local', retry_stage='network')
File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 709, in _maybe_set_hostname
cc_set_hostname.handle('set-hostname', init.cfg, cloud, LOG, None)
File "/usr/lib/python3/dist-packages/cloudinit/config/cc_set_hostname.py", line 67, in handle
prev_hostname = util.load_json(util.load_file(prev_fn))
File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1586, in load_json
decoded = json.loads(decode_binary(text))
File "/usr/lib/python3.6/json/init.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I'm not sure on where the actual problem is here. Is set-hostname supposed to always contain something? Should cloud-init be able to handle an empty set-hostname? Could the fact that the instance is rebooted shortly after being deployed affect this?

The full logs are attached.

@ubuntu-server-builder ubuntu-server-builder added launchpad Migrated from Launchpad priority Fix soon labels May 11, 2023
@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Paride Legovini(paride) wrote on 2019-08-23T13:25:05.373507+00:00

Launchpad attachments: cloud-init.tar.gz

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Ryan Harper(raharper) wrote on 2019-08-28T19:03:08.116678+00:00

Thanks for submitting this. Is it possible to collect-logs from the initial boot of the instance before the reboot? In particular I'd like to see a cloud-init collect-logs and a tar of /var/lib/cloud before the reboot.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Paride Legovini(paride) wrote on 2019-08-30T09:36:45+00:00

Ryan Harper wrote on 28/08/2019:

Thanks for submitting this. Is it possible to collect-logs from the
initial boot of the instance before the reboot? In particular I'd like
to see a cloud-init collect-logs and a tar of /var/lib/cloud before the
reboot.

Here are the first-boot logs from an instance that when rebooted hit the
problem. The post-reboot-console-log-fail.log file is the console log
collected after the reboot and that shows the cloud-init error message.
It has been collected using aws ec2 get-console-output.

Paride

Launchpad attachments: _var_lib_cloud.tar.gz,cloud-init.tar.gz,post-reboot-console-log-fail.log

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Watkins(oddbloke) wrote on 2019-09-03T21:42:54.779264+00:00

OK, it looks like /var/lib/cloud/data/set-hostname ends up empty after the reboot, which is what causes the issue. We should handle that case gracefully, assuming that it isn't cloud-init that's causing it to be removed.

To help confirm what's going on, are you able to capture /var/lib/cloud and logs post-failure (perhaps by attaching the root disk elsewhere)?

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Paride Legovini(paride) wrote on 2019-09-05T13:58:07+00:00

Dan Watkins wrote on 03/09/2019:

OK, it looks like /var/lib/cloud/data/set-hostname ends up empty after
the reboot, which is what causes the issue. We should handle that case
gracefully, assuming that it isn't cloud-init that's causing it to be
removed.

To help confirm what's going on, are you able to capture /var/lib/cloud
and logs post-failure (perhaps by attaching the root disk elsewhere)?

Here are the first-boot and post-reboot logs collected from an instance
that encountered the problem.

Launchpad attachments: FIRSTBOOT_cloud-init.tar.gz,FIRSTBOOT_var_lib_cloud.tar.gz,REBOOT_cloud-init.tar.gz,REBOOT_var_lib_cloud.tar.gz,console-output-i-0946e77e872111556

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Watkins(oddbloke) wrote on 2019-09-09T14:18:30.045006+00:00

OK, to clarify my previous comment, the issue is not that /var/lib/cloud/data/set-hostname is missing, the issue is specifically that it is present but empty. Unfortunately, we don't have any logging from the code path used to write set-hostname (it uses cloudinit.atomic_helper.write_json which uses atomic_helper.write_file), so we can't be 100% sure that we aren't incorrectly writing out this empty file.

So, I think this breaks down into three things:

  1. We should improve the logging situation there (just filed bug 1843276)
  2. We should work out why the file ends up empty in this case
  3. We should gracefully treat an empty file as an absent file (probably with a WARNING?), because I think that's the best we can do once we find ourselves in that situation

(2) and (3) should probably be separate bugs, but without (1) it's not 100% clear that that's the case so I haven't filed anything separate yet.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Watkins(oddbloke) wrote on 2019-09-10T17:24:04.559069+00:00

Paride, are you able to test this again using the cloud-init from the daily PPA[0]? That has some additional logging added, which should make working out (2) easier.

[0] https://launchpad.net/~cloud-init-dev/+archive/ubuntu/daily

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Launchpad Janitor(janitor) wrote on 2019-11-17T04:17:29.091340+00:00

[Expired for cloud-init because there has been no activity for 60 days.]

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Launchpad Janitor(janitor) wrote on 2020-01-18T04:17:25.766116+00:00

[Expired for cloud-init because there has been no activity for 60 days.]

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Paride Legovini(paride) wrote on 2020-01-21T11:17:14.377029+00:00

This didn't happen again in quite some time, not sure what fixed it. I'll reopen this bug report if I encounter the issue again.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user xiaoyi chen(xiachen-rh) wrote on 2022-11-08T08:14:04.440769+00:00

Hi Dan Watkins,
we got a similar issue that cloud-init fails when /var/lib/cloud/data/set-hostname is present but empty during reboot. We are doing further research that why the file ends up empty in our case, meanwhile, will cloud-init handle the exception as you commented(https://bugs.launchpad.net/cloud-init/+bug/1841182/comments/6)?

Our problem details are here,
https://bugzilla.redhat.com/show_bug.cgi?id=2140893

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user James Falcon(falcojr) wrote on 2022-11-08T14:08:02.225639+00:00

Xiaoyi, that bugzilla bug isn't public, so we can't see the details. We'll need the resulting tarball from 'cloud-init collect-logs' (with --include-userdata if there's no sensitive userdata) in order to try to debug this issue. Also note that this bug was filed 3 years ago against a version of cloud-init that is 3 years old. Can the problem be reproduced on the current version of cloud-init?

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Emanuele Esposito(esposem) wrote on 2022-11-22T16:11:40.266309+00:00

The BZ is now public. However, there is only one suggestion that is private that might help solving the problem, so I am pasting it here:

Firstly, IMO, empty '/var/lib/cloud/data/set-hostname' is indeed an exception/case that cloud-init should handle properly than leave the service failed directly.

Back to this issue, hard reboot during boot or trigger sysrq b immediately when system boot up. The content of set-hostname lost after system boot up.
It seems that the content did not synced to disk when system reset.


Here is the instructions of sysrq b.

``b``       Will immediately reboot the system without syncing or unmounting
            your disks.

Below information shows that set-hostname already has content before reset, but it is 0 after boot again.
2022-11-08 01:49:17,759 - util.py[DEBUG]: Reading from /var/lib/cloud/data/set-hostname (quiet=False)
2022-11-08 01:49:17,759 - util.py[DEBUG]: Read 88 bytes from /var/lib/cloud/data/set-hostname
2022-11-08 01:49:17,759 - cc_set_hostname.py[DEBUG]: No hostname changes. Skipping set-hostname
2022-11-08 02:33:58,585 - util.py[DEBUG]: Reading from /var/lib/cloud/data/set-hostname (quiet=False)
2022-11-08 02:33:58,585 - util.py[DEBUG]: Read 0 bytes from /var/lib/cloud/data/set-hostname

Not sure if turn off buffering is helpful to mitigate this, but I did not reproduce this issue after 10+ retry times.

# diff -u /usr/lib/python3.9/site-packages/cloudinit/atomic_helper.py /usr/lib/python3.9/site-packages/cloudinit/atomic_helper.py.orig
--- /usr/lib/python3.9/site-packages/cloudinit/atomic_helper.py	2022-11-08 08:17:56.262679073 +0000
+++ /usr/lib/python3.9/site-packages/cloudinit/atomic_helper.py.orig	2022-11-08 08:17:03.722890860 +0000
@@ -25,7 +25,7 @@
     tf = None
     try:
         tf = tempfile.NamedTemporaryFile(
-            dir=os.path.dirname(filename), delete=False, mode=omode, buffering=0
+            dir=os.path.dirname(filename), delete=False, mode=omode
         )

@ubuntu-server-builder ubuntu-server-builder closed this as not planned Won't fix, can't repro, duplicate, stale May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
launchpad Migrated from Launchpad priority Fix soon
Projects
None yet
Development

No branches or pull requests

1 participant