Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.8] Juju fails to connect to instance after juju-clean-shutdown.service timeout in cloud-init #3681

Closed
ubuntu-server-builder opened this issue May 12, 2023 · 14 comments
Labels
incomplete Action required by submitter launchpad Migrated from Launchpad

Comments

@ubuntu-server-builder
Copy link
Collaborator

This bug was originally filed in Launchpad as LP: #1878639

Launchpad details
affected_projects = ['juju']
assignee = None
assignee_name = None
date_closed = None
date_created = 2020-05-14T15:47:43.214957+00:00
date_fix_committed = 2020-06-15T10:50:15.003668+00:00
date_fix_released = 2020-06-15T10:50:15.003668+00:00
id = 1878639
importance = undecided
is_complete = False
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1878639
milestone = None
owner = jason-hobbs
owner_name = Jason Hobbs
private = False
status = incomplete
submitter = genet022
submitter_name = Joshua Genet
tags = ['cdo-qa', 'foundations-engine']
duplicates = []

Launchpad user Joshua Genet(genet022) wrote on 2020-05-14T15:47:43.214957+00:00

AWS does spin up an instance and assigns an IP, but Juju stays stuck in Pending.
There's a bunch of EC2RoleRequest EC2Metadata Errors in the controller logs.

Here's a link to the logs/artifacts:
https://oil-jenkins.canonical.com/artifacts/5e61db53-50f0-4b82-9bb1-957bd0085d46/index.html

@ubuntu-server-builder ubuntu-server-builder added incomplete Action required by submitter launchpad Migrated from Launchpad labels May 12, 2023
@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Heather Lanigan(hmlanigan) wrote on 2020-05-14T17:06:22.269991+00:00

I spun up a happy aws controller and deployed a unit. Nothing like a k8s config, but I see the same EC2 errors in the /var/log/amazon/ssm files.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Heather Lanigan(hmlanigan) wrote on 2020-05-14T17:22:01.592041+00:00

Machine 18 is the one stuck in pending.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Pen Gale(pengale) wrote on 2020-05-14T17:51:43.028877+00:00

@genet022: can you get us logs from the machine that Juju was having a hard time talking to? Its logs didn't make it into the crash dump, and that's the most interesting machine, from a troubleshooting standpoint.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Joshua Genet(genet022) wrote on 2020-05-14T18:08:51.833369+00:00

@petevg Unfortunately because this was an automated run in our CI, the crashdump is all we have. And like you said, the machine 18 logs are empty.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Tim Penhey(thumper) wrote on 2020-05-14T21:24:33.171312+00:00

Is this a one off? Or is it happening every time?

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Tim Penhey(thumper) wrote on 2020-05-14T21:38:11.134968+00:00

Grabbed the logs from the crashdumps. As mentioned by @petevg there is nothing we can use here for diagnosis.

The problem is on the machine that we have no information for. The controller logs show that the machine-18 in the kubernetes model never tried to connect. This normally indicates some networking or cloud-init issue on the started instance.

Without access to the instance that has had the problem, there is nothing we can do.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user John George(jog) wrote on 2020-05-15T15:50:03.737691+00:00

We hit something similar on vsphere, and were able to get cloud-init-output.log.
It's available in the artifacts of this run:
https://solutions.qa.canonical.com/#/qa/testRun/0a3705fe-3357-486e-a61f-01abfffe3c58

There is a failure from juju-clean-shutdown.service

  • /bin/systemctl enable /etc/systemd/system/juju-clean-shutdown.service
    Failed to enable unit: Connection timed out
    Cloud-init v. 19.4-33-gbb4131a2-0ubuntu118.04.1 running 'modules:final' at Thu, 14 May 2020 16:30:15 +0000. Up 226.09 seconds.
    2020-05-14 16:43:15,256 - util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/runcmd [1]
    2020-05-14 16:43:18,676 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
    2020-05-14 16:43:18,677 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
    Cloud-init v. 19.4-33-gbb4131a2-0ubuntu1
    18.04.1 finished at Thu, 14 May 2020 16:43:18 +0000. Datasource DataSourceOVF [seed=iso]. Up 1009.50 seconds

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Heather Lanigan(hmlanigan) wrote on 2020-05-15T17:28:37.136713+00:00

With the vsphere config machine-4:

May 14 16:43:18 juju-60990d-4 cloud-init[1562]: + /bin/systemctl enable /etc/systemd/system/juju-clean-shutdown.service
May 14 16:43:18 juju-60990d-4 cloud-init[1562]: Failed to enable unit: Connection timed out

The systectl command failed, causing cloud-init to fail and exit before jujud-machine-4.service cloud be enabled on that machine.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Pen Gale(pengale) wrote on 2020-05-18T14:04:00.347873+00:00

Per convo w/ Juju team, this is likely a systemd bug. Juju can't cleanly do much about it -- if a machine fails to cloud-init, it's never going to get to the point where it can talk to Juju. The "correct" next steps would involve further investigation on the failed machine, and a bug filed against systemd.

That said, there is some investigation that we might do from the Juju end of things. For example, we could queue up the service to be started, rather than blocking on start. This might cause other issues later on in the unit's life cycle, however.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Tim Penhey(thumper) wrote on 2020-05-19T21:18:39.089024+00:00

Isn't this a cloud-init issue? Not a Juju issue?

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Pen Gale(pengale) wrote on 2020-05-19T21:43:30.670938+00:00

This is not a regression, and isn't a bug with the Juju service being started. There might be some longer term work to make Juju behave better when a piece of the pipeline fails like this. But this doesn't make sense as a release blocker -- any fixes we did in the release window would be partial, and wouldn't address the underlying bug in cloud-init.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Paride Legovini(paride) wrote on 2020-06-15T10:50:09.173879+00:00

Hi,

I think this is unlikely to be a bug in cloud-init, as the cloud-init failure is a consequence of the failure starting the juju-clean-shutdown service, as noted already. We could get better understanding on what happens on the cloud-init side from the logs tarball generated by running

cloud-init collect-logs

on the failed machine. For the moment I'm marking the cloud-init task as Incomplete.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Joseph Phillips(manadart) wrote on 2020-06-17T11:42:42.107093+00:00

This service is no longer created on machines using systemd.
juju/juju#11717

@holmanb
Copy link
Member

holmanb commented Apr 27, 2024

No longer relevant, and looks like this wasn't a cloud-init issue either. Closing.

@holmanb holmanb closed this as completed Apr 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incomplete Action required by submitter launchpad Migrated from Launchpad
Projects
None yet
Development

No branches or pull requests

2 participants