-
Notifications
You must be signed in to change notification settings - Fork 815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[2.8] Juju fails to connect to instance after juju-clean-shutdown.service timeout in cloud-init #3681
Comments
Launchpad user Heather Lanigan(hmlanigan) wrote on 2020-05-14T17:06:22.269991+00:00 I spun up a happy aws controller and deployed a unit. Nothing like a k8s config, but I see the same EC2 errors in the /var/log/amazon/ssm files. |
Launchpad user Heather Lanigan(hmlanigan) wrote on 2020-05-14T17:22:01.592041+00:00 Machine 18 is the one stuck in pending. |
Launchpad user Pen Gale(pengale) wrote on 2020-05-14T17:51:43.028877+00:00 @genet022: can you get us logs from the machine that Juju was having a hard time talking to? Its logs didn't make it into the crash dump, and that's the most interesting machine, from a troubleshooting standpoint. |
Launchpad user Joshua Genet(genet022) wrote on 2020-05-14T18:08:51.833369+00:00 @petevg Unfortunately because this was an automated run in our CI, the crashdump is all we have. And like you said, the machine 18 logs are empty. |
Launchpad user Tim Penhey(thumper) wrote on 2020-05-14T21:24:33.171312+00:00 Is this a one off? Or is it happening every time? |
Launchpad user Tim Penhey(thumper) wrote on 2020-05-14T21:38:11.134968+00:00 Grabbed the logs from the crashdumps. As mentioned by @petevg there is nothing we can use here for diagnosis. The problem is on the machine that we have no information for. The controller logs show that the machine-18 in the kubernetes model never tried to connect. This normally indicates some networking or cloud-init issue on the started instance. Without access to the instance that has had the problem, there is nothing we can do. |
Launchpad user John George(jog) wrote on 2020-05-15T15:50:03.737691+00:00 We hit something similar on vsphere, and were able to get cloud-init-output.log. There is a failure from juju-clean-shutdown.service
|
Launchpad user Heather Lanigan(hmlanigan) wrote on 2020-05-15T17:28:37.136713+00:00 With the vsphere config machine-4: May 14 16:43:18 juju-60990d-4 cloud-init[1562]: + /bin/systemctl enable /etc/systemd/system/juju-clean-shutdown.service The systectl command failed, causing cloud-init to fail and exit before jujud-machine-4.service cloud be enabled on that machine. |
Launchpad user Pen Gale(pengale) wrote on 2020-05-18T14:04:00.347873+00:00 Per convo w/ Juju team, this is likely a systemd bug. Juju can't cleanly do much about it -- if a machine fails to cloud-init, it's never going to get to the point where it can talk to Juju. The "correct" next steps would involve further investigation on the failed machine, and a bug filed against systemd. That said, there is some investigation that we might do from the Juju end of things. For example, we could queue up the service to be started, rather than blocking on start. This might cause other issues later on in the unit's life cycle, however. |
Launchpad user Tim Penhey(thumper) wrote on 2020-05-19T21:18:39.089024+00:00 Isn't this a cloud-init issue? Not a Juju issue? |
Launchpad user Pen Gale(pengale) wrote on 2020-05-19T21:43:30.670938+00:00 This is not a regression, and isn't a bug with the Juju service being started. There might be some longer term work to make Juju behave better when a piece of the pipeline fails like this. But this doesn't make sense as a release blocker -- any fixes we did in the release window would be partial, and wouldn't address the underlying bug in cloud-init. |
Launchpad user Paride Legovini(paride) wrote on 2020-06-15T10:50:09.173879+00:00 Hi, I think this is unlikely to be a bug in cloud-init, as the cloud-init failure is a consequence of the failure starting the juju-clean-shutdown service, as noted already. We could get better understanding on what happens on the cloud-init side from the logs tarball generated by running cloud-init collect-logs on the failed machine. For the moment I'm marking the cloud-init task as Incomplete. |
Launchpad user Joseph Phillips(manadart) wrote on 2020-06-17T11:42:42.107093+00:00 This service is no longer created on machines using systemd. |
No longer relevant, and looks like this wasn't a cloud-init issue either. Closing. |
This bug was originally filed in Launchpad as LP: #1878639
Launchpad details
Launchpad user Joshua Genet(genet022) wrote on 2020-05-14T15:47:43.214957+00:00
AWS does spin up an instance and assigns an IP, but Juju stays stuck in Pending.
There's a bunch of EC2RoleRequest EC2Metadata Errors in the controller logs.
Here's a link to the logs/artifacts:
https://oil-jenkins.canonical.com/artifacts/5e61db53-50f0-4b82-9bb1-957bd0085d46/index.html
The text was updated successfully, but these errors were encountered: