Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
After auto-update, EC2 instance inaccessible (network issues) #2543
We are using Container as the base image for EKS worker nodes. I built a custom AMI (based off the community Container Linux AMI) which simply mirrors what Amazon does for their Amazon Linux EKS worker AMIs.
After deploying a cluster I was experiencing issues in which worker nodes would go offline. I was unable to SSH into the machine, nor ping it. Looking at the cloudWatch metrics and EC2 system log, that AWS provides, I did not see anything that stood out.
Previously I had this issue and it was related to systemd getting hosed and socket activation not working, thus you could not SSH in. In the custom AMI I built I disabled socket activation for SSHD and instead enabled the service. Yet the problem still exists.
From what I can (by snapshotting the root volume and attaching it to another EC2 instance) via the system logs, the OS performs and update and then reboots. After that reboot the box does not re-establish network connectivity. Even after multiple reboots the box still never becomes accessible via the network.
AWS support doesn't see anything from their end. So I am left to think that something is happening after the update and reboot. This doesn't happen to every machine as some reboot just fine and function after.
We are moving towards immutable infrastructure so for the time being (assuming that auto-update is my issue) I have disabled update-engine.
From what I can tell, after update-engine kicks off and reboots the EC2 instance never re-establishes connectivity to allow access via SSH.
Container Linux Version
Running in AWS region us-west-2.
Not sure I have pin pointed the exact steps to reproduce.
Looking at the system logs for the machine in question I did see the network starting and DHCP reporting that it received an IP (and this is the IP reported in EC2 console).
But once services that require network access start to boot they begin to complain due to their inability to access network services.
As you can see from the above timesyncd times out waiting to connect to the NTP server. In addition kubelet begins throwing a bunch of network errors.
As mentioned I am building my own AMI (based off community Container Linux AMI) via Packer and staging the necessary files to runs EKS/k8s. My thoughts were that maybe I wasn't cleaning up a temporary file during the build process. For example I immediately thought maybe I didn't clean up the dhcp cache (during build) and the machine was trying to renew a lease for an IP it didn't own. But the box boots the first time successfully and looking at the system logs I see it obtaining the correct IP.