New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After auto-update, EC2 instance inaccessible (network issues) #2543

Open
cyrus-mc opened this Issue Jan 14, 2019 · 0 comments

Comments

Projects
None yet
1 participant
@cyrus-mc
Copy link

cyrus-mc commented Jan 14, 2019

Issue Report

We are using Container as the base image for EKS worker nodes. I built a custom AMI (based off the community Container Linux AMI) which simply mirrors what Amazon does for their Amazon Linux EKS worker AMIs.

After deploying a cluster I was experiencing issues in which worker nodes would go offline. I was unable to SSH into the machine, nor ping it. Looking at the cloudWatch metrics and EC2 system log, that AWS provides, I did not see anything that stood out.

Previously I had this issue and it was related to systemd getting hosed and socket activation not working, thus you could not SSH in. In the custom AMI I built I disabled socket activation for SSHD and instead enabled the service. Yet the problem still exists.

From what I can (by snapshotting the root volume and attaching it to another EC2 instance) via the system logs, the OS performs and update and then reboots. After that reboot the box does not re-establish network connectivity. Even after multiple reboots the box still never becomes accessible via the network.

AWS support doesn't see anything from their end. So I am left to think that something is happening after the update and reboot. This doesn't happen to every machine as some reboot just fine and function after.

We are moving towards immutable infrastructure so for the time being (assuming that auto-update is my issue) I have disabled update-engine.

Bug

From what I can tell, after update-engine kicks off and reboots the EC2 instance never re-establishes connectivity to allow access via SSH.

Container Linux Version

$ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1911.5.0
VERSION_ID=1911.5.0
BUILD_ID=2018-12-15-2317
PRETTY_NAME="Container Linux by CoreOS 1911.5.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

Running in AWS region us-west-2.

Expected Behavior

Actual Behavior

Reproduction Steps

Not sure I have pin pointed the exact steps to reproduce.

  1. Boot machine
  2. Let update-engine update and reboot

Other Information

Looking at the system logs for the machine in question I did see the network starting and DHCP reporting that it received an IP (and this is the IP reported in EC2 console).

But once services that require network access start to boot they begin to complain due to their inability to access network services.

Jan 11 14:20:14 ip-10-36-0-151.us-west-2.compute.internal locksmithd[840]: LastCheckedTime=0 Progress=0 CurrentOperation="UPDATE_STATUS_REPORTING_ERROR_EVENT" NewVersion=0.0.0 NewSize=0
Jan 11 14:20:15 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: W0111 14:20:15.152854    1146 cni.go:243] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "650a205dec03b4f6540b8403930ea025bdb0298a24755852aead743f1c8
Jan 11 14:20:15 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:15.170272    1146 cni.go:280] Error deleting network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error
Jan 11 14:20:15 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:15.170713    1146 remote_runtime.go:115] StopPodSandbox "650a205dec03b4f6540b8403930ea025bdb0298a24755852aead743f1c8877fa" from runtime service failed: rpc error: code = Unknown desc = NetworkPlug
Jan 11 14:20:15 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:15.170747    1146 kuberuntime_manager.go:799] Failed to stop sandbox {"docker" "650a205dec03b4f6540b8403930ea025bdb0298a24755852aead743f1c8877fa"}
Jan 11 14:20:15 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:15.170804    1146 kuberuntime_manager.go:594] killPodWithSyncResult failed: failed to "KillPodSandbox" for "76f47acb-1514-11e9-9d50-025a40f42cc6" with KillPodSandboxError: "rpc error: code = Unkno
Jan 11 14:20:15 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:15.170827    1146 pod_workers.go:186] Error syncing pod 76f47acb-1514-11e9-9d50-025a40f42cc6 ("vault-0_default(76f47acb-1514-11e9-9d50-025a40f42cc6)"), skipping: failed to "KillPodSandbox" for "76
Jan 11 14:20:17 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: W0111 14:20:17.068134    1146 kubelet_node_status.go:1114] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s
Jan 11 14:20:20 ip-10-36-0-151.us-west-2.compute.internal systemd-timesyncd[725]: Timed out waiting for reply from 108.59.2.24:123 (0.coreos.pool.ntp.org).
Jan 11 14:20:22 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: W0111 14:20:22.176017    1146 cni.go:243] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "707906e7763a058196519713a239313cf9fca0524a54b689a43ce13aa22
Jan 11 14:20:22 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:22.194193    1146 cni.go:280] Error deleting network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error
Jan 11 14:20:22 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:22.194661    1146 remote_runtime.go:115] StopPodSandbox "707906e7763a058196519713a239313cf9fca0524a54b689a43ce13aa222e043" from runtime service failed: rpc error: code = Unknown desc = NetworkPlug
Jan 11 14:20:22 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:22.194690    1146 kuberuntime_gc.go:153] Failed to stop sandbox "707906e7763a058196519713a239313cf9fca0524a54b689a43ce13aa222e043" before removing: rpc error: code = Unknown desc = NetworkPlugin c
Jan 11 14:20:25 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: W0111 14:20:25.152957    1146 cni.go:243] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "3577199779da3543c584ea52aaed1ca464e249f654e1f39736d69fe74ff
Jan 11 14:20:25 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:25.170729    1146 cni.go:280] Error deleting network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error
Jan 11 14:20:25 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:25.171260    1146 remote_runtime.go:115] StopPodSandbox "3577199779da3543c584ea52aaed1ca464e249f654e1f39736d69fe74ff9817b" from runtime service failed: rpc error: code = Unknown desc = NetworkPlug
Jan 11 14:20:25 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:25.171298    1146 kuberuntime_manager.go:799] Failed to stop sandbox {"docker" "3577199779da3543c584ea52aaed1ca464e249f654e1f39736d69fe74ff9817b"}
Jan 11 14:20:25 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:25.171338    1146 kuberuntime_manager.go:594] killPodWithSyncResult failed: failed to "KillPodSandbox" for "f2172f37-15aa-11e9-9d50-025a40f42cc6" with KillPodSandboxError: "rpc error: code = Unkno
Jan 11 14:20:25 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:25.171361    1146 pod_workers.go:186] Error syncing pod f2172f37-15aa-11e9-9d50-025a40f42cc6 ("vault-etcd-4sjmsglsvr_default(f2172f37-15aa-11e9-9d50-025a40f42cc6)"), skipping: failed to "KillPodSa
Jan 11 14:20:27 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: W0111 14:20:27.078846    1146 kubelet_node_status.go:1114] Failed to set some node status fields: failed to get node address from cloud provider that matches ip: 10.36.0.151
Jan 11 14:20:29 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: W0111 14:20:29.152946    1146 cni.go:243] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "650a205dec03b4f6540b8403930ea025bdb0298a24755852aead743f1c8
Jan 11 14:20:29 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:29.171370    1146 cni.go:280] Error deleting network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error
Jan 11 14:20:29 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:29.171837    1146 remote_runtime.go:115] StopPodSandbox "650a205dec03b4f6540b8403930ea025bdb0298a24755852aead743f1c8877fa" from runtime service failed: rpc error: code = Unknown desc = NetworkPlug
Jan 11 14:20:29 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:29.171873    1146 kuberuntime_manager.go:799] Failed to stop sandbox {"docker" "650a205dec03b4f6540b8403930ea025bdb0298a24755852aead743f1c8877fa"}
Jan 11 14:20:29 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:29.171910    1146 kuberuntime_manager.go:594] killPodWithSyncResult failed: failed to "KillPodSandbox" for "76f47acb-1514-11e9-9d50-025a40f42cc6" with KillPodSandboxError: "rpc error: code = Unkno
Jan 11 14:20:29 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:29.171932    1146 pod_workers.go:186] Error syncing pod 76f47acb-1514-11e9-9d50-025a40f42cc6 ("vault-0_default(76f47acb-1514-11e9-9d50-025a40f42cc6)"), skipping: failed to "KillPodSandbox" for "76
Jan 11 14:20:30 ip-10-36-0-151.us-west-2.compute.internal systemd-timesyncd[725]: Timed out waiting for reply from 208.75.88.4:123 (0.coreos.pool.ntp.org).
Jan 11 14:20:36 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: W0111 14:20:36.153090    1146 cni.go:243] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "3577199779da3543c584ea52aaed1ca464e249f654e1f39736d69fe74ff
Jan 11 14:20:36 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:36.171821    1146 cni.go:280] Error deleting network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error
Jan 11 14:20:36 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:36.172281    1146 remote_runtime.go:115] StopPodSandbox "3577199779da3543c584ea52aaed1ca464e249f654e1f39736d69fe74ff9817b" from runtime service failed: rpc error: code = Unknown desc = NetworkPlug
Jan 11 14:20:36 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:36.172317    1146 kuberuntime_manager.go:799] Failed to stop sandbox {"docker" "3577199779da3543c584ea52aaed1ca464e249f654e1f39736d69fe74ff9817b"}
Jan 11 14:20:36 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:36.172358    1146 kuberuntime_manager.go:594] killPodWithSyncResult failed: failed to "KillPodSandbox" for "f2172f37-15aa-11e9-9d50-025a40f42cc6" with KillPodSandboxError: "rpc error: code = Unkno
Jan 11 14:20:36 ip-10-36-0-151.us-west-2.compute.internal kubelet-wrapper[1146]: E0111 14:20:36.172380    1146 pod_workers.go:186] Error syncing pod f2172f37-15aa-11e9-9d50-025a40f42cc6 ("vault-etcd-4sjmsglsvr_default(f2172f37-15aa-11e9-9d50-025a40f42cc6)"), skipping: failed to "KillPodSa
Jan 11 14:20:37 ip-10-36-0-151.us-west-2.compute.internal update_engine[756]: E0111 14:20:37.900404   756 libcurl_http_fetcher.cc:243] Unable to get http response code: Failed to connect to public.update.core-os.net port 443: Connection timed out
Jan 11 14:20:37 ip-10-36-0-151.us-west-2.compute.internal update_engine[756]: I0111 14:20:37.900467   756 libcurl_http_fetcher.cc:274] Transfer resulted in an error (0), 0 bytes downloaded
Jan 11 14:20:37 ip-10-36-0-151.us-west-2.compute.internal update_engine[756]: I0111 14:20:37.900477   756 omaha_request_action.cc:592] Omaha request response:
Jan 11 14:20:37 ip-10-36-0-151.us-west-2.compute.internal update_engine[756]: I0111 14:20:37.900485   756 action_processor.cc:65] ActionProcessor::ActionComplete: finished last action of type OmahaRequestAction
Jan 11 14:20:37 ip-10-36-0-151.us-west-2.compute.internal update_engine[756]: I0111 14:20:37.900490   756 action_processor.cc:73] ActionProcessor::ActionComplete: finished last action of type OmahaRequestAction
Jan 11 14:20:37 ip-10-36-0-151.us-west-2.compute.internal update_engine[756]: I0111 14:20:37.900494   756 update_attempter.cc:290] Processing Done.
Jan 11 14:20:37 ip-10-36-0-151.us-west-2.compute.internal update_engine[756]: I0111 14:20:37.900501   756 update_attempter.cc:294] Error event sent.
Jan 11 14:20:37 ip-10-36-0-151.us-west-2.compute.internal update_engine[756]: I0111 14:20:37.900513   756 update_check_scheduler.cc:74] Next update check in 46m28s
Jan 11 14:20:37 ip-10-36-0-151.us-west-2.compute.internal locksmithd[840]: LastCheckedTime=0 Progress=0 CurrentOperation="UPDATE_STATUS_IDLE" NewVersion=0.0.0 NewSize=0
Jan 11 14:20:41 ip-10-36-0-151.us-west-2.compute.internal systemd-timesyncd[725]: Timed out waiting for reply from 64.62.206.99:123 (0.coreos.pool.ntp.org).

As you can see from the above timesyncd times out waiting to connect to the NTP server. In addition kubelet begins throwing a bunch of network errors.

As mentioned I am building my own AMI (based off community Container Linux AMI) via Packer and staging the necessary files to runs EKS/k8s. My thoughts were that maybe I wasn't cleaning up a temporary file during the build process. For example I immediately thought maybe I didn't clean up the dhcp cache (during build) and the machine was trying to renew a lease for an IP it didn't own. But the box boots the first time successfully and looking at the system logs I see it obtaining the correct IP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment