Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[09-12-2018] head K8s release fails on provisioning stage on VMware vSphere #75

Closed
akutz opened this issue Sep 12, 2018 · 1 comment
Closed

Comments

@akutz
Copy link

akutz commented Sep 12, 2018

Problem

The head K8s release fails on provisioning stage on VMware vSphere

Project(s) affected

Kubernetes head

Failing stage

Waiting for a worker node

Failed job

Cluster pd17855 - https://gitlab.cncf.ci/cncf/cross-cloud/-/jobs/91730

System information

@figo, please note this is cluster pd17855 with the following host information:

Host IPV4 Address
pd17855-master-1 192.168.1.126
pd17855-master-2 192.168.1.127
pd17855-master-3 192.168.1.121
pd17855-worker-1 192.168.1.123

The hosts are accessible via the VMC jump box via our users on that host. To access the nodes usessh core@IPV4_ADDRESS.

Investigation

The Cross-Cloud provisioner for vSphere failed (deploy log) to deploy Kubernetes cluster pd17855. The reason appears to be the failure of the worker node to join the cluster:

$ KUBECTL_PATH=$(which kubectl) NUM_NODES="$TF_VAR_worker_node_count" KUBERNETES_PROVIDER=local ./validate-cluster/cluster/validate-cluster.sh || true
No resources found.
Waiting for 1 ready nodes. 0 ready nodes, 0 registered. Retrying.
No resources found.
Waiting for 1 ready nodes. 0 ready nodes, 0 registered. Retrying.
No resources found.

The above entries repeat for a while prior to the Gitlab agent giving up and declaring a failed deployment. Please see the above link to the deploy logs for a full log dump.

This gist contains the kubelet service log showing that the kubelet service restarted 24 times.

$ grep "Failed with result 'exit-code'." kubelet.log  | wc -l | awk '{print $1}'
24

The reason for the first 15 failures appears to be related to the failure to generate a certificate:

$ grep "error: failed to run Kubelet" kubelet.log | wc -l | awk '{print $1}'
15

Here's one of the first 15 failures:

$ grep -A 2 "error: failed to run Kubelet" kubelet.log | head -n 3
Sep 12 07:32:48 pd17855-worker-1.pd17855.vsphere.local kubelet[1458]: error: failed to run Kubelet: cannot create certificate signing request: Post https://internal-master.pd17855.vsphere.local/apis/certificates.k8s.io/v1beta1/certificatesigningrequests: dial tcp 192.168.1.121:443: getsockopt: connection refused
Sep 12 07:32:48 pd17855-worker-1.pd17855.vsphere.local systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 07:32:48 pd17855-worker-1.pd17855.vsphere.local systemd[1]: kubelet.service: Failed with result 'exit-code'.

After the first 15 failures, all subsequent reasons for the service exiting with a non-zero exit code is not explicitly noted in the log:

Sep 12 14:30:43 pd17855-worker-1.pd17855.vsphere.local systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 14:30:43 pd17855-worker-1.pd17855.vsphere.local systemd[1]: kubelet.service: Failed with result 'exit-code'.

However, it is probably safe to assume it's the same reason -- related to the failure to genereate the certificate.

The error indicates the kubelet is unable to request a certificate from one of the masters. The address used, https://internal-master.pd17855.vsphere.local/apis/certificates.k8s.io/v1beta1/certificatesigningrequests, is resolvable from the worker node:

$ host internal-master.pd17855.vsphere.local
internal-master.pd17855.vsphere.local has address 192.168.1.127
internal-master.pd17855.vsphere.local has address 192.168.1.121
internal-master.pd17855.vsphere.local has address 192.168.1.126

Using the kubelet's certificate and key file to try and curl the aforementioned URL results in an error:

$ sudo curl -w '\n' -k \
  --cert /etc/srv/kubernetes/pki/kubelet.crt \
  --key /etc/srv/kubernetes/pki/kubelet.key \
  https://internal-master.pd17855.vsphere.local/apis/certificates.k8s.io/v1beta1/certificatesigningrequests
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {
    
  },
  "status": "Failure",
  "message": "certificatesigningrequests.certificates.k8s.io is forbidden: User \"kubernetes\" cannot list certificatesigningrequests.certificates.k8s.io at the cluster scope",
  "reason": "Forbidden",
  "details": {
    "group": "certificates.k8s.io",
    "kind": "certificatesigningrequests"
  },
  "code": 403
}

cc @figo

@akutz
Copy link
Author

akutz commented Sep 12, 2018

Hi @taylor / @denverwilliams,

Please send your SSH public keys, and I will provide you temporary access to our jump host so you can directly log into the affected nodes. Thank you.

@lixuna lixuna closed this as completed Feb 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants