[09-12-2018] head K8s release fails on provisioning stage on VMware vSphere #75

akutz · 2018-09-12T15:23:57Z

Problem

The head K8s release fails on provisioning stage on VMware vSphere

Project(s) affected

Kubernetes head

Failing stage

Waiting for a worker node

Failed job

Cluster pd17855 - https://gitlab.cncf.ci/cncf/cross-cloud/-/jobs/91730

System information

@figo, please note this is cluster pd17855 with the following host information:

Host	IPV4 Address
`pd17855-master-1`	`192.168.1.126`
`pd17855-master-2`	`192.168.1.127`
`pd17855-master-3`	`192.168.1.121`
`pd17855-worker-1`	`192.168.1.123`

The hosts are accessible via the VMC jump box via our users on that host. To access the nodes usessh core@IPV4_ADDRESS.

Investigation

The Cross-Cloud provisioner for vSphere failed (deploy log) to deploy Kubernetes cluster pd17855. The reason appears to be the failure of the worker node to join the cluster:

$ KUBECTL_PATH=$(which kubectl) NUM_NODES="$TF_VAR_worker_node_count" KUBERNETES_PROVIDER=local ./validate-cluster/cluster/validate-cluster.sh || true
No resources found.
Waiting for 1 ready nodes. 0 ready nodes, 0 registered. Retrying.
No resources found.
Waiting for 1 ready nodes. 0 ready nodes, 0 registered. Retrying.
No resources found.

The above entries repeat for a while prior to the Gitlab agent giving up and declaring a failed deployment. Please see the above link to the deploy logs for a full log dump.

This gist contains the kubelet service log showing that the kubelet service restarted 24 times.

$ grep "Failed with result 'exit-code'." kubelet.log  | wc -l | awk '{print $1}'
24

The reason for the first 15 failures appears to be related to the failure to generate a certificate:

$ grep "error: failed to run Kubelet" kubelet.log | wc -l | awk '{print $1}'
15

Here's one of the first 15 failures:

$ grep -A 2 "error: failed to run Kubelet" kubelet.log | head -n 3
Sep 12 07:32:48 pd17855-worker-1.pd17855.vsphere.local kubelet[1458]: error: failed to run Kubelet: cannot create certificate signing request: Post https://internal-master.pd17855.vsphere.local/apis/certificates.k8s.io/v1beta1/certificatesigningrequests: dial tcp 192.168.1.121:443: getsockopt: connection refused
Sep 12 07:32:48 pd17855-worker-1.pd17855.vsphere.local systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 07:32:48 pd17855-worker-1.pd17855.vsphere.local systemd[1]: kubelet.service: Failed with result 'exit-code'.

After the first 15 failures, all subsequent reasons for the service exiting with a non-zero exit code is not explicitly noted in the log:

Sep 12 14:30:43 pd17855-worker-1.pd17855.vsphere.local systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 14:30:43 pd17855-worker-1.pd17855.vsphere.local systemd[1]: kubelet.service: Failed with result 'exit-code'.

However, it is probably safe to assume it's the same reason -- related to the failure to genereate the certificate.

The error indicates the kubelet is unable to request a certificate from one of the masters. The address used, https://internal-master.pd17855.vsphere.local/apis/certificates.k8s.io/v1beta1/certificatesigningrequests, is resolvable from the worker node:

$ host internal-master.pd17855.vsphere.local
internal-master.pd17855.vsphere.local has address 192.168.1.127
internal-master.pd17855.vsphere.local has address 192.168.1.121
internal-master.pd17855.vsphere.local has address 192.168.1.126

Using the kubelet's certificate and key file to try and curl the aforementioned URL results in an error:

$ sudo curl -w '\n' -k \
  --cert /etc/srv/kubernetes/pki/kubelet.crt \
  --key /etc/srv/kubernetes/pki/kubelet.key \
  https://internal-master.pd17855.vsphere.local/apis/certificates.k8s.io/v1beta1/certificatesigningrequests
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {
    
  },
  "status": "Failure",
  "message": "certificatesigningrequests.certificates.k8s.io is forbidden: User \"kubernetes\" cannot list certificatesigningrequests.certificates.k8s.io at the cluster scope",
  "reason": "Forbidden",
  "details": {
    "group": "certificates.k8s.io",
    "kind": "certificatesigningrequests"
  },
  "code": 403
}

cc @figo

The text was updated successfully, but these errors were encountered:

akutz · 2018-09-12T15:29:30Z

Hi @taylor / @denverwilliams,

Please send your SSH public keys, and I will provide you temporary access to our jump host so you can directly log into the affected nodes. Thank you.

lixuna closed this as completed Feb 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[09-12-2018] head K8s release fails on provisioning stage on VMware vSphere #75

[09-12-2018] head K8s release fails on provisioning stage on VMware vSphere #75

akutz commented Sep 12, 2018 •

edited

Loading

akutz commented Sep 12, 2018

[09-12-2018] head K8s release fails on provisioning stage on VMware vSphere #75

[09-12-2018] head K8s release fails on provisioning stage on VMware vSphere #75

Comments

akutz commented Sep 12, 2018 • edited Loading

Problem

Project(s) affected

Failing stage

Failed job

System information

Investigation

akutz commented Sep 12, 2018

akutz commented Sep 12, 2018 •

edited

Loading