You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@figo, please note this is cluster pd17855 with the following host information:
Host
IPV4 Address
pd17855-master-1
192.168.1.126
pd17855-master-2
192.168.1.127
pd17855-master-3
192.168.1.121
pd17855-worker-1
192.168.1.123
The hosts are accessible via the VMC jump box via our users on that host. To access the nodes usessh core@IPV4_ADDRESS.
Investigation
The Cross-Cloud provisioner for vSphere failed (deploy log) to deploy Kubernetes cluster pd17855. The reason appears to be the failure of the worker node to join the cluster:
$ KUBECTL_PATH=$(which kubectl) NUM_NODES="$TF_VAR_worker_node_count" KUBERNETES_PROVIDER=local ./validate-cluster/cluster/validate-cluster.sh ||true
No resources found.
Waiting for 1 ready nodes. 0 ready nodes, 0 registered. Retrying.
No resources found.
Waiting for 1 ready nodes. 0 ready nodes, 0 registered. Retrying.
No resources found.
The above entries repeat for a while prior to the Gitlab agent giving up and declaring a failed deployment. Please see the above link to the deploy logs for a full log dump.
This gist contains the kubelet service log showing that the kubelet service restarted 24 times.
$ grep "Failed with result 'exit-code'." kubelet.log | wc -l | awk '{print $1}'
24
The reason for the first 15 failures appears to be related to the failure to generate a certificate:
$ grep "error: failed to run Kubelet" kubelet.log | wc -l | awk '{print $1}'
15
Here's one of the first 15 failures:
$ grep -A 2 "error: failed to run Kubelet" kubelet.log | head -n 3
Sep 12 07:32:48 pd17855-worker-1.pd17855.vsphere.local kubelet[1458]: error: failed to run Kubelet: cannot create certificate signing request: Post https://internal-master.pd17855.vsphere.local/apis/certificates.k8s.io/v1beta1/certificatesigningrequests: dial tcp 192.168.1.121:443: getsockopt: connection refused
Sep 12 07:32:48 pd17855-worker-1.pd17855.vsphere.local systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 07:32:48 pd17855-worker-1.pd17855.vsphere.local systemd[1]: kubelet.service: Failed with result 'exit-code'.
After the first 15 failures, all subsequent reasons for the service exiting with a non-zero exit code is not explicitly noted in the log:
Sep 12 14:30:43 pd17855-worker-1.pd17855.vsphere.local systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 14:30:43 pd17855-worker-1.pd17855.vsphere.local systemd[1]: kubelet.service: Failed with result 'exit-code'.
However, it is probably safe to assume it's the same reason -- related to the failure to genereate the certificate.
$ host internal-master.pd17855.vsphere.local
internal-master.pd17855.vsphere.local has address 192.168.1.127
internal-master.pd17855.vsphere.local has address 192.168.1.121
internal-master.pd17855.vsphere.local has address 192.168.1.126
Using the kubelet's certificate and key file to try and curl the aforementioned URL results in an error:
$ sudo curl -w '\n' -k \
--cert /etc/srv/kubernetes/pki/kubelet.crt \
--key /etc/srv/kubernetes/pki/kubelet.key \
https://internal-master.pd17855.vsphere.local/apis/certificates.k8s.io/v1beta1/certificatesigningrequests
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
},
"status": "Failure",
"message": "certificatesigningrequests.certificates.k8s.io is forbidden: User \"kubernetes\" cannot list certificatesigningrequests.certificates.k8s.io at the cluster scope",
"reason": "Forbidden",
"details": {
"group": "certificates.k8s.io",
"kind": "certificatesigningrequests"
},
"code": 403
}
Problem
The
head
K8s release fails on provisioning stage on VMware vSphereProject(s) affected
Kubernetes
head
Failing stage
Waiting for a worker node
Failed job
Cluster
pd17855
- https://gitlab.cncf.ci/cncf/cross-cloud/-/jobs/91730System information
@figo, please note this is cluster
pd17855
with the following host information:pd17855-master-1
192.168.1.126
pd17855-master-2
192.168.1.127
pd17855-master-3
192.168.1.121
pd17855-worker-1
192.168.1.123
The hosts are accessible via the VMC jump box via our users on that host. To access the nodes use
ssh core@IPV4_ADDRESS
.Investigation
The Cross-Cloud provisioner for vSphere failed (deploy log) to deploy Kubernetes cluster pd17855. The reason appears to be the failure of the worker node to join the cluster:
The above entries repeat for a while prior to the Gitlab agent giving up and declaring a failed deployment. Please see the above link to the deploy logs for a full log dump.
This gist contains the kubelet service log showing that the kubelet service restarted 24 times.
The reason for the first 15 failures appears to be related to the failure to generate a certificate:
Here's one of the first 15 failures:
After the first 15 failures, all subsequent reasons for the service exiting with a non-zero exit code is not explicitly noted in the log:
Sep 12 14:30:43 pd17855-worker-1.pd17855.vsphere.local systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE Sep 12 14:30:43 pd17855-worker-1.pd17855.vsphere.local systemd[1]: kubelet.service: Failed with result 'exit-code'.
However, it is probably safe to assume it's the same reason -- related to the failure to genereate the certificate.
The error indicates the kubelet is unable to request a certificate from one of the masters. The address used, https://internal-master.pd17855.vsphere.local/apis/certificates.k8s.io/v1beta1/certificatesigningrequests, is resolvable from the worker node:
Using the kubelet's certificate and key file to try and curl the aforementioned URL results in an error:
cc @figo
The text was updated successfully, but these errors were encountered: