Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubectl top pods doesn't work with cri-o on K8s v1.30.0-rc.2 #8034

Open
uablrek opened this issue Apr 17, 2024 · 24 comments
Open

Kubectl top pods doesn't work with cri-o on K8s v1.30.0-rc.2 #8034

uablrek opened this issue Apr 17, 2024 · 24 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@uablrek
Copy link

uablrek commented Apr 17, 2024

What happened?

When using K8s v1.30.0-rc.2 and crio then pod metrics never become available:

# kubectl top pods
error: Metrics not available for pod default/tserver-6854544b4b-fjmmg, age: 8m4.082260059s

Tested with crio 1.28.1 and 1.29.2 with the same result.

Metric-server v0.7.1 is used, and seem to run without problems:

# kubectl get pod -n kube-system   metrics-server-d994c478f-sxbf4
NAME                             READY   STATUS    RESTARTS   AGE
metrics-server-d994c478f-sxbf4   1/1     Running   0          12m

No errors in the metrics-server logs.

What did you expect to happen?

"kubectl top pods" should work. It works with K8s v1.29.x

How can we reproduce it (as minimally and precisely as possible)?

  • Install K8s v1.30.0-rc.2 (or K8s built on master)
  • Install the metrics-server. I use the components.yaml method
  • Wait until the metrics-server is running 1/1, and yet some time
  • Do "kubectl top pods" in some namespace with pods

Anything else we need to know?

It works with K8s v1.29.3 and cri-o.
It also works with K8s v1.30.0-rc.2 and containerd 1.7.11 and 2.0.0-beta.2.

CRI-O and Kubernetes version

See above

OS version

Linux 6.8.0

Additional environment details (AWS, VirtualBox, physical, etc.)

@uablrek uablrek added the kind/bug Categorizes issue or PR as related to a bug. label Apr 17, 2024
@uablrek
Copy link
Author

uablrek commented Apr 18, 2024

The problem is the same with K8s v1.30.0, released yesterday Apr 17.

@saschagrunert
Copy link
Member

saschagrunert commented Apr 18, 2024

@uablrek thank you for the report! Can I use the metrics server with hack/local-up-cluster.sh?

When using hack/local-up-cluster.sh and installing the metrics server like that:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Then I have to edit the deployment metrics-server to add --kubelet-insecure-tls to the container args.

But then I still see:

> kubectl logs -f metrics-server-d994c478f-9zs5k

E0418 08:09:07.743449       1 scraper.go:149] "Failed to scrape node" err="request failed, status: \"403 Forbidden\"" node="127.0.0.1"
I0418 08:09:12.047455       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0418 08:09:22.049166       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0418 08:09:22.738901       1 scraper.go:149] "Failed to scrape node" err="request failed, status: \"403 Forbidden\"" node="127.0.0.1"
I0418 08:09:32.047754       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"

Do I have to adjust anything else? 🤔

@uablrek
Copy link
Author

uablrek commented Apr 18, 2024

Not sure with hack/local-up-cluster.sh. But I had to add a number of options to the api-server. I copied from kubernetes-sigs/metrics-server#1014. Here is my commit Nordix/xcluster@c0ff147. It's for xcluster, but you can take it as a hint 😃

@uablrek
Copy link
Author

uablrek commented Apr 18, 2024

I tried it in KinD, and the metrics-server starts ok. But I used K8s v1.29.1, and KinD uses containerd so "kubectl top pods -n kube-system" worked

@uablrek
Copy link
Author

uablrek commented Apr 18, 2024

Naturally @aojea has made a gist on howto use crio in KinD 😄

@uablrek
Copy link
Author

uablrek commented Apr 18, 2024

Oh, you had commented that already. I got to go now, but if I get the time I will try if I get the error with KinD later

@uablrek
Copy link
Author

uablrek commented Apr 18, 2024

I made a mistake testing crio v1.29.2: I used my old crio config and use crun instead of crio-crun. I corrected it but it doesn't seem to matter

@haircommander
Copy link
Member

what is your crio and kubelet config? I wonder if CRI stats was turned (where cri-o doesn't currently report metrics, but will in 1.30.0)

@uablrek
Copy link
Author

uablrek commented Apr 19, 2024

No kubelet config, but started with:

kubelet --address=:: --container-runtime-endpoint=unix:///var/run/crio/crio.sock --image-service-endpoint=unix:///var/run/crio/ \
crio.sock --node-ip=192.168.1.2,fd00::192.168.1.2 --register-node=true --kubeconfig /etc/kubernetes/kubeconfig.token \
--cluster-dns=192.168.1.2 --cluster-domain=xcluster --runtime-cgroups=/ --kubelet-cgroups=/

Crio config: crio.conf.txt

Please note that it works with the same start/config on K8s v1.29.3. And it also works with K8s v1.30.0 and containerd.

A difference is that I use crun for cri-o, and runc for containerd. I will see if I can check that

@pycgo
Copy link

pycgo commented Apr 19, 2024

try get this

kubectl -n kube-system get apiservices.apiregistration.k8s.io |grep metrics-server
v1beta1.metrics.k8s.io kube-system/metrics-server

just see there has api for metrics, if not reinstall metrics-server

@uablrek
Copy link
Author

uablrek commented Apr 19, 2024

# kubectl -n kube-system get apiservices.apiregistration.k8s.io |grep m
etrics-server
v1beta1.metrics.k8s.io                  kube-system/metrics-server   True        2m2s

Please note that since "kubectl top pods" works with K8s v1.30.0 and containerd, it is unlikely that it is something wrong with my K8s config. And the metric-server is running and ready with k8s v1.30.0 and crio 1.29.2, but getting stats for pods fails. "kubectl top nodes" works fine always.

# kubectl version
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0
# containerd -v
containerd github.com/containerd/containerd v1.7.11 64b8a811b07ba6288238eefc14d898ee0b5b99ba
# kubectl top pods
NAME                       CPU(cores)   MEMORY(bytes)   
tserver-6854544b4b-8lv6j   0m           3Mi             
tserver-6854544b4b-b8qv9   0m           3Mi             
tserver-6854544b4b-kcxcs   0m           3Mi             
tserver-6854544b4b-vfq9x   0m           3Mi             

@uablrek
Copy link
Author

uablrek commented Apr 19, 2024

# kubectl version
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0
# crio version
INFO[2024-04-19 07:13:00.852114859Z] Starting CRI-O, version: 1.29.2, git: d317b5dc918bbfbc78481072a0d93e572aa8d0e8(clean) 
Version:        1.29.2
GitCommit:      d317b5dc918bbfbc78481072a0d93e572aa8d0e8
GitCommitDate:  2024-02-22T19:23:38Z
GitTreeState:   clean
BuildDate:      1970-01-01T00:00:00Z
GoVersion:      go1.21.1
Compiler:       gc
Platform:       linux/amd64
Linkmode:       static
BuildTags:      
  static
  netgo
  osusergo
  exclude_graphdriver_btrfs
  exclude_graphdriver_devicemapper
  seccomp
  apparmor
  selinux
LDFlags:          unknown
SeccompEnabled:   true
AppArmorEnabled:  false
# kubectl get pod -n kube-system  -l k8s-app=metrics-server
NAME                             READY   STATUS    RESTARTS   AGE
metrics-server-d994c478f-mwkjf   1/1     Running   0          2m59s
# kubectl top pods
error: metrics not available yet
# (for some minutes, then...)
# kubectl top pods
error: Metrics not available for pod default/tserver-6854544b4b-2nmgw, age: 2m36.007635117s
# (forever. well at least >10m)

@pycgo
Copy link

pycgo commented Apr 19, 2024

# kubectl version
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0
# crio version
INFO[2024-04-19 07:13:00.852114859Z] Starting CRI-O, version: 1.29.2, git: d317b5dc918bbfbc78481072a0d93e572aa8d0e8(clean) 
Version:        1.29.2
GitCommit:      d317b5dc918bbfbc78481072a0d93e572aa8d0e8
GitCommitDate:  2024-02-22T19:23:38Z
GitTreeState:   clean
BuildDate:      1970-01-01T00:00:00Z
GoVersion:      go1.21.1
Compiler:       gc
Platform:       linux/amd64
Linkmode:       static
BuildTags:      
  static
  netgo
  osusergo
  exclude_graphdriver_btrfs
  exclude_graphdriver_devicemapper
  seccomp
  apparmor
  selinux
LDFlags:          unknown
SeccompEnabled:   true
AppArmorEnabled:  false
# kubectl get pod -n kube-system  -l k8s-app=metrics-server
NAME                             READY   STATUS    RESTARTS   AGE
metrics-server-d994c478f-mwkjf   1/1     Running   0          2m59s
# kubectl top pods
error: metrics not available yet
# (for some minutes, then...)
# kubectl top pods
error: Metrics not available for pod default/tserver-6854544b4b-2nmgw, age: 2m36.007635117s
# (forever. well at least >10m)

not see api in this crio
kubectl -n kube-system get apiservices.apiregistration.k8s.io |grep m

@uablrek
Copy link
Author

uablrek commented Apr 19, 2024

# kubectl -n kube-system get apiservices.apiregistration.k8s.io |grep m
v1.                                     Local                        True        9m14s
v1.admissionregistration.k8s.io         Local                        True        9m14s
v1.apiextensions.k8s.io                 Local                        True        9m14s
v1.apps                                 Local                        True        9m14s
v1.authentication.k8s.io                Local                        True        9m14s
v1.authorization.k8s.io                 Local                        True        9m14s
v1.autoscaling                          Local                        True        9m14s
v1.batch                                Local                        True        9m14s
v1.certificates.k8s.io                  Local                        True        9m14s
v1.coordination.k8s.io                  Local                        True        9m14s
v1.discovery.k8s.io                     Local                        True        9m14s
v1.events.k8s.io                        Local                        True        9m14s
v1.flowcontrol.apiserver.k8s.io         Local                        True        9m14s
v1.networking.k8s.io                    Local                        True        9m14s
v1.node.k8s.io                          Local                        True        9m14s
v1.policy                               Local                        True        9m14s
v1.rbac.authorization.k8s.io            Local                        True        9m14s
v1.scheduling.k8s.io                    Local                        True        9m14s
v1.storage.k8s.io                       Local                        True        9m14s
v1alpha1.admissionregistration.k8s.io   Local                        True        9m14s
v1alpha1.authentication.k8s.io          Local                        True        9m14s
v1alpha1.internal.apiserver.k8s.io      Local                        True        9m14s
v1alpha1.networking.k8s.io              Local                        True        9m14s
v1alpha1.storage.k8s.io                 Local                        True        9m14s
v1alpha2.resource.k8s.io                Local                        True        9m14s
v1beta1.admissionregistration.k8s.io    Local                        True        9m14s
v1beta1.authentication.k8s.io           Local                        True        9m14s
v1beta1.metrics.k8s.io                  kube-system/metrics-server   True        9m9s
v1beta3.flowcontrol.apiserver.k8s.io    Local                        True        9m14s
v2.autoscaling                          Local                        True        9m14s

@pycgo
Copy link

pycgo commented Apr 19, 2024

i try it and ok

[root@master ~]# kubectl version
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0
[root@master ~]# crio version
INFO[2024-04-19 15:45:34.034627675+08:00] Starting CRI-O, version: 1.28.4, git: c5fc2a463053cf988db2aebe9b762700484922e5(clean) 
Version:        1.28.4
GitCommit:      c5fc2a463053cf988db2aebe9b762700484922e5
GitCommitDate:  2024-02-22T19:17:55Z
GitTreeState:   clean
BuildDate:      1970-01-01T00:00:00Z
GoVersion:      go1.20.10
Compiler:       gc
Platform:       linux/amd64
Linkmode:       static
BuildTags:      
  static
  netgo
  osusergo
  exclude_graphdriver_btrfs
  exclude_graphdriver_devicemapper
  seccomp
  apparmor
  selinux
LDFlags:          unknown
SeccompEnabled:   true
AppArmorEnabled:  false

[root@master ~]# kubectl top pods -n kube-system 
NAME                              CPU(cores)   MEMORY(bytes)   
coredns-7db6d8ff4d-7ks25          3m           14Mi            
coredns-7db6d8ff4d-gf8s4          2m           12Mi            
etcd-master                       21m          50Mi            
kube-apiserver-master             41m          346Mi           
kube-controller-manager-master    16m          60Mi            
kube-proxy-7j8b2                  1m           17Mi            
kube-scheduler-master             2m           23Mi            
metrics-server-7bcfbbf584-z5fw2   3m           19Mi            
[root@master ~]# kubectl top nodes
NAME     CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
master   267m         6%     1359Mi          19%       
[root@master ~]# 

@uablrek
Copy link
Author

uablrek commented Apr 19, 2024

I have now installed K8s v1.30.0 with kubeadm, and I still get this problem (containerd works).

@uablrek
Copy link
Author

uablrek commented Apr 23, 2024

I should mention that I am only testing, no production cluster. So, no urgency for my part

@lbogdan
Copy link

lbogdan commented Apr 25, 2024

Works fine here in a 1-node test cluster bootstrapped with kubeadm, with cri-o 1.29.3, kubernetes 1.30.0, and metrics-server v0.7.1 (chart version 3.12.1):

$ crio version
INFO[2024-04-25 07:34:27.402979362Z] Starting CRI-O, version: 1.29.3, git: 12c618780c42414d92d6a8dc8d09c16337668eb2(clean)
Version:        1.29.3
GitCommit:      12c618780c42414d92d6a8dc8d09c16337668eb2
GitCommitDate:  2024-04-19T14:33:22Z
GitTreeState:   clean
BuildDate:      1970-01-01T00:00:00Z
GoVersion:      go1.21.7
Compiler:       gc
Platform:       linux/amd64
Linkmode:       static
BuildTags:
  static
  netgo
  osusergo
  exclude_graphdriver_btrfs
  exclude_graphdriver_devicemapper
  seccomp
  apparmor
  selinux
LDFlags:          unknown
SeccompEnabled:   true
AppArmorEnabled:  false

$ kubectl version
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0

$ helm list -n kube-system
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
metrics-server  kube-system     2               2024-04-25 07:32:03.844304125 +0000 UTC deployed        metrics-server-3.12.1   0.7.1

$ kubectl top node
NAME          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ubuntu-test   41m          2%     1497Mi          39%

$ kubectl top pod -A
NAMESPACE     NAME                                      CPU(cores)   MEMORY(bytes)
default       nginx                                     0m           3Mi
kube-system   calico-kube-controllers-ddf655445-6m4g9   1m           10Mi
kube-system   calico-node-tzwb5                         8m           114Mi
kube-system   coredns-7db6d8ff4d-9wc2f                  1m           12Mi
kube-system   coredns-7db6d8ff4d-zs9cb                  1m           12Mi
kube-system   etcd-ubuntu-test                          7m           36Mi
kube-system   kube-apiserver-ubuntu-test                17m          227Mi
kube-system   kube-controller-manager-ubuntu-test       4m           44Mi
kube-system   kube-proxy-jzb9c                          1m           11Mi
kube-system   kube-scheduler-ubuntu-test                1m           16Mi
kube-system   metrics-server-68cfccbdf6-8vh97           1m           15Mi

@uablrek
Copy link
Author

uablrek commented Apr 25, 2024

If you can't reproduce, please close this issue with "not reproducible". It can be a problem with my environment, but it is very clear that something has happened in K8s v1.30, so there may be others who encounter this problem in the future.

My aim was testing the horizontal autoscaler, and I can do that with containerd. No problem

@uablrek
Copy link
Author

uablrek commented Apr 28, 2024

There seem to be a general "get resources" problem with the combination K8s-v1.30+crio. Eviction test with:

apiVersion: v1
kind: Pod
metadata:
  name: test-pod-evicted
spec:
  containers:
  - name: alpine
    image: alpine
    imagePullPolicy: IfNotPresent
    command: ["/bin/sh", "-c", "sleep 10; dd if=/dev/zero of=file bs=1M count=10; sleep 10000"]
    resources:
      limits:
        ephemeral-storage: 5Mi
      requests:
        ephemeral-storage: 5Mi

doesn't work. The pod runs forever.

However, the pod get's evicted with K8s-v1.29.4+crio, or with containerd and any K8s version (including master)

@saschagrunert
Copy link
Member

@uablrek I cant reproduce that with the mentioned pod using CRI-O d5666af and Kubernetes c6b6163e2e8:

> k get pods
NAME               READY   STATUS                   RESTARTS   AGE
test-pod-evicted   0/1     ContainerStatusUnknown   1          4m10s

And the kubelet logs:

I0429 11:13:43.162982  735336 status_manager.go:984] "Pod status is inconsistent with cached status for pod, a reconciliation should be triggered" pod="default/test-pod-evicted" statusDiff=<
	  &v1.PodStatus{
	- 	Phase:             "Running",
	+ 	Phase:             "Failed",
	  	Conditions:        {{Type: "PodReadyToStartContainers", Status: "True", LastTransitionTime: {Time: s"2024-04-29 11:12:22 +0200 CEST"}}, {Type: "Initialized", Status: "True", LastTransitionTime: {Time: s"2024-04-29 11:12:16 +0200 CEST"}}, {Type: "Ready", Status: "True", LastTransitionTime: {Time: s"2024-04-29 11:12:22 +0200 CEST"}}, {Type: "ContainersReady", Status: "True", LastTransitionTime: {Time: s"2024-04-29 11:12:22 +0200 CEST"}}, ...},
	- 	Message:           "",
	+ 	Message:           "Pod ephemeral local storage usage exceeds the total limit of containers 5Mi. ",
	- 	Reason:            "",
	+ 	Reason:            "Evicted",
	  	NominatedNodeName: "",
	  	HostIP:            "127.0.0.1",
	  	... // 2 identical fields
	  	PodIPs:                     {{IP: "10.88.0.3"}, {IP: "2001:db8:4860::3"}},
	  	StartTime:                  s"2024-04-29 11:12:16 +0200 CEST",
	- 	InitContainerStatuses:      nil,
	+ 	InitContainerStatuses:      []v1.ContainerStatus{},
	  	ContainerStatuses:          {{Name: "alpine", State: {Running: &{StartedAt: {Time: s"2024-04-29 11:12:21 +0200 CEST"}}}, Ready: true, Image: "docker.io/library/alpine:latest", ...}},
	  	QOSClass:                   "BestEffort",
	- 	EphemeralContainerStatuses: nil,
	+ 	EphemeralContainerStatuses: []v1.ContainerStatus{},
	  	Resize:                     "",
	  	ResourceClaimStatuses:      nil,
	  }
 

@uablrek
Copy link
Author

uablrek commented Apr 29, 2024

It must be something is my environment then, but I can't understand what. Since I get it working with other combinations, it must be something that happened in K8s v1.30.0, but may be unrelated to crio.

My run differs from yours:

# kubectl get pods
NAME               READY   STATUS   RESTARTS   AGE
test-pod-evicted   0/1     Error    0          3m36s

It gets evicted after a minute or so, goes to "Error" and never restarts.

BTW, I built crio from source (on master), but get the same problem. But I couldn't even start containers with "crun" (my default), but "runc" worked. My guess was that new crio needs a newer crun, but I didn't investigate further.

@aojea
Copy link
Contributor

aojea commented Apr 29, 2024

do you use a non standard crio socket path? it seems this patch fixes this problem for cadvisor metrics https://github.com/kubernetes/kubernetes/pull/118704/files

@uablrek
Copy link
Author

uablrek commented Apr 29, 2024

do you use a non standard crio socket path?

I don't think so:

# In /etc/crio/crio.conf
listen = "/var/run/crio/crio.sock"
# And kubelet started with:
--container-runtime-endpoint=unix:///var/run/crio/crio.sock

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

6 participants