minikube cannot detect the GPUs #2

andakai · 2023-09-29T09:18:47Z

I used your image to create a container. In the container, I installed minikube. When I run minikube start, the node minikube didn't detect any GPU. I am wondering how to fix this. By the way, the command nvidia-smi works well.

docker run --gpus 1 -it --privileged --name ElasticDL -d ghcr.io/ehfd/nvidia-dind:latest
docker exec -it ElasticDL /bin/bash
# install minikube
minikube start
alias kubectl="minikube kubectl --"
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

Then the result shows that the GPU num is <none>

When I run a pod, the pod status is:

FailedScheduling
0/1 nodes are available: 1 Insufficient ndefault-schedulervidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

The text was updated successfully, but these errors were encountered:

ehfd · 2023-09-29T09:38:12Z

Perhaps you need k8s-device-plugin?

https://github.com/NVIDIA/k8s-device-plugin

andakai · 2023-09-29T10:19:02Z

Thanks, but I have tried this. I start from the section "Enabling GPU Support in Kubernetes". I think this image has done the work before this section, I am not sure if it is right?

root@440403c45a7b:/usr/src# kubectl get pod -A
NAMESPACE     NAME                                   READY   STATUS    RESTARTS        AGE
kube-system   coredns-5d78c9869d-wttcf               1/1     Running   0               2m36s
kube-system   etcd-minikube                          1/1     Running   0               5m27s
kube-system   kube-apiserver-minikube                1/1     Running   0               4m4s
kube-system   kube-controller-manager-minikube       1/1     Running   4 (4m1s ago)    5m23s
kube-system   kube-proxy-dv6gc                       1/1     Running   0               2m36s
kube-system   kube-scheduler-minikube                1/1     Running   0               4m15s
kube-system   nvidia-device-plugin-daemonset-lppbw   1/1     Running   0               2m27s
kube-system   storage-provisioner                    1/1     Running   1 (2m20s ago)   3m8s

Then I run the command: kubectl apply -f test-gpu.yaml
The content of the test-gpu.yaml is :

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

The pod status is Pending, the detail is :

root@440403c45a7b:/usr/src# kubectl describe pod gpu-pod
Name:             gpu-pod
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           <none>
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Containers:
  cuda-container:
    Image:      nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8d7kt (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-8d7kt:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  24s   default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

I also tried to start from the Preparing your GPU Nodes. But there is some difficulty with systemctl. I tried to install the command systemctl and then systemctl restart docker. The container will exit.

ehfd · 2023-09-29T10:42:50Z

Perhaps https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#configure-containerd or https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#configure-docker is the problematic location.

andakai · 2023-09-29T10:47:16Z

I am trying to implement these, but systemctl is not supported in this image. I get confused about how to run systemctl restart docker. I tried some ways to install systemctl but still failed to restart docker. Any suggestion is appreciated :>

ehfd · 2023-09-29T10:51:01Z

(sudo) supervisorctl restart dockerd

andakai · 2023-09-29T11:32:45Z

Still not work.

I restart a container. docker run --gpus 1 -it --privileged --name ElasticDL -d elasticdl:v1. The image elasticdl:v1 only adds minikube.
run docker exec -it ElasticDL /bin/bash
Configure the /etc/docker/daemon.json. There is no file named /etc/containerd/config.toml and no service named containerd, so I didn't do this.

root@c0ac3df639d6:/usr/bin# supervisorctl restart dockerd
dockerd: stopped
dockerd: started

run minikube start

root@c0ac3df639d6:/usr/bin# minikube start --force
* minikube v1.31.2 on Ubuntu 22.04 (docker/amd64)
! minikube skips various validations when --force is supplied; this may lead to unexpected behavior
* Using the docker driver based on existing profile
* The "docker" driver should not be used with root privileges. If you wish to continue as root, use --force.
* If you are running minikube within a VM, consider using --driver=none:
*   https://minikube.sigs.k8s.io/docs/reference/drivers/none/
* Tip: To remove this root owned cluster, run: sudo minikube delete
* Starting control plane node minikube in cluster minikube
* Pulling base image ...
* Restarting existing docker container for "minikube" ...
* Preparing Kubernetes v1.27.4 on Docker 24.0.4 ...
* Configuring bridge CNI (Container Networking Interface) ...
  - Using image gcr.io/k8s-minikube/storage-provisioner:v5
* Verifying Kubernetes components...
* Enabled addons: default-storageclass, storage-provisioner
* kubectl not found. If you need it, try: 'minikube kubectl -- get pods -A'
* Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default

run kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu", and the result is bellow. I am not sure if this is the reason as the GPU is <none>.
NAME GPU
minikube
run kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml
run test-gpu.yaml
run kubectl get pod -A. The pod gpu-pod status is the same.

root@c0ac3df639d6:/usr/bin# kubectl get pod -A
NAMESPACE     NAME                                   READY   STATUS    RESTARTS        AGE
default       gpu-pod                                0/1     Pending   0               11m
kube-system   coredns-5d78c9869d-r8x22               1/1     Running   1 (8m7s ago)    17m
kube-system   etcd-minikube                          1/1     Running   1 (8m11s ago)   19m
kube-system   kube-apiserver-minikube                1/1     Running   1 (8m11s ago)   18m
kube-system   kube-controller-manager-minikube       1/1     Running   6 (5m10s ago)   20m
kube-system   kube-proxy-8tl8c                       1/1     Running   1 (8m12s ago)   17m
kube-system   kube-scheduler-minikube                1/1     Running   1 (8m12s ago)   19m
kube-system   nvidia-device-plugin-daemonset-rbppm   1/1     Running   1               13m
kube-system   storage-provisioner                    1/1     Running   3 (4m53s ago)   17m

andakai · 2023-09-29T12:35:11Z

I also tried this document: https://github.com/intelligent-machine-learning/dlrover/blob/master/docs/tutorial/gpu_user_guide.md, similar to the Nvidia's document. Still get the same result.

root@c0ac3df639d6:/usr/src# kubectl describe pod nvidia-device-plugin-daemonset-r9spv -n kube-system
Name:                 nvidia-device-plugin-daemonset-r9spv
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      default
Node:                 minikube/192.168.49.2
Start Time:           Fri, 29 Sep 2023 12:27:34 +0000
Labels:               controller-revision-hash=586d67c5
                      name=nvidia-device-plugin-ds
                      pod-template-generation=2
Annotations:          scheduler.alpha.kubernetes.io/critical-pod: 
Status:               Running
IP:                   10.244.0.9
IPs:
  IP:           10.244.0.9
Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
Containers:
  nvidia-device-plugin-ctr:
    Container ID:   docker://e3994f9d249dbf33089fced497bb52ba8233f84b53bf4b76c72fc33cc58df1f2
    Image:          nvidia/k8s-device-plugin:1.11
    Image ID:       docker-pullable://nvidia/k8s-device-plugin@sha256:41b3531d338477d26eb1151c15d0bea130d31e690752315a5205d8094439b0a6
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Fri, 29 Sep 2023 12:28:36 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      FAIL_ON_INIT_ERROR:  false
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tcq6h (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  kube-api-access-tcq6h:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  6m43s  default-scheduler  Successfully assigned kube-system/nvidia-device-plugin-daemonset-r9spv to minikube
  Normal  Pulling    6m34s  kubelet            Pulling image "nvidia/k8s-device-plugin:1.11"
  Normal  Pulled     6m8s   kubelet            Successfully pulled image "nvidia/k8s-device-plugin:1.11" in 25.768299986s (25.768309412s including waiting)
  Normal  Created    5m43s  kubelet            Created container nvidia-device-plugin-ctr
  Normal  Started    5m40s  kubelet            Started container nvidia-device-plugin-ctr

root@c0ac3df639d6:/usr/src# kubectl describe pod gpu-pod
Name:             gpu-pod
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           <none>
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Containers:
  cuda-container:
    Image:      nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bbdkm (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-bbdkm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  20s (x2 over 5m20s)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

ehfd · 2023-11-25T10:02:02Z

I've updated the NVIDIA container toolkit. Please see if this might solve anything.

rajat709 · 2024-01-01T08:25:15Z

Hello @darrenglow \ CC @ehfd How will you be able to run Minikube inside a container? I have been trying for a long time, but I keep getting OCI and Cgroup errors. Can you help me with this?

ehfd · 2024-01-05T03:36:38Z

I have no idea... Perhaps try KinD?

rajat709 · 2024-01-05T04:17:34Z

Sure I'll try it...Thanks

ehfd · 2024-01-09T06:02:12Z

https://www.substratus.ai/blog/kind-with-gpus/
kubernetes-sigs/kind#3257 (comment)

Both actually look relevant/applicable here too.

rajat709 · 2024-01-30T18:28:59Z

@ehfd I have tried but it did'nt work :((

ehfd · 2024-02-25T08:45:25Z

NVIDIA/k8s-device-plugin#332 (comment)
One more resource.

rajat709 · 2024-02-25T13:31:34Z

Thank you, @ehfd . I’ve already explored that resource, but unfortunately, it didn’t work too. However, I’ve now switched to using virtual machines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

minikube cannot detect the GPUs #2

minikube cannot detect the GPUs #2

andakai commented Sep 29, 2023

ehfd commented Sep 29, 2023

andakai commented Sep 29, 2023 •

edited

Loading

ehfd commented Sep 29, 2023 •

edited

Loading

andakai commented Sep 29, 2023

ehfd commented Sep 29, 2023

andakai commented Sep 29, 2023 •

edited

Loading

andakai commented Sep 29, 2023 •

edited

Loading

ehfd commented Nov 25, 2023

rajat709 commented Jan 1, 2024

ehfd commented Jan 5, 2024

rajat709 commented Jan 5, 2024

ehfd commented Jan 9, 2024 •

edited

Loading

rajat709 commented Jan 30, 2024

ehfd commented Feb 25, 2024

rajat709 commented Feb 25, 2024

minikube cannot detect the GPUs #2

minikube cannot detect the GPUs #2

Comments

andakai commented Sep 29, 2023

ehfd commented Sep 29, 2023

andakai commented Sep 29, 2023 • edited Loading

ehfd commented Sep 29, 2023 • edited Loading

andakai commented Sep 29, 2023

ehfd commented Sep 29, 2023

andakai commented Sep 29, 2023 • edited Loading

andakai commented Sep 29, 2023 • edited Loading

ehfd commented Nov 25, 2023

rajat709 commented Jan 1, 2024

ehfd commented Jan 5, 2024

rajat709 commented Jan 5, 2024

ehfd commented Jan 9, 2024 • edited Loading

rajat709 commented Jan 30, 2024

ehfd commented Feb 25, 2024

rajat709 commented Feb 25, 2024

andakai commented Sep 29, 2023 •

edited

Loading

ehfd commented Sep 29, 2023 •

edited

Loading

andakai commented Sep 29, 2023 •

edited

Loading

andakai commented Sep 29, 2023 •

edited

Loading

ehfd commented Jan 9, 2024 •

edited

Loading