Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minikube cannot detect the GPUs #2

Open
andakai opened this issue Sep 29, 2023 · 15 comments
Open

minikube cannot detect the GPUs #2

andakai opened this issue Sep 29, 2023 · 15 comments

Comments

@andakai
Copy link

andakai commented Sep 29, 2023

I used your image to create a container. In the container, I installed minikube. When I run minikube start, the node minikube didn't detect any GPU. I am wondering how to fix this. By the way, the command nvidia-smi works well.

docker run --gpus 1 -it --privileged --name ElasticDL -d ghcr.io/ehfd/nvidia-dind:latest
docker exec -it ElasticDL /bin/bash
# install minikube
minikube start
alias kubectl="minikube kubectl --"
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

Then the result shows that the GPU num is <none>

When I run a pod, the pod status is:

FailedScheduling
0/1 nodes are available: 1 Insufficient ndefault-schedulervidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
@ehfd
Copy link
Owner

ehfd commented Sep 29, 2023

Perhaps you need k8s-device-plugin?

https://github.com/NVIDIA/k8s-device-plugin

@andakai
Copy link
Author

andakai commented Sep 29, 2023

Thanks, but I have tried this. I start from the section "Enabling GPU Support in Kubernetes". I think this image has done the work before this section, I am not sure if it is right?

root@440403c45a7b:/usr/src# kubectl get pod -A
NAMESPACE     NAME                                   READY   STATUS    RESTARTS        AGE
kube-system   coredns-5d78c9869d-wttcf               1/1     Running   0               2m36s
kube-system   etcd-minikube                          1/1     Running   0               5m27s
kube-system   kube-apiserver-minikube                1/1     Running   0               4m4s
kube-system   kube-controller-manager-minikube       1/1     Running   4 (4m1s ago)    5m23s
kube-system   kube-proxy-dv6gc                       1/1     Running   0               2m36s
kube-system   kube-scheduler-minikube                1/1     Running   0               4m15s
kube-system   nvidia-device-plugin-daemonset-lppbw   1/1     Running   0               2m27s
kube-system   storage-provisioner                    1/1     Running   1 (2m20s ago)   3m8s

Then I run the command: kubectl apply -f test-gpu.yaml
The content of the test-gpu.yaml is :

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

The pod status is Pending, the detail is :

root@440403c45a7b:/usr/src# kubectl describe pod gpu-pod
Name:             gpu-pod
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           <none>
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Containers:
  cuda-container:
    Image:      nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8d7kt (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-8d7kt:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  24s   default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

I also tried to start from the Preparing your GPU Nodes. But there is some difficulty with systemctl. I tried to install the command systemctl and then systemctl restart docker. The container will exit.

@ehfd
Copy link
Owner

ehfd commented Sep 29, 2023

@andakai
Copy link
Author

andakai commented Sep 29, 2023

I am trying to implement these, but systemctl is not supported in this image. I get confused about how to run systemctl restart docker. I tried some ways to install systemctl but still failed to restart docker. Any suggestion is appreciated :>

@ehfd
Copy link
Owner

ehfd commented Sep 29, 2023

(sudo) supervisorctl restart dockerd

@andakai
Copy link
Author

andakai commented Sep 29, 2023

Still not work.

  1. I restart a container. docker run --gpus 1 -it --privileged --name ElasticDL -d elasticdl:v1. The image elasticdl:v1 only adds minikube.
  2. run docker exec -it ElasticDL /bin/bash
  3. Configure the /etc/docker/daemon.json. There is no file named /etc/containerd/config.toml and no service named containerd, so I didn't do this.
root@c0ac3df639d6:/usr/bin# supervisorctl restart dockerd
dockerd: stopped
dockerd: started
  1. run minikube start
root@c0ac3df639d6:/usr/bin# minikube start --force
* minikube v1.31.2 on Ubuntu 22.04 (docker/amd64)
! minikube skips various validations when --force is supplied; this may lead to unexpected behavior
* Using the docker driver based on existing profile
* The "docker" driver should not be used with root privileges. If you wish to continue as root, use --force.
* If you are running minikube within a VM, consider using --driver=none:
*   https://minikube.sigs.k8s.io/docs/reference/drivers/none/
* Tip: To remove this root owned cluster, run: sudo minikube delete
* Starting control plane node minikube in cluster minikube
* Pulling base image ...
* Restarting existing docker container for "minikube" ...
* Preparing Kubernetes v1.27.4 on Docker 24.0.4 ...
* Configuring bridge CNI (Container Networking Interface) ...
  - Using image gcr.io/k8s-minikube/storage-provisioner:v5
* Verifying Kubernetes components...
* Enabled addons: default-storageclass, storage-provisioner
* kubectl not found. If you need it, try: 'minikube kubectl -- get pods -A'
* Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default
  1. run kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu", and the result is bellow. I am not sure if this is the reason as the GPU is <none>.
    NAME GPU
    minikube
  2. run kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml
  3. run test-gpu.yaml
  4. run kubectl get pod -A. The pod gpu-pod status is the same.
root@c0ac3df639d6:/usr/bin# kubectl get pod -A
NAMESPACE     NAME                                   READY   STATUS    RESTARTS        AGE
default       gpu-pod                                0/1     Pending   0               11m
kube-system   coredns-5d78c9869d-r8x22               1/1     Running   1 (8m7s ago)    17m
kube-system   etcd-minikube                          1/1     Running   1 (8m11s ago)   19m
kube-system   kube-apiserver-minikube                1/1     Running   1 (8m11s ago)   18m
kube-system   kube-controller-manager-minikube       1/1     Running   6 (5m10s ago)   20m
kube-system   kube-proxy-8tl8c                       1/1     Running   1 (8m12s ago)   17m
kube-system   kube-scheduler-minikube                1/1     Running   1 (8m12s ago)   19m
kube-system   nvidia-device-plugin-daemonset-rbppm   1/1     Running   1               13m
kube-system   storage-provisioner                    1/1     Running   3 (4m53s ago)   17m

@andakai
Copy link
Author

andakai commented Sep 29, 2023

I also tried this document: https://github.com/intelligent-machine-learning/dlrover/blob/master/docs/tutorial/gpu_user_guide.md, similar to the Nvidia's document. Still get the same result.

root@c0ac3df639d6:/usr/src# kubectl describe pod nvidia-device-plugin-daemonset-r9spv -n kube-system
Name:                 nvidia-device-plugin-daemonset-r9spv
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      default
Node:                 minikube/192.168.49.2
Start Time:           Fri, 29 Sep 2023 12:27:34 +0000
Labels:               controller-revision-hash=586d67c5
                      name=nvidia-device-plugin-ds
                      pod-template-generation=2
Annotations:          scheduler.alpha.kubernetes.io/critical-pod: 
Status:               Running
IP:                   10.244.0.9
IPs:
  IP:           10.244.0.9
Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
Containers:
  nvidia-device-plugin-ctr:
    Container ID:   docker://e3994f9d249dbf33089fced497bb52ba8233f84b53bf4b76c72fc33cc58df1f2
    Image:          nvidia/k8s-device-plugin:1.11
    Image ID:       docker-pullable://nvidia/k8s-device-plugin@sha256:41b3531d338477d26eb1151c15d0bea130d31e690752315a5205d8094439b0a6
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Fri, 29 Sep 2023 12:28:36 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      FAIL_ON_INIT_ERROR:  false
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tcq6h (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  kube-api-access-tcq6h:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  6m43s  default-scheduler  Successfully assigned kube-system/nvidia-device-plugin-daemonset-r9spv to minikube
  Normal  Pulling    6m34s  kubelet            Pulling image "nvidia/k8s-device-plugin:1.11"
  Normal  Pulled     6m8s   kubelet            Successfully pulled image "nvidia/k8s-device-plugin:1.11" in 25.768299986s (25.768309412s including waiting)
  Normal  Created    5m43s  kubelet            Created container nvidia-device-plugin-ctr
  Normal  Started    5m40s  kubelet            Started container nvidia-device-plugin-ctr
root@c0ac3df639d6:/usr/src# kubectl describe pod gpu-pod
Name:             gpu-pod
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           <none>
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Containers:
  cuda-container:
    Image:      nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bbdkm (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-bbdkm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  20s (x2 over 5m20s)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

@ehfd
Copy link
Owner

ehfd commented Nov 25, 2023

I've updated the NVIDIA container toolkit. Please see if this might solve anything.

@rajat709
Copy link

rajat709 commented Jan 1, 2024

Hello @darrenglow \ CC @ehfd How will you be able to run Minikube inside a container? I have been trying for a long time, but I keep getting OCI and Cgroup errors. Can you help me with this?

@ehfd
Copy link
Owner

ehfd commented Jan 5, 2024

I have no idea... Perhaps try KinD?

@rajat709
Copy link

rajat709 commented Jan 5, 2024

Sure I'll try it...Thanks

@ehfd
Copy link
Owner

ehfd commented Jan 9, 2024

https://www.substratus.ai/blog/kind-with-gpus/
kubernetes-sigs/kind#3257 (comment)

Both actually look relevant/applicable here too.

@rajat709
Copy link

@ehfd I have tried but it did'nt work :((

@ehfd
Copy link
Owner

ehfd commented Feb 25, 2024

NVIDIA/k8s-device-plugin#332 (comment)
One more resource.

@rajat709
Copy link

Thank you, @ehfd . I’ve already explored that resource, but unfortunately, it didn’t work too. However, I’ve now switched to using virtual machines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants