NVIDIA GPU in microk8s in WSL #3024

maximehyh · 2022-04-02T03:25:43Z

Hello,

I am looking for a way to run some Machine Learning Inference within a Kubernetes cluster on Windows. Microk8s seemed to be a good solution as I saw that there was a gpu add on. I did some experiments with the documented way of running microk8s on windows using multipass but quickly realized that enabling GPU usage within multipass could bring some difficulties (canonical/multipass#2503 (comment)).

Knowing that CUDA can be enabled within the WSL and seeing some threads about installing microk8s on WSL (meaning overcoming the lack of snap) I started to try building a Kubernetes cluster within the WSL using microk8s.

Everything went smoothly until I actually deployed the pod that needed access to the GPU. The pod can not managed to be deployed, describe deployment gives the following error: 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. (I did microk8s enable gpu and added limits: nvidia.com/gpu: 1 in the deployment.yaml as described in the doc).

I have not put too much effort yet on trying to make it work as I figured that my assumption that having the GPU available in the WSL would allow to have GPU enabled within Kubernetes could be wrong.

Do you know if such a setup could work?

Below the requested logs.

Thanks!

Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy disk usage information to the final report tarball
df: /proc/sys/fs/binfmt_misc: Too many levels of symbolic links
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy openSSL information to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting juju
  Inspect Juju
Inspecting kubeflow
  Inspect Kubeflow
Inspecting dqlite
  Inspect dqlite

inspection-report-20220401_223428.tar.gz

The text was updated successfully, but these errors were encountered:

fzhan · 2022-12-02T05:37:41Z

@maximehyh
I've been following the same issue, the last bit of hope I found is here: NVIDIA/k8s-device-plugin#207

fplk · 2023-04-26T16:30:03Z

I hit a similar issue. I tried the following:

sudo snap install microk8s --classic

sudo usermod -a -G microk8s ${USER}
sudo chown -R ${USER} ~/.kube
newgrp microk8s

vim /var/snap/microk8s/current/args/kubelet
# Replace --container-runtime=remote with --container-runtime=docker

microk8s stop
microk8s start

cat /var/snap/microk8s/current/args/kubelet
# Make sure container-runtime is docker

microk8s enable gpu

microk8s kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1
EOF

i.e., I switched the runtime to Docker, since my Nvidia Container Runtime works well in WSL2 and hoped it would be able to pick up the GPU from there.

However, my pod remains in status Pending:

$ microk8s kubectl get pod
NAME              READY   STATUS    RESTARTS   AGE
cuda-vector-add   0/1     Pending   0          8m22s

I think this would be a really nice use case to enable - many people develop on Windows via WSL2 and Ubuntu. If microk8s with GPU acceleration was a two liner that would be very convenient and potentially attract many new users. It is also entirely possible we just miss a minor detail which prevents this from working.

Optional Side Observation: Same Happens With Minikube

curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube_latest_amd64.deb
sudo dpkg -i minikube_latest_amd64.deb
minikube start --driver=docker --container-runtime=docker --addons=ingress --memory=8192
kubectl version
kubectl get pod
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml

kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1
EOF

The pod also gets stuck in status Pending:

$ kubectl get pod
NAME              READY   STATUS    RESTARTS   AGE
cuda-vector-add   0/1     Pending   0          90s

fzhan · 2023-05-02T11:06:57Z

I hit a similar issue. I tried the following:

sudo snap install microk8s --classic

sudo usermod -a -G microk8s ${USER}
sudo chown -R ${USER} ~/.kube
newgrp microk8s

vim /var/snap/microk8s/current/args/kubelet
# Replace --container-runtime=remote with --container-runtime=docker

microk8s stop
microk8s start

cat /var/snap/microk8s/current/args/kubelet
# Make sure container-runtime is docker

microk8s enable gpu

microk8s kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1
EOF

i.e., I switched the runtime to Docker, since my Nvidia Container Runtime works well in WSL2 and hoped it would be able to pick up the GPU from there.

However, my pod remains in status Pending:

$ microk8s kubectl get pod
NAME              READY   STATUS    RESTARTS   AGE
cuda-vector-add   0/1     Pending   0          8m22s

I think this would be a really nice use case to enable - many people develop on Windows via WSL2 and Ubuntu. If microk8s with GPU acceleration was a two liner that would be very convenient and potentially attract many new users. It is also entirely possible we just miss a minor detail which prevents this from working.

Optional Side Observation: Same Happens With Minikube

curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube_latest_amd64.deb
sudo dpkg -i minikube_latest_amd64.deb
minikube start --driver=docker --container-runtime=docker --addons=ingress --memory=8192
kubectl version
kubectl get pod
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml

kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1
EOF

The pod also gets stuck in status Pending:

$ kubectl get pod
NAME              READY   STATUS    RESTARTS   AGE
cuda-vector-add   0/1     Pending   0          90s

Before getting into Microk8s, what does your nvidia-smi in WSL say?

valxv · 2023-07-06T08:56:25Z

Have the same issue. I observe that before enabling the gpu add-on nvidia-smi outputs correct driver information. But after enabling, I'm getting the "Command 'nvidia-smi' not found" error. Restarting WSL2 instance makes nvidia-smi working again.

neoaggelos · 2023-07-06T19:23:27Z

Hi @valxv

Can you please share the output of nvidia-smi -L from your wsl2 host?

Also, could you try enabling GPU as microk8s enable gpu --driver=host? Finally, if you could share an inspection report from the cluster with the GPU addon enabled so that we can get a closer look, it would help a lot. Thanks!

valxv · 2023-07-07T07:03:43Z

Hi @neoaggelos

Thanks for the prompt response. The nvidia-smi -L output is:
GPU 0: NVIDIA RTX A5500 Laptop GPU (UUID: GPU-12d4cc90-1b45-0f55-561f-b7664aa0ccda)

I install MicroK8s v1.27.2 following the Get Started instructions. Everything goes smoothly and I get a working MicroK8s instance.

$ microk8s enable gpu --driver=host
Infer repository core for addon gpu
Enabling NVIDIA GPU
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Using host GPU driver
"nvidia" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
NAME: gpu-operator
LAST DEPLOYED: Fri Jul  7 08:33:49 2023
NAMESPACE: gpu-operator-resources
STATUS: deployed
REVISION: 1
TEST SUITE: None
NVIDIA is enabled

$ microk8s kubectl get pods -n gpu-operator-resources
NAME                                                          READY   STATUS    RESTARTS   AGE
gpu-operator-node-feature-discovery-worker-p9wzh              1/1     Running   0          3m10s
gpu-operator-76998cd846-f5jsp                                 1/1     Running   0          3m10s
gpu-operator-node-feature-discovery-master-6fbc745786-85gwr   1/1     Running   0          3m10s

$ microk8s kubectl logs -n gpu-operator-resources gpu-operator-node-feature-discovery-worker-p9wzh
I0707 06:34:05.937188       1 nfd-worker.go:155] Node Feature Discovery Worker v0.10.1
I0707 06:34:05.937239       1 nfd-worker.go:156] NodeName: 'hyperion'
I0707 06:34:05.937469       1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I0707 06:34:05.937518       1 nfd-worker.go:461] worker (re-)configuration successfully completed
I0707 06:34:05.937551       1 base.go:126] connecting to nfd-master at gpu-operator-node-feature-discovery-master:8080 ...
I0707 06:34:05.937574       1 component.go:36] [core]parsed scheme: ""
I0707 06:34:05.937586       1 component.go:36] [core]scheme "" not registered, fallback to default scheme
I0707 06:34:05.937595       1 component.go:36] [core]ccResolverWrapper: sending update to cc: {[{gpu-operator-node-feature-discovery-master:8080  <nil> 0 <nil>}] <nil> <nil>}
I0707 06:34:05.937598       1 component.go:36] [core]ClientConn switching balancer to "pick_first"
I0707 06:34:05.937600       1 component.go:36] [core]Channel switches to new LB policy "pick_first"
I0707 06:34:05.937609       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I0707 06:34:05.937622       1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I0707 06:34:05.937734       1 component.go:36] [core]Channel Connectivity change to CONNECTING
W0707 06:34:05.938324       1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {gpu-operator-node-feature-discovery-master:8080 gpu-operator-node-feature-discovery-master:8080 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 10.152.183.247:8080: connect: connection refused". Reconnecting...
I0707 06:34:05.938339       1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
I0707 06:34:05.938354       1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
I0707 06:34:06.938509       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I0707 06:34:06.938535       1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I0707 06:34:06.938651       1 component.go:36] [core]Channel Connectivity change to CONNECTING
W0707 06:34:06.939060       1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {gpu-operator-node-feature-discovery-master:8080 gpu-operator-node-feature-discovery-master:8080 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 10.152.183.247:8080: connect: connection refused". Reconnecting...
I0707 06:34:06.939076       1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
I0707 06:34:06.939464       1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
I0707 06:34:08.780616       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I0707 06:34:08.780668       1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I0707 06:34:08.780805       1 component.go:36] [core]Channel Connectivity change to CONNECTING
W0707 06:34:08.781717       1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {gpu-operator-node-feature-discovery-master:8080 gpu-operator-node-feature-discovery-master:8080 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 10.152.183.247:8080: connect: connection refused". Reconnecting...
I0707 06:34:08.781770       1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
I0707 06:34:08.781785       1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
I0707 06:34:11.220900       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I0707 06:34:11.220961       1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I0707 06:34:11.221001       1 component.go:36] [core]Channel Connectivity change to CONNECTING
W0707 06:34:11.222542       1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {gpu-operator-node-feature-discovery-master:8080 gpu-operator-node-feature-discovery-master:8080 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 10.152.183.247:8080: connect: connection refused". Reconnecting...
I0707 06:34:11.222586       1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
I0707 06:34:11.222652       1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
I0707 06:34:15.705486       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I0707 06:34:15.705535       1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I0707 06:34:15.705650       1 component.go:36] [core]Channel Connectivity change to CONNECTING
W0707 06:34:15.706620       1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {gpu-operator-node-feature-discovery-master:8080 gpu-operator-node-feature-discovery-master:8080 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 10.152.183.247:8080: connect: connection refused". Reconnecting...
I0707 06:34:15.706677       1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
I0707 06:34:15.706719       1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
I0707 06:34:22.641223       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I0707 06:34:22.641274       1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I0707 06:34:22.641296       1 component.go:36] [core]Channel Connectivity change to CONNECTING
I0707 06:34:22.643195       1 component.go:36] [core]Subchannel Connectivity change to READY
I0707 06:34:22.643233       1 component.go:36] [core]Channel Connectivity change to READY
E0707 06:34:22.649299       1 memory.go:87] failed to detect NUMA nodes: failed to list numa nodes: open /host-sys/bus/node/devices: no such file or directory
I0707 06:34:22.651796       1 nfd-worker.go:472] starting feature discovery...
I0707 06:34:22.651871       1 nfd-worker.go:484] feature discovery completed
I0707 06:34:22.651884       1 nfd-worker.go:565] sending labeling request to nfd-master
E0707 06:35:22.665807       1 memory.go:87] failed to detect NUMA nodes: failed to list numa nodes: open /host-sys/bus/node/devices: no such file or directory

$ microk8s kubectl logs -n gpu-operator-resources gpu-operator-76998cd846-f5jsp
{"level":"info","ts":1688711642.4598541,"msg":"version: 1489c9cd"}
{"level":"info","ts":1688711642.4601014,"msg":"commit: "}
{"level":"info","ts":1688711642.4668164,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1688711642.4669774,"logger":"setup","msg":"starting manager"}
{"level":"info","ts":1688711642.4671853,"msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"info","ts":1688711642.4672139,"msg":"Starting server","kind":"health probe","addr":"[::]:8081"}
I0707 06:34:02.467238       1 leaderelection.go:248] attempting to acquire leader lease gpu-operator-resources/53822513.nvidia.com...
I0707 06:34:02.477122       1 leaderelection.go:258] successfully acquired lease gpu-operator-resources/53822513.nvidia.com
{"level":"info","ts":1688711642.4773223,"logger":"controller.clusterpolicy-controller","msg":"Starting EventSource","source":"kind source: *v1.ClusterPolicy"}
{"level":"info","ts":1688711642.4773612,"logger":"controller.clusterpolicy-controller","msg":"Starting EventSource","source":"kind source: *v1.Node"}
{"level":"info","ts":1688711642.477365,"logger":"controller.clusterpolicy-controller","msg":"Starting EventSource","source":"kind source: *v1.DaemonSet"}
{"level":"info","ts":1688711642.4773674,"logger":"controller.clusterpolicy-controller","msg":"Starting Controller"}
{"level":"info","ts":1688711642.5786552,"logger":"controller.clusterpolicy-controller","msg":"Starting workers","worker count":1}
{"level":"info","ts":1688711642.5809586,"logger":"controllers.ClusterPolicy","msg":"Kubernetes version detected","version":"v1.27.2"}
{"level":"info","ts":1688711642.5814576,"logger":"controllers.ClusterPolicy","msg":"Operator metrics initialized."}
{"level":"info","ts":1688711642.581512,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/pre-requisites"}
{"level":"info","ts":1688711642.5818772,"logger":"controllers.ClusterPolicy","msg":"PodSecurityPolicy no longer supported by API. Skipping..."}
{"level":"info","ts":1688711642.5819106,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-operator-metrics"}
{"level":"info","ts":1688711642.5822437,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-driver"}
{"level":"info","ts":1688711642.5848181,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-container-toolkit"}
{"level":"info","ts":1688711642.585515,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-operator-validation"}
{"level":"info","ts":1688711642.586566,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-device-plugin"}
{"level":"info","ts":1688711642.5873156,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-dcgm"}
{"level":"info","ts":1688711642.587671,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-dcgm-exporter"}
{"level":"info","ts":1688711642.588288,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/gpu-feature-discovery"}
{"level":"info","ts":1688711642.589732,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-mig-manager"}
{"level":"info","ts":1688711642.5903223,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-node-status-exporter"}
{"level":"info","ts":1688711642.5907319,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-vgpu-manager"}
{"level":"info","ts":1688711642.5912368,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-vgpu-device-manager"}
{"level":"info","ts":1688711642.5921996,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-sandbox-validation"}
{"level":"info","ts":1688711642.592684,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-vfio-manager"}
{"level":"info","ts":1688711642.5932589,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-sandbox-device-plugin"}
{"level":"info","ts":1688711642.593693,"logger":"controllers.ClusterPolicy","msg":"Sandbox workloads","Enabled":false,"DefaultWorkload":"container"}
{"level":"info","ts":1688711642.593723,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"hyperion","GpuWorkloadConfig":"container"}
{"level":"info","ts":1688711642.5937262,"logger":"controllers.ClusterPolicy","msg":"Number of nodes with GPU label","NodeCount":0}
{"level":"info","ts":1688711642.5937448,"logger":"controllers.ClusterPolicy","msg":"Unable to get runtime info from the cluster, defaulting to containerd"}
{"level":"info","ts":1688711642.5937638,"logger":"controllers.ClusterPolicy","msg":"Using container runtime: containerd"}
{"level":"info","ts":1688711642.5937653,"logger":"controllers.ClusterPolicy","msg":"WARNING: NFD labels missing in the cluster, GPU nodes cannot be discovered."}
{"level":"info","ts":1688711642.5937667,"logger":"controllers.ClusterPolicy","msg":"No GPU node can be found in the cluster."}
{"level":"info","ts":1688711642.6943653,"logger":"controllers.ClusterPolicy","msg":"Not found, creating...","RuntimeClass":"nvidia"}
{"level":"info","ts":1688711642.7034507,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"pre-requisites","status":"ready"}
{"level":"info","ts":1688711642.8048227,"logger":"controllers.ClusterPolicy","msg":"Not found, creating...","Service":"gpu-operator","Namespace":"gpu-operator-resources"}
{"level":"info","ts":1688711642.8186789,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-operator-metrics","status":"ready"}
{"level":"info","ts":1688711642.8277469,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-driver","status":"disabled"}
{"level":"info","ts":1688711642.8390174,"logger":"controllers.ClusterPolicy","msg":"No GPU node in the cluster, do not create DaemonSets","DaemonSet":"nvidia-container-toolkit-daemonset","Namespace":"gpu-operator-resources"}
{"level":"info","ts":1688711642.8390455,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-container-toolkit","status":"ready"}
{"level":"info","ts":1688711642.8577373,"logger":"controllers.ClusterPolicy","msg":"No GPU node in the cluster, do not create DaemonSets","DaemonSet":"nvidia-operator-validator","Namespace":"gpu-operator-resources"}
{"level":"info","ts":1688711642.8577573,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-operator-validation","status":"ready"}

$ microk8s kubectl logs -n gpu-operator-resources gpu-operator-node-feature-discovery-master-6fbc745786-85gwr
I0707 06:34:06.306152       1 nfd-master.go:170] Node Feature Discovery Master v0.10.1
I0707 06:34:06.306206       1 nfd-master.go:174] NodeName: "hyperion"
I0707 06:34:06.306210       1 nfd-master.go:185] starting nfd LabelRule controller
I0707 06:34:06.318168       1 nfd-master.go:226] gRPC server serving on port: 8080
I0707 06:34:22.655115       1 nfd-master.go:423] received labeling request for node "hyperion"
I0707 06:35:22.669416       1 nfd-master.go:423] received labeling request for node "hyperion"
I0707 06:36:22.681099       1 nfd-master.go:423] received labeling request for node "hyperion"
I0707 06:37:22.695402       1 nfd-master.go:423] received labeling request for node "hyperion"
I0707 06:38:22.708822       1 nfd-master.go:423] received labeling request for node "hyperion"
I0707 06:39:22.723842       1 nfd-master.go:423] received labeling request for node "hyperion"
I0707 06:40:22.734516       1 nfd-master.go:423] received labeling request for node "hyperion"
I0707 06:41:22.747200       1 nfd-master.go:423] received labeling request for node "hyperion"
I0707 06:42:22.761981       1 nfd-master.go:423] received labeling request for node "hyperion"
I0707 06:43:22.773978       1 nfd-master.go:423] received labeling request for node "hyperion"
I0707 06:44:22.783804       1 nfd-master.go:423] received labeling request for node "hyperion"
I0707 06:45:22.800138       1 nfd-master.go:423] received labeling request for node "hyperion"
I0707 06:46:22.815982       1 nfd-master.go:423] received labeling request for node "hyperion"
I0707 06:47:22.832192       1 nfd-master.go:423] received labeling request for node "hyperion"

And enabling GPU without --driver=host parameter is also successful:

$ microk8s enable gpu
Infer repository core for addon gpu
Enabling NVIDIA GPU
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Checking if NVIDIA driver is already installed
GPU 0: NVIDIA RTX A5500 Laptop GPU (UUID: GPU-12d4cc90-1b45-0f55-561f-b7664aa0ccda)
Using host GPU driver
"nvidia" already exists with the same configuration, skipping
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
NAME: gpu-operator
LAST DEPLOYED: Fri Jul  7 08:51:55 2023
NAMESPACE: gpu-operator-resources
STATUS: deployed
REVISION: 1
TEST SUITE: None
NVIDIA is enabled

But operator's pods logs look the same.

inspection-report-20230707_090001.tar.gz

Please write me down in case you need any additional information.

neoaggelos · 2023-07-07T07:15:49Z

Pardon my ignorance, but does your GPU support CUDA? I don't see it in https://developer.nvidia.com/cuda-gpus#compute (unless I missed it)

If the GPU should be supported, then I believe the right place to open an issue would be https://github.com/nvidia/gpu-operator

Hope this helps!

valxv · 2023-07-07T10:20:23Z

Yes, it is supported. This NVIDIA page you're referring to is a little bit outdated. I don't have any problems running CUDA apps locally and inside Docker containers. The only issue I have is with Kubernetes. :)

neoaggelos · 2023-07-07T12:32:16Z

OK, I am not too familiar with the details. Thanks @valxv, happy to keep this issue around so that we are in the loop and see if there's anyway to support from the microk8s side.

Ar-ALN · 2023-08-21T18:24:24Z

Does this issue have been resolved ? I'm actually running into the same use case. Even k3s can't detect the gpu into the wsl2 node

OwendotKim · 2023-10-10T16:03:05Z

E0707 06:35:22.665807 1 memory.go:87] failed to detect NUMA nodes: failed to list numa nodes: open /host-sys/bus/node/devices: no such file or directory

This error can be resolved by building the Ubuntu kernel, but the gpu operator still says that it cannot find the gpu node.

Naegionn · 2023-12-15T09:52:46Z

Hey is there any Update on this? I am facing the same issue on microk8s 1.24.

shortwavedave · 2024-01-02T11:42:32Z

Same issue - any ideas?

shaictal · 2024-01-02T21:52:24Z

same issue :(

russfellows · 2024-03-27T05:16:24Z

Same issue for me, see following output. The issue is unable to access the GPU path, which doesn't exist in WSL2.

failed to detect NUMA nodes: failed to list numa nodes: open /host-sys/bus/node/devices: no such file or directory

(base) eval@rtx-4090-1:$ microk8s enable gpu --driver=host
Infer repository core for addon gpu
Enabling NVIDIA GPU
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Using host GPU driver
"nvidia" already exists with the same configuration, skipping
Hang tight while we grab the latest from your chart repositories...
^Z
[1]+ Stopped microk8s enable gpu --driver=host
(base) eval@rtx-4090-1:$ bg
[1]+ microk8s enable gpu --driver=host &

(base) eval@rtx-4090-1:$ NAME gpu-operator-559f7cd69b-psk2l gpu-operator-node-fea gpu-operator-node-fea (base) eval@rtx-4090-1:$ I0327 05:09:16.393610 I0327 05:09:16.393658 I0327 05:09:16.393839 I0327 05:09:16.393879 I0327 05:09:16.393900 I0327 05:09:16.393934 I0327 05:09:16.393945 I0327 05:09:16.393953 I0327 05:09:16.393956 I0327 05:09:16.393957 I0327 05:09:16.393965 I0327 05:09:16.393983 I0327 05:09:16.394088 I0327 05:09:16.394828 I0327 05:09:16.394842 E0327 05:09:16.396882 I0327 05:09:16.398464 I0327 05:09:16.398527 I0327 05:09:16.398537 E0327 05:10:16.407356 I0327 05:10:16.409700 I0327 05:10:16.409800 I0327 05:10:16.409817 (base) eval@rtx-4090-1:~$ microk8s.kubectl get pods -n gpu-operator-resources
READY STATUS RESTARTS AGE
1/1 Running 6 (86s ago) 46m
ture-discovery-master-5bfbc54c8d-zs7w2 1/1 Running 6 (86s ago) 46m
ture-discovery-worker-4gjsd 0/1 Error 10 (86s ago) 46m
microk8s.kubectl logs -n gpu-operator-resources gpu-operator-node-feature-discovery-worker-4gjsd
1 nfd-worker.go:155] Node Feature Discovery Worker v0.10.1
1 nfd-worker.go:156] NodeName: 'rtx-4090-1'
1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
1 nfd-worker.go:461] worker (re-)configuration successfully completed
1 base.go:126] connecting to nfd-master at gpu-operator-node-feature-discovery-master:8080 ...
1 component.go:36] [core]parsed scheme: ""
1 component.go:36] [core]scheme "" not registered, fallback to default scheme
1 component.go:36] [core]ccResolverWrapper: sending update to cc: {[{gpu-operator-node-feature-discovery-master:8080 0 }] }
1 component.go:36] [core]ClientConn switching balancer to "pick_first"
1 component.go:36] [core]Channel switches to new LB policy "pick_first"
1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
1 component.go:36] [core]Channel Connectivity change to CONNECTING
1 component.go:36] [core]Subchannel Connectivity change to READY
1 component.go:36] [core]Channel Connectivity change to READY
1 memory.go:87] failed to detect NUMA nodes: failed to list numa nodes: open /host-sys/bus/node/devices: no such file or directory
1 nfd-worker.go:472] starting feature discovery...
1 nfd-worker.go:484] feature discovery completed
1 nfd-worker.go:565] sending labeling request to nfd-master
1 memory.go:87] failed to detect NUMA nodes: failed to list numa nodes: open /host-sys/bus/node/devices: no such file or directory
1 nfd-worker.go:472] starting feature discovery...
1 nfd-worker.go:484] feature discovery completed
1 nfd-worker.go:565] sending labeling request to nfd-master

choigawoon · 2024-07-30T08:37:16Z

meet same issue and i'm here

cedricve · 2024-08-16T18:15:31Z

same issue any idea? WSL with microk8s installed

SeanSnyders · 2024-09-09T03:33:37Z

+1
Also same issue for me in Windows 11, WSL2, Ubuntu 22.04.03, MicroK8s v1.30.4 revision 7167 on a computer with a NVIDIA RTX 4070.
cuda-vector-add test workload, just has pod "pending" and

microk8s kubectl logs -n gpu-operator-resources -lapp=nvidia-operator-validator -c nvidia-operator-validator, logs

No resources found in gpu-operator-resources namespace.

nvidia-smi -L logs

GPU 0: NVIDIA GeForce RTX 4070 Laptop GPU (UUID: GPU-b1e40205-134a-1246-de3b-cd82bde35445)

microk8s kubectl get pod -n gpu-operator-resources gives

NAME                                                          READY   STATUS    RESTARTS      AGE
gpu-operator-56b6cf869d-8jqj5                                 1/1     Running   1 (75m ago)   77m
gpu-operator-node-feature-discovery-gc-5fcdc8894b-xhg9q       1/1     Running   0             77m
gpu-operator-node-feature-discovery-master-7d84b856d7-srskp   1/1     Running   0             77m
gpu-operator-node-feature-discovery-worker-wn48k              1/1     Running   0             77m

and microk8s kubectl logs -n gpu-operator-resources gpu-operator-node-feature-discovery-worker-wn48k shows

....
....
E0909 03:21:18.418300       1 memory.go:91] "failed to detect NUMA nodes" err="failed to list numa nodes: open /host-sys/bus/node/devices: no such file or directory"
I0909 03:21:18.423916       1 nfd-worker.go:562] "starting feature discovery..."
I0909 03:21:18.424052       1 nfd-worker.go:577] "feature discovery completed"

fzhan mentioned this issue Dec 6, 2022

WSLg/Cuda suddenly broken due to nvidia-smi unable to find GPU microsoft/WSL#9099

Open

1 task

valxv mentioned this issue Jul 7, 2023

WSL2 Support NVIDIA/gpu-operator#318

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA GPU in microk8s in WSL #3024

NVIDIA GPU in microk8s in WSL #3024

maximehyh commented Apr 2, 2022 •

edited

Loading

fzhan commented Dec 2, 2022

fplk commented Apr 26, 2023 •

edited

Loading

fzhan commented May 2, 2023

valxv commented Jul 6, 2023

neoaggelos commented Jul 6, 2023

valxv commented Jul 7, 2023

neoaggelos commented Jul 7, 2023

valxv commented Jul 7, 2023

neoaggelos commented Jul 7, 2023

Ar-ALN commented Aug 21, 2023

OwendotKim commented Oct 10, 2023

Naegionn commented Dec 15, 2023

shortwavedave commented Jan 2, 2024

shaictal commented Jan 2, 2024

russfellows commented Mar 27, 2024 •

edited

Loading

choigawoon commented Jul 30, 2024

cedricve commented Aug 16, 2024

SeanSnyders commented Sep 9, 2024 •

edited

Loading

NVIDIA GPU in microk8s in WSL #3024

NVIDIA GPU in microk8s in WSL #3024

Comments

maximehyh commented Apr 2, 2022 • edited Loading

fzhan commented Dec 2, 2022

fplk commented Apr 26, 2023 • edited Loading

fzhan commented May 2, 2023

valxv commented Jul 6, 2023

neoaggelos commented Jul 6, 2023

valxv commented Jul 7, 2023

neoaggelos commented Jul 7, 2023

valxv commented Jul 7, 2023

neoaggelos commented Jul 7, 2023

Ar-ALN commented Aug 21, 2023

OwendotKim commented Oct 10, 2023

Naegionn commented Dec 15, 2023

shortwavedave commented Jan 2, 2024

shaictal commented Jan 2, 2024

russfellows commented Mar 27, 2024 • edited Loading

choigawoon commented Jul 30, 2024

cedricve commented Aug 16, 2024

SeanSnyders commented Sep 9, 2024 • edited Loading

maximehyh commented Apr 2, 2022 •

edited

Loading

fplk commented Apr 26, 2023 •

edited

Loading

russfellows commented Mar 27, 2024 •

edited

Loading

SeanSnyders commented Sep 9, 2024 •

edited

Loading