GPU Not Available using PyTorch in azureml_py38_PT_and_TF kernel on DSVM #41

riow1983 · 2022-04-15T05:06:16Z

Is it only me?

$ conda activate azureml_py38_PT_and_TF
(azureml_py38_PT_and_TF) $ python
Python 3.8.5 (default, Sep  4 2020, 07:30:14)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False

$ nvidia-smi
Fri Apr 15 04:43:40 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000001:00:00.0 Off |                  Off |
| N/A   29C    P0    29W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Looks like pytorch installed is built as CPU-only?

~$ conda list | grep pytorch
pytorch                   1.10.0              py3.8_cpu_0    pytorch

Leaving from azureml_py38_PT_and_TF and after I installed pytorch using conda onto newly created conda env from scratch, torch.cuda.is_availabel() returned True. So the problem should not sit around CUDA.

My VM's image details:

"imageReference": {
      "communityGalleryImageId": null,
      "exactVersion": "22.04.01",
      "id": null,
      "offer": "ubuntu-1804",
      "publisher": "microsoft-dsvm",
      "sharedGalleryImageId": null,
      "sku": "1804-gen2",
      "version": "latest"
    },

It was succeeded when I was using old version of VM and previous conda env named azureml_py36_pytorch, which has been deleted recently according to the document.
BTW, in this env, PyTorch is built as cuda version.

$ conda list | grep pytorch
# packages in environment at /anaconda/envs/azureml_py36_pytorch:
pytorch                   1.10.0          py3.6_cuda11.1_cudnn8.0.5_0    pytorch

The text was updated successfully, but these errors were encountered:

leestott · 2022-04-25T10:30:31Z

What are you trying to do

I'm trying to install pytorch with GPU acceleration.

What command are using > What DSVM version are you running and OS
Ubuntu 20.04 DSVM on a Standard NV32as v4 (East US 2)

Error encountered
Typing in python :

import torch
torch.cuda.is_available() # evaluates to False

Expected behaviour
should evaluate to true (it did on the nvidia machines of the NS family,
but they are all unavailable). So I started troubleshooting :

$ lspci
0299:00:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
[AMD/ATI] Vega 10 [Radeon Instinct MI25 MxGPU]

and there is no /dev/kfd and lsmod does not show any amd related modules
: the amdgpu drivers are not loaded. In recent kernel versions they
should be bundled by default, but if they are not, one can install them
using these instructions :
https://amdgpu-install.readthedocs.io/en/latest/install-prereq.html#installing-the-installer-package

which boils down to downloading the right package and using the
following commands :

sudo apt --fix-broken install ./amdgpu-install_21.50.2.50002-1_all.deb
sudo apt update
sudo amdgpu-install --no-32
sudo modprobe amdgpu

I then get the following error message indefinitely and the last command
hangs :

...
[drm:amdgpu_virt_read_pf2vf_data [amdgpu]] ERROR invalid pf2vf message
[drm:amdgpu_virt_read_pf2vf_data [amdgpu]] ERROR invalid pf2vf message
[drm:amdgpu_virt_read_pf2vf_data [amdgpu]] ERROR invalid pf2vf message
[drm:amdgpu_virt_read_pf2vf_data [amdgpu]] ERROR invalid pf2vf message
[drm:amdgpu_virt_read_pf2vf_data [amdgpu]] ERROR invalid pf2vf message
[drm:amdgpu_virt_read_pf2vf_data [amdgpu]] ERROR invalid pf2vf message
[drm:amdgpu_virt_read_pf2vf_data [amdgpu]] ERROR invalid pf2vf message
...

riow1983 · 2022-06-28T00:12:57Z

Hi,
I tried again to create my VM from the latest version of DSVM image (detailed below) and found the issue above has been solved. I'll close this issue. Thanks.

"imageReference": {
      "communityGalleryImageId": null,
      "exactVersion": "22.05.11",
      "id": null,
      "offer": "ubuntu-2004",
      "publisher": "microsoft-dsvm",
      "sharedGalleryImageId": null,
      "sku": "2004-gen2",
      "version": "latest"
    },

riow1983 closed this as completed Jun 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Not Available using PyTorch in azureml_py38_PT_and_TF kernel on DSVM #41

GPU Not Available using PyTorch in azureml_py38_PT_and_TF kernel on DSVM #41

riow1983 commented Apr 15, 2022

leestott commented Apr 25, 2022

riow1983 commented Jun 28, 2022

GPU Not Available using PyTorch in azureml_py38_PT_and_TF kernel on DSVM #41

GPU Not Available using PyTorch in azureml_py38_PT_and_TF kernel on DSVM #41

Comments

riow1983 commented Apr 15, 2022

leestott commented Apr 25, 2022

riow1983 commented Jun 28, 2022