Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Not Available using PyTorch in azureml_py38_PT_and_TF kernel on DSVM #41

Closed
riow1983 opened this issue Apr 15, 2022 · 2 comments
Closed

Comments

@riow1983
Copy link

Is it only me?

$ conda activate azureml_py38_PT_and_TF
(azureml_py38_PT_and_TF) $ python
Python 3.8.5 (default, Sep  4 2020, 07:30:14)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False
$ nvidia-smi
Fri Apr 15 04:43:40 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000001:00:00.0 Off |                  Off |
| N/A   29C    P0    29W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Looks like pytorch installed is built as CPU-only?

~$ conda list | grep pytorch
pytorch                   1.10.0              py3.8_cpu_0    pytorch

Leaving from azureml_py38_PT_and_TF and after I installed pytorch using conda onto newly created conda env from scratch, torch.cuda.is_availabel() returned True. So the problem should not sit around CUDA.

My VM's image details:

"imageReference": {
      "communityGalleryImageId": null,
      "exactVersion": "22.04.01",
      "id": null,
      "offer": "ubuntu-1804",
      "publisher": "microsoft-dsvm",
      "sharedGalleryImageId": null,
      "sku": "1804-gen2",
      "version": "latest"
    },

It was succeeded when I was using old version of VM and previous conda env named azureml_py36_pytorch, which has been deleted recently according to the document.
BTW, in this env, PyTorch is built as cuda version.

$ conda list | grep pytorch
# packages in environment at /anaconda/envs/azureml_py36_pytorch:
pytorch                   1.10.0          py3.6_cuda11.1_cudnn8.0.5_0    pytorch
@leestott
Copy link
Member

What are you trying to do

I'm trying to install pytorch with GPU acceleration.

What command are using > What DSVM version are you running and OS
Ubuntu 20.04 DSVM on a Standard NV32as v4 (East US 2)

Error encountered
Typing in python :

import torch
torch.cuda.is_available() # evaluates to False

Expected behaviour
should evaluate to true (it did on the nvidia machines of the NS family,
but they are all unavailable). So I started troubleshooting :

$ lspci
0299:00:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
[AMD/ATI] Vega 10 [Radeon Instinct MI25 MxGPU]

and there is no /dev/kfd and lsmod does not show any amd related modules
: the amdgpu drivers are not loaded. In recent kernel versions they
should be bundled by default, but if they are not, one can install them
using these instructions :
https://amdgpu-install.readthedocs.io/en/latest/install-prereq.html#installing-the-installer-package

which boils down to downloading the right package and using the
following commands :

sudo apt --fix-broken install ./amdgpu-install_21.50.2.50002-1_all.deb
sudo apt update
sudo amdgpu-install --no-32
sudo modprobe amdgpu

I then get the following error message indefinitely and the last command
hangs :

...
[drm:amdgpu_virt_read_pf2vf_data [amdgpu]] ERROR invalid pf2vf message
[drm:amdgpu_virt_read_pf2vf_data [amdgpu]] ERROR invalid pf2vf message
[drm:amdgpu_virt_read_pf2vf_data [amdgpu]] ERROR invalid pf2vf message
[drm:amdgpu_virt_read_pf2vf_data [amdgpu]] ERROR invalid pf2vf message
[drm:amdgpu_virt_read_pf2vf_data [amdgpu]] ERROR invalid pf2vf message
[drm:amdgpu_virt_read_pf2vf_data [amdgpu]] ERROR invalid pf2vf message
[drm:amdgpu_virt_read_pf2vf_data [amdgpu]] ERROR invalid pf2vf message
...

@riow1983
Copy link
Author

Hi,
I tried again to create my VM from the latest version of DSVM image (detailed below) and found the issue above has been solved. I'll close this issue. Thanks.

"imageReference": {
      "communityGalleryImageId": null,
      "exactVersion": "22.05.11",
      "id": null,
      "offer": "ubuntu-2004",
      "publisher": "microsoft-dsvm",
      "sharedGalleryImageId": null,
      "sku": "2004-gen2",
      "version": "latest"
    },

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants