Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] nvitop.Device.from_cuda_visible_devices() not detecting GPU #99

Closed
3 tasks done
juan-barajas-p opened this issue Oct 4, 2023 · 4 comments · Fixed by #100
Closed
3 tasks done

[BUG] nvitop.Device.from_cuda_visible_devices() not detecting GPU #99

juan-barajas-p opened this issue Oct 4, 2023 · 4 comments · Fixed by #100
Assignees
Labels
api Something related to the core APIs bug Something isn't working

Comments

@juan-barajas-p
Copy link

Required prerequisites

  • I have read the documentation https://nvitop.readthedocs.io.
  • I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
  • I have tried the latest version of nvitop in a new isolated virtual environment.

What version of nvitop are you using?

1.3.0

Operating system and version

Pop!_OS 22.04 LTS

NVIDIA driver version

535.113.01

NVIDIA-SMI

Wed Oct  4 08:57:35 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3070 ...    Off | 00000000:01:00.0  On |                  N/A |
| N/A   51C    P8              15W / 125W |     59MiB /  8192MiB |     13%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      3081      G   /usr/lib/xorg/Xorg                           53MiB |
+---------------------------------------------------------------------------------------+

Python environment

Virtualenv created with micromamba v1.5.1 with mm create --name testing python=3.11, then installed nvitop with pip install nvitop.

Command output:

3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] linux
nvidia-ml-py==12.535.108
nvitop==1.3.0

Problem description

Using the following code snippet results in an empty list:

import nvitop; nvitop.Device.from_cuda_visible_devices()

Regardless of if CUDA_VISIBLE_DEVICES is set or not.

Steps to Reproduce

Command lines:

python -c "import nvitop; print(nvitop.Device.from_cuda_visible_devices())"

Traceback

N/A

Logs

N/A

Expected behavior

I would expect to see the same number of devices given by nvitop.Device.all() when calling nvitop.Device.from_cuda_visible_devices() if CUDA_VISIBLE_DEVICES is not set or if CUDA_VISIBLE_DEVICES is set to all GPUs in the system.

Additional context

This has never happened before on any previous machines using the same nvitop version and OS, which at first led me to believe it was a problem with this particular machine's setup, but after some more testing I'm not so sure. I'm giving the following information to see if nvitop can be improved to deal with this situation accordingly.

I looked into it in more detail, and it turns out that visible_device_indices is empty in this machine whereas in other machines it does find the correct GPU uuid.

# file: api.device.py, method: from_cuda_visible_devices

visible_device_indices = Device.parse_cuda_visible_devices()  # value: []

Looking closer at _parse_cuda_visible_devices, the complete uuid is correctly detected by _get_all_physical_device_attrs():

# file: api.device.py, function: _parse_cuda_visible_devices

physical_device_attrs = _get_all_physical_device_attrs()  # value: _PhysicalDeviceAttrs(index=0, name='NVIDIA GeForce RTX 3070 Ti Laptop GPU', uuid='GPU-13096139-7ada-8313-ee08-000dd8540fe1', support_mig_mode=False))])

But the subprocess that parses visible devices to uuids appears to be missing the last part of the uuid. This causes further logic to assume this uuid is for a MIG device (as it doesn't find it in physical_device_attrs), and among other things, it ends up not showing up as a valid GPU detected by nvitop.

# file: api.device.py, function: _parse_cuda_visible_devices

raw_uuids = subprocess.check_output(...)  # value: ['13096139-7ada-8313-ee08-']

I kept on tracking the incorrect UUID to cuDeviceGetUuid and it appears that this is the point where the uuid is incomplete.

# file: api.libcuda.py, function: cuDeviceGetUuid

uuid = ''.join(map('{:02x}'.format, uuid.value))  # value: "130961397ada8313ee08"

As I understand, this is just a wrapper for using the CUDA driver API, directly using the function cuDeviceGetUuid_v2, so I tried to use NVIDIA's cuda-python to see if I could replicate it, but oddly enough this does return the full uuid of the GPU.

micromamba create --name testing_2 python=3.11
micromamba activate
pip install cuda-python  # v12.2.0
python -c "from cuda import cuda; cuda.cuInit(0); print(cuda.cuDeviceGetUuid_v2(0)[1])"
# prints: bytes : 130961397ada8313ee08000dd8540fe1

As using the python wrappers of the API returns the expected value, I wonder if there's something nvitop's implementation could do to mitigate this issue.

@juan-barajas-p juan-barajas-p added the bug Something isn't working label Oct 4, 2023
@XuehaiPan XuehaiPan added the api Something related to the core APIs label Oct 4, 2023
@XuehaiPan
Copy link
Owner

Hi, @juan-barajas-p thanks for raising this! Much appreciate the detailed context for the investigation.

The cause is the UUID contains the null character \x00, which terminates the string buffer.

Your UUID:

uuid = '130961397ada8313ee08000dd8540fe1'

stripped uuid:

uuid = '130961397ada8313ee08'

as we can see there is a 00 after ..ee08 and the string buffer terminates early at the null character.

I will submit a quick fix for this.

@XuehaiPan XuehaiPan changed the title nvitop.Device.from_cuda_visible_devices not detecting GPU [BUG] [BUG] nvitop.Device.from_cuda_visible_devices not detecting GPU Oct 4, 2023
@XuehaiPan XuehaiPan changed the title [BUG] nvitop.Device.from_cuda_visible_devices not detecting GPU [BUG] nvitop.Device.from_cuda_visible_devices() not detecting GPU Oct 4, 2023
@XuehaiPan
Copy link
Owner

Hi @juan-barajas-p, I create a fix to resolve this issue:

You can try it via:

python3 -m pip install git+https://github.com/XuehaiPan/nvitop.git@fix-cuDeviceGetUuid

BTW, you can use Device.cuda.all() or CudaDevice.all() to get all CUDA visible devices.

from nvitop import Device, CudaDevice

# Use this only when you don't want to use the `CUDA_VISIBLE_DEVICES` from the environment variable
all_cuda_devices = Device.from_cuda_visible_devices()             # from the environment variable
other_cuda_devices = Device.from_cuda_visible_devices('4,3,0,1')  # do not use the environment variable

# alternatives if you only read `CUDA_VISIBLE_DEVICES` from the environment variable
all_cuda_devices = Device.cuda.all()  # you can have only `from nvitop import Device`
all_cuda_devices = CudaDevice.all()

@juan-barajas-p
Copy link
Author

Hi! Thank you for the very quick response. Good job with this library, as it's the easiest method of interacting with GPU metrics that I've used.

Ohh of course that's the problem haha. Also, thank you for the tip! I didn't know you could do it that way.

It almost works. I think you meant to apply the fix to api.libcuda.cuDeviceGetUuid instead of api.libcuda.cuDeviceGetUuid_v2, as it's the entrypoint used in api.device._cuda_visible_devices_parser? But if I use cuDeviceGetUuid_v2 is does solve the issue!

@XuehaiPan
Copy link
Owner

It almost works. I think you meant to apply the fix to api.libcuda.cuDeviceGetUuid instead of api.libcuda.cuDeviceGetUuid_v2, as it's the entrypoint used in api.device._cuda_visible_devices_parser?

Thanks for the notes. I have updated the fix accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Something related to the core APIs bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants