-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] nvitop.Device.from_cuda_visible_devices()
not detecting GPU
#99
Comments
Hi, @juan-barajas-p thanks for raising this! Much appreciate the detailed context for the investigation. The cause is the UUID contains the null character Your UUID: uuid = '130961397ada8313ee08000dd8540fe1' stripped uuid: uuid = '130961397ada8313ee08' as we can see there is a I will submit a quick fix for this. |
nvitop.Device.from_cuda_visible_devices
not detecting GPU
nvitop.Device.from_cuda_visible_devices
not detecting GPUnvitop.Device.from_cuda_visible_devices()
not detecting GPU
Hi @juan-barajas-p, I create a fix to resolve this issue: You can try it via: python3 -m pip install git+https://github.com/XuehaiPan/nvitop.git@fix-cuDeviceGetUuid BTW, you can use from nvitop import Device, CudaDevice
# Use this only when you don't want to use the `CUDA_VISIBLE_DEVICES` from the environment variable
all_cuda_devices = Device.from_cuda_visible_devices() # from the environment variable
other_cuda_devices = Device.from_cuda_visible_devices('4,3,0,1') # do not use the environment variable
# alternatives if you only read `CUDA_VISIBLE_DEVICES` from the environment variable
all_cuda_devices = Device.cuda.all() # you can have only `from nvitop import Device`
all_cuda_devices = CudaDevice.all() |
Hi! Thank you for the very quick response. Good job with this library, as it's the easiest method of interacting with GPU metrics that I've used. Ohh of course that's the problem haha. Also, thank you for the tip! I didn't know you could do it that way. It almost works. I think you meant to apply the fix to |
Thanks for the notes. I have updated the fix accordingly. |
Required prerequisites
What version of nvitop are you using?
1.3.0
Operating system and version
Pop!_OS 22.04 LTS
NVIDIA driver version
535.113.01
NVIDIA-SMI
Python environment
Virtualenv created with micromamba v1.5.1 with
mm create --name testing python=3.11
, then installed nvitop withpip install nvitop
.Command output:
Problem description
Using the following code snippet results in an empty list:
Regardless of if
CUDA_VISIBLE_DEVICES
is set or not.Steps to Reproduce
Command lines:
python -c "import nvitop; print(nvitop.Device.from_cuda_visible_devices())"
Traceback
Logs
Expected behavior
I would expect to see the same number of devices given by
nvitop.Device.all()
when callingnvitop.Device.from_cuda_visible_devices()
ifCUDA_VISIBLE_DEVICES
is not set or ifCUDA_VISIBLE_DEVICES
is set to all GPUs in the system.Additional context
This has never happened before on any previous machines using the same
nvitop
version and OS, which at first led me to believe it was a problem with this particular machine's setup, but after some more testing I'm not so sure. I'm giving the following information to see ifnvitop
can be improved to deal with this situation accordingly.I looked into it in more detail, and it turns out that
visible_device_indices
is empty in this machine whereas in other machines it does find the correct GPU uuid.Looking closer at
_parse_cuda_visible_devices
, the complete uuid is correctly detected by_get_all_physical_device_attrs()
:But the subprocess that parses visible devices to uuids appears to be missing the last part of the uuid. This causes further logic to assume this uuid is for a MIG device (as it doesn't find it in physical_device_attrs), and among other things, it ends up not showing up as a valid GPU detected by
nvitop
.I kept on tracking the incorrect UUID to
cuDeviceGetUuid
and it appears that this is the point where the uuid is incomplete.As I understand, this is just a wrapper for using the CUDA driver API, directly using the function
cuDeviceGetUuid_v2
, so I tried to use NVIDIA's cuda-python to see if I could replicate it, but oddly enough this does return the full uuid of the GPU.As using the python wrappers of the API returns the expected value, I wonder if there's something
nvitop
's implementation could do to mitigate this issue.The text was updated successfully, but these errors were encountered: