Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixing up GPU names on slowest first GPU bus ID #18

Open
eliorc opened this issue Nov 13, 2018 · 4 comments
Open

Mixing up GPU names on slowest first GPU bus ID #18

eliorc opened this issue Nov 13, 2018 · 4 comments

Comments

@eliorc
Copy link

eliorc commented Nov 13, 2018

I understand that GPUtil infers the GPUs attributes so it will match the nvidia-smi output.

The thing is, that GPUtil is commonly used with TensorFlow or other GPU utilizing frameworks - these frameworks usually use the IDs in a manner that is sorted by their quality.
For example, in TensorFlow if you set CUDA_VISIBLE_DEVICES = '0' in your environment variables, only the fastest GPU will be exposed to the library.

In my setup, I have two different GPUs on the same machine - during runtime I use GPUtil to figure out which GPU has most memory available and using the GPU ID I designate a GPU to use. But since my slowest GPU is installed in the first bus, then it shows up in GPUtil as 0 and the faster one as 1.

I would suggest that there will be a parameter to pass to GPUtil.getGPUs() that will help sort that out, so that any downstream frameworks that rely on CUDA_VISIBLE_DEVICES would be able to get the IDs right.

@anderskm
Copy link
Owner

anderskm commented Nov 13, 2018

I am not sure what your main concern is. Is it 1) that the IDs does not match between Tensorflow (CUDA_VISIBLE_DEVICES) and GPUtil (nvidia-smi) or 2) that the GPUs returned by GPUtil (nvidia-smi) are not ordered according to their processing speed?

In case of 1), that can be solved by setting the CUDA environment variable CUDA_DEVICE_ORDER = "PCI_BUS_ID".
See the example Occupy only 1 GPU in TensorFlow in the GPUtil readme.
See also NVIDIAs description of the CUDA environment variables.

In case of 2), NVIDIA only guarantees that the first GPU is the fastest. The rest of the GPUs are returned in unspecified order.
From the CUDA environment variables:

FASTEST_FIRST causes CUDA to guess which device is fastest using a simple heuristic, and make that device 0, leaving the order of the rest of the devices unspecified.

As such there is no guarantee, that the GPU#2 is faster than GPU#3. And if GPU#1 is already occupied, you are back to the original problem.
Unfortunately, I do not see a solution to case 2) at the moment. However, you or anyone else are very welcome to suggest a solution :-)

  • Edit: Fixed some spelling.

@eliorc
Copy link
Author

eliorc commented Nov 13, 2018

Thanks for quick response. I am talking about issue 1).

Yeah this is how I deal with it now, setting the CUDA_DEVICE_ORDER, my suggestion was that since GPUtil is a standard choice when working with CUDA backed frameworks, it would be helpful if on the libraries side (GPUtil's side) there will be support for that default behavior since it is such a common use case (using the CUDA defaults)

@anderskm
Copy link
Owner

@eliorc I'm sorry for not getting back to you sooner.

As far as I can tell, there is no way of sorting the GPUs according to fastest in nvidia-smi. Likewise, CUDAs heuristics for ordering the GPUs according to fastest is proprietary, which means there is no way of replicating the order. Secondly, they do not guarantee the order of the remaining GPUs.
In short, I do not see a reliable solution for GPUtil to deal with the default behavior of CUDAs GPU ordering (fastest).

I will keep the issue open.

@tashrifbillah
Copy link

Here are my two cents--GPUtil.get*() functions should respect the environment variable CUDA_VISIBLE_DEVICES. Let's say I have 4 GPUs but I make only 2 visible. Then, the above methods should return assuming only 2 are available. Currently, it looks at nvidia-smi output and returns whatever that returns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants