Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when discrepancy between available CUDA device at build time / runtime #39

Closed
timlacroix opened this issue Dec 9, 2019 · 8 comments

Comments

@timlacroix
Copy link

Hey, first off, thanks for the library !

I have had some weird issues today when trying to use a kernel on 'cuda:1' when the kernel was built on a machine with only 2 gpus. I run into this because I use a shared home filesystem (and hence shared .cache folder) on a cluster where I have access to machines with various number of GPUS.

Here is how to reproduce, on a machine with 2 GPUs:

test.py :

import torch
from pykeops.torch import LazyTensor

def test(data):
	neigh_state = LazyTensor(data[None, :, :])
	state = LazyTensor(data[:, None, :])
	all_distances = ((neigh_state - state) ** 2).sum(dim=2)
	return (- all_distances).logsumexp(dim=1)

tensor = torch.randn(10,128).to('cuda:0')
print(torch.cuda.device_count())
test(tensor)

run CUDA_VISIBLE_DEVICES=0 python test.py. This should build a kernel.
then change 'cuda:0' to 'cuda:1' in test.py
run python test.py.

This fails with error :
invalid Gpu device number. If the number of available Gpus is > 12, add required lines at the end of function SetGpuProps and recompile.

Recompiling is not a great option for me, as I might run different experiments using the same kernel but on machines with different number of available gpus.

@bcharlier
Copy link
Member

Hi @timlacroix ,

when you set CUDA_VISIBLE_DEVICES=0, the nvidia driver expose only the GPU with id=0. So at compilation time, keops only detect the gpu with id=0. This is thus the expected behavior.

Can you try to call python test.py without setting the env variable CUDA_VISIBLE_DEVICE. It should work as you expected, as you already ask keops to run on gpu 0 through to the .to('cuda:0')...

For instance:

import torch
from pykeops.torch import LazyTensor

def test(data):
	neigh_state = LazyTensor(data[None, :, :])
	state = LazyTensor(data[:, None, :])
	all_distances = ((neigh_state - state) ** 2).sum(dim=2)
	return (- all_distances).logsumexp(dim=1)

tensor = torch.randn(10,128).to('cuda:0')
test(tensor) # should run on gpu 0

tensor1 = torch.randn(10,128).to('cuda:1')
print(torch.cuda.device_count())
test(tensor) # should run on gpu 1... without recompiling

@timlacroix
Copy link
Author

timlacroix commented Dec 12, 2019

hi, I used CUDA_VISIBLE_DEVICES here to make the problem reproducible.

My question is in a set-up where development (and thus compilation) happens on a machine with N gpus and test happens on a machine with M gpus, but sharing the same compilation cache.

Couldn't the number of GPUs available at compile time be used in the compiled code hash ? This way, changing the number of GPUs would just force a rebuild, but wouldn't raise an error.

@bcharlier
Copy link
Member

Maybe @joanglaunes know that better than me, but I think it will not be possible to make the same shared lib working on 2 different system. Why don't you define 2 separated cache folder ?

@bcharlier
Copy link
Member

hi, I used CUDA_VISIBLE_DEVICES here to make the problem reproducible.

My question is in a set-up where development (and thus compilation) happens on a machine with N gpus and test happens on a machine with M gpus, but sharing the same compilation cache.

Couldn't the number of GPUs available at compile time be used in the compiled code hash ? This way, changing the number of GPUs would just force a rebuild, but wouldn't raise an error.

ok, a quick solution could be: include the number of gpu and their respected arch in the name of the cache folder. So when you call your code from different node, it will get the sharedlib from the right cache dir

@timlacroix
Copy link
Author

@bcharlier yes, if that's possible that would be great :)

@bcharlier
Copy link
Member

bcharlier commented Dec 12, 2019

is the hostname unique in your case? I mean, is one of those output different on the various nodes :

import platform
print(platform.node())

import socket
print(socket.gethostname())

@timlacroix
Copy link
Author

both are different on various nodes.
(However, I might want to vary the number of GPUs available at runtime on the same machine, for exemple while developing, I have two things running on 1 GPU, then at some point I want to try 1 thing on 2 GPUs ...)

I don't know if including the hostname is a good idea. This means using a separate cache folder per machine which the user can do if necessary by just using a random cache at runtime. In my case I would be happy to re-use the same cache for various nodes of the cluster.

@joanglaunes
Copy link
Contributor

Hello @timlacroix ,
In fact the technical problem for us is that detection of Gpus and their properties is currently done at compilation in the Cmake scripts that are launched after the Python code detected there is a need for compilation.
So as @bcharlier is suggesting the easiest solution for us is to include hostname and node (+ the content of CUDA_VISIBLE_DEVICES maybe) in the hash code, because this is easy to do with Python.
However ok, maybe including the Gpu properties in the hash code is not so difficult, I guess it can be done with GPUtils...

@gdurif gdurif closed this as completed in 726eb51 Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants