Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel Panic using Cuda 10.2 #43

Closed
vade opened this issue Jan 29, 2020 · 5 comments
Closed

Kernel Panic using Cuda 10.2 #43

vade opened this issue Jan 29, 2020 · 5 comments

Comments

@vade
Copy link

vade commented Jan 29, 2020

Hello - firstly, thanks for this and your great documentation. Much appreciated.

Im using Ubuntu 18.0.4 LTS, Cuda 10.2, Nvidia 4.40 drivers and a single Titan X

Ive followed the readme, installed the dependencies in a virtual envs, compiled the extensions, and am able to run the demo - however, after a few seconds the demo crashes and kernal panics the entire system.

I've attempted to edit both extension 's NVCC flags, as per the helpful note in the documentation, but to no avail.

    '-gencode', 'arch=compute_52,code=sm_52',
    '-gencode', 'arch=compute_60,code=sm_60',
    '-gencode', 'arch=compute_61,code=sm_61',
    '-gencode', 'arch=compute_70,code=sm_70',
    '-gencode', 'arch=compute_75,code=sm_75',
    '-gencode', 'arch=compute_75,code=compute_75',

However, that also kernel panics the machine.

I am able to monitor GPU memory usage right before the crash and am able to see pytorch allocating GPU memory, but It appears to go to the max, then the system dies.

Are there other specific hardware requirements for this code base?

@vade
Copy link
Author

vade commented Jan 29, 2020

Also, im aware I can't expect you to help resolve a remote kernel panic, im just looking for any other place info to guide my debugging.

@NexonSU
Copy link

NexonSU commented Jan 29, 2020

Seems like, it's problem with your kernel + nvidia-driver-440 + cuda-10.2.
I have similar system:
Ubuntu 18.04.3 (kernel 5.3.0-28)
cuda 10.2.89-1
nvidia-driver-440 440.33.01-0ubuntu1
And everything is fine.

@vade
Copy link
Author

vade commented Jan 29, 2020

Thanks, thats great to know. Ill see if I can find any other issues.

@attashe
Copy link

attashe commented Jan 29, 2020

Hi, I have similar problem:

error in correlation_forward_cuda_kernel: no kernel image is available for execution on the device
Traceback (most recent call last):
File "demo_MiddleBury_slowmotion.py", line 126, in
y_s,offset,filter = model(torch.stack((X0, X1),dim = 0))

pytorch = 1.3.1
NVIDIA GPU = Tesla V100
CUDA Version: 10.2

@MarlNox
Copy link

MarlNox commented Feb 21, 2020

#44 (comment)

Here you go, Just follow the colab posted on the comments and modify it according to my comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants