Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

HELP ! RuntimeError: CUDA error: device-side assert triggered #2

Open
wanngweiwei opened this issue Apr 9, 2020 · 10 comments
Open

Comments

@wanngweiwei
Copy link

wanngweiwei commented Apr 9, 2020

I downloaded this repository containing codes, data sets, and models trained, and tried to run the commands in the ipython notebook given by Dr. Lample. But I get a bug that I cannot solve.
The first 10 Inputs in the ipython notebooks run well, but for the In [11] to Decode with beam search, there throw out an error:

_File "", line 109, in
_, _, beam = decoder.generate_beam(encoded, len1, beam_size=beam_size, length_penalty=1.0, early_stopping=1, max_len=200)

File "D:\LampleCharton2019\SymbolicMathematics-master\src\model\transformer.py", line 544, in generate_beam
cache[k] = (cache[k][0][beam_idx], cache[k][1][beam_idx])

RuntimeError: CUDA error: device-side assert triggered_

The environment in my computer is win10, anaconda3, python3.7.5, pytorch (gpu), torch.cuda.is_available() = true, two Nvidia quadro P4000, they work well in other programs.

@wanngweiwei wanngweiwei changed the title CUDA HELP ! RuntimeError: CUDA error: device-side assert triggered Apr 9, 2020
@wanngweiwei
Copy link
Author

I queried this problem to Dr. Lample. He is so kindhearted that he replied me quikly.
Maybe I can paste his answer here:

Hi Weiwei,

I'm not sure what is happening, but this is the kind of issues that usually happen when one is indexing an array by a higher value than what is available (for instance the lookup table has 100 embeddings, but you query the word 105 or something). The problem with CUDA is that it's not clear where the issue is happening because it is running asynchronously.

Did you modify the code? What is the command you ran?
Can you try the same command with the CUDA_LAUNCH_BLOCKING prefix? "CUDA_LAUNCH_BLOCKING=1 python ....." and see what happens?
This should give a better error message about where the issue is exactly happening.

Also, would you mind asking the issue on the Github? In case someone else faces the same problem.

Thank you,
Guillaume

@wanngweiwei
Copy link
Author

wanngweiwei commented Apr 9, 2020

I am so exciting to get his reply. Merci beaucoup beaucoup.

I have tried to add prefix "CUDA_LAUNCH_BLOCKING=1" ,then the bug is that,

File "", line 110, in
_, _, beam = decoder.generate_beam(encoded, len1, beam_size=beam_size, length_penalty=1.0, early_stopping=1, max_len=200)

File "D:\LampleCharton2019\SymbolicMathematics-master\src\model\transformer.py", line 540, in generate_beam
generated = generated[:, beam_idx]

RuntimeError: CUDA error: device-side assert triggered

when the prefix changed to
"CUDA_LAUNCH_BLOCKING=0"
then the bug is the same with that with no prefix about CUDA_LAUNCH_BLOCKING. it is

File "", line 110, in
_, _, beam = decoder.generate_beam(encoded, len1, beam_size=beam_size, length_penalty=1.0, early_stopping=1, max_len=200)

File "D:\LampleCharton2019\SymbolicMathematics-master\src\model\transformer.py", line 544, in generate_beam
cache[k] = (cache[k][0][beam_idx], cache[k][1][beam_idx])

RuntimeError: CUDA error: device-side assert triggered

@wanngweiwei
Copy link
Author

Is there anybody can help me ?

@glample
Copy link
Contributor

glample commented Apr 18, 2020

Hi @wanngweiwei , sorry for the delay. The CUDA_LAUNCH_BLOCKING=1 is helpful, the error seems to come from this line generated = generated[:, beam_idx]

I don't understand how this error can happen though. Do you have the full command that you used to get this error? So I can try to reproduce it.

Also, did you make modifications in the code? Could you try to print the shape of generated and the beam_idx value, with print(generated.shape, beam_idx) just before it fails?

Best,
Guillaume

@wanngweiwei
Copy link
Author

print 2020-04-19 150031

Thank you Dr. Lample, I tried to print something as your advise. But, please forgive me that I am a new learner on the seq2seq and Beam searching, even not familiar with the python. Can you give more guide here? Thank you so much.

@glample
Copy link
Contributor

glample commented Apr 19, 2020

Okay so generated has the good shape. Not sure what is going on with beam_idx though. These 794946954264578 huge values look like a bug. What version of PyTorch are you using?

Could you try to print:
print(sent_id, beam_size, beam_id)
just before
next_sent_beam.append((value, word_id, sent_id * beam_size + beam_id))
and see the output? This is what is converted to something weird.

Again, that would be helpful if you could provide me with the command you use to have this issue. I could try to debug and fix it on my side.

@wanngweiwei
Copy link
Author

wanngweiwei commented Apr 19, 2020

Thank you, Dear Lample
my pytorch version is 1.3.0
the print shows below.

screen 2020-04-19 233230

and the command I used is just the ipython notebook given in this code, they are

In [1]:

import os
import numpy as np
import sympy as sp
import torch
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

from src.utils import AttrDict
from src.envs import build_env
from src.model import build_modules

from src.utils import to_cuda
from src.envs.sympy_utils import simplify

In [2]:

assert os.path.isfile(model_path)

In [3]:

params = params = AttrDict({
'env_name': 'char_sp',
'int_base': 10,
'balanced': False,
'positive': True,
'precision': 10,
'n_variables': 1,
'n_coefficients': 0,
'leaf_probs': '0.75,0,0.25,0',
'max_len': 512,
'max_int': 5,
'max_ops': 15,
'max_ops_G': 15,
'clean_prefix_expr': True,
'rewrite_functions': '',
'tasks': 'prim_fwd',
'operators': 'add:10,sub:3,mul:10,div:5,sqrt:4,pow2:4,pow3:2,pow4:1,pow5:1,ln:4,exp:4,sin:4,cos:4,tan:4,asin:1,acos:1,atan:1,sinh:1,cosh:1,tanh:1,asinh:1,acosh:1,atanh:1',
'cpu': False,
'emb_dim': 1024,
'n_enc_layers': 6,
'n_dec_layers': 6,
'n_heads': 8,
'dropout': 0,
'attention_dropout': 0,
'sinusoidal_embeddings': False,
'share_inout_emb': True,
'reload_model': model_path,
})

In [4]:

env = build_env(params)
x = env.local_dict['x']

In [5]:

modules = build_modules(env, params)
encoder = modules['encoder']
decoder = modules['decoder']

In [6]:

F_infix = 'x * tan(exp(x)/x)'
F_infix = 'x * cos(x2) * tan(x)'
F_infix = 'cos(x
2 * exp(x * cos(x)))'
F_infix = 'ln(cos(x + exp(x)) * sin(x**2 + 2) * exp(x) / x)'

In [7]:

F = sp.S(F_infix, locals=env.local_dict)
F

In [8]:

f = F.diff(x)
f

In [9]:

F_prefix = env.sympy_to_prefix(F)
f_prefix = env.sympy_to_prefix(f)
print(f"F prefix: {F_prefix}")
print(f"f prefix: {f_prefix}")

In [10]:

x1_prefix = env.clean_prefix(['sub', 'derivative', 'f', 'x', 'x'] + f_prefix)
x1 = torch.LongTensor(
[env.eos_index] +
[env.word2id[w] for w in x1_prefix] +
[env.eos_index]
).view(-1, 1)
len1 = torch.LongTensor([len(x1)])
x1, len1 = to_cuda(x1, len1)
with torch.no_grad():
encoded = encoder('fwd', x=x1, lengths=len1, causal=False).transpose(0, 1)

In [11]:

beam_size = 10
with torch.no_grad():
_, _, beam = decoder.generate_beam(encoded, len1, beam_size=beam_size, length_penalty=1.0, early_stopping=1, max_len=200)
assert len(beam) == 1
hypotheses = beam[0].hyp
assert len(hypotheses) == beam_size

THEN, The Error cames in the In[11], cry....

@glample
Copy link
Contributor

glample commented Apr 20, 2020

Can you try to do print(idx, n_words) just before the beam_id = idx // n_words line?
Basically I want to find out the first line where some gigantic value is appearing.

@wanngweiwei
Copy link
Author

Dear Dr, this printing show this:
1   2020-04-21 104753
2   2020-04-21 104753

Hope this can give you some useful infomation.

@glample
Copy link
Contributor

glample commented Apr 21, 2020

I see. So this is the next_words variable which contains the huge values. Problem must come from this line:
next_scores, next_words = torch.topk(_scores, 2 * beam_size, dim=1, largest=True, sorted=True)

Can you try to inspect if there is anything wrong with the _scores variable?
Maybe try to print it, if the printed matrix is too large.
I suspect there are some NaN in _scores , c.f. allenai/allennlp#2028

It's very difficult for me to help like this, I really need to investigate on my computer. Can you tell me the command you ran / how I can reproduce this error?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants