HELP ! RuntimeError: CUDA error: device-side assert triggered #2

wanngweiwei · 2020-04-09T10:39:37Z

I downloaded this repository containing codes, data sets, and models trained, and tried to run the commands in the ipython notebook given by Dr. Lample. But I get a bug that I cannot solve.
The first 10 Inputs in the ipython notebooks run well, but for the In [11] to Decode with beam search, there throw out an error:

_File "", line 109, in
_, _, beam = decoder.generate_beam(encoded, len1, beam_size=beam_size, length_penalty=1.0, early_stopping=1, max_len=200)

File "D:\LampleCharton2019\SymbolicMathematics-master\src\model\transformer.py", line 544, in generate_beam
cache[k] = (cache[k][0][beam_idx], cache[k][1][beam_idx])

RuntimeError: CUDA error: device-side assert triggered_

The environment in my computer is win10, anaconda3, python3.7.5, pytorch (gpu), torch.cuda.is_available() = true, two Nvidia quadro P4000, they work well in other programs.

wanngweiwei · 2020-04-09T11:01:47Z

I queried this problem to Dr. Lample. He is so kindhearted that he replied me quikly.
Maybe I can paste his answer here:

Hi Weiwei,

I'm not sure what is happening, but this is the kind of issues that usually happen when one is indexing an array by a higher value than what is available (for instance the lookup table has 100 embeddings, but you query the word 105 or something). The problem with CUDA is that it's not clear where the issue is happening because it is running asynchronously.

Did you modify the code? What is the command you ran?
Can you try the same command with the CUDA_LAUNCH_BLOCKING prefix? "CUDA_LAUNCH_BLOCKING=1 python ....." and see what happens?
This should give a better error message about where the issue is exactly happening.

Also, would you mind asking the issue on the Github? In case someone else faces the same problem.

Thank you,
Guillaume

wanngweiwei · 2020-04-09T11:09:07Z

I am so exciting to get his reply. Merci beaucoup beaucoup.

I have tried to add prefix "CUDA_LAUNCH_BLOCKING=1" ,then the bug is that,

File "", line 110, in
_, _, beam = decoder.generate_beam(encoded, len1, beam_size=beam_size, length_penalty=1.0, early_stopping=1, max_len=200)

File "D:\LampleCharton2019\SymbolicMathematics-master\src\model\transformer.py", line 540, in generate_beam
generated = generated[:, beam_idx]

RuntimeError: CUDA error: device-side assert triggered

when the prefix changed to
"CUDA_LAUNCH_BLOCKING=0"
then the bug is the same with that with no prefix about CUDA_LAUNCH_BLOCKING. it is

File "", line 110, in
_, _, beam = decoder.generate_beam(encoded, len1, beam_size=beam_size, length_penalty=1.0, early_stopping=1, max_len=200)

File "D:\LampleCharton2019\SymbolicMathematics-master\src\model\transformer.py", line 544, in generate_beam
cache[k] = (cache[k][0][beam_idx], cache[k][1][beam_idx])

RuntimeError: CUDA error: device-side assert triggered

wanngweiwei · 2020-04-09T11:13:39Z

Is there anybody can help me ?

glample · 2020-04-18T23:46:11Z

Hi @wanngweiwei , sorry for the delay. The CUDA_LAUNCH_BLOCKING=1 is helpful, the error seems to come from this line generated = generated[:, beam_idx]

I don't understand how this error can happen though. Do you have the full command that you used to get this error? So I can try to reproduce it.

Also, did you make modifications in the code? Could you try to print the shape of generated and the beam_idx value, with print(generated.shape, beam_idx) just before it fails?

Best,
Guillaume

wanngweiwei · 2020-04-19T07:11:30Z

Thank you Dr. Lample, I tried to print something as your advise. But, please forgive me that I am a new learner on the seq2seq and Beam searching, even not familiar with the python. Can you give more guide here? Thank you so much.

glample · 2020-04-19T14:36:02Z

Okay so generated has the good shape. Not sure what is going on with beam_idx though. These 794946954264578 huge values look like a bug. What version of PyTorch are you using?

Could you try to print:
print(sent_id, beam_size, beam_id)
just before
next_sent_beam.append((value, word_id, sent_id * beam_size + beam_id))
and see the output? This is what is converted to something weird.

Again, that would be helpful if you could provide me with the command you use to have this issue. I could try to debug and fix it on my side.

wanngweiwei · 2020-04-19T15:36:46Z

Thank you, Dear Lample
my pytorch version is 1.3.0
the print shows below.

and the command I used is just the ipython notebook given in this code, they are

In [1]:

import os
import numpy as np
import sympy as sp
import torch
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

from src.utils import AttrDict
from src.envs import build_env
from src.model import build_modules

from src.utils import to_cuda
from src.envs.sympy_utils import simplify

In [2]:

assert os.path.isfile(model_path)

In [3]:

params = params = AttrDict({
'env_name': 'char_sp',
'int_base': 10,
'balanced': False,
'positive': True,
'precision': 10,
'n_variables': 1,
'n_coefficients': 0,
'leaf_probs': '0.75,0,0.25,0',
'max_len': 512,
'max_int': 5,
'max_ops': 15,
'max_ops_G': 15,
'clean_prefix_expr': True,
'rewrite_functions': '',
'tasks': 'prim_fwd',
'operators': 'add:10,sub:3,mul:10,div:5,sqrt:4,pow2:4,pow3:2,pow4:1,pow5:1,ln:4,exp:4,sin:4,cos:4,tan:4,asin:1,acos:1,atan:1,sinh:1,cosh:1,tanh:1,asinh:1,acosh:1,atanh:1',
'cpu': False,
'emb_dim': 1024,
'n_enc_layers': 6,
'n_dec_layers': 6,
'n_heads': 8,
'dropout': 0,
'attention_dropout': 0,
'sinusoidal_embeddings': False,
'share_inout_emb': True,
'reload_model': model_path,
})

In [4]:

env = build_env(params)
x = env.local_dict['x']

In [5]:

modules = build_modules(env, params)
encoder = modules['encoder']
decoder = modules['decoder']

In [6]:

F_infix = 'x * tan(exp(x)/x)'
F_infix = 'x * cos(x2) * tan(x)'
F_infix = 'cos(x2 * exp(x * cos(x)))'
F_infix = 'ln(cos(x + exp(x)) * sin(x**2 + 2) * exp(x) / x)'

In [7]:

F = sp.S(F_infix, locals=env.local_dict)
F

In [8]:

f = F.diff(x)
f

In [9]:

F_prefix = env.sympy_to_prefix(F)
f_prefix = env.sympy_to_prefix(f)
print(f"F prefix: {F_prefix}")
print(f"f prefix: {f_prefix}")

In [10]:

x1_prefix = env.clean_prefix(['sub', 'derivative', 'f', 'x', 'x'] + f_prefix)
x1 = torch.LongTensor(
[env.eos_index] +
[env.word2id[w] for w in x1_prefix] +
[env.eos_index]
).view(-1, 1)
len1 = torch.LongTensor([len(x1)])
x1, len1 = to_cuda(x1, len1)
with torch.no_grad():
encoded = encoder('fwd', x=x1, lengths=len1, causal=False).transpose(0, 1)

In [11]:

beam_size = 10
with torch.no_grad():
_, _, beam = decoder.generate_beam(encoded, len1, beam_size=beam_size, length_penalty=1.0, early_stopping=1, max_len=200)
assert len(beam) == 1
hypotheses = beam[0].hyp
assert len(hypotheses) == beam_size

THEN, The Error cames in the In[11], cry....

glample · 2020-04-20T09:06:40Z

Can you try to do print(idx, n_words) just before the beam_id = idx // n_words line?
Basically I want to find out the first line where some gigantic value is appearing.

wanngweiwei · 2020-04-21T02:50:47Z

Dear Dr, this printing show this:

Hope this can give you some useful infomation.

glample · 2020-04-21T09:22:36Z

I see. So this is the next_words variable which contains the huge values. Problem must come from this line:
next_scores, next_words = torch.topk(_scores, 2 * beam_size, dim=1, largest=True, sorted=True)

Can you try to inspect if there is anything wrong with the _scores variable?
Maybe try to print it, if the printed matrix is too large.
I suspect there are some NaN in _scores , c.f. allenai/allennlp#2028

It's very difficult for me to help like this, I really need to investigate on my computer. Can you tell me the command you ran / how I can reproduce this error?

wanngweiwei changed the title ~~CUDA~~ HELP ! RuntimeError: CUDA error: device-side assert triggered Apr 9, 2020

andife mentioned this issue Apr 3, 2021

Train error Beckschen/TransUNet#26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HELP ! RuntimeError: CUDA error: device-side assert triggered #2

HELP ! RuntimeError: CUDA error: device-side assert triggered #2

wanngweiwei commented Apr 9, 2020 •

edited

wanngweiwei commented Apr 9, 2020

wanngweiwei commented Apr 9, 2020 •

edited

wanngweiwei commented Apr 9, 2020

glample commented Apr 18, 2020

wanngweiwei commented Apr 19, 2020

glample commented Apr 19, 2020

wanngweiwei commented Apr 19, 2020 •

edited

glample commented Apr 20, 2020

wanngweiwei commented Apr 21, 2020

glample commented Apr 21, 2020

HELP ! RuntimeError: CUDA error: device-side assert triggered #2

HELP ! RuntimeError: CUDA error: device-side assert triggered #2

Comments

wanngweiwei commented Apr 9, 2020 • edited

wanngweiwei commented Apr 9, 2020

wanngweiwei commented Apr 9, 2020 • edited

wanngweiwei commented Apr 9, 2020

glample commented Apr 18, 2020

wanngweiwei commented Apr 19, 2020

glample commented Apr 19, 2020

wanngweiwei commented Apr 19, 2020 • edited

In [1]:

In [2]:

In [3]:

In [4]:

In [5]:

In [6]:

In [7]:

In [8]:

In [9]:

In [10]:

In [11]:

glample commented Apr 20, 2020

wanngweiwei commented Apr 21, 2020

glample commented Apr 21, 2020

wanngweiwei commented Apr 9, 2020 •

edited

wanngweiwei commented Apr 9, 2020 •

edited

wanngweiwei commented Apr 19, 2020 •

edited