errors when training to the third epoch. everytime. #20

Dootmaan · 2021-11-23T10:02:11Z

THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=29 error=1 : invalid argument
Traceback (most recent call last):
  File "train_pointunet.py", line 211, in <module>
    loss_seg = lossfunc_seg(outputs_seg, labels)+lossfunc_dice(outputs_seg,labels)
  File "/home/why/miniconda3/envs/3.6.8/lib/python3.6/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/why/miniconda3/envs/3.6.8/lib/python3.6/site-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: cuda runtime error (1) : invalid argument at /pytorch/aten/src/THC/generic/THCTensorMath.cu:29

im very confused because in the first several epoches it works fine.

The text was updated successfully, but these errors were encountered:

TimDettmers · 2021-11-23T14:54:12Z

This looks to me like an error where I forgot to check if the inputs are on the GPU for optimizer calls. If an error occurs, often the next CUDA method throws an error. Are you doing something in particular with the optimizer after each epoch (saving it, or casting it in some way, or anything else)?

Would you be able to post the code or isolate the problem and post that snippet? This would help immensely to understand what is going wrong and where the error is.

Dootmaan · 2021-11-25T08:02:15Z

This looks to me like an error where I forgot to check if the inputs are on the GPU for optimizer calls. If an error occurs, often the next CUDA method throws an error. Are you doing something in particular with the optimizer after each epoch (saving it, or casting it in some way, or anything else)?

Would you be able to post the code or isolate the problem and post that snippet? This would help immensely to understand what is going wrong and where the error is.

thank you for your help. i didnt do anything special with Adam8bit after each epoch and this error actually was thrown halfway in the third epoch. however at last i found that my cudatoolkit version is inconsistent with the CUDA version. i tried to set up the experiment environment again and this time the problem just magically disappeared.

Dootmaan closed this as completed Nov 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

errors when training to the third epoch. everytime. #20

errors when training to the third epoch. everytime. #20

Dootmaan commented Nov 23, 2021

TimDettmers commented Nov 23, 2021

Dootmaan commented Nov 25, 2021

errors when training to the third epoch. everytime. #20

errors when training to the third epoch. everytime. #20

Comments

Dootmaan commented Nov 23, 2021

TimDettmers commented Nov 23, 2021

Dootmaan commented Nov 25, 2021