Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

errors when training to the third epoch. everytime. #20

Closed
Dootmaan opened this issue Nov 23, 2021 · 2 comments
Closed

errors when training to the third epoch. everytime. #20

Dootmaan opened this issue Nov 23, 2021 · 2 comments

Comments

@Dootmaan
Copy link

THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=29 error=1 : invalid argument
Traceback (most recent call last):
  File "train_pointunet.py", line 211, in <module>
    loss_seg = lossfunc_seg(outputs_seg, labels)+lossfunc_dice(outputs_seg,labels)
  File "/home/why/miniconda3/envs/3.6.8/lib/python3.6/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/why/miniconda3/envs/3.6.8/lib/python3.6/site-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: cuda runtime error (1) : invalid argument at /pytorch/aten/src/THC/generic/THCTensorMath.cu:29

im very confused because in the first several epoches it works fine.

@TimDettmers
Copy link
Contributor

This looks to me like an error where I forgot to check if the inputs are on the GPU for optimizer calls. If an error occurs, often the next CUDA method throws an error. Are you doing something in particular with the optimizer after each epoch (saving it, or casting it in some way, or anything else)?

Would you be able to post the code or isolate the problem and post that snippet? This would help immensely to understand what is going wrong and where the error is.

@Dootmaan
Copy link
Author

This looks to me like an error where I forgot to check if the inputs are on the GPU for optimizer calls. If an error occurs, often the next CUDA method throws an error. Are you doing something in particular with the optimizer after each epoch (saving it, or casting it in some way, or anything else)?

Would you be able to post the code or isolate the problem and post that snippet? This would help immensely to understand what is going wrong and where the error is.

thank you for your help. i didnt do anything special with Adam8bit after each epoch and this error actually was thrown halfway in the third epoch. however at last i found that my cudatoolkit version is inconsistent with the CUDA version. i tried to set up the experiment environment again and this time the problem just magically disappeared.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants