You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As you can see that for the first time , the model out is a valid tensor with values ( i.e before optimizer.step() )
when the iteration 1 begins ( i.e after optimizer.step() ) output becomes nan
Debug method : 0
after setting this torch.autograd.set_detect_anomaly(True) globally
i found this result below
[W python_anomaly_mode.cpp:104] Warning: Error detected in MseLossBackward. Traceback of forward call that caused the error:
File "test.py", line 86, in <module>
loss = loss_func(out,y)
File "/home/buckaroo/miniconda3/envs/dev/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/e/workspace/@training/@datasets/cnns/yolo/yolo-v1-pytorch/loss.py", line 120, in forward
torch.flatten(exists_box * target[..., :20], end_dim=-2,),
File "/home/buckaroo/miniconda3/envs/dev/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/buckaroo/miniconda3/envs/dev/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 528, in forward
return F.mse_loss(input, target, reduction=self.reduction)
File "/home/buckaroo/miniconda3/envs/dev/lib/python3.7/site-packages/torch/nn/functional.py", line 2929, in mse_loss
return torch._C._nn.mse_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
(function _print_stack)
Traceback (most recent call last):
File "test.py", line 89, in <module>
loss.backward()
File "/home/buckaroo/miniconda3/envs/dev/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/buckaroo/miniconda3/envs/dev/lib/python3.7/site-packages/torch/autograd/__init__.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: Function 'MseLossBackward' returned nan values in its 0th output.
so i have tried
clamping the loss tensors torch.clamp(value, min=0.0 , max=1.0) in loss.py
adding epsilon (1e-6) after torch.sqrt() like torch.sqrt(val+epsilon) in loss.py
I encountered this error while i was trying train the model on my local gpu
Here : Machine-Learning-Collection/ML/Pytorch/object_detection/YOLO/
This is the test script that i have used to test the yolo-v1 model
Note : i am using
half()
because of the cuda error =>RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
while running the script i was getting this output below
Observations
optimizer.step()
)optimizer.step()
) output becomesnan
Debug method : 0
torch.autograd.set_detect_anomaly(True)
globallyso i have tried
torch.clamp(value, min=0.0 , max=1.0)
inloss.py
torch.sqrt(val+epsilon)
inloss.py
But this didnot fix my issue.
Reference
Getting NaN values in backward pass
Output of Model is nan every time
Nan Loss coming after some time
Getting Nan after first iteration with custom loss
Weights become NaN values after first batch step
Why nan after backward pass?
NaN values popping up during loss.backward()
Debugging neural networks
So kindly help me debug this issue , thanks in advance
The text was updated successfully, but these errors were encountered: