Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAN appear during training #10

Closed
loocy3 opened this issue Nov 2, 2021 · 14 comments
Closed

NAN appear during training #10

loocy3 opened this issue Nov 2, 2021 · 14 comments

Comments

@loocy3
Copy link

loocy3 commented Nov 2, 2021

After 21850 training iterates, I got NAN in UNet extracted features.
Could you give any advice that where of the source code should I look into?

@sarlinpe
Copy link
Member

sarlinpe commented Nov 2, 2021

  1. What dataset are you training with?
  2. Could you try to enable anomaly detection by uncommenting this line? Please then report the entire stack traceback.
    # torch.autograd.set_detect_anomaly(True)

@sarlinpe
Copy link
Member

sarlinpe commented Nov 4, 2021

  1. There could appear NaNs in the solver step if the optimization is too difficult, but this should already be handled by the code.
  2. Did you try to train with a different random seed? Is the NaN always appearing at the same training iteration?

@loocy3
Copy link
Author

loocy3 commented Nov 5, 2021

  1. There could appear NaNs in the solver step if the optimization is too difficult, but this should already be handled by the code.
    -> may I know how you handled the case? by 'too few match points' check?
  2. Did you try to train with a different random seed? Is the NaN always appearing at the same training iteration?
    -> I loaded pertained CMU model, and fine-tune on Kitti data. I did not change the random seed. Nan is not always appearing at the same training iteration, but appearing around 29000~34000 iteration if it recurs.

@jmorlana
Copy link
Contributor

jmorlana commented Nov 6, 2021

I'm also having these kind of issues. Training in the same MegaDepth dataset with different configurations of U-Net (encoder pretrained on other data, frozen encoder, deleting decoder, etc). All of them lead to NaN at some point during the optimization. I didn´t conclude yet if they come from the optimization or from features directly.

Edit: I did not change the random seed either and the error does not repeat in the same iteration. Seems to appear randomly in the middle of training.

@sarlinpe
Copy link
Member

sarlinpe commented Nov 6, 2021

  • Does the anomaly detection show that NaNs consistently appear at the same ops?
  • Any spike in the loss function in the preceding iterations?
  • What versions of numpy & pytorch are you using?
  • Does reducing the learning rate help?

This is concerning; let me dig into it (this will likely take me a few days).

@jmorlana
Copy link
Contributor

jmorlana commented Nov 6, 2021

  • The output for the anomaly detection always points to the power operation in the loss estimation. But the NaN trace comes from the pose optimization, not sure if from features or from the pose itself. I'm running another training so I hope to give some more information.

[11/02/2021 07:16:05 pixloc INFO] [E 7 | it 2450] loss {total 3.257E+00, reprojection_error/0 9.695E+00, reprojection_error/1 8.376E+00, reprojection_error/2 8.366E+00, reprojection_error 8.366E+00, reprojection_error/init 3.127E+01}
[11/02/2021 07:16:06 pixloc.pixlib.models.two_view_refiner WARNING] NaN detected ['error', tensor([ nan, 5.0000e+01, 1.4714e-01, 2.6252e-03, 2.5921e-02, 3.2593e-02],
device='cuda:0', grad_fn=), 'loss', tensor([ nan, 0.0000, 0.0490, 0.0009, 0.0086, 0.0109], device='cuda:0',
grad_fn=)]
[W python_anomaly_mode.cpp:104] Warning: Error detected in PowBackward1. Traceback of forward call that caused the error:
File "/home/jmorlana/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/jmorlana/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/jmorlana/pixloc/pixloc/pixlib/train.py", line 391, in
main_worker(0, conf, output_dir, args)
File "/home/jmorlana/pixloc/pixloc/pixlib/train.py", line 358, in main_worker
training(rank, conf, output_dir, args)
File "/home/jmorlana/pixloc/pixloc/pixlib/train.py", line 259, in training
losses = loss_fn(pred, data)
File "/home/jmorlana/pixloc/pixloc/pixlib/models/two_view_refiner.py", line 151, in loss
err = reprojection_error(T_opt).clamp(max=self.conf.clamp_error)
File "/home/jmorlana/pixloc/pixloc/pixlib/models/two_view_refiner.py", line 133, in reprojection_error
err = scaled_barron(1., 2.)(err)[0]/4
File "/home/jmorlana/pixloc/pixloc/pixlib/geometry/losses.py", line 81, in
return lambda x: scaled_loss(
File "/home/jmorlana/pixloc/pixloc/pixlib/geometry/losses.py", line 18, in scaled_loss
loss, loss_d1, loss_d2 = fn(x/a2)
File "/home/jmorlana/pixloc/pixloc/pixlib/geometry/losses.py", line 82, in
x, lambda y: barron_loss(y, y.new_tensor(a)), c)
File "/home/jmorlana/pixloc/pixloc/pixlib/geometry/losses.py", line 59, in barron_loss
torch.pow(x / beta_safe + 1., 0.5 * alpha) - 1.)
(function _print_stack)

  • Validation total loss jumps to 15 (previous was 3) after the first NaN appear. All the training losses that come after become NaN too.
  • My version for torch is 1.7.1 and for numpy is 1.19.5
  • I haven't checked a different learning rate yet, I will give it a try.

Thank you!

@loocy3
Copy link
Author

loocy3 commented Nov 8, 2021

Thank you for the analysis. I have reproduced the issue:


[W python_anomaly_mode.cpp:104] Warning: Error detected in MulBackward0. Traceback of forward call that caused the error:
File "pixloc/pixlib/train.py", line 417, in
main_worker(0, conf, output_dir, args)
File "pixloc/pixlib/train.py", line 383, in main_worker
training(rank, conf, output_dir, args)
File "pixloc/pixlib/train.py", line 281, in training
pred = model(data)
File ".local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "pixloc/pixloc/pixlib/models/base_model.py", line 106, in forward
return self._forward(data)
File "pixloc/pixloc/pixlib/models/two_view_refiner.py", line 117, in _forward
mask=mask, W_ref_q=W_ref_q))
File ".local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/pixloc/pixloc/pixlib/models/base_model.py", line 106, in forward
return self._forward(data)
File "pixloc/pixloc/pixlib/models/base_optimizer.py", line 97, in forward
data['cam_q'], data['mask'], data.get('W_ref_q'))
File "pixloc/pixloc/pixlib/models/learned_optimizer.py", line 78, in run
delta = optimizer_step(g, H, lambda
, mask=~failed)
File "pixloc/pixloc/pixlib/geometry/optimization.py", line 18, in optimizer_step
diag = H.diagonal(dim1=-2, dim2=-1) * lambda

(function _print_stack)
Traceback (most recent call last):
File "pixloc/pixlib/train.py", line 417, in
main_worker(0, conf, output_dir, args)
File "pixloc/pixlib/train.py", line 383, in main_worker
training(rank, conf, output_dir, args)
File "pixloc/pixlib/train.py", line 292, in training
loss.backward()
File ".local/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File ".local/lib/python3.7/site-packages/torch/autograd/init.py", line 156, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: Function 'MulBackward0' returned nan values in its 1th output.

@angiend
Copy link

angiend commented Nov 18, 2021

RuntimeError:Function 'PowBackward1' returned nan values in its 0th output #16

@sarlinpe
Copy link
Member

I believe that the issue has been addressed by 8937e29 and 0ab0e79. Can you please confirm that this helps? I will continue to investigate other sources of instabilities.

@angiend
Copy link

angiend commented Nov 25, 2021

i tested the change code ,but get the same error .

@sarlinpe
Copy link
Member

@angiend what dataset are you training with? at which iteration does it crash? with what version of PyTorch?

@angiend
Copy link

angiend commented Nov 25, 2021

@skydes i retrain on CMU dataset, crash at "E 65| it 800 "(3000 iter at each epoch),and my pytorch version is 1.9.1

@sarlinpe
Copy link
Member

The training has usually fully converged at epoch 20 so this should not prevent reproducing the results. Could give a try to PyTorch 1.7.1? I have tried both 1.7.1 and 1.10.0 and both work fine.

@sarlinpe sarlinpe changed the title NAN in Unet extracted features NAN appear during training Nov 27, 2021
@loocy3
Copy link
Author

loocy3 commented Dec 11, 2021

Thanks, I have test 3 epochs and I think this issue has been fixed.

@loocy3 loocy3 closed this as completed Dec 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants