-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NAN appear during training #10
Comments
|
|
|
I'm also having these kind of issues. Training in the same MegaDepth dataset with different configurations of U-Net (encoder pretrained on other data, frozen encoder, deleting decoder, etc). All of them lead to NaN at some point during the optimization. I didn´t conclude yet if they come from the optimization or from features directly. Edit: I did not change the random seed either and the error does not repeat in the same iteration. Seems to appear randomly in the middle of training. |
This is concerning; let me dig into it (this will likely take me a few days). |
Thank you! |
Thank you for the analysis. I have reproduced the issue: [W python_anomaly_mode.cpp:104] Warning: Error detected in MulBackward0. Traceback of forward call that caused the error: |
RuntimeError:Function 'PowBackward1' returned nan values in its 0th output #16 |
i tested the change code ,but get the same error . |
@angiend what dataset are you training with? at which iteration does it crash? with what version of PyTorch? |
@skydes i retrain on CMU dataset, crash at "E 65| it 800 "(3000 iter at each epoch),and my pytorch version is 1.9.1 |
The training has usually fully converged at epoch 20 so this should not prevent reproducing the results. Could give a try to PyTorch 1.7.1? I have tried both 1.7.1 and 1.10.0 and both work fine. |
Thanks, I have test 3 epochs and I think this issue has been fixed. |
After 21850 training iterates, I got NAN in UNet extracted features.
Could you give any advice that where of the source code should I look into?
The text was updated successfully, but these errors were encountered: