New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The loss becomes NaN after some training epochs #17
Comments
Hi, We also had this problem during early Efficient Conformer experiments after 30/50 training epochs. Networks activations were growing too large for mixed precision training. Applying weight decay of 1e-6 fixed the problem. Since the loss becomes NaN only after some epochs, I assume it is not related to some target sequences being longer than model outputs for CTC computation. Best, |
Thanks for your response.
One thing I was wondering is that there may be a possibility that weights are becoming small and some log functions are returning for that. I will investigate that, but do you think that might be a possibility as well? |
Yes you could try to see if increasing the weight decay solves the problem. Also using AdamW with the default decay of 0.01. But I advise you to first find where NaNs come from by logging or printing networks hidden activations / norms during training starting from the last healthy checkpoint. You could then try using these methods to compare. |
I checked the weights and for some specific batches, all the activations were becoming NaN. |
Thanks for your feedback ! I suppose this was due to some weight parameters causing the activations to grow too large for float16. |
@harisgulzar1 @burchim I am still getting NAN during CTC training phase in spite of using AdamW and learning rate of 0.01. What else can I try? |
@kafan1986 |
When I train the ConformerCTC with the same code.
The loss becomes NaN after some training epochs.
After looking into it, I found that the loss is becoming NaN for a few batches but that causes accumulated loss to become NaN too.
So far, I have tried the following:
None of it has worked, can you please advise on this? Thanks!
The text was updated successfully, but these errors were encountered: