Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The loss becomes NaN after some training epochs #17

Closed
harisgulzar1 opened this issue Dec 8, 2022 · 7 comments
Closed

The loss becomes NaN after some training epochs #17

harisgulzar1 opened this issue Dec 8, 2022 · 7 comments

Comments

@harisgulzar1
Copy link

When I train the ConformerCTC with the same code.
The loss becomes NaN after some training epochs.
After looking into it, I found that the loss is becoming NaN for a few batches but that causes accumulated loss to become NaN too.
So far, I have tried the following:

  1. Applied gradient clipping (both normalized and clipping by value)
  2. Decreased the learning rate.
  3. Applied weight clipping, to put a check if any weights are becoming too large.

None of it has worked, can you please advise on this? Thanks!

@burchim
Copy link
Owner

burchim commented Dec 8, 2022

Hi,

We also had this problem during early Efficient Conformer experiments after 30/50 training epochs. Networks activations were growing too large for mixed precision training. Applying weight decay of 1e-6 fixed the problem.
What exact config are you using ? also tokenizer and dataset ? Maybe that applying a larger weight decay or AdamW instead of Adam could fix the problem. Setting mixed precision to false may fix the problem at the cost of slower training and larger GPU memory use.

Since the loss becomes NaN only after some epochs, I assume it is not related to some target sequences being longer than model outputs for CTC computation.

Best,
Maxime

@harisgulzar1
Copy link
Author

harisgulzar1 commented Dec 9, 2022

Thanks for your response.
Interestingly for the ConformerCTC-Small the problem occurred at the 200th epoch or so, but for Conformer-CTC Medium and Large, the problem occurred at earlier epochs. So yes, I think the problem is not with the sequence lengths in the dataset.

  • For training 1e-6 weight decay is already applied, do you think I should increase the value?
  • Currently, I am using Adam, I will try using AdamW and see if the problem is solved.
  • Will also try by turning off the mixed precision training.

One thing I was wondering is that there may be a possibility that weights are becoming small and some log functions are returning for that. I will investigate that, but do you think that might be a possibility as well?
I am using LibriSpeech dataset with default bpe tokenizer as you have provided with your code.
Below are the current configuration values.
"training_params": { "epochs": 450, "batch_size": 16, "accumulated_steps": 4, "mixed_precision": true, "optimizer": "Adam", "beta1": 0.9, "beta2": 0.98, "eps": 1e-9, "weight_decay": 1e-6, "lr_schedule": "Transformer", "schedule_dim": 256, "warmup_steps": 10000, "K": 2 }

@burchim
Copy link
Owner

burchim commented Dec 9, 2022

Yes you could try to see if increasing the weight decay solves the problem. Also using AdamW with the default decay of 0.01. But I advise you to first find where NaNs come from by logging or printing networks hidden activations / norms during training starting from the last healthy checkpoint. You could then try using these methods to compare.

@harisgulzar1
Copy link
Author

I checked the weights and for some specific batches, all the activations were becoming NaN.
I couldn't pinpoint the exact activation function which was causing problems.
But using AdamW with default decay value has solved the problem.
Thank you very much for your suggestions.

@burchim
Copy link
Owner

burchim commented Dec 16, 2022

Thanks for your feedback ! I suppose this was due to some weight parameters causing the activations to grow too large for float16.

@kafan1986
Copy link

@harisgulzar1 @burchim I am still getting NAN during CTC training phase in spite of using AdamW and learning rate of 0.01. What else can I try?

@harisgulzar1
Copy link
Author

@kafan1986
0.01 is a very large value of the learning rate. Try not to change the learning rate from default.
Instead set the value of weight_decay to 0.01 (default value of AdamW).
If this doesn't work, set the mixed_precision value to false.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants