The loss becomes NaN after some training epochs #17

harisgulzar1 · 2022-12-08T00:25:14Z

When I train the ConformerCTC with the same code.
The loss becomes NaN after some training epochs.
After looking into it, I found that the loss is becoming NaN for a few batches but that causes accumulated loss to become NaN too.
So far, I have tried the following:

Applied gradient clipping (both normalized and clipping by value)
Decreased the learning rate.
Applied weight clipping, to put a check if any weights are becoming too large.

None of it has worked, can you please advise on this? Thanks!

burchim · 2022-12-08T23:04:47Z

Hi,

We also had this problem during early Efficient Conformer experiments after 30/50 training epochs. Networks activations were growing too large for mixed precision training. Applying weight decay of 1e-6 fixed the problem.
What exact config are you using ? also tokenizer and dataset ? Maybe that applying a larger weight decay or AdamW instead of Adam could fix the problem. Setting mixed precision to false may fix the problem at the cost of slower training and larger GPU memory use.

Since the loss becomes NaN only after some epochs, I assume it is not related to some target sequences being longer than model outputs for CTC computation.

Best,
Maxime

harisgulzar1 · 2022-12-09T00:28:32Z

Thanks for your response.
Interestingly for the ConformerCTC-Small the problem occurred at the 200th epoch or so, but for Conformer-CTC Medium and Large, the problem occurred at earlier epochs. So yes, I think the problem is not with the sequence lengths in the dataset.

For training 1e-6 weight decay is already applied, do you think I should increase the value?
Currently, I am using Adam, I will try using AdamW and see if the problem is solved.
Will also try by turning off the mixed precision training.

One thing I was wondering is that there may be a possibility that weights are becoming small and some log functions are returning for that. I will investigate that, but do you think that might be a possibility as well?
I am using LibriSpeech dataset with default bpe tokenizer as you have provided with your code.
Below are the current configuration values.
"training_params": { "epochs": 450, "batch_size": 16, "accumulated_steps": 4, "mixed_precision": true, "optimizer": "Adam", "beta1": 0.9, "beta2": 0.98, "eps": 1e-9, "weight_decay": 1e-6, "lr_schedule": "Transformer", "schedule_dim": 256, "warmup_steps": 10000, "K": 2 }

burchim · 2022-12-09T19:22:22Z

Yes you could try to see if increasing the weight decay solves the problem. Also using AdamW with the default decay of 0.01. But I advise you to first find where NaNs come from by logging or printing networks hidden activations / norms during training starting from the last healthy checkpoint. You could then try using these methods to compare.

harisgulzar1 · 2022-12-15T23:45:56Z

I checked the weights and for some specific batches, all the activations were becoming NaN.
I couldn't pinpoint the exact activation function which was causing problems.
But using AdamW with default decay value has solved the problem.
Thank you very much for your suggestions.

burchim · 2022-12-16T13:16:59Z

Thanks for your feedback ! I suppose this was due to some weight parameters causing the activations to grow too large for float16.

kafan1986 · 2023-01-11T14:15:29Z

@harisgulzar1 @burchim I am still getting NAN during CTC training phase in spite of using AdamW and learning rate of 0.01. What else can I try?

harisgulzar1 · 2023-01-12T01:14:59Z

@kafan1986
0.01 is a very large value of the learning rate. Try not to change the learning rate from default.
Instead set the value of weight_decay to 0.01 (default value of AdamW).
If this doesn't work, set the mixed_precision value to false.

harisgulzar1 closed this as completed Dec 15, 2022

burchim mentioned this issue Feb 10, 2023

AO Training Issue burchim/AVEC#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The loss becomes NaN after some training epochs #17

The loss becomes NaN after some training epochs #17

harisgulzar1 commented Dec 8, 2022

burchim commented Dec 8, 2022

harisgulzar1 commented Dec 9, 2022 •

edited

burchim commented Dec 9, 2022

harisgulzar1 commented Dec 15, 2022

burchim commented Dec 16, 2022

kafan1986 commented Jan 11, 2023

harisgulzar1 commented Jan 12, 2023

The loss becomes NaN after some training epochs #17

The loss becomes NaN after some training epochs #17

Comments

harisgulzar1 commented Dec 8, 2022

burchim commented Dec 8, 2022

harisgulzar1 commented Dec 9, 2022 • edited

burchim commented Dec 9, 2022

harisgulzar1 commented Dec 15, 2022

burchim commented Dec 16, 2022

kafan1986 commented Jan 11, 2023

harisgulzar1 commented Jan 12, 2023

harisgulzar1 commented Dec 9, 2022 •

edited