Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss increases during pretraining #35

Closed
mmaaz60 opened this issue Sep 12, 2021 · 4 comments
Closed

Loss increases during pretraining #35

mmaaz60 opened this issue Sep 12, 2021 · 4 comments

Comments

@mmaaz60
Copy link

mmaaz60 commented Sep 12, 2021

Hi @alcinos, @ashkamath, @nguyeho7,

I hope you are doing good.

I was trying to pretrain MDETR using the provided instructions. What I noticed is that loss started increasing during the 20th epoch. It kept decreasing to around 39 till the 19th epoch and jumped to around 77 after the 20th epoch. What could be the reason for this? Note that I am using the EfficientNetB5 backbone. The log.txt is attached.

Thanks

log.txt

@alcinos
Copy link
Collaborator

alcinos commented Sep 17, 2021

Hi @mmaaz60
Thank you for your interest in MDETR.
It looks like you training diverged. Can I ask how many gpus you used?

@mmaaz60
Copy link
Author

mmaaz60 commented Sep 17, 2021

Hi @mmaaz60
Thank you for your interest in MDETR.
It looks like you training diverged. Can I ask how many gpus you used?

Thank You @alcinos,

I used 32 GPUs with batch_size of 2 per GPU.

@alcinos
Copy link
Collaborator

alcinos commented Sep 17, 2021

Hum that’s quite surprising then. Nothing fishy happened, like the job getting preempted then restarted?
Are you sure you have the correct transformers version?
Otherwise mb try with a slightly smaller lr?

@mmaaz60
Copy link
Author

mmaaz60 commented Sep 17, 2021

Thank You

Hum that’s quite surprising then. Nothing fishy happened, like the job getting preempted then restarted?

Nothing such happened during training

Are you sure you have the correct transformers version?

I am using transformers version 4.5.1

Otherwise mb try with a slightly smaller lr?

I actually stopped and then resumed the training from the 19th epoch and now it reaches to 25th epoch and seems to be converging. Not sure what went wrong previously as I didn't change anything when resuming.

@mmaaz60 mmaaz60 closed this as completed Sep 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants