Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nan values for loss and accuracy in training and testing. #23

Closed
adityagupta-9900 opened this issue Sep 4, 2021 · 5 comments
Closed

Comments

@adityagupta-9900
Copy link

adityagupta-9900 commented Sep 4, 2021

As per our capacity, we reduced it to 4 GPUs and kept the learning rate as default 0.02. After 40-60 iterations we started getting Nan value of losses
We reduced the rate to 0.015 and then trained. Even with this for >200 iterations, sometimes it shows Nan loss values, sometimes it runs fine.
Even after testing with Nan loss values, we found all the accuracy values in the output table came out to be Nan.

WhatsApp Image 2021-09-04 at 22 48 10

@mondrasovic
Copy link

Currently, I am working on the solution to this. So far I have identified that the ground truth bounding boxes disappear during the filtering phase (code here).

As you can see on the report, everything is labeled as FP - False Positive. But why? I still have no idea, but it is one of the tasks I have to solve before moving forward.

There is another issue with the very same problem.

@mondrasovic
Copy link

You can see my answer in this issue related to this project for the solution.

@adityagupta-9900
Copy link
Author

@mondrasovic Thank you so much for your help.
But there is another thing I needed to clarify. I'm getting Nan values even for training. Like after 40-60 iterations, I start getting Nan(Nan) loss values in training MOT data-set.
Could you suggest why would it come like that?

@mondrasovic
Copy link

mondrasovic commented Sep 10, 2021

@mondrasovic Thank you so much for your help.
But there is another thing I needed to clarify. I'm getting Nan values even for training. Like after 40-60 iterations, I start getting Nan(Nan) loss values in training MOT data-set.
Could you suggest why would it come like that?

I am sorry that I only answered one part of the question. I was stuck on that problem so that caused the narrow-mindedness. Nevertheless, here are my suggestions, because I also experienced the same issues.

Since I am a mortal, I do not have 8 GPUs on my local machine. As a result, I had to reduce the batch size significantly to handle the training. Well, once that happens, the learning rate should be adjusted accordingly, too. Thus, the only time when I experienced what you described with this architecture was when the learning rate was too high and that produced exploding gradients.

I would bet more than just two cents on this because you mentioned that you used 4 GPUs and yet maintained the learning rate. Even the description of the progress of your training is highly indicative of this cause.

Currently, I am using

BASE_LR: 0.002

and it works.

Hope this helps!

@adityagupta-9900
Copy link
Author

@mondrasovic
Thank you so much. It helped a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants