Nan values for loss and accuracy in training and testing. #23

adityagupta-9900 · 2021-09-04T17:20:16Z

As per our capacity, we reduced it to 4 GPUs and kept the learning rate as default 0.02. After 40-60 iterations we started getting Nan value of losses
We reduced the rate to 0.015 and then trained. Even with this for >200 iterations, sometimes it shows Nan loss values, sometimes it runs fine.
Even after testing with Nan loss values, we found all the accuracy values in the output table came out to be Nan.

mondrasovic · 2021-09-08T12:16:42Z

Currently, I am working on the solution to this. So far I have identified that the ground truth bounding boxes disappear during the filtering phase (code here).

As you can see on the report, everything is labeled as FP - False Positive. But why? I still have no idea, but it is one of the tasks I have to solve before moving forward.

There is another issue with the very same problem.

mondrasovic · 2021-09-09T06:12:44Z

You can see my answer in this issue related to this project for the solution.

adityagupta-9900 · 2021-09-09T06:36:16Z

@mondrasovic Thank you so much for your help.
But there is another thing I needed to clarify. I'm getting Nan values even for training. Like after 40-60 iterations, I start getting Nan(Nan) loss values in training MOT data-set.
Could you suggest why would it come like that?

mondrasovic · 2021-09-10T05:18:42Z

@mondrasovic Thank you so much for your help.
But there is another thing I needed to clarify. I'm getting Nan values even for training. Like after 40-60 iterations, I start getting Nan(Nan) loss values in training MOT data-set.
Could you suggest why would it come like that?

I am sorry that I only answered one part of the question. I was stuck on that problem so that caused the narrow-mindedness. Nevertheless, here are my suggestions, because I also experienced the same issues.

Since I am a mortal, I do not have 8 GPUs on my local machine. As a result, I had to reduce the batch size significantly to handle the training. Well, once that happens, the learning rate should be adjusted accordingly, too. Thus, the only time when I experienced what you described with this architecture was when the learning rate was too high and that produced exploding gradients.

I would bet more than just two cents on this because you mentioned that you used 4 GPUs and yet maintained the learning rate. Even the description of the progress of your training is highly indicative of this cause.

Currently, I am using

BASE_LR: 0.002

and it works.

Hope this helps!

adityagupta-9900 · 2021-09-12T13:01:13Z

@mondrasovic
Thank you so much. It helped a lot.

adityagupta-9900 closed this as completed Sep 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nan values for loss and accuracy in training and testing. #23

Nan values for loss and accuracy in training and testing. #23

adityagupta-9900 commented Sep 4, 2021 •

edited

mondrasovic commented Sep 8, 2021

mondrasovic commented Sep 9, 2021

adityagupta-9900 commented Sep 9, 2021

mondrasovic commented Sep 10, 2021 •

edited

adityagupta-9900 commented Sep 12, 2021

Nan values for loss and accuracy in training and testing. #23

Nan values for loss and accuracy in training and testing. #23

Comments

adityagupta-9900 commented Sep 4, 2021 • edited

mondrasovic commented Sep 8, 2021

mondrasovic commented Sep 9, 2021

adityagupta-9900 commented Sep 9, 2021

mondrasovic commented Sep 10, 2021 • edited

adityagupta-9900 commented Sep 12, 2021

adityagupta-9900 commented Sep 4, 2021 •

edited

mondrasovic commented Sep 10, 2021 •

edited