Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss goes to NaN at 150K Iterations #8

Closed
killawhale2 opened this issue Jul 27, 2020 · 7 comments
Closed

Loss goes to NaN at 150K Iterations #8

killawhale2 opened this issue Jul 27, 2020 · 7 comments

Comments

@killawhale2
Copy link

Through this issue, I've fixed the problem with the prior/reg loss weights as per the author's response (add 1e-6 to avoid divide by zero).
However, I noticed that my loc_loss and reg_loss became NaN.
I retried with clipping the gradients by setting the --clip_grad option as True.
My loc_loss and reg_losses still became NaN at 150k iteration and the training failed.
The exact command I ran was the following:
python ssd/train_bidet_ssd.py --dataset VOC --data_root ./data/VOCdevkit/ --basenet ./ssd/pretrain/vgg16.pth --clip_grad true
Any help would be appreciated.

@Wuziyi616
Copy link
Contributor

Hummm, that's quite strange, I will look into it. BTW, you can try evaling the model weight before loss goes to NaN and report the mAP here, it can help me determine what's the problem.

@killawhale2
Copy link
Author

Thank you again for your quick replies! I ran the eval code using the model weights from iteration 145K (just before the loss goes to NaN) and the mAP I got was 56.06. The AP for the categories are as follows:

AP for aeroplane = 0.6938
AP for bicycle = 0.6756
AP for bird = 0.4631
AP for boat = 0.4548
AP for bottle = 0.3000
AP for bus = 0.6628
AP for car = 0.7311
AP for cat = 0.6751
AP for chair = 0.3491
AP for cow = 0.4354
AP for diningtable = 0.5509
AP for dog = 0.5649
AP for horse = 0.7091
AP for motorbike = 0.6937
AP for person = 0.5767
AP for pottedplant = 0.2871
AP for sheep = 0.4872
AP for sofa = 0.6301
AP for train = 0.7178
AP for tvmonitor = 0.5533

Hope this information will be useful!

@Wuziyi616
Copy link
Contributor

Wuziyi616 commented Jul 27, 2020

According to my learning rate decay schedule, the lr at iteration 150k should be 1e-5. It's a small value and I don't think training will break at this point. Also according to my experience of training BiDet, the network should have converged at 150k iteration so I guess the mAP would be around 66.0 before loss goes to NaN.

BTW, I have to say that the training of binary neural networks especially binary detectors is very unstable. In my experiments, I have to watch its loss curve and sometimes manually adjust the learning rate if its training "breaks". The training of binary-SSD often breaks, while binary-Faster R-CNN is much more stable. One of the indicators that the training of binary-SSD breaks is that, if the cls loss (termed as 'conf' in the saved weight files) suddenly decreased largely in a few iterations (e.g. 3.55-->3.54-->3.52**-->**3.40), then we should kill the program and manually decay lr by 0.1 then continue training.

Besides, the lr decay schedule in config.py is just an empirical one, I tried running the code multiple times and sometimes you need to decay earlier to prevent training from breaking. Also, if I use different PyTorch version, you may get different results. For example, I set up several conda virtual environments on one Ubuntu server and tried running the code. For BiDet-SSD on Pascal VOC, I got mAP 66.6% using PyTorch 1.5 (2020.5), mAP 65.4% using PyTorch 1.2 (2020.3), and the mAP 66.0% reported in the paper was obtained via PyTorch 1.0 (2019.11). I really don't know why, maybe just because the training of binary neural networks is too unstable and full of uncertainty. Different lr schedules even different weight initialization would cause different results.

So my suggestion for you is that, maybe you can try training again and monitor the loss of BiDet-SSD. Manually decay the learning rate can ensure you a more stable results I think.

@Wuziyi616
Copy link
Contributor

Ah, I'm sorry I didn't see the response you post before my last comment. 56 is much lower than 66 and there should be something wrong in the training procedure. Perhaps the training "breaks" earlier before 145k iteration? Does the conf loss decrease abnormally as I described in my last comment (decrease largely in 5k iterations)? If so, then the performance of weight at 145k iteration is surely to be affected to perform badly.

@killawhale2
Copy link
Author

I checked and the conf_loss is relatively stable but jumps from 0.6536 to 2.2935 from 145K iteration to 150K iteration. I'll try different learning rate schedulings as suggested.

@Wuziyi616
Copy link
Contributor

Indeed, in order to get good performance, I'd recommend you to monitor the loss and decay lr only when training breaks in current lr (conf loss decrease rapidly). The best way is to kill the program if training breaks and re-start with a decayed lr from the weight before breaking. I guess this is because binary neural networks are easily stuck to local minimal, so the more iterations you use large lr to train, the less likely you will be stuck to local minimal and the better performance you will get (at least in the case of binary detectors).

@matthai
Copy link

matthai commented Sep 15, 2021

I checked and the conf_loss is relatively stable but jumps from 0.6536 to 2.2935 from 145K iteration to 150K iteration. I'll try different learning rate schedulings as suggested.

@killawhale2 were you able to get to ~65% accuracy in the end? It would be great to hear from folks who have managed to replicate it (so we can try to make a robust recipe).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants