Loss goes to NaN at 150K Iterations #8

killawhale2 · 2020-07-27T05:47:08Z

Through this issue, I've fixed the problem with the prior/reg loss weights as per the author's response (add 1e-6 to avoid divide by zero).
However, I noticed that my loc_loss and reg_loss became NaN.
I retried with clipping the gradients by setting the --clip_grad option as True.
My loc_loss and reg_losses still became NaN at 150k iteration and the training failed.
The exact command I ran was the following:
python ssd/train_bidet_ssd.py --dataset VOC --data_root ./data/VOCdevkit/ --basenet ./ssd/pretrain/vgg16.pth --clip_grad true
Any help would be appreciated.

The text was updated successfully, but these errors were encountered:

Wuziyi616 · 2020-07-27T08:27:19Z

Hummm, that's quite strange, I will look into it. BTW, you can try evaling the model weight before loss goes to NaN and report the mAP here, it can help me determine what's the problem.

killawhale2 · 2020-07-27T08:58:31Z

Thank you again for your quick replies! I ran the eval code using the model weights from iteration 145K (just before the loss goes to NaN) and the mAP I got was 56.06. The AP for the categories are as follows:

AP for aeroplane = 0.6938
AP for bicycle = 0.6756
AP for bird = 0.4631
AP for boat = 0.4548
AP for bottle = 0.3000
AP for bus = 0.6628
AP for car = 0.7311
AP for cat = 0.6751
AP for chair = 0.3491
AP for cow = 0.4354
AP for diningtable = 0.5509
AP for dog = 0.5649
AP for horse = 0.7091
AP for motorbike = 0.6937
AP for person = 0.5767
AP for pottedplant = 0.2871
AP for sheep = 0.4872
AP for sofa = 0.6301
AP for train = 0.7178
AP for tvmonitor = 0.5533

Hope this information will be useful!

Wuziyi616 · 2020-07-27T09:04:31Z

According to my learning rate decay schedule, the lr at iteration 150k should be 1e-5. It's a small value and I don't think training will break at this point. Also according to my experience of training BiDet, the network should have converged at 150k iteration so I guess the mAP would be around 66.0 before loss goes to NaN.

BTW, I have to say that the training of binary neural networks especially binary detectors is very unstable. In my experiments, I have to watch its loss curve and sometimes manually adjust the learning rate if its training "breaks". The training of binary-SSD often breaks, while binary-Faster R-CNN is much more stable. One of the indicators that the training of binary-SSD breaks is that, if the cls loss (termed as 'conf' in the saved weight files) suddenly decreased largely in a few iterations (e.g. 3.55-->3.54-->3.52**-->**3.40), then we should kill the program and manually decay lr by 0.1 then continue training.

Besides, the lr decay schedule in config.py is just an empirical one, I tried running the code multiple times and sometimes you need to decay earlier to prevent training from breaking. Also, if I use different PyTorch version, you may get different results. For example, I set up several conda virtual environments on one Ubuntu server and tried running the code. For BiDet-SSD on Pascal VOC, I got mAP 66.6% using PyTorch 1.5 (2020.5), mAP 65.4% using PyTorch 1.2 (2020.3), and the mAP 66.0% reported in the paper was obtained via PyTorch 1.0 (2019.11). I really don't know why, maybe just because the training of binary neural networks is too unstable and full of uncertainty. Different lr schedules even different weight initialization would cause different results.

So my suggestion for you is that, maybe you can try training again and monitor the loss of BiDet-SSD. Manually decay the learning rate can ensure you a more stable results I think.

Wuziyi616 · 2020-07-27T09:13:18Z

Ah, I'm sorry I didn't see the response you post before my last comment. 56 is much lower than 66 and there should be something wrong in the training procedure. Perhaps the training "breaks" earlier before 145k iteration? Does the conf loss decrease abnormally as I described in my last comment (decrease largely in 5k iterations)? If so, then the performance of weight at 145k iteration is surely to be affected to perform badly.

killawhale2 · 2020-07-28T05:18:36Z

I checked and the conf_loss is relatively stable but jumps from 0.6536 to 2.2935 from 145K iteration to 150K iteration. I'll try different learning rate schedulings as suggested.

Wuziyi616 · 2020-07-28T10:13:08Z

Indeed, in order to get good performance, I'd recommend you to monitor the loss and decay lr only when training breaks in current lr (conf loss decrease rapidly). The best way is to kill the program if training breaks and re-start with a decayed lr from the weight before breaking. I guess this is because binary neural networks are easily stuck to local minimal, so the more iterations you use large lr to train, the less likely you will be stuck to local minimal and the better performance you will get (at least in the case of binary detectors).

matthai · 2021-09-15T22:03:15Z

I checked and the conf_loss is relatively stable but jumps from 0.6536 to 2.2935 from 145K iteration to 150K iteration. I'll try different learning rate schedulings as suggested.

@killawhale2 were you able to get to ~65% accuracy in the end? It would be great to hear from folks who have managed to replicate it (so we can try to make a robust recipe).

ZiweiWangTHU closed this as completed Aug 20, 2020

Wuziyi616 mentioned this issue Sep 15, 2020

some questions about the model size #13

Closed

Wuziyi616 mentioned this issue Sep 1, 2021

loading pretrained model #37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss goes to NaN at 150K Iterations #8

Loss goes to NaN at 150K Iterations #8

killawhale2 commented Jul 27, 2020

Wuziyi616 commented Jul 27, 2020

killawhale2 commented Jul 27, 2020

Wuziyi616 commented Jul 27, 2020 •

edited

Loading

Wuziyi616 commented Jul 27, 2020

killawhale2 commented Jul 28, 2020

Wuziyi616 commented Jul 28, 2020

matthai commented Sep 15, 2021

Loss goes to NaN at 150K Iterations #8

Loss goes to NaN at 150K Iterations #8

Comments

killawhale2 commented Jul 27, 2020

Wuziyi616 commented Jul 27, 2020

killawhale2 commented Jul 27, 2020

Wuziyi616 commented Jul 27, 2020 • edited Loading

Wuziyi616 commented Jul 27, 2020

killawhale2 commented Jul 28, 2020

Wuziyi616 commented Jul 28, 2020

matthai commented Sep 15, 2021

Wuziyi616 commented Jul 27, 2020 •

edited

Loading