-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Nan for each epoch #5767
Description
Hello
I have problems regarding training of a convolutional neural network.
I experience NaN's in the loss for each epoch. (I have also tried to change the loss function but with same error. I do not see any indication of NaN's in the internal network at any batches. However, the final loss gives NaN's.
The report of the training is here:
2017-04-10 19:58:31,948 Host Batch: 154 conv1a_backward_weight 0.000210175
2017-04-10 19:58:31,948 Host Batch: 154 conv1a_weight 0.0382393
2017-04-10 19:58:31,948 Host Batch: 154 linear2_backward_weight 0.201007
2017-04-10 19:58:31,949 Host Batch: 154 linear2_weight 0.0153211
2017-04-10 19:58:31,949 Host Batch: 154 res2a_branch1_backward_weight 0.000367878
2017-04-10 19:58:31,949 Host Batch: 154 res2a_branch1_weight 0.246445
2017-04-10 19:58:31,949 Host Batch: 154 res2a_branch2a_backward_weight 0.000476692
2017-04-10 19:58:31,949 Host Batch: 154 res2a_branch2a_weight 0.0853167
2017-04-10 19:58:31,949 Host Batch: 154 res2a_branch2b1_backward_weight 0.000416497
2017-04-10 19:58:31,949 Host Batch: 154 res2a_branch2b1_weight 0.0608358
2017-04-10 19:58:31,949 Host Batch: 154 res2b1_branch2a_backward_weight 0.000309442
2017-04-10 19:58:31,949 Host Batch: 154 res2b1_branch2a_weight 0.0601303
2017-04-10 19:58:31,949 Host Batch: 154 res2b1_branch2b1_backward_weight 0.000337989
2017-04-10 19:58:31,949 Host Batch: 154 res2b1_branch2b1_weight 0.060109
2017-04-10 19:58:31,949 Host Batch: 154 res2b2_branch2a_backward_weight 0.000268277
2017-04-10 19:58:31,949 Host Batch: 154 res2b2_branch2a_weight 0.0594585
2017-04-10 19:58:31,950 Host Batch: 154 res2b2_branch2b1_backward_weight 0.000302767
2017-04-10 19:58:31,950 Host Batch: 154 res2b2_branch2b1_weight 0.0588229
2017-04-10 19:58:31,950 Host Batch: 154 res3a_branch1_backward_weight 0.000458686
2017-04-10 19:58:31,950 Host Batch: 154 res3a_branch1_weight 0.174965
2017-04-10 19:58:31,950 Host Batch: 154 res3a_branch2a_backward_weight 0.000435945
2017-04-10 19:58:31,950 Host Batch: 154 res3a_branch2a_weight 0.0593551
2017-04-10 19:58:31,950 Host Batch: 154 res3a_branch2b1_backward_weight 0.000483657
2017-04-10 19:58:31,950 Host Batch: 154 res3a_branch2b1_weight 0.0420545
2017-04-10 19:58:31,950 Host Batch: 154 res3b1_branch2a_backward_weight 0.000396346
2017-04-10 19:58:31,950 Host Batch: 154 res3b1_branch2a_weight 0.0419271
2017-04-10 19:58:31,950 Host Batch: 154 res3b1_branch2b1_backward_weight 0.00039641
2017-04-10 19:58:31,950 Host Batch: 154 res3b1_branch2b1_weight 0.0419139
2017-04-10 19:58:31,950 Host Batch: 154 res3b2_branch2a_backward_weight 0.000321227
2017-04-10 19:58:31,950 Host Batch: 154 res3b2_branch2a_weight 0.0417158
2017-04-10 19:58:31,951 Host Batch: 154 res3b2_branch2b1_backward_weight 0.000357124
2017-04-10 19:58:31,951 Host Batch: 154 res3b2_branch2b1_weight 0.0417599
2017-04-10 19:58:31,951 Host Batch: 154 res4a_branch1_backward_weight 0.000450693
2017-04-10 19:58:31,951 Host Batch: 154 res4a_branch1_weight 0.125856
2017-04-10 19:58:31,951 Host Batch: 154 res4a_branch2a_backward_weight 0.000492684
2017-04-10 19:58:31,951 Host Batch: 154 res4a_branch2a_weight 0.0417395
2017-04-10 19:58:31,951 Host Batch: 154 res4a_branch2b1_backward_weight 0.000484657
2017-04-10 19:58:31,951 Host Batch: 154 res4a_branch2b1_weight 0.0296349
2017-04-10 19:58:31,951 Host Batch: 154 res4b1_branch2a_backward_weight 0.000380861
2017-04-10 19:58:31,951 Host Batch: 154 res4b1_branch2a_weight 0.0295497
2017-04-10 19:58:31,951 Host Batch: 154 res4b1_branch2b1_backward_weight 0.000409661
2017-04-10 19:58:31,951 Host Batch: 154 res4b1_branch2b1_weight 0.0295103
2017-04-10 19:58:31,951 Host Batch: 154 res4b2_branch2a_backward_weight 0.000353395
2017-04-10 19:58:31,951 Host Batch: 154 res4b2_branch2a_weight 0.0295437
2017-04-10 19:58:31,951 Host Batch: 154 res4b2_branch2b1_backward_weight 0.000381005
2017-04-10 19:58:31,952 Host Batch: 154 res4b2_branch2b1_weight 0.0295163
2017-04-10 19:58:31,952 Host Batch: 154 res4b3_branch2a_backward_weight 0.000388173
2017-04-10 19:58:31,952 Host Batch: 154 res4b3_branch2a_weight 0.0295094
2017-04-10 19:58:31,952 Host Batch: 154 res4b3_branch2b1_backward_weight 0.000353592
2017-04-10 19:58:31,952 Host Batch: 154 res4b3_branch2b1_weight 0.0294803
2017-04-10 19:58:31,952 Host Batch: 154 res4b4_branch2a_backward_weight 0.000375338
2017-04-10 19:58:31,952 Host Batch: 154 res4b4_branch2a_weight 0.0295069
2017-04-10 19:58:31,952 Host Batch: 154 res4b4_branch2b1_backward_weight 0.000341182
2017-04-10 19:58:31,952 Host Batch: 154 res4b4_branch2b1_weight 0.0294631
2017-04-10 19:58:31,952 Host Batch: 154 res4b5_branch2a_backward_weight 0.000368075
2017-04-10 19:58:31,952 Host Batch: 154 res4b5_branch2a_weight 0.0294748
2017-04-10 19:58:31,952 Host Batch: 154 res4b5_branch2b1_backward_weight 0.000323147
2017-04-10 19:58:31,952 Host Batch: 154 res4b5_branch2b1_weight 0.0294281
2017-04-10 19:58:31,953 Host Batch: 154 res5a_branch1_backward_weight 0.000713526
2017-04-10 19:58:31,953 Host Batch: 154 res5a_branch1_weight 0.0883342
2017-04-10 19:58:31,953 Host Batch: 154 res5a_branch2a_backward_weight 0.000745788
2017-04-10 19:58:31,953 Host Batch: 154 res5a_branch2a_weight 0.0295571
2017-04-10 19:58:31,953 Host Batch: 154 res5a_branch2b1_backward_weight 0.000752134
2017-04-10 19:58:31,953 Host Batch: 154 res5a_branch2b1_weight 0.0294675
2017-04-10 19:58:31,953 Host Batch: 154 res5b1_branch2a_backward_weight 0.000697756
2017-04-10 19:58:31,953 Host Batch: 154 res5b1_branch2a_weight 0.0208629
2017-04-10 19:58:31,953 Host Batch: 154 res5b1_branch2b1_backward_weight 0.000691178
2017-04-10 19:58:31,953 Host Batch: 154 res5b1_branch2b1_weight 0.029506
2017-04-10 19:58:31,953 Host Batch: 154 res5b2_branch2a_backward_weight 0.000745268
2017-04-10 19:58:31,953 Host Batch: 154 res5b2_branch2a_weight 0.0208363
2017-04-10 19:58:31,954 Host Batch: 154 res5b2_branch2b1_backward_weight 0.000626755
2017-04-10 19:58:31,954 Host Batch: 154 res5b2_branch2b1_weight 0.0294602
2017-04-10 19:58:31,954 Host Batch: 154 res6a_branch1_backward_weight 0.000802658
2017-04-10 19:58:31,954 Host Batch: 154 res6a_branch1_weight 0.06239
2017-04-10 19:58:31,954 Host Batch: 154 res6a_branch2a_backward_weight 0.00172652
2017-04-10 19:58:31,954 Host Batch: 154 res6a_branch2a_weight 0.0626998
2017-04-10 19:58:31,954 Host Batch: 154 res6a_branch2b1_backward_weight 0.00119287
2017-04-10 19:58:31,954 Host Batch: 154 res6a_branch2b1_weight 0.0294494
2017-04-10 19:58:31,954 Host Batch: 154 res6a_branch2b2_backward_weight 0.000777648
2017-04-10 19:58:31,954 Host Batch: 154 res6a_branch2b2_weight 0.0624873
2017-04-10 19:58:31,954 Host Batch: 154 res7a_branch1_backward_weight 0.00046971
2017-04-10 19:58:31,954 Host Batch: 154 res7a_branch1_weight 0.0441438
2017-04-10 19:58:31,955 Host Batch: 154 res7a_branch2a_backward_weight 0.00100658
2017-04-10 19:58:31,955 Host Batch: 154 res7a_branch2a_weight 0.0442069
2017-04-10 19:58:31,955 Host Batch: 154 res7a_branch2b1_backward_weight 0.000699126
2017-04-10 19:58:31,955 Host Batch: 154 res7a_branch2b1_weight 0.0208296
2017-04-10 19:58:31,955 Host Batch: 154 res7a_branch2b2_backward_weight 0.000475449
2017-04-10 19:58:31,955 Host Batch: 154 res7a_branch2b2_weight 0.0442136
2017-04-10 19:58:31,955 Host Epoch[1] Batch [76] Speed: 1.39 samples/sec Train-cross-entropy=0.856569
2017-04-10 19:58:31,955 Host Epoch[1] Train-cross-entropy=nan