Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Nan for each epoch #5767

@ChristianEschen

Description

@ChristianEschen

Hello

I have problems regarding training of a convolutional neural network.
I experience NaN's in the loss for each epoch. (I have also tried to change the loss function but with same error. I do not see any indication of NaN's in the internal network at any batches. However, the final loss gives NaN's.
The report of the training is here:

2017-04-10 19:58:31,948 Host Batch: 154 conv1a_backward_weight 0.000210175
2017-04-10 19:58:31,948 Host Batch: 154 conv1a_weight 0.0382393
2017-04-10 19:58:31,948 Host Batch: 154 linear2_backward_weight 0.201007
2017-04-10 19:58:31,949 Host Batch: 154 linear2_weight 0.0153211
2017-04-10 19:58:31,949 Host Batch: 154 res2a_branch1_backward_weight 0.000367878
2017-04-10 19:58:31,949 Host Batch: 154 res2a_branch1_weight 0.246445
2017-04-10 19:58:31,949 Host Batch: 154 res2a_branch2a_backward_weight 0.000476692
2017-04-10 19:58:31,949 Host Batch: 154 res2a_branch2a_weight 0.0853167
2017-04-10 19:58:31,949 Host Batch: 154 res2a_branch2b1_backward_weight 0.000416497
2017-04-10 19:58:31,949 Host Batch: 154 res2a_branch2b1_weight 0.0608358
2017-04-10 19:58:31,949 Host Batch: 154 res2b1_branch2a_backward_weight 0.000309442
2017-04-10 19:58:31,949 Host Batch: 154 res2b1_branch2a_weight 0.0601303
2017-04-10 19:58:31,949 Host Batch: 154 res2b1_branch2b1_backward_weight 0.000337989
2017-04-10 19:58:31,949 Host Batch: 154 res2b1_branch2b1_weight 0.060109
2017-04-10 19:58:31,949 Host Batch: 154 res2b2_branch2a_backward_weight 0.000268277
2017-04-10 19:58:31,949 Host Batch: 154 res2b2_branch2a_weight 0.0594585
2017-04-10 19:58:31,950 Host Batch: 154 res2b2_branch2b1_backward_weight 0.000302767
2017-04-10 19:58:31,950 Host Batch: 154 res2b2_branch2b1_weight 0.0588229
2017-04-10 19:58:31,950 Host Batch: 154 res3a_branch1_backward_weight 0.000458686
2017-04-10 19:58:31,950 Host Batch: 154 res3a_branch1_weight 0.174965
2017-04-10 19:58:31,950 Host Batch: 154 res3a_branch2a_backward_weight 0.000435945
2017-04-10 19:58:31,950 Host Batch: 154 res3a_branch2a_weight 0.0593551
2017-04-10 19:58:31,950 Host Batch: 154 res3a_branch2b1_backward_weight 0.000483657
2017-04-10 19:58:31,950 Host Batch: 154 res3a_branch2b1_weight 0.0420545
2017-04-10 19:58:31,950 Host Batch: 154 res3b1_branch2a_backward_weight 0.000396346
2017-04-10 19:58:31,950 Host Batch: 154 res3b1_branch2a_weight 0.0419271
2017-04-10 19:58:31,950 Host Batch: 154 res3b1_branch2b1_backward_weight 0.00039641
2017-04-10 19:58:31,950 Host Batch: 154 res3b1_branch2b1_weight 0.0419139
2017-04-10 19:58:31,950 Host Batch: 154 res3b2_branch2a_backward_weight 0.000321227
2017-04-10 19:58:31,950 Host Batch: 154 res3b2_branch2a_weight 0.0417158
2017-04-10 19:58:31,951 Host Batch: 154 res3b2_branch2b1_backward_weight 0.000357124
2017-04-10 19:58:31,951 Host Batch: 154 res3b2_branch2b1_weight 0.0417599
2017-04-10 19:58:31,951 Host Batch: 154 res4a_branch1_backward_weight 0.000450693
2017-04-10 19:58:31,951 Host Batch: 154 res4a_branch1_weight 0.125856
2017-04-10 19:58:31,951 Host Batch: 154 res4a_branch2a_backward_weight 0.000492684
2017-04-10 19:58:31,951 Host Batch: 154 res4a_branch2a_weight 0.0417395
2017-04-10 19:58:31,951 Host Batch: 154 res4a_branch2b1_backward_weight 0.000484657
2017-04-10 19:58:31,951 Host Batch: 154 res4a_branch2b1_weight 0.0296349
2017-04-10 19:58:31,951 Host Batch: 154 res4b1_branch2a_backward_weight 0.000380861
2017-04-10 19:58:31,951 Host Batch: 154 res4b1_branch2a_weight 0.0295497
2017-04-10 19:58:31,951 Host Batch: 154 res4b1_branch2b1_backward_weight 0.000409661
2017-04-10 19:58:31,951 Host Batch: 154 res4b1_branch2b1_weight 0.0295103
2017-04-10 19:58:31,951 Host Batch: 154 res4b2_branch2a_backward_weight 0.000353395
2017-04-10 19:58:31,951 Host Batch: 154 res4b2_branch2a_weight 0.0295437
2017-04-10 19:58:31,951 Host Batch: 154 res4b2_branch2b1_backward_weight 0.000381005
2017-04-10 19:58:31,952 Host Batch: 154 res4b2_branch2b1_weight 0.0295163
2017-04-10 19:58:31,952 Host Batch: 154 res4b3_branch2a_backward_weight 0.000388173
2017-04-10 19:58:31,952 Host Batch: 154 res4b3_branch2a_weight 0.0295094
2017-04-10 19:58:31,952 Host Batch: 154 res4b3_branch2b1_backward_weight 0.000353592
2017-04-10 19:58:31,952 Host Batch: 154 res4b3_branch2b1_weight 0.0294803
2017-04-10 19:58:31,952 Host Batch: 154 res4b4_branch2a_backward_weight 0.000375338
2017-04-10 19:58:31,952 Host Batch: 154 res4b4_branch2a_weight 0.0295069
2017-04-10 19:58:31,952 Host Batch: 154 res4b4_branch2b1_backward_weight 0.000341182
2017-04-10 19:58:31,952 Host Batch: 154 res4b4_branch2b1_weight 0.0294631
2017-04-10 19:58:31,952 Host Batch: 154 res4b5_branch2a_backward_weight 0.000368075
2017-04-10 19:58:31,952 Host Batch: 154 res4b5_branch2a_weight 0.0294748
2017-04-10 19:58:31,952 Host Batch: 154 res4b5_branch2b1_backward_weight 0.000323147
2017-04-10 19:58:31,952 Host Batch: 154 res4b5_branch2b1_weight 0.0294281
2017-04-10 19:58:31,953 Host Batch: 154 res5a_branch1_backward_weight 0.000713526
2017-04-10 19:58:31,953 Host Batch: 154 res5a_branch1_weight 0.0883342
2017-04-10 19:58:31,953 Host Batch: 154 res5a_branch2a_backward_weight 0.000745788
2017-04-10 19:58:31,953 Host Batch: 154 res5a_branch2a_weight 0.0295571
2017-04-10 19:58:31,953 Host Batch: 154 res5a_branch2b1_backward_weight 0.000752134
2017-04-10 19:58:31,953 Host Batch: 154 res5a_branch2b1_weight 0.0294675
2017-04-10 19:58:31,953 Host Batch: 154 res5b1_branch2a_backward_weight 0.000697756
2017-04-10 19:58:31,953 Host Batch: 154 res5b1_branch2a_weight 0.0208629
2017-04-10 19:58:31,953 Host Batch: 154 res5b1_branch2b1_backward_weight 0.000691178
2017-04-10 19:58:31,953 Host Batch: 154 res5b1_branch2b1_weight 0.029506
2017-04-10 19:58:31,953 Host Batch: 154 res5b2_branch2a_backward_weight 0.000745268
2017-04-10 19:58:31,953 Host Batch: 154 res5b2_branch2a_weight 0.0208363
2017-04-10 19:58:31,954 Host Batch: 154 res5b2_branch2b1_backward_weight 0.000626755
2017-04-10 19:58:31,954 Host Batch: 154 res5b2_branch2b1_weight 0.0294602
2017-04-10 19:58:31,954 Host Batch: 154 res6a_branch1_backward_weight 0.000802658
2017-04-10 19:58:31,954 Host Batch: 154 res6a_branch1_weight 0.06239
2017-04-10 19:58:31,954 Host Batch: 154 res6a_branch2a_backward_weight 0.00172652
2017-04-10 19:58:31,954 Host Batch: 154 res6a_branch2a_weight 0.0626998
2017-04-10 19:58:31,954 Host Batch: 154 res6a_branch2b1_backward_weight 0.00119287
2017-04-10 19:58:31,954 Host Batch: 154 res6a_branch2b1_weight 0.0294494
2017-04-10 19:58:31,954 Host Batch: 154 res6a_branch2b2_backward_weight 0.000777648
2017-04-10 19:58:31,954 Host Batch: 154 res6a_branch2b2_weight 0.0624873
2017-04-10 19:58:31,954 Host Batch: 154 res7a_branch1_backward_weight 0.00046971
2017-04-10 19:58:31,954 Host Batch: 154 res7a_branch1_weight 0.0441438
2017-04-10 19:58:31,955 Host Batch: 154 res7a_branch2a_backward_weight 0.00100658
2017-04-10 19:58:31,955 Host Batch: 154 res7a_branch2a_weight 0.0442069
2017-04-10 19:58:31,955 Host Batch: 154 res7a_branch2b1_backward_weight 0.000699126
2017-04-10 19:58:31,955 Host Batch: 154 res7a_branch2b1_weight 0.0208296
2017-04-10 19:58:31,955 Host Batch: 154 res7a_branch2b2_backward_weight 0.000475449
2017-04-10 19:58:31,955 Host Batch: 154 res7a_branch2b2_weight 0.0442136
2017-04-10 19:58:31,955 Host Epoch[1] Batch [76] Speed: 1.39 samples/sec Train-cross-entropy=0.856569
2017-04-10 19:58:31,955 Host Epoch[1] Train-cross-entropy=nan

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions