Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.

NaN loss when using pretrained weights from my custom classification dataset #868

Open
FishWoWater opened this issue Apr 15, 2019 · 11 comments

Comments

@FishWoWater
Copy link

FishWoWater commented Apr 15, 2019

I am using my custom pretrained model to train with the e2e_faster_rcnn_R-101-FPN_2x config. But the loss went to NaN after the second forward pass. However, when I used the official pretrained model from ImageNet, nothing went wrong.

Expected results

The training will proceed as using the official pretrained model

Actual results

Here is part of the log. After comparing it with the output by training with official weights file, I have found it's abnormal that the number of rois below is quite small. I am wondering whether there is something going wrong.

INFO net.py: 254: res5_2_branch2c_bn          : (1, 2048, 25, 34)    => res5_2_sum                  : (1, 2048, 25, 34)    ------- (op: Sum)
INFO net.py: 254: res5_1_branch2c_bn          : (1, 2048, 25, 34)    => res5_2_sum                  : (1, 2048, 25, 34)    ------|
INFO net.py: 254: res5_2_sum                  : (1, 2048, 25, 34)    => res5_2_sum                  : (1, 2048, 25, 34)    ------- (op: Relu)
INFO net.py: 254: res5_2_sum                  : (1, 2048, 25, 34)    => fpn_inner_res5_2_sum        : (1, 256, 25, 34)     ------- (op: Conv)
INFO net.py: 254: res4_22_sum                 : (1, 1024, 50, 68)    => fpn_inner_res4_22_sum_lateral: (1, 256, 50, 68)     ------- (op: Conv)
INFO net.py: 254: fpn_inner_res5_2_sum        : (1, 256, 25, 34)     => fpn_inner_res4_22_sum_topdown: (1, 256, 50, 68)     ------- (op: UpsampleNearest)
INFO net.py: 254: fpn_inner_res4_22_sum_lateral: (1, 256, 50, 68)     => fpn_inner_res4_22_sum       : (1, 256, 50, 68)     ------- (op: Sum)
INFO net.py: 254: fpn_inner_res4_22_sum_topdown: (1, 256, 50, 68)     => fpn_inner_res4_22_sum       : (1, 256, 50, 68)     ------|
INFO net.py: 254: res3_3_sum                  : (1, 512, 100, 136)   => fpn_inner_res3_3_sum_lateral: (1, 256, 100, 136)   ------- (op: Conv)
INFO net.py: 254: fpn_inner_res4_22_sum       : (1, 256, 50, 68)     => fpn_inner_res3_3_sum_topdown: (1, 256, 100, 136)   ------- (op: UpsampleNearest)
INFO net.py: 254: fpn_inner_res3_3_sum_lateral: (1, 256, 100, 136)   => fpn_inner_res3_3_sum        : (1, 256, 100, 136)   ------- (op: Sum)
INFO net.py: 254: fpn_inner_res3_3_sum_topdown: (1, 256, 100, 136)   => fpn_inner_res3_3_sum        : (1, 256, 100, 136)   ------|
INFO net.py: 254: res2_2_sum                  : (1, 256, 200, 272)   => fpn_inner_res2_2_sum_lateral: (1, 256, 200, 272)   ------- (op: Conv)
INFO net.py: 254: fpn_inner_res3_3_sum        : (1, 256, 100, 136)   => fpn_inner_res2_2_sum_topdown: (1, 256, 200, 272)   ------- (op: UpsampleNearest)
INFO net.py: 254: fpn_inner_res2_2_sum_lateral: (1, 256, 200, 272)   => fpn_inner_res2_2_sum        : (1, 256, 200, 272)   ------- (op: Sum)
INFO net.py: 254: fpn_inner_res2_2_sum_topdown: (1, 256, 200, 272)   => fpn_inner_res2_2_sum        : (1, 256, 200, 272)   ------|
INFO net.py: 254: fpn_inner_res5_2_sum        : (1, 256, 25, 34)     => fpn_res5_2_sum              : (1, 256, 25, 34)     ------- (op: Conv)
INFO net.py: 254: fpn_inner_res4_22_sum       : (1, 256, 50, 68)     => fpn_res4_22_sum             : (1, 256, 50, 68)     ------- (op: Conv)
INFO net.py: 254: fpn_inner_res3_3_sum        : (1, 256, 100, 136)   => fpn_res3_3_sum              : (1, 256, 100, 136)   ------- (op: Conv)
INFO net.py: 254: fpn_inner_res2_2_sum        : (1, 256, 200, 272)   => fpn_res2_2_sum              : (1, 256, 200, 272)   ------- (op: Conv)
INFO net.py: 254: fpn_res5_2_sum              : (1, 256, 25, 34)     => fpn_res5_2_sum_subsampled_2x: (1, 256, 13, 17)     ------- (op: MaxPool)
INFO net.py: 254: fpn_res2_2_sum              : (1, 256, 200, 272)   => conv_rpn_fpn2               : (1, 256, 200, 272)   ------- (op: Conv)
INFO net.py: 254: conv_rpn_fpn2               : (1, 256, 200, 272)   => conv_rpn_fpn2               : (1, 256, 200, 272)   ------- (op: Relu)
INFO net.py: 254: conv_rpn_fpn2               : (1, 256, 200, 272)   => rpn_cls_logits_fpn2         : (1, 3, 200, 272)     ------- (op: Conv)
INFO net.py: 254: conv_rpn_fpn2               : (1, 256, 200, 272)   => rpn_bbox_pred_fpn2          : (1, 12, 200, 272)    ------- (op: Conv)
INFO net.py: 254: rpn_cls_logits_fpn2         : (1, 3, 200, 272)     => rpn_cls_probs_fpn2          : (1, 3, 200, 272)     ------- (op: Sigmoid)
INFO net.py: 254: rpn_cls_probs_fpn2          : (1, 3, 200, 272)     => rpn_rois_fpn2               : (4, 5)               ------- (op: Python:GenerateProposalsOp:gpu_0/rpn_cls_probs_fpn2,gpu_0/rpn_bbox_pred_fpn2,im_info)
INFO net.py: 254: rpn_bbox_pred_fpn2          : (1, 12, 200, 272)    => rpn_rois_fpn2               : (4, 5)               ------|
INFO net.py: 254: im_info                     : (1, 3)               => rpn_rois_fpn2               : (4, 5)               ------|
INFO net.py: 254: fpn_res3_3_sum              : (1, 256, 100, 136)   => conv_rpn_fpn3               : (1, 256, 100, 136)   ------- (op: Conv)
INFO net.py: 254: conv_rpn_fpn3               : (1, 256, 100, 136)   => conv_rpn_fpn3               : (1, 256, 100, 136)   ------- (op: Relu)
INFO net.py: 254: conv_rpn_fpn3               : (1, 256, 100, 136)   => rpn_cls_logits_fpn3         : (1, 3, 100, 136)     ------- (op: Conv)
INFO net.py: 254: conv_rpn_fpn3               : (1, 256, 100, 136)   => rpn_bbox_pred_fpn3          : (1, 12, 100, 136)    ------- (op: Conv)
INFO net.py: 254: rpn_cls_logits_fpn3         : (1, 3, 100, 136)     => rpn_cls_probs_fpn3          : (1, 3, 100, 136)     ------- (op: Sigmoid)
INFO net.py: 254: rpn_cls_probs_fpn3          : (1, 3, 100, 136)     => rpn_rois_fpn3               : (4, 5)               ------- (op: Python:GenerateProposalsOp:gpu_0/rpn_cls_probs_fpn3,gpu_0/rpn_bbox_pred_fpn3,im_info)
INFO net.py: 254: rpn_bbox_pred_fpn3          : (1, 12, 100, 136)    => rpn_rois_fpn3               : (4, 5)               ------|
INFO net.py: 254: im_info                     : (1, 3)               => rpn_rois_fpn3               : (4, 5)               ------|
INFO net.py: 254: fpn_res4_22_sum             : (1, 256, 50, 68)     => conv_rpn_fpn4               : (1, 256, 50, 68)     ------- (op: Conv)
INFO net.py: 254: conv_rpn_fpn4               : (1, 256, 50, 68)     => conv_rpn_fpn4               : (1, 256, 50, 68)     ------- (op: Relu)
INFO net.py: 254: conv_rpn_fpn4               : (1, 256, 50, 68)     => rpn_cls_logits_fpn4         : (1, 3, 50, 68)       ------- (op: Conv)
INFO net.py: 254: conv_rpn_fpn4               : (1, 256, 50, 68)     => rpn_bbox_pred_fpn4          : (1, 12, 50, 68)      ------- (op: Conv)
INFO net.py: 254: rpn_cls_logits_fpn4         : (1, 3, 50, 68)       => rpn_cls_probs_fpn4          : (1, 3, 50, 68)       ------- (op: Sigmoid)
INFO net.py: 254: rpn_cls_probs_fpn4          : (1, 3, 50, 68)       => rpn_rois_fpn4               : (4, 5)               ------- (op: Python:GenerateProposalsOp:gpu_0/rpn_cls_probs_fpn4,gpu_0/rpn_bbox_pred_fpn4,im_info)
INFO net.py: 254: rpn_bbox_pred_fpn4          : (1, 12, 50, 68)      => rpn_rois_fpn4               : (4, 5)               ------|
INFO net.py: 254: im_info                     : (1, 3)               => rpn_rois_fpn4               : (4, 5)               ------|
INFO net.py: 254: fpn_res5_2_sum              : (1, 256, 25, 34)     => conv_rpn_fpn5               : (1, 256, 25, 34)     ------- (op: Conv)
INFO net.py: 254: conv_rpn_fpn5               : (1, 256, 25, 34)     => conv_rpn_fpn5               : (1, 256, 25, 34)     ------- (op: Relu)
INFO net.py: 254: conv_rpn_fpn5               : (1, 256, 25, 34)     => rpn_cls_logits_fpn5         : (1, 3, 25, 34)       ------- (op: Conv)
INFO net.py: 254: conv_rpn_fpn5               : (1, 256, 25, 34)     => rpn_bbox_pred_fpn5          : (1, 12, 25, 34)      ------- (op: Conv)
INFO net.py: 254: rpn_cls_logits_fpn5         : (1, 3, 25, 34)       => rpn_cls_probs_fpn5          : (1, 3, 25, 34)       ------- (op: Sigmoid)
INFO net.py: 254: rpn_cls_probs_fpn5          : (1, 3, 25, 34)       => rpn_rois_fpn5               : (4, 5)               ------- (op: Python:GenerateProposalsOp:gpu_0/rpn_cls_probs_fpn5,gpu_0/rpn_bbox_pred_fpn5,im_info)
INFO net.py: 254: rpn_bbox_pred_fpn5          : (1, 12, 25, 34)      => rpn_rois_fpn5               : (4, 5)               ------|
INFO net.py: 254: im_info                     : (1, 3)               => rpn_rois_fpn5               : (4, 5)               ------|
INFO net.py: 254: fpn_res5_2_sum_subsampled_2x: (1, 256, 13, 17)     => conv_rpn_fpn6               : (1, 256, 13, 17)     ------- (op: Conv)
INFO net.py: 254: conv_rpn_fpn6               : (1, 256, 13, 17)     => conv_rpn_fpn6               : (1, 256, 13, 17)     ------- (op: Relu)
INFO net.py: 254: conv_rpn_fpn6               : (1, 256, 13, 17)     => rpn_cls_logits_fpn6         : (1, 3, 13, 17)       ------- (op: Conv)
INFO net.py: 254: conv_rpn_fpn6               : (1, 256, 13, 17)     => rpn_bbox_pred_fpn6          : (1, 12, 13, 17)      ------- (op: Conv)
INFO net.py: 254: rpn_cls_logits_fpn6         : (1, 3, 13, 17)       => rpn_cls_probs_fpn6          : (1, 3, 13, 17)       ------- (op: Sigmoid)
INFO net.py: 254: rpn_cls_probs_fpn6          : (1, 3, 13, 17)       => rpn_rois_fpn6               : (4, 5)               ------- (op: Python:GenerateProposalsOp:gpu_0/rpn_cls_probs_fpn6,gpu_0/rpn_bbox_pred_fpn6,im_info)
INFO net.py: 254: rpn_bbox_pred_fpn6          : (1, 12, 13, 17)      => rpn_rois_fpn6               : (4, 5)               ------|
INFO net.py: 254: im_info                     : (1, 3)               => rpn_rois_fpn6               : (4, 5)               ------|
INFO net.py: 254: rpn_rois_fpn2               : (4, 5)               => rois                        : (23, 5)              ------- (op: Python:CollectAndDistributeFpnRpnProposalsOp:gpu_0/rpn_rois_fpn2,gpu_0/rpn_rois_fpn3,gpu_0/rpn_rois_fpn4,gpu_0/rpn_rois_fpn5,gpu_0/rpn_rois_fpn6,gpu_0/rpn_roi_probs_fpn2,gpu_0/rpn_roi_probs_fpn3,gpu_0/rpn_roi_probs_fpn4,gpu_0/rpn_roi_probs_fpn5,gpu_0/rpn_roi_probs_fpn6,gpu_0/roidb,gpu_0/im_info)
INFO net.py: 254: rpn_rois_fpn3               : (4, 5)               => rois                        : (23, 5)              ------|
INFO net.py: 254: rpn_rois_fpn4               : (4, 5)               => rois                        : (23, 5)              ------|
INFO net.py: 254: rpn_rois_fpn5               : (4, 5)               => rois                        : (23, 5)              ------|
INFO net.py: 254: rpn_rois_fpn6               : (4, 5)               => rois                        : (23, 5)              ------|
INFO net.py: 254: rpn_roi_probs_fpn2          : (4, 1)               => rois                        : (23, 5)              ------|
INFO net.py: 254: rpn_roi_probs_fpn3          : (4, 1)               => rois                        : (23, 5)              ------|
INFO net.py: 254: rpn_roi_probs_fpn4          : (4, 1)               => rois                        : (23, 5)              ------|
INFO net.py: 254: rpn_roi_probs_fpn5          : (4, 1)               => rois                        : (23, 5)              ------|
INFO net.py: 254: rpn_roi_probs_fpn6          : (4, 1)               => rois                        : (23, 5)              ------|
INFO net.py: 254: roidb                       : (910,)               => rois                        : (23, 5)              ------|
INFO net.py: 254: im_info                     : (1, 3)               => rois                        : (23, 5)              ------|
INFO net.py: 254: rpn_labels_int32_wide_fpn2  : (1, 3, 336, 336)     => rpn_labels_int32_fpn2       : (1, 3, 200, 272)     ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_cls_logits_fpn2         : (1, 3, 200, 272)     => rpn_labels_int32_fpn2       : (1, 3, 200, 272)     ------|
INFO net.py: 254: rpn_bbox_targets_wide_fpn2  : (1, 12, 336, 336)    => rpn_bbox_targets_fpn2       : (1, 12, 200, 272)    ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_bbox_pred_fpn2          : (1, 12, 200, 272)    => rpn_bbox_targets_fpn2       : (1, 12, 200, 272)    ------|
INFO net.py: 254: rpn_bbox_inside_weights_wide_fpn2: (1, 12, 336, 336)    => rpn_bbox_inside_weights_fpn2: (1, 12, 200, 272)    ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_bbox_pred_fpn2          : (1, 12, 200, 272)    => rpn_bbox_inside_weights_fpn2: (1, 12, 200, 272)    ------|
INFO net.py: 254: rpn_bbox_outside_weights_wide_fpn2: (1, 12, 336, 336)    => rpn_bbox_outside_weights_fpn2: (1, 12, 200, 272)    ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_bbox_pred_fpn2          : (1, 12, 200, 272)    => rpn_bbox_outside_weights_fpn2: (1, 12, 200, 272)    ------|
INFO net.py: 254: rpn_cls_logits_fpn2         : (1, 3, 200, 272)     => loss_rpn_cls_fpn2           : ()                   ------- (op: SigmoidCrossEntropyLoss)
INFO net.py: 254: rpn_labels_int32_fpn2       : (1, 3, 200, 272)     => loss_rpn_cls_fpn2           : ()                   ------|
INFO net.py: 254: rpn_bbox_pred_fpn2          : (1, 12, 200, 272)    => loss_rpn_bbox_fpn2          : ()                   ------- (op: SmoothL1Loss)
INFO net.py: 254: rpn_bbox_targets_fpn2       : (1, 12, 200, 272)    => loss_rpn_bbox_fpn2          : ()                   ------|
INFO net.py: 254: rpn_bbox_inside_weights_fpn2: (1, 12, 200, 272)    => loss_rpn_bbox_fpn2          : ()                   ------|
INFO net.py: 254: rpn_bbox_outside_weights_fpn2: (1, 12, 200, 272)    => loss_rpn_bbox_fpn2          : ()                   ------|
INFO net.py: 254: rpn_labels_int32_wide_fpn3  : (1, 3, 168, 168)     => rpn_labels_int32_fpn3       : (1, 3, 100, 136)     ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_cls_logits_fpn3         : (1, 3, 100, 136)     => rpn_labels_int32_fpn3       : (1, 3, 100, 136)     ------|
INFO net.py: 254: rpn_bbox_targets_wide_fpn3  : (1, 12, 168, 168)    => rpn_bbox_targets_fpn3       : (1, 12, 100, 136)    ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_bbox_pred_fpn3          : (1, 12, 100, 136)    => rpn_bbox_targets_fpn3       : (1, 12, 100, 136)    ------|
INFO net.py: 254: rpn_bbox_inside_weights_wide_fpn3: (1, 12, 168, 168)    => rpn_bbox_inside_weights_fpn3: (1, 12, 100, 136)    ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_bbox_pred_fpn3          : (1, 12, 100, 136)    => rpn_bbox_inside_weights_fpn3: (1, 12, 100, 136)    ------|
INFO net.py: 254: rpn_bbox_outside_weights_wide_fpn3: (1, 12, 168, 168)    => rpn_bbox_outside_weights_fpn3: (1, 12, 100, 136)    ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_bbox_pred_fpn3          : (1, 12, 100, 136)    => rpn_bbox_outside_weights_fpn3: (1, 12, 100, 136)    ------|
INFO net.py: 254: rpn_cls_logits_fpn3         : (1, 3, 100, 136)     => loss_rpn_cls_fpn3           : ()                   ------- (op: SigmoidCrossEntropyLoss)
INFO net.py: 254: rpn_labels_int32_fpn3       : (1, 3, 100, 136)     => loss_rpn_cls_fpn3           : ()                   ------|
INFO net.py: 254: rpn_bbox_pred_fpn3          : (1, 12, 100, 136)    => loss_rpn_bbox_fpn3          : ()                   ------- (op: SmoothL1Loss)
INFO net.py: 254: rpn_bbox_targets_fpn3       : (1, 12, 100, 136)    => loss_rpn_bbox_fpn3          : ()                   ------|
INFO net.py: 254: rpn_bbox_inside_weights_fpn3: (1, 12, 100, 136)    => loss_rpn_bbox_fpn3          : ()                   ------|
INFO net.py: 254: rpn_bbox_outside_weights_fpn3: (1, 12, 100, 136)    => loss_rpn_bbox_fpn3          : ()                   ------|
INFO net.py: 254: rpn_labels_int32_wide_fpn4  : (1, 3, 84, 84)       => rpn_labels_int32_fpn4       : (1, 3, 50, 68)       ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_cls_logits_fpn4         : (1, 3, 50, 68)       => rpn_labels_int32_fpn4       : (1, 3, 50, 68)       ------|
INFO net.py: 254: rpn_bbox_targets_wide_fpn4  : (1, 12, 84, 84)      => rpn_bbox_targets_fpn4       : (1, 12, 50, 68)      ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_bbox_pred_fpn4          : (1, 12, 50, 68)      => rpn_bbox_targets_fpn4       : (1, 12, 50, 68)      ------|
INFO net.py: 254: rpn_bbox_inside_weights_wide_fpn4: (1, 12, 84, 84)      => rpn_bbox_inside_weights_fpn4: (1, 12, 50, 68)      ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_bbox_pred_fpn4          : (1, 12, 50, 68)      => rpn_bbox_inside_weights_fpn4: (1, 12, 50, 68)      ------|
INFO net.py: 254: rpn_bbox_outside_weights_wide_fpn4: (1, 12, 84, 84)      => rpn_bbox_outside_weights_fpn4: (1, 12, 50, 68)      ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_bbox_pred_fpn4          : (1, 12, 50, 68)      => rpn_bbox_outside_weights_fpn4: (1, 12, 50, 68)      ------|
INFO net.py: 254: rpn_cls_logits_fpn4         : (1, 3, 50, 68)       => loss_rpn_cls_fpn4           : ()                   ------- (op: SigmoidCrossEntropyLoss)
INFO net.py: 254: rpn_labels_int32_fpn4       : (1, 3, 50, 68)       => loss_rpn_cls_fpn4           : ()                   ------|
INFO net.py: 254: rpn_bbox_pred_fpn4          : (1, 12, 50, 68)      => loss_rpn_bbox_fpn4          : ()                   ------- (op: SmoothL1Loss)
INFO net.py: 254: rpn_bbox_targets_fpn4       : (1, 12, 50, 68)      => loss_rpn_bbox_fpn4          : ()                   ------|
INFO net.py: 254: rpn_bbox_inside_weights_fpn4: (1, 12, 50, 68)      => loss_rpn_bbox_fpn4          : ()                   ------|
INFO net.py: 254: rpn_bbox_outside_weights_fpn4: (1, 12, 50, 68)      => loss_rpn_bbox_fpn4          : ()                   ------|
INFO net.py: 254: rpn_labels_int32_wide_fpn5  : (1, 3, 42, 42)       => rpn_labels_int32_fpn5       : (1, 3, 25, 34)       ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_cls_logits_fpn5         : (1, 3, 25, 34)       => rpn_labels_int32_fpn5       : (1, 3, 25, 34)       ------|
INFO net.py: 254: rpn_bbox_targets_wide_fpn5  : (1, 12, 42, 42)      => rpn_bbox_targets_fpn5       : (1, 12, 25, 34)      ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_bbox_pred_fpn5          : (1, 12, 25, 34)      => rpn_bbox_targets_fpn5       : (1, 12, 25, 34)      ------|
INFO net.py: 254: rpn_bbox_inside_weights_wide_fpn5: (1, 12, 42, 42)      => rpn_bbox_inside_weights_fpn5: (1, 12, 25, 34)      ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_bbox_pred_fpn5          : (1, 12, 25, 34)      => rpn_bbox_inside_weights_fpn5: (1, 12, 25, 34)      ------|
INFO net.py: 254: rpn_bbox_outside_weights_wide_fpn5: (1, 12, 42, 42)      => rpn_bbox_outside_weights_fpn5: (1, 12, 25, 34)      ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_bbox_pred_fpn5          : (1, 12, 25, 34)      => rpn_bbox_outside_weights_fpn5: (1, 12, 25, 34)      ------|
INFO net.py: 254: rpn_cls_logits_fpn5         : (1, 3, 25, 34)       => loss_rpn_cls_fpn5           : ()                   ------- (op: SigmoidCrossEntropyLoss)
INFO net.py: 254: rpn_labels_int32_fpn5       : (1, 3, 25, 34)       => loss_rpn_cls_fpn5           : ()                   ------|
INFO net.py: 254: rpn_bbox_pred_fpn5          : (1, 12, 25, 34)      => loss_rpn_bbox_fpn5          : ()                   ------- (op: SmoothL1Loss)
INFO net.py: 254: rpn_bbox_targets_fpn5       : (1, 12, 25, 34)      => loss_rpn_bbox_fpn5          : ()                   ------|
INFO net.py: 254: rpn_bbox_inside_weights_fpn5: (1, 12, 25, 34)      => loss_rpn_bbox_fpn5          : ()                   ------|
INFO net.py: 254: rpn_bbox_outside_weights_fpn5: (1, 12, 25, 34)      => loss_rpn_bbox_fpn5          : ()                   ------|
INFO net.py: 254: rpn_labels_int32_wide_fpn6  : (1, 3, 21, 21)       => rpn_labels_int32_fpn6       : (1, 3, 13, 17)       ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_cls_logits_fpn6         : (1, 3, 13, 17)       => rpn_labels_int32_fpn6       : (1, 3, 13, 17)       ------|
INFO net.py: 254: rpn_bbox_targets_wide_fpn6  : (1, 12, 21, 21)      => rpn_bbox_targets_fpn6       : (1, 12, 13, 17)      ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_bbox_pred_fpn6          : (1, 12, 13, 17)      => rpn_bbox_targets_fpn6       : (1, 12, 13, 17)      ------|
INFO net.py: 254: rpn_bbox_inside_weights_wide_fpn6: (1, 12, 21, 21)      => rpn_bbox_inside_weights_fpn6: (1, 12, 13, 17)      ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_bbox_pred_fpn6          : (1, 12, 13, 17)      => rpn_bbox_inside_weights_fpn6: (1, 12, 13, 17)      ------|
INFO net.py: 254: rpn_bbox_outside_weights_wide_fpn6: (1, 12, 21, 21)      => rpn_bbox_outside_weights_fpn6: (1, 12, 13, 17)      ------- (op: SpatialNarrowAs)
INFO net.py: 254: rpn_bbox_pred_fpn6          : (1, 12, 13, 17)      => rpn_bbox_outside_weights_fpn6: (1, 12, 13, 17)      ------|
INFO net.py: 254: rpn_cls_logits_fpn6         : (1, 3, 13, 17)       => loss_rpn_cls_fpn6           : ()                   ------- (op: SigmoidCrossEntropyLoss)
INFO net.py: 254: rpn_labels_int32_fpn6       : (1, 3, 13, 17)       => loss_rpn_cls_fpn6           : ()                   ------|
INFO net.py: 254: rpn_bbox_pred_fpn6          : (1, 12, 13, 17)      => loss_rpn_bbox_fpn6          : ()                   ------- (op: SmoothL1Loss)
INFO net.py: 254: rpn_bbox_targets_fpn6       : (1, 12, 13, 17)      => loss_rpn_bbox_fpn6          : ()                   ------|
INFO net.py: 254: rpn_bbox_inside_weights_fpn6: (1, 12, 13, 17)      => loss_rpn_bbox_fpn6          : ()                   ------|
INFO net.py: 254: rpn_bbox_outside_weights_fpn6: (1, 12, 13, 17)      => loss_rpn_bbox_fpn6          : ()                   ------|
INFO net.py: 254: fpn_res2_2_sum              : (1, 256, 200, 272)   => roi_feat_fpn2               : (20, 256, 7, 7)      ------- (op: RoIAlign)
INFO net.py: 254: rois_fpn2                   : (20, 5)              => roi_feat_fpn2               : (20, 256, 7, 7)      ------|
INFO net.py: 254: fpn_res3_3_sum              : (1, 256, 100, 136)   => roi_feat_fpn3               : (3, 256, 7, 7)       ------- (op: RoIAlign)
INFO net.py: 254: rois_fpn3                   : (3, 5)               => roi_feat_fpn3               : (3, 256, 7, 7)       ------|
INFO net.py: 254: fpn_res4_22_sum             : (1, 256, 50, 68)     => roi_feat_fpn4               : (0, 256, 7, 7)       ------- (op: RoIAlign)
INFO net.py: 254: rois_fpn4                   : (0, 5)               => roi_feat_fpn4               : (0, 256, 7, 7)       ------|
INFO net.py: 254: fpn_res5_2_sum              : (1, 256, 25, 34)     => roi_feat_fpn5               : (0, 256, 7, 7)       ------- (op: RoIAlign)
INFO net.py: 254: rois_fpn5                   : (0, 5)               => roi_feat_fpn5               : (0, 256, 7, 7)       ------|
INFO net.py: 254: roi_feat_fpn2               : (20, 256, 7, 7)      => roi_feat_shuffled           : (23, 256, 7, 7)      ------- (op: Concat)
INFO net.py: 254: roi_feat_fpn3               : (3, 256, 7, 7)       => roi_feat_shuffled           : (23, 256, 7, 7)      ------|
INFO net.py: 254: roi_feat_fpn4               : (0, 256, 7, 7)       => roi_feat_shuffled           : (23, 256, 7, 7)      ------|
INFO net.py: 254: roi_feat_fpn5               : (0, 256, 7, 7)       => roi_feat_shuffled           : (23, 256, 7, 7)      ------|
INFO net.py: 254: roi_feat_shuffled           : (23, 256, 7, 7)      => roi_feat                    : (23, 256, 7, 7)      ------- (op: BatchPermutation)
INFO net.py: 254: rois_idx_restore_int32      : (23,)                => roi_feat                    : (23, 256, 7, 7)      ------|
INFO net.py: 254: roi_feat                    : (23, 256, 7, 7)      => fc6                         : (23, 1024)           ------- (op: FC)
INFO net.py: 254: fc6                         : (23, 1024)           => fc6                         : (23, 1024)           ------- (op: Relu)
INFO net.py: 254: fc6                         : (23, 1024)           => fc7                         : (23, 1024)           ------- (op: FC)
INFO net.py: 254: fc7                         : (23, 1024)           => fc7                         : (23, 1024)           ------- (op: Relu)
INFO net.py: 254: fc7                         : (23, 1024)           => cls_score                   : (23, 19)             ------- (op: FC)
INFO net.py: 254: fc7                         : (23, 1024)           => bbox_pred                   : (23, 76)             ------- (op: FC)
INFO net.py: 254: cls_score                   : (23, 19)             => cls_prob                    : (23, 19)             ------- (op: SoftmaxWithLoss)
INFO net.py: 254: labels_int32                : (23,)                => cls_prob                    : (23, 19)             ------|
INFO net.py: 254: bbox_pred                   : (23, 76)             => loss_bbox                   : ()                   ------- (op: SmoothL1Loss)
INFO net.py: 254: bbox_targets                : (23, 76)             => loss_bbox                   : ()                   ------|
INFO net.py: 254: bbox_inside_weights         : (23, 76)             => loss_bbox                   : ()                   ------|
INFO net.py: 254: bbox_outside_weights        : (23, 76)             => loss_bbox                   : ()                   ------|
INFO net.py: 254: cls_prob                    : (23, 19)             => accuracy_cls                : ()                   ------- (op: Accuracy)
INFO net.py: 254: labels_int32                : (23,)                => accuracy_cls                : ()                   ------|
INFO net.py: 258: End of model: generalized_rcnn
json_stats: {"accuracy_cls": "0.013333", "eta": "18:41:28", "iter": 0, "loss": "17451542731264.000000", "loss_bbox": "958357979136.000000", "loss_cls": "15573960359936.000000", "loss_rpn_bbox_fpn2": "0.000000", "loss_rpn_bbox_fpn3": "0.000000", "loss_rpn_bbox_fpn4": "400505454592.000000", "loss_rpn_bbox_fpn5": "87838390272.000000", "loss_rpn_bbox_fpn6": "0.000000", "loss_rpn_cls_fpn2": "13083249664.000000", "loss_rpn_cls_fpn3": "145479442432.000000", "loss_rpn_cls_fpn4": "250312511488.000000", "loss_rpn_cls_fpn5": "22005343744.000000", "loss_rpn_cls_fpn6": "0.000000", "lr": "0.125000", "mb_qsize": 64, "mem": 6056, "time": "6.728863"}
CRITICAL train.py: 101: Loss is NaN

Detailed steps to reproduce

The script to transform my classification model(trained from scratch and successfully converged) into the detectron format is written by myself, as follows:

"""
Convert the model in pth format into pkl format used by detectron
e.g conv1.weight -> res_conv1_bn_b
e.g. layer1.0.bn2.weight -> res2_0_branch2b_bn_w
e.g fc.weight -> pred_w
"""

import pickle
import torch
import numpy as np
import resnext
import re
import argparse
import sys

BASE_WIDTH = 8
CARDINALITY = 32

PTH_MODEL_PATH = 'model_best.pth.tar'
REF_MODEL_PATH = '/home/slashgns/detect/detectron/detectron-download-cache/ImageNetPretrained/20171220/X-101-32x8d.pkl'
PKL_OUTPUT_PATH = '/home/slashgns/detect/detectron/detectron-download-cache/ImageNetPretrained/20171220/Cells_pretrained.pkl'
NAIVE_PAIRS = {'conv1.weight': 'conv1_w', 'bn1.weight': 'res_conv1_bn_s', 'bn1.bias': 'res_conv1_bn_b', 'fc.weight': 'pred_w', 'fc.bias': 'pred_b'}
PATTERN = re.compile('layer(\d+)\.(\d+)\.(.*)(\d+)\.(.*)')

parser = argparse.ArgumentParser(description="Convert the torch model(.pth) into the detectron format(.pkl)")

parser.add_argument('--input-model', dest="input_model", help="pth model")
parser.add_argument('--ref-model', dest='ref_model', help="the reference model(not necessary)", default=None)
parser.add_argument('--output-model', dest='output_model', help="output model", default=None)

if len(sys.argv) < 2:
    parser.print_help()
    sys.exit(1)

args = parser.parse_args()
if args.ref_model:
    REF_MODEL_PATH = args.ref_model
if args.output_model:
    PKL_OUTPUT_PATH = args.output_model

model = resnext.resnext101(baseWidth=BASE_WIDTH, cardinality=CARDINALITY)
model = torch.nn.DataParallel(model).cuda()
model.load_state_dict(torch.load(PTH_MODEL_PATH)['state_dict'])

model = model.cpu()

model_clean = {}
for k, v in model.module.state_dict().items():
    if k in NAIVE_PAIRS.keys():
        new_key = NAIVE_PAIRS[k]
    elif 'running_mean' in k or 'running_var' in k:
        continue
    else:
        res = PATTERN.search(k)
        assert res!=None, "Error! The key {} is invalid!".format(k)
        stage, block, type, branch, wb = int(res.group(1)), int(res.group(2)), res.group(3), int(res.group(4)), res.group(5)

        stage += 1
        if 'bn' in type:
            middle = "branch2" + chr(96 + branch) + "_" + "bn"
            suffix = 's' if wb == 'weight' else 'b'
        elif 'conv' in type:
            middle = "branch2" + chr(96 + branch)
            suffix = 'w'
        elif 'downsample' in type:
            if branch == 0:
                middle = "branch1"
                suffix = 'w'
            elif branch == 1:
                middle = "branch1" + "_" + 'bn'
                suffix = 's' if wb == 'weight' else 'b'
            else:
                raise ValueError
        else:
            raise ValueError
        new_key = "res" + str(stage) + '_' + str(block) + '_' + middle + '_' + suffix

    new_val = np.array(v)
    model_clean[new_key] = new_val

with open(REF_MODEL_PATH, 'rb') as f:
    ref = pickle.load(f, encoding='latin1')

# check and print out the difference
diff = [key for key in ref['blobs'].keys() if key not in model_clean.keys()]

assert len(diff) == 0, 'Error!The keys are not the same!'

model_out = {"blobs": model_clean}

with open(PKL_OUTPUT_PATH, 'wb') as f:
    pickle.dump(model_out, f)
print("pkl model saved => {}".format(PKL_OUTPUT_PATH))
python tools/train_net.py --cfg configs/12_2017_baselines/e2e_faster_rcnn_X-101-32x8d-FPN_2x.yaml OUTPUT_DIR output

I have tried to set a much smaller learning rate but it did not help. Anybody has some idea? Thanks in advance.

System information

  • Operating system: Ubuntu 14.04
  • Compiler version:
  • CUDA version: 9.0
  • cuDNN version: 7.4.1
  • NVIDIA driver version:
  • GPU models (for all devices if they are not all the same):
  • PYTHONPATH environment variable:
  • python --version output: 3.6.8, anaconda
  • Anything else that seems relevant:
@satyajithj
Copy link

The OP was able to fix a Loss is NaN issue (#182) by lowering the learning rate. Try and check if this works for you as well.

@FishWoWater
Copy link
Author

The OP was able to fix a Loss is NaN issue (#182) by lowering the learning rate. Try and check if this works for you as well.

Thanks a lot, but I have tried to lower my learning rate to 1e-8 and it still does not work

@satyajithj
Copy link

In your post, you have, "loss": "17451542731264.000000"and all other losses are high as well.
Is it the same when you set the learning rate to 1e-8?

@FishWoWater
Copy link
Author

In your post, you have, "loss": "17451542731264.000000"and all other losses are high as well.
Is it the same when you set the learning rate to 1e-8?

The output when I lower the lr to 1e-8 is as follows:

json_stats: {"accuracy_cls": "0.000000", "eta": "19:23:53", "iter": 0, "loss": "47156856092672.000000", "loss_bbox": "1855492718592.000000", "loss_cls": "40212032913408.000000", "loss_rpn_bbox_fpn2": "0.000000", "loss_rpn_bbox_fpn3": "0.000000", "loss_rpn_bbox_fpn4": "1970523504640.000000", "loss_rpn_bbox_fpn5": "225773981696.000000", "loss_rpn_bbox_fpn6": "0.000000", "loss_rpn_cls_fpn2": "2342287245312.000000", "loss_rpn_cls_fpn3": "31649947648.000000", "loss_rpn_cls_fpn4": "469987680256.000000", "loss_rpn_cls_fpn5": "49108101120.000000", "loss_rpn_cls_fpn6": "0.000000", "lr": "0.000000", "mb_qsize": 64, "mem": 6056, "time": "6.983366"}
/home/slashgns/detect/detectron/detectron/utils/boxes.py:176: RuntimeWarning: overflow encountered in multiply
  pred_ctr_x = dx * widths[:, np.newaxis] + ctr_x[:, np.newaxis]
/home/slashgns/detect/detectron/detectron/utils/boxes.py:177: RuntimeWarning: overflow encountered in multiply
  pred_ctr_y = dy * heights[:, np.newaxis] + ctr_y[:, np.newaxis]
CRITICAL train.py: 101: Loss is NaN

@FishWoWater
Copy link
Author

emmmm When I updated the detectron version, it offered extra information, saying that there is overflow of dx. I found that the bbox_deltas are quite large(1e+12 or more). I think it is the problem of RPN but I can not solve the bug.......

@JeromeMutgeert
Copy link

Possibly it has something to do with the input normalisation that might be missing? I am facing problems as well, and it doesn't seem like inputs are rescaled to [-1,1], but are in range [-128,128] instead.

@FishWoWater
Copy link
Author

Possibly it has something to do with the input normalisation that might be missing? I am facing problems as well, and it doesn't seem like inputs are rescaled to [-1,1], but are in range [-128,128] instead.

Oh, yeah.... The normalization of my pretrained model( [-1,1]) is inconsistent with the normalization of detectron([-128, 128]). I will retrain the model and see whether the problem can be solved. Thank you!

@JeromeMutgeert
Copy link

Great to hear that this is indeed been the reason! Note that I do not understand why FAIR is using [-128,128] inputs at all, because this will lead to 128x larger activations on average. Because of that the training of the biases is very slow, for they weigh only a factor 1. I would recommend you to adjust prep_im_for_blob() in detectron/utils/blob.py to divide the im by 128 before returning. (Or create a field in core/config.py to toggle it.) Your model doesn't have to be re-trained, it should work then.

@FishWoWater
Copy link
Author

Emmmmm.... I still can not solve the problem. I have tried the following two ways but failed:

  • divide the cfg.PIXEL_MEANS by 255 and modify the prep_im_for_blob ( I use 255 as the scaler, why 128?)
  • retrain my classification model to match it with the detectron(same normalization)

@JeromeMutgeert
Copy link

Aah yes that is also fine. I was thinking of:
im -= means
im /= 128
which is about similar to:
im /= 255
im -= means/255

Hmmm, not sure I can help you any further. What is the loss at iteration 0 when normalizing? Maybe you can get insight by printing out some blobs. I'm using the following blob summary function for this:

def blob_summary(blobs):
# blobs = workspace.Blobs()
print()
for blob in blobs:

    b = workspace.FetchBlob('gpu_0/'+blob)
    shape = b.shape
    b = np.array(b.astype(float)).reshape(-1)
    order = np.argsort(b)
    step = max(1,len(b)//10)
    idxs = np.arange(step//2,len(b),step)
    percentiles = b[order[idxs]]
    hi = b[order[-1]]
    lo = b[order[0]]
    abs_mean,mean,std,zeros = [np.format_float_scientific(v,precision=2) for v in [np.abs(b).mean(), b.mean(),b.std(),sum(b == 0.0)/len(b)]]
    print(" {} {} ({}): abs mean:{} mean:{} std:{} zeros:{} \nmin-5-15-...-85-95-max percentiles: {} ".format(blob, shape, len(b),
           abs_mean,mean,std,zeros,' '.join([np.format_float_scientific(p,precision=2) for p in [lo] + list(percentiles)+ [hi]])))
    print()

@FishWoWater
Copy link
Author

OK, I will try to print some blobs later, actually I am pretty new to caffe2 :-)
You are so kind and thanks for all your help!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants