Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.

Loss is NaN , while training on both VOC data and Custom dataset #182

Closed
soumenms2015 opened this issue Feb 22, 2018 · 3 comments
Closed

Comments

@soumenms2015
Copy link

I am getting the loss is NaN error while training on custom data as well as benchmark dataset pascal voc dataset.
Here is the error:

INFO net.py: 271: labels_int32 : (512,) => cls_prob : (512, 10) ------|
INFO net.py: 271: bbox_pred : (512, 40) => loss_bbox : () ------- (op: SmoothL1Loss)
INFO net.py: 271: bbox_targets : (512, 40) => loss_bbox : () ------|
INFO net.py: 271: bbox_inside_weights : (512, 40) => loss_bbox : () ------|
INFO net.py: 271: bbox_outside_weights : (512, 40) => loss_bbox : () ------|
INFO net.py: 271: cls_prob : (512, 10) => accuracy_cls : () ------- (op: Accuracy)
INFO net.py: 271: labels_int32 : (512,) => accuracy_cls : () ------|
INFO net.py: 271: fpn_res2_2_sum : (1, 256, 336, 152) => _[mask]_roi_feat_fpn2 : (8, 256, 14, 14) ------- (op: RoIAlign)
INFO net.py: 271: mask_rois_fpn2 : (8, 5) => _[mask]_roi_feat_fpn2 : (8, 256, 14, 14) ------|
INFO net.py: 271: fpn_res3_7_sum : (1, 256, 168, 76) => _[mask]_roi_feat_fpn3 : (12, 256, 14, 14) ------- (op: RoIAlign)
INFO net.py: 271: mask_rois_fpn3 : (12, 5) => _[mask]_roi_feat_fpn3 : (12, 256, 14, 14) ------|
INFO net.py: 271: fpn_res4_35_sum : (1, 256, 84, 38) => _[mask]_roi_feat_fpn4 : (9, 256, 14, 14) ------- (op: RoIAlign)
INFO net.py: 271: mask_rois_fpn4 : (9, 5) => _[mask]_roi_feat_fpn4 : (9, 256, 14, 14) ------|
INFO net.py: 271: fpn_res5_2_sum : (1, 256, 42, 19) => _[mask]_roi_feat_fpn5 : (23, 256, 14, 14) ------- (op: RoIAlign)
INFO net.py: 271: mask_rois_fpn5 : (23, 5) => _[mask]_roi_feat_fpn5 : (23, 256, 14, 14) ------|
INFO net.py: 271: _[mask]_roi_feat_fpn2 : (8, 256, 14, 14) => _[mask]_roi_feat_shuffled : (52, 256, 14, 14) ------- (op: Concat)
INFO net.py: 271: _[mask]_roi_feat_fpn3 : (12, 256, 14, 14) => _[mask]_roi_feat_shuffled : (52, 256, 14, 14) ------|
INFO net.py: 271: _[mask]_roi_feat_fpn4 : (9, 256, 14, 14) => _[mask]_roi_feat_shuffled : (52, 256, 14, 14) ------|
INFO net.py: 271: _[mask]_roi_feat_fpn5 : (23, 256, 14, 14) => _[mask]_roi_feat_shuffled : (52, 256, 14, 14) ------|
INFO net.py: 271: _[mask]_roi_feat_shuffled : (52, 256, 14, 14) => _[mask]_roi_feat : (52, 256, 14, 14) ------- (op: BatchPermutation)
INFO net.py: 271: mask_rois_idx_restore_int32 : (52,) => _[mask]_roi_feat : (52, 256, 14, 14) ------|
INFO net.py: 271: _[mask]_roi_feat : (52, 256, 14, 14) => _[mask]_fcn1 : (52, 256, 14, 14) ------- (op: Conv)
INFO net.py: 271: _[mask]_fcn1 : (52, 256, 14, 14) => _[mask]_fcn1 : (52, 256, 14, 14) ------- (op: Relu)
INFO net.py: 271: _[mask]_fcn1 : (52, 256, 14, 14) => _[mask]_fcn2 : (52, 256, 14, 14) ------- (op: Conv)
INFO net.py: 271: _[mask]_fcn2 : (52, 256, 14, 14) => _[mask]_fcn2 : (52, 256, 14, 14) ------- (op: Relu)
INFO net.py: 271: _[mask]_fcn2 : (52, 256, 14, 14) => _[mask]_fcn3 : (52, 256, 14, 14) ------- (op: Conv)
INFO net.py: 271: _[mask]_fcn3 : (52, 256, 14, 14) => _[mask]_fcn3 : (52, 256, 14, 14) ------- (op: Relu)
INFO net.py: 271: _[mask]_fcn3 : (52, 256, 14, 14) => _[mask]_fcn4 : (52, 256, 14, 14) ------- (op: Conv)
INFO net.py: 271: _[mask]_fcn4 : (52, 256, 14, 14) => _[mask]_fcn4 : (52, 256, 14, 14) ------- (op: Relu)
INFO net.py: 271: _[mask]_fcn4 : (52, 256, 14, 14) => conv5_mask : (52, 256, 28, 28) ------- (op: ConvTranspose)
INFO net.py: 271: conv5_mask : (52, 256, 28, 28) => conv5_mask : (52, 256, 28, 28) ------- (op: Relu)
INFO net.py: 271: conv5_mask : (52, 256, 28, 28) => mask_fcn_logits : (52, 10, 28, 28) ------- (op: Conv)
INFO net.py: 271: mask_fcn_logits : (52, 10, 28, 28) => loss_mask : () ------- (op: SigmoidCrossEntropyLoss)
INFO net.py: 271: masks_int32 : (52, 7840) => loss_mask : () ------|
INFO net.py: 275: End of model: generalized_rcnn
../anaconda2/lib/python2.7/site-packages/numpy/lib/function_base.py:4033: RuntimeWarning: Invalid value encountered in median
r = func(a, **kwargs)
json_stats: {"accuracy_cls": 0.898438, "eta": "21 days, 12:25:49", "iter": 0, "loss": NaN, "loss_bbox": -0.071702, "loss_cls": 2.302585, "loss_mask": NaN, "loss_rpn_bbox_fpn2": 0.000000, "loss_rpn_bbox_fpn3": 0.000000, "loss_rpn_bbox_fpn4": NaN, "loss_rpn_bbox_fpn5": 0.000000, "loss_rpn_bbox_fpn6": 0.000000, "loss_rpn_cls_fpn2": 0.000000, "loss_rpn_cls_fpn3": NaN, "loss_rpn_cls_fpn4": 0.000000, "loss_rpn_cls_fpn5": 0.000000, "loss_rpn_cls_fpn6": 0.000000, "lr": 0.000333, "mb_qsize": 64, "mem": 7174, "time": 7.150576}
CRITICAL train_net.py: 159: Loss is NaN, exiting...
Tried with lowering the base learning rate.

@telwell
Copy link

telwell commented Mar 10, 2018

I see that you closed this, did you ever figure out what was going on?

@RafaRuiz
Copy link

RafaRuiz commented Apr 9, 2018

Hello @soumenms2015, any hint of how to proceed?

@soumenms2015
Copy link
Author

@RafaRuiz Lowering the learning rate would alleviate the problem.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants