Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeWarning: invalid value encountered in log targets_dw = np.log(gt_widths / ex_widths) Command terminated by signal 11 #107

Closed
xzy295461445 opened this issue May 26, 2017 · 16 comments

Comments

@xzy295461445
Copy link

i use my own datasets replace the voc2007 and have some issue. Can you please suggest solutions?

here is the log.

##`+ echo Logging output to experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-26_14-23-40
Logging output to experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-26_14-23-40

  • set +x
  • '[' '!' -f output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt.index ']'
  • [[ ! -z '' ]]
  • CUDA_VISIBLE_DEVICES=0
  • time python ./tools/trainval_net.py --weight data/imagenet_weights/vgg16.ckpt --imdb voc_2007_trainval --imdbval voc_2007_test --iters 70000 --cfg experiments/cfgs/vgg16.yml --net vgg16 --set ANCHOR_SCALES '[8,16,32]' ANCHOR_RATIOS '[0.5,1,2]' TRAIN.STEPSIZE 50000
    Called with args:
    Namespace(cfg_file='experiments/cfgs/vgg16.yml', imdb_name='voc_2007_trainval', imdbval_name='voc_2007_test', max_iters=70000, net='vgg16', set_cfgs=['ANCHOR_SCALES', '[8,16,32]', 'ANCHOR_RATIOS', '[0.5,1,2]', 'TRAIN.STEPSIZE', '50000'], tag=None, weight='data/imagenet_weights/vgg16.ckpt')
    Using config:
    {'ANCHOR_RATIOS': [0.5, 1, 2],
    'ANCHOR_SCALES': [8, 16, 32],
    'DATA_DIR': '/media/y/B0AAA15CAAA11FB8/linux/tf-faster-rcnn/data',
    'DEDUP_BOXES': 0.0625,
    'EPS': 1e-14,
    'EXP_DIR': 'vgg16',
    'GPU_ID': 0,
    'MATLAB': 'matlab',
    'PIXEL_MEANS': array([[[ 102.9801, 115.9465, 122.7717]]]),
    'POOLING_MODE': 'crop',
    'POOLING_SIZE': 7,
    'RESNET': {'BN_TRAIN': False, 'FIXED_BLOCKS': 1, 'MAX_POOL': False},
    'RNG_SEED': 3,
    'ROOT_DIR': '/media/y/B0AAA15CAAA11FB8/linux/tf-faster-rcnn',
    'TEST': {'BBOX_REG': True,
    'HAS_RPN': True,
    'MAX_SIZE': 1000,
    'MODE': 'nms',
    'NMS': 0.3,
    'PROPOSAL_METHOD': 'gt',
    'RPN_NMS_THRESH': 0.7,
    'RPN_POST_NMS_TOP_N': 300,
    'RPN_PRE_NMS_TOP_N': 6000,
    'RPN_TOP_N': 5000,
    'SCALES': [600],
    'SVM': False},
    'TRAIN': {'ASPECT_GROUPING': False,
    'BATCH_SIZE': 256,
    'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
    'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
    'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
    'BBOX_NORMALIZE_TARGETS': True,
    'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
    'BBOX_REG': True,
    'BBOX_THRESH': 0.5,
    'BG_THRESH_HI': 0.5,
    'BG_THRESH_LO': 0.0,
    'BIAS_DECAY': False,
    'DISPLAY': 20,
    'DOUBLE_BIAS': True,
    'FG_FRACTION': 0.25,
    'FG_THRESH': 0.5,
    'GAMMA': 0.1,
    'HAS_RPN': True,
    'IMS_PER_BATCH': 1,
    'LEARNING_RATE': 0.001,
    'MAX_SIZE': 1000,
    'MOMENTUM': 0.9,
    'PROPOSAL_METHOD': 'gt',
    'RPN_BATCHSIZE': 256,
    'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
    'RPN_CLOBBER_POSITIVES': False,
    'RPN_FG_FRACTION': 0.5,
    'RPN_NEGATIVE_OVERLAP': 0.3,
    'RPN_NMS_THRESH': 0.7,
    'RPN_POSITIVE_OVERLAP': 0.7,
    'RPN_POSITIVE_WEIGHT': -1.0,
    'RPN_POST_NMS_TOP_N': 2000,
    'RPN_PRE_NMS_TOP_N': 12000,
    'SCALES': [600],
    'SNAPSHOT_ITERS': 5000,
    'SNAPSHOT_KEPT': 3,
    'SNAPSHOT_PREFIX': 'vgg16_faster_rcnn',
    'STEPSIZE': 50000,
    'SUMMARY_INTERVAL': 180,
    'TRUNCATED': False,
    'USE_ALL_GT': True,
    'USE_FLIPPED': True,
    'USE_GT': False,
    'WEIGHT_DECAY': 0.0005},
    'USE_GPU_NMS': False}
    Loaded dataset voc_2007_trainval for training
    Set proposal method: gt
    Appending horizontally-flipped training examples...
    voc_2007_trainval gt roidb loaded from /media/y/B0AAA15CAAA11FB8/linux/tf-faster-rcnn/data/cache/voc_2007_trainval_gt_roidb.pkl
    done
    Preparing training data...
    done
    1528 roidb entries
    Output will be saved to /media/y/B0AAA15CAAA11FB8/linux/tf-faster-rcnn/output/vgg16/voc_2007_trainval/default
    TensorFlow summaries will be saved to /media/y/B0AAA15CAAA11FB8/linux/tf-faster-rcnn/tensorboard/vgg16/voc_2007_trainval/default
    Loaded dataset voc_2007_test for training
    Set proposal method: gt
    Preparing training data...
    voc_2007_test gt roidb loaded from /media/y/B0AAA15CAAA11FB8/linux/tf-faster-rcnn/data/cache/voc_2007_test_gt_roidb.pkl
    done
    328 validation roidb entries
    Filtered 0 roidb entries: 1528 -> 1528
    Filtered 0 roidb entries: 328 -> 328
    2017-05-26 14:24:11.316553: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
    2017-05-26 14:24:11.316569: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
    2017-05-26 14:24:11.316572: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
    2017-05-26 14:24:11.316575: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
    2017-05-26 14:24:11.316577: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
    Solving...
    /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
    "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
    Loading initial model weights from data/imagenet_weights/vgg16.ckpt
    Varibles restored: vgg_16/conv1/conv1_1/biases:0
    Varibles restored: vgg_16/conv1/conv1_2/weights:0
    Varibles restored: vgg_16/conv1/conv1_2/biases:0
    Varibles restored: vgg_16/conv2/conv2_1/weights:0
    Varibles restored: vgg_16/conv2/conv2_1/biases:0
    Varibles restored: vgg_16/conv2/conv2_2/weights:0
    Varibles restored: vgg_16/conv2/conv2_2/biases:0
    Varibles restored: vgg_16/conv3/conv3_1/weights:0
    Varibles restored: vgg_16/conv3/conv3_1/biases:0
    Varibles restored: vgg_16/conv3/conv3_2/weights:0
    Varibles restored: vgg_16/conv3/conv3_2/biases:0
    Varibles restored: vgg_16/conv3/conv3_3/weights:0
    Varibles restored: vgg_16/conv3/conv3_3/biases:0
    Varibles restored: vgg_16/conv4/conv4_1/weights:0
    Varibles restored: vgg_16/conv4/conv4_1/biases:0
    Varibles restored: vgg_16/conv4/conv4_2/weights:0
    Varibles restored: vgg_16/conv4/conv4_2/biases:0
    Varibles restored: vgg_16/conv4/conv4_3/weights:0
    Varibles restored: vgg_16/conv4/conv4_3/biases:0
    Varibles restored: vgg_16/conv5/conv5_1/weights:0
    Varibles restored: vgg_16/conv5/conv5_1/biases:0
    Varibles restored: vgg_16/conv5/conv5_2/weights:0
    Varibles restored: vgg_16/conv5/conv5_2/biases:0
    Varibles restored: vgg_16/conv5/conv5_3/weights:0
    Varibles restored: vgg_16/conv5/conv5_3/biases:0
    Varibles restored: vgg_16/fc6/biases:0
    Varibles restored: vgg_16/fc7/biases:0
    Loaded.
    Fix VGG16 layers..
    /media/y/B0AAA15CAAA11FB8/linux/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:26: RuntimeWarning: invalid value encountered in log
    targets_dw = np.log(gt_widths / ex_widths)
    Command terminated by signal 11
    62.03user 5.27system 0:57.96elapsed 116%CPU (0avgtext+0avgdata 3723648maxresident)k
    382896inputs+16outputs (296major+3462186minor)pagefaults 0swaps`
@HTLife
Copy link

HTLife commented May 28, 2017

I also encounter the same error.

Before iteration 55, everything goes fine.

Fix VGG16 layers..
Fixed.
iter: 20 / 7000, total loss: 0.315415
 >>> rpn_loss_cls: 0.120561
 >>> rpn_loss_box: 0.016272
 >>> loss_cls: 0.137040
 >>> loss_box: 0.041542
 >>> lr: 0.001000
speed: 1.417s / iter
iter: 40 / 7000, total loss: 0.740965
 >>> rpn_loss_cls: 0.077266
 >>> rpn_loss_box: 0.005625
 >>> loss_cls: 0.416012
 >>> loss_box: 0.242062
 >>> lr: 0.001000
speed: 1.133s / iter
/notebooks/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:31: RuntimeWarning: invalid value encountered in log
  targets_dh = np.log(gt_heights / ex_heights)
iter: 60 / 7000, total loss: nan
 >>> rpn_loss_cls: 0.681259
 >>> rpn_loss_box: nan
 >>> loss_cls: 2.784790
 >>> loss_box: 0.000000
 >>> lr: 0.001000
speed: 0.925s / iter

After iter=55, rpn_loss_box will become nan which caused by wrong value in bbox_transform.

lib/model/bbox_transform.py
Line20:
gt_heights = gt_rois[:, 3] - gt_rois[:, 1] + 1.0
gt_rois[:, 1] become greater than gt_rois[:, 3] and result gt_heights to negative value.

gt_rois[:, 3]  188.75
gt_rois[:, 1]  81918.8
gt_heights  -81729.0

@HTLife
Copy link

HTLife commented May 28, 2017

My temporary solution is to ignore incorrect value (ymin > ymax).

Checking gt_boxes value as follows:

lib/model/train_val.py
https://github.com/endernewton/tf-faster-rcnn/blob/master/lib/model/train_val.py#L219
Line:219

      blobs = self.data_layer.forward() 
	  if blobs['gt_boxes'][0][1] > blobs['gt_boxes'][0][3]:
        iter += 1    
        continue

This modification could let the program running without getting 'Nan'.


Beside this temporary solution.
I start to figure out what is the cause of this error.
We could guess "ymin and ymax value" of brounding box ground truth is wrong.
However, after I exeamine my brounding box data, ymin is always smaller than ymax.

@endernewton What would you suggest to find out the source of the Nan problem?

@endernewton
Copy link
Owner

i am not sure your setting, your application.. it is hard to help. sorry

@xzy295461445
Copy link
Author

i run it again and it changed. could you please tell me what happened?

Fix VGG16 layers..
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/script_ops.py", line 82, in call
ret = func(*args)
File "/media/y/B0AAA15CAAA11FB8/linux/tf-faster-rcnn/tools/../lib/layer_utils/anchor_target_layer.py", line 90, in anchor_target_layer
bbox_targets = _compute_targets(anchors, gt_boxes[argmax_overlaps, :])
File "/media/y/B0AAA15CAAA11FB8/linux/tf-faster-rcnn/tools/../lib/layer_utils/anchor_target_layer.py", line 163, in _compute_targets
return bbox_transform(ex_rois, gt_rois[:, :4]).astype(np.float32, copy=False)
File "/media/y/B0AAA15CAAA11FB8/linux/tf-faster-rcnn/tools/../lib/model/bbox_transform.py", line 26, in bbox_transform
targets_dw = np.log(gt_widths / ex_widths) if ex_widths != 0 else 0
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
2017-05-30 20:17:20.974332: W tensorflow/core/framework/op_kernel.cc:1152] Internal: Failed to run py callback pyfunc_2: see error log.
Traceback (most recent call last):
File "./tools/trainval_net.py", line 136, in
max_iters=args.max_iters)
File "/media/y/B0AAA15CAAA11FB8/linux/tf-faster-rcnn/tools/../lib/model/train_val.py", line 386, in train_net
sw.train_model(sess, max_iters)
File "/media/y/B0AAA15CAAA11FB8/linux/tf-faster-rcnn/tools/../lib/model/train_val.py", line 285, in train_model
self.net.train_step(sess, blobs, train_op)
File "/media/y/B0AAA15CAAA11FB8/linux/tf-faster-rcnn/tools/../lib/nets/network.py", line 374, in train_step
feed_dict=feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 778, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 982, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1032, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1052, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Failed to run py callback pyfunc_2: see error log.
[[Node: vgg_16/anchor/PyFunc = PyFunc[Tin=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_FLOAT, DT_INT32], Tout=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], token="pyfunc_2", _device="/job:localhost/replica:0/task:0/cpu:0"](vgg_16/rpn_cls_score/BiasAdd, _recv_Placeholder_2_0, _recv_Placeholder_1_0, vgg_16/anchor/PyFunc/input_3, vgg_16/ANCHOR_default/generate_anchors, vgg_16/anchor/PyFunc/input_5)]]

Caused by op u'vgg_16/anchor/PyFunc', defined at:
File "./tools/trainval_net.py", line 136, in
max_iters=args.max_iters)
File "/media/y/B0AAA15CAAA11FB8/linux/tf-faster-rcnn/tools/../lib/model/train_val.py", line 386, in train_net
sw.train_model(sess, max_iters)
File "/media/y/B0AAA15CAAA11FB8/linux/tf-faster-rcnn/tools/../lib/model/train_val.py", line 105, in train_model
anchor_ratios=cfg.ANCHOR_RATIOS)
File "/media/y/B0AAA15CAAA11FB8/linux/tf-faster-rcnn/tools/../lib/nets/network.py", line 305, in create_architecture
rois, cls_prob, bbox_pred = self.build_network(sess, training)
File "/media/y/B0AAA15CAAA11FB8/linux/tf-faster-rcnn/tools/../lib/nets/vgg16.py", line 68, in build_network
rpn_labels = self._anchor_target_layer(rpn_cls_score, "anchor")
File "/media/y/B0AAA15CAAA11FB8/linux/tf-faster-rcnn/tools/../lib/nets/network.py", line 149, in _anchor_target_layer
[tf.float32, tf.float32, tf.float32, tf.float32])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/script_ops.py", line 189, in py_func
input=inp, token=token, Tout=Tout, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_script_ops.py", line 40, in _py_func
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1228, in init
self._traceback = _extract_stack()

InternalError (see above for traceback): Failed to run py callback pyfunc_2: see error log.
[[Node: vgg_16/anchor/PyFunc = PyFunc[Tin=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_FLOAT, DT_INT32], Tout=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], token="pyfunc_2", _device="/job:localhost/replica:0/task:0/cpu:0"](vgg_16/rpn_cls_score/BiasAdd, _recv_Placeholder_2_0, _recv_Placeholder_1_0, vgg_16/anchor/PyFunc/input_3, vgg_16/ANCHOR_default/generate_anchors, vgg_16/anchor/PyFunc/input_5)]]

Command exited with non-zero status 1
18.59user 3.10system 0:24.67elapsed 87%CPU (0avgtext+0avgdata 3754036maxresident)k
659096inputs+16outputs (377major+1686078minor)pagefaults 0swaps

@xzy295461445
Copy link
Author

In the bbox_transform , the gt_width is odds. I alter it . I can't ensure i am right. but it work.
cloes

@abhiML
Copy link

abhiML commented Jun 15, 2017

I am getting a similar error. My ex_widths is coming to be nan after 100th iteration. This its giving a runtime warning and then exiting after a few more iterations. Any clues?
@HTLife

@lonlonago
Copy link

@xzy295461445, how did you alter it? do you solve the problem?

@lonlonago
Copy link

@xzy295461445 , @HTLife , @abhiML , I get the same problem with train my data , the rpn_box_loss is nan, after some research, it's because in the file 'pascal_voc.py', the function '_load_pascal_annotation' has Make pixel indexes 0-based,the code is :
x1 = float(bbox.find('xmin').text) - 1
y1 = float(bbox.find('ymin').text) - 1
x2 = float(bbox.find('xmax').text) - 1
y2 = float(bbox.find('ymax').text) - 1
but if your data is not based 1, such as my data is based 0, then it will get -1 in the data, may be you can try to delete the -1 operation,hope helpful!

@VisintZJ
Copy link

@xzy295461445 how did you alter it?Did you have solved this problem?

@xzy295461445
Copy link
Author

@VisintZJ Can you train with the V0C datasets?

@VisintZJ
Copy link

@xzy295461445 Yes, there is no question when I train with the VOC datasets

@xzy295461445
Copy link
Author

When I make the xml file of my own datasets, the width and height is contrary、

@VisintZJ
Copy link

@xzy295461445 Thank you! I solve my problem after checking my training data sets and I found the reason——there are some wrong data in my data. :(

@Site1997
Copy link

Site1997 commented Apr 1, 2018

It is perhaps due to the errors of "bbox" coodinates ( x < 0 or x > img_width ) in your Annotations. (At least for my case)

@liangxiaotian
Copy link

liangxiaotian commented Apr 25, 2018

if your dataset's bbox xmin = 0 or ymin = 0, you should change code in pascal_voc.py
# x1 = float(bbox.find('xmin').text) - 1
# y1 = float(bbox.find('ymin').text) - 1
# x2 = float(bbox.find('xmax').text) - 1
# y2 = float(bbox.find('ymax').text) - 1
to
# x1 = float(bbox.find('xmin').text)
# y1 = float(bbox.find('ymin').text)
# x2 = float(bbox.find('xmax').text)
# y2 = float(bbox.find('ymax').text)
if your dataset's bbox xmax= width or ymin = height ,you should change code in imdb.py
# boxes[:, 0] = widths[i] - oldx2 - 1
# boxes[:, 2] = widths[i] - oldx1 - 1
to
boxes[:, 0] = widths[i] - oldx2
boxes[:, 2] = widths[i] - oldx1
if your dataset's bbox xmax > width or ymax > height, you should delete it or relabel it

@TianChenone
Copy link

If you have checked the xmin ymin xmax ymax and ensure that xmin>0 and xmax<width, ymin>0 and ymax<height, but the problem is still there. Maybe you can try delete the file in /data/chache and rerun the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants