coco training problem #85

2017hack · 2017-09-18T09:53:31Z

when i training use follow command,
./experiments/scripts/rfcn_end2end_ohem.sh 1 ResNet-101 coco
it print below
I0918 14:51:27.386899 22785 net.cpp:775] Ignoring source layer prob
Solving...
I0918 14:51:28.109838 22785 solver.cpp:228] Iteration 0, loss = 5.13607
I0918 14:51:28.109882 22785 solver.cpp:244] Train net output #0: accuarcy = 0
I0918 14:51:28.109892 22785 solver.cpp:244] Train net output #1: loss_bbox = 0 (* 1 = 0 loss)
I0918 14:51:28.109900 22785 solver.cpp:244] Train net output #2: loss_cls = 4.4268 (* 1 = 4.4268 loss)
I0918 14:51:28.109906 22785 solver.cpp:244] Train net output #3: rpn_cls_loss = 0.698125 (* 1 = 0.698125 loss)
I0918 14:51:28.109913 22785 solver.cpp:244] Train net output #4: rpn_loss_bbox = 0.0111453 (* 1 = 0.0111453 loss)
I0918 14:51:28.109922 22785 sgd_solver.cpp:106] Iteration 0, lr = 0.0005
I0918 14:52:03.514792 22785 solver.cpp:228] Iteration 100, loss = 2.51846
I0918 14:52:03.514839 22785 solver.cpp:244] Train net output #0: accuarcy = 1
I0918 14:52:03.514849 22785 solver.cpp:244] Train net output #1: loss_bbox = 0 (* 1 = 0 loss)
I0918 14:52:03.514855 22785 solver.cpp:244] Train net output #2: loss_cls = 0 (* 1 = 0 loss)
I0918 14:52:03.514861 22785 solver.cpp:244] Train net output #3: rpn_cls_loss = 0.559937 (* 1 = 0.559937 loss)
I0918 14:52:03.514868 22785 solver.cpp:244] Train net output #4: rpn_loss_bbox = 1.95852 (* 1 = 1.95852 loss)
I0918 14:52:03.514873 22785 sgd_solver.cpp:106] Iteration 100, lr = 0.0005
I0918 14:52:39.039602 22785 solver.cpp:228] Iteration 200, loss = 0.823245
I0918 14:52:39.039641 22785 solver.cpp:244] Train net output #0: accuarcy = 1
I0918 14:52:39.039651 22785 solver.cpp:244] Train net output #1: loss_bbox = 0 (* 1 = 0 loss)
I0918 14:52:39.039657 22785 solver.cpp:244] Train net output #2: loss_cls = 0 (* 1 = 0 loss)
I0918 14:52:39.039664 22785 solver.cpp:244] Train net output #3: rpn_cls_loss = 0.218862 (* 1 = 0.218862 loss)
I0918 14:52:39.039669 22785 solver.cpp:244] Train net output #4: rpn_loss_bbox = 0.604383 (* 1 = 0.604383 loss)
I0918 14:52:39.039674 22785 sgd_solver.cpp:106] Iteration 200, lr = 0.0005
I0918 14:53:14.707566 22785 solver.cpp:228] Iteration 300, loss = 0.940254
I0918 14:53:14.707613 22785 solver.cpp:244] Train net output #0: accuarcy = 1
I0918 14:53:14.707623 22785 solver.cpp:244] Train net output #1: loss_bbox = 0 (* 1 = 0 loss)
I0918 14:53:14.707629 22785 solver.cpp:244] Train net output #2: loss_cls = 0 (* 1 = 0 loss)
I0918 14:53:14.707635 22785 solver.cpp:244] Train net output #3: rpn_cls_loss = 0.338957 (* 1 = 0.338957 loss)
I0918 14:53:14.707641 22785 solver.cpp:244] Train net output #4: rpn_loss_bbox = 0.601298 (* 1 = 0.601298 loss)

i don't know why this happen, please help!

YuwenXiong · 2017-09-18T19:01:29Z

Please list your environment clearly, including CUDA version, Caffe version (please ensure you read the README and use the Caffe version we suggest); whether this situation is reproducible. And make sure your images are not corrupted.

2017hack · 2017-09-19T03:26:36Z

my linux version is Linux version 3.10.0-327.x86_64 (admin@rs3g12027.et2sqa) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Tue Dec 29 19:54:05 CST 2015. I use cuda-7.5 and use the Caffe version README suggest, when i run ./experiments/scripts/rfcn_end2end_ohem.sh 1 ResNet-101 coco, the print like this

echo Logging output to experiments/logs/rfcn_end2end_ResNet-101_.txt.2017-09-19_10-58-54
Logging output to experiments/logs/rfcn_end2end_ResNet-101_.txt.2017-09-19_10-58-54
./tools/train_net.py --gpu 0 --solver models/coco/ResNet-101/rfcn_end2end/solver_ohem.prototxt --weights data/imagenet_models/ResNet-101-model.caffemodel --imdb coco_2014_train --iters 350000 --cfg experiments/cfgs/rfcn_end2end_ohem.yml
Called with args:
Namespace(cfg_file='experiments/cfgs/rfcn_end2end_ohem.yml', gpu_id=0, imdb_name='coco_2014_train', max_iters=350000, pretrained_model='data/imagenet_models/ResNet-101-model.caffemodel', randomize=False, set_cfgs=None, solver='models/coco/ResNet-101/rfcn_end2end/solver_ohem.prototxt')
Using config:
{'DATA_DIR': '/home/yiwei.yw/R-FCN/py-R-FCN/data',
'DEDUP_BOXES': 0.0625,
'EPS': 1e-14,
'EXP_DIR': 'rfcn_end2end_ohem',
'GPU_ID': 0,
'MATLAB': 'matlab',
'MODELS_DIR': '/home/yiwei.yw/R-FCN/py-R-FCN/models/coco',
'PIXEL_MEANS': array([[[ 102.9801, 115.9465, 122.7717]]]),
'RNG_SEED': 3,
'ROOT_DIR': '/home/yiwei.yw/R-FCN/py-R-FCN',
'TEST': {'AGNOSTIC': True,
'BBOX_REG': True,
'HAS_RPN': True,
'MAX_SIZE': 1000,
'NMS': 0.3,
'PROPOSAL_METHOD': 'selective_search',
'RPN_MIN_SIZE': 16,
'RPN_NMS_THRESH': 0.7,
'RPN_POST_NMS_TOP_N': 300,
'RPN_PRE_NMS_TOP_N': 6000,
'SCALES': [600],
'SVM': False},
'TRAIN': {'AGNOSTIC': True,
'ASPECT_GROUPING': True,
'BATCH_SIZE': -1,
'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
'BBOX_NORMALIZE_TARGETS': True,
'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
'BBOX_REG': True,
'BBOX_THRESH': 0.5,
'BG_THRESH_HI': 0.5,
'BG_THRESH_LO': 0.0,
'FG_FRACTION': 0.25,
'FG_THRESH': 0.5,
'HAS_RPN': True,
'IMS_PER_BATCH': 1,
'MAX_SIZE': 1000,
'PROPOSAL_METHOD': 'gt',
'RPN_BATCHSIZE': 256,
'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'RPN_CLOBBER_POSITIVES': False,
'RPN_FG_FRACTION': 0.5,
'RPN_MIN_SIZE': 16,
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESH': 0.7,
'RPN_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
'RPN_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
'RPN_NORMALIZE_TARGETS': True,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POSITIVE_WEIGHT': -1.0,
'RPN_POST_NMS_TOP_N': 300,
'RPN_PRE_NMS_TOP_N': 6000,
'SCALES': [600],
'SNAPSHOT_INFIX': '',
'SNAPSHOT_ITERS': 10000,
'USE_FLIPPED': True,
'USE_PREFETCH': False},
'USE_GPU_NMS': True}
loading annotations into memory...
Done (t=12.61s)
creating index...
index created!
Loaded dataset coco_2014_train for training
Set proposal method: gt
Appending horizontally-flipped training examples...
wrote gt roidb to /home/yiwei.yw/R-FCN/py-R-FCN/data/cache/coco_2014_train_gt_roidb.pkl
done
Preparing training data...
done
loading annotations into memory...
Done (t=17.12s)
creating index...
index created!
165566 roidb entries
Output will be saved to /home/yiwei.yw/R-FCN/py-R-FCN/output/rfcn_end2end_ohem/coco_2014_train
Filtered 28290 roidb entries: 165566 -> 137276
Computing bounding-box regression targets...
bbox target means:
[[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
[ 0. 0. 0. 0.]
bbox target stdevs:
bbox target stdevs:
[[ 0.1 0.1 0.2 0.2]
[ 0.1 0.1 0.2 0.2]]
[ 0.1 0.1 0.2 0.2]
Normalizing targets
done
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0919 11:01:31.636265 29315 solver.cpp:48] Initializing solver from parameters:
train_net: "models/coco/ResNet-101/rfcn_end2end/train_agnostic_ohem.prototxt"
base_lr: 0.0005
display: 100
lr_policy: "step"
gamma: 0.1
momentum: 0.9
weight_decay: 0.0005
stepsize: 1280000
snapshot: 0
snapshot_prefix: "resnet101_rfcn_ohem"
average_loss: 100
I0919 11:01:31.636314 29315 solver.cpp:81] Creating training net from train_net file: models/coco/ResNet-101/rfcn_end2end/train_agnostic_ohem.prototxt
I0919 11:01:31.648705 29315 net.cpp:58] Initializing net from parameters:
name: "ResNet-101"
state {
phase: TRAIN
}
layer {
name: "input-data"
...
Solving...
I0919 11:01:33.715899 29315 solver.cpp:228] Iteration 0, loss = 5.67758
I0919 11:01:33.715940 29315 solver.cpp:244] Train net output #0: accuarcy = 0
I0919 11:01:33.715951 29315 solver.cpp:244] Train net output Small documentation issues for train&test ResNet-50 (without OHEM) #1: loss_bbox = 0 (* 1 = 0 loss)
I0919 11:01:33.715958 29315 solver.cpp:244] Train net output Question about 'RPN' location! #2: loss_cls = 3.81131 (* 1 = 3.81131 loss)
I0919 11:01:33.715965 29315 solver.cpp:244] Train net output Why the length of output bbox is 8 ? #3: rpn_cls_loss = 0.708636 (* 1 = 0.708636 loss)
I0919 11:01:33.715970 29315 solver.cpp:244] Train net output availability of demo code #4: rpn_loss_bbox = 1.15763 (* 1 = 1.15763 loss)
I0919 11:01:33.715978 29315 sgd_solver.cpp:106] Iteration 0, lr = 0.0005
I0919 11:02:08.437101 29315 solver.cpp:228] Iteration 100, loss = 1.20817
I0919 11:02:08.437142 29315 solver.cpp:244] Train net output #0: accuarcy = 1
I0919 11:02:08.437152 29315 solver.cpp:244] Train net output Small documentation issues for train&test ResNet-50 (without OHEM) #1: loss_bbox = 0 (* 1 = 0 loss)
I0919 11:02:08.437160 29315 solver.cpp:244] Train net output Question about 'RPN' location! #2: loss_cls = 0 (* 1 = 0 loss)
I0919 11:02:08.437166 29315 solver.cpp:244] Train net output Why the length of output bbox is 8 ? #3: rpn_cls_loss = 0.568281 (* 1 = 0.568281 loss)
I0919 11:02:08.437172 29315 solver.cpp:244] Train net output availability of demo code #4: rpn_loss_bbox = 0.639885 (* 1 = 0.639885 loss)
I0919 11:02:08.437178 29315 sgd_solver.cpp:106] Iteration 100, lr = 0.0005
I0919 11:02:42.693392 29315 solver.cpp:228] Iteration 200, loss = 0.17464
I0919 11:02:42.693434 29315 solver.cpp:244] Train net output #0: accuarcy = 1
I0919 11:02:42.693444 29315 solver.cpp:244] Train net output Small documentation issues for train&test ResNet-50 (without OHEM) #1: loss_bbox = 0 (* 1 = 0 loss)
I0919 11:02:42.693451 29315 solver.cpp:244] Train net output Question about 'RPN' location! #2: loss_cls = 0 (* 1 = 0 loss)
I0919 11:02:42.693459 29315 solver.cpp:244] Train net output Why the length of output bbox is 8 ? #3: rpn_cls_loss = 0.125884 (* 1 = 0.125884 loss)
I0919 11:02:42.693464 29315 solver.cpp:244] Train net output availability of demo code #4: rpn_loss_bbox = 0.0487557 (* 1 = 0.0487557 loss)
I0919 11:02:42.693470 29315 sgd_solver.cpp:106] Iteration 200, lr = 0.0005
...

it quickly happens like above, can you please help to fix this, thanks!

2017hack · 2017-09-19T03:29:06Z

this situation is reproducible, i just follow the README, download the code, compile, download the coco2014 train and val data, do not change anything, just run and problem happens! i don't known where is wrong

2017hack · 2017-09-25T08:22:31Z

can you help me, i am confused for many days!

xxxxxxxx-dl · 2017-10-26T02:35:31Z

May be it's numpy version. In minibatch sample, if you use numpy version higher than 1.10.1 and you use astype to transform float to int, you may get illegal value like -Nan. This will lead to 0 of positive samples and may cause the problem you encounter.

sherryxie1 · 2018-01-25T03:29:11Z

@2017hack
Hi, I don't know whether you have solved the problem or not, but I think I have found the reason.
The problem is caused by the numpy package.
In lib/rpn/proposal_target_layer.py line 60, you may be changed the code as

fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image).astype(np.int)

it works well for normal number, but in training with ohem, the rois_per_image is np.inf, so if you do this , fg_rois_per_image will be a negative number.

What you should do is using a if statement, like this:

if rois_per_image == np.inf:
fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image)
else:
fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image).astype(np.int)

By the way, in lib/roi_data_layer/minibatch.py line 26, there is the same problem, but it only matters if you use Selective Search

starxhong · 2018-03-10T02:53:17Z

@sherryxie1 Thank you very much! it perfectly solved my problem. My numpy version is 1.14.1 , higher than 1.10.1，so when i train faster r-cnn i meet "TypeError: 'numpy.float64' object cannot be interpreted as an index" problem. Following the solution given in py-faster-rcnn/issues/481, I use some astype to transform float to int and successfully train the faster r-cnn, as well as rfcn_end2end without ohem. However, when it comes to rfcn_end2end_ohem i get the bbox_loss=0 problem. Thank you for your solution!

Huangswust182 · 2018-03-14T02:54:25Z

@VersionHX
Hello!
Excuse me, do you change it to this, and do you succeed?
If rois_per_image = = np.inf:
Fg_rois_per_image = np.round (cfg.TRAIN.FG_FRACTION * rois_per_image)
Else:
Fg_rois_per_image = np.round (cfg.TRAIN.FG_FRACTION * rois_per_image).Astype (np.int)

starxhong · 2018-03-14T08:27:42Z

@Huangswust182 Yes i do! @sherryxie1's solution works well on my machine~

Huangswust182 · 2018-03-14T08:29:26Z

@VersionHX
Thank you very much

Cazforshort · 2022-02-17T15:14:56Z

There doesn't seem to be a minbatch.py file in the tensorflow package to make these changes in? Where do I change it to int()?

This was referenced Mar 10, 2018

bbox_loss is always zero when I train py-R-FCN with OHEM rbgirshick/py-faster-rcnn#791

Open

bbox_loss is always zero when I train R-FCN with OHEM #108

Open

starxhong mentioned this issue Mar 26, 2018

Very low test accuracy using OHEM. #91

Open

starxhong mentioned this issue Jul 28, 2018

At training the loss bbox_loss is always zero rbgirshick/py-faster-rcnn#266

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coco training problem #85

coco training problem #85

2017hack commented Sep 18, 2017

YuwenXiong commented Sep 18, 2017 •

edited

2017hack commented Sep 19, 2017

2017hack commented Sep 19, 2017

2017hack commented Sep 25, 2017

xxxxxxxx-dl commented Oct 26, 2017

sherryxie1 commented Jan 25, 2018 •

edited

starxhong commented Mar 10, 2018

Huangswust182 commented Mar 14, 2018

starxhong commented Mar 14, 2018

Huangswust182 commented Mar 14, 2018

Cazforshort commented Feb 17, 2022

coco training problem #85

coco training problem #85

Comments

2017hack commented Sep 18, 2017

YuwenXiong commented Sep 18, 2017 • edited

2017hack commented Sep 19, 2017

2017hack commented Sep 19, 2017

2017hack commented Sep 25, 2017

xxxxxxxx-dl commented Oct 26, 2017

sherryxie1 commented Jan 25, 2018 • edited

starxhong commented Mar 10, 2018

Huangswust182 commented Mar 14, 2018

starxhong commented Mar 14, 2018

Huangswust182 commented Mar 14, 2018

Cazforshort commented Feb 17, 2022

YuwenXiong commented Sep 18, 2017 •

edited

sherryxie1 commented Jan 25, 2018 •

edited