Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coco training problem #85

Open
2017hack opened this issue Sep 18, 2017 · 11 comments
Open

coco training problem #85

2017hack opened this issue Sep 18, 2017 · 11 comments

Comments

@2017hack
Copy link

when i training use follow command,
./experiments/scripts/rfcn_end2end_ohem.sh 1 ResNet-101 coco
it print below
I0918 14:51:27.386899 22785 net.cpp:775] Ignoring source layer prob
Solving...
I0918 14:51:28.109838 22785 solver.cpp:228] Iteration 0, loss = 5.13607
I0918 14:51:28.109882 22785 solver.cpp:244] Train net output #0: accuarcy = 0
I0918 14:51:28.109892 22785 solver.cpp:244] Train net output #1: loss_bbox = 0 (* 1 = 0 loss)
I0918 14:51:28.109900 22785 solver.cpp:244] Train net output #2: loss_cls = 4.4268 (* 1 = 4.4268 loss)
I0918 14:51:28.109906 22785 solver.cpp:244] Train net output #3: rpn_cls_loss = 0.698125 (* 1 = 0.698125 loss)
I0918 14:51:28.109913 22785 solver.cpp:244] Train net output #4: rpn_loss_bbox = 0.0111453 (* 1 = 0.0111453 loss)
I0918 14:51:28.109922 22785 sgd_solver.cpp:106] Iteration 0, lr = 0.0005
I0918 14:52:03.514792 22785 solver.cpp:228] Iteration 100, loss = 2.51846
I0918 14:52:03.514839 22785 solver.cpp:244] Train net output #0: accuarcy = 1
I0918 14:52:03.514849 22785 solver.cpp:244] Train net output #1: loss_bbox = 0 (* 1 = 0 loss)
I0918 14:52:03.514855 22785 solver.cpp:244] Train net output #2: loss_cls = 0 (* 1 = 0 loss)
I0918 14:52:03.514861 22785 solver.cpp:244] Train net output #3: rpn_cls_loss = 0.559937 (* 1 = 0.559937 loss)
I0918 14:52:03.514868 22785 solver.cpp:244] Train net output #4: rpn_loss_bbox = 1.95852 (* 1 = 1.95852 loss)
I0918 14:52:03.514873 22785 sgd_solver.cpp:106] Iteration 100, lr = 0.0005
I0918 14:52:39.039602 22785 solver.cpp:228] Iteration 200, loss = 0.823245
I0918 14:52:39.039641 22785 solver.cpp:244] Train net output #0: accuarcy = 1
I0918 14:52:39.039651 22785 solver.cpp:244] Train net output #1: loss_bbox = 0 (* 1 = 0 loss)
I0918 14:52:39.039657 22785 solver.cpp:244] Train net output #2: loss_cls = 0 (* 1 = 0 loss)
I0918 14:52:39.039664 22785 solver.cpp:244] Train net output #3: rpn_cls_loss = 0.218862 (* 1 = 0.218862 loss)
I0918 14:52:39.039669 22785 solver.cpp:244] Train net output #4: rpn_loss_bbox = 0.604383 (* 1 = 0.604383 loss)
I0918 14:52:39.039674 22785 sgd_solver.cpp:106] Iteration 200, lr = 0.0005
I0918 14:53:14.707566 22785 solver.cpp:228] Iteration 300, loss = 0.940254
I0918 14:53:14.707613 22785 solver.cpp:244] Train net output #0: accuarcy = 1
I0918 14:53:14.707623 22785 solver.cpp:244] Train net output #1: loss_bbox = 0 (* 1 = 0 loss)
I0918 14:53:14.707629 22785 solver.cpp:244] Train net output #2: loss_cls = 0 (* 1 = 0 loss)
I0918 14:53:14.707635 22785 solver.cpp:244] Train net output #3: rpn_cls_loss = 0.338957 (* 1 = 0.338957 loss)
I0918 14:53:14.707641 22785 solver.cpp:244] Train net output #4: rpn_loss_bbox = 0.601298 (* 1 = 0.601298 loss)

i don't know why this happen, please help!

@YuwenXiong
Copy link
Owner

YuwenXiong commented Sep 18, 2017

Please list your environment clearly, including CUDA version, Caffe version (please ensure you read the README and use the Caffe version we suggest); whether this situation is reproducible. And make sure your images are not corrupted.

@2017hack
Copy link
Author

my linux version is Linux version 3.10.0-327.x86_64 (admin@rs3g12027.et2sqa) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Tue Dec 29 19:54:05 CST 2015. I use cuda-7.5 and use the Caffe version README suggest, when i run ./experiments/scripts/rfcn_end2end_ohem.sh 1 ResNet-101 coco, the print like this

  • echo Logging output to experiments/logs/rfcn_end2end_ResNet-101_.txt.2017-09-19_10-58-54
    Logging output to experiments/logs/rfcn_end2end_ResNet-101_.txt.2017-09-19_10-58-54
  • ./tools/train_net.py --gpu 0 --solver models/coco/ResNet-101/rfcn_end2end/solver_ohem.prototxt --weights data/imagenet_models/ResNet-101-model.caffemodel --imdb coco_2014_train --iters 350000 --cfg experiments/cfgs/rfcn_end2end_ohem.yml
    Called with args:
    Namespace(cfg_file='experiments/cfgs/rfcn_end2end_ohem.yml', gpu_id=0, imdb_name='coco_2014_train', max_iters=350000, pretrained_model='data/imagenet_models/ResNet-101-model.caffemodel', randomize=False, set_cfgs=None, solver='models/coco/ResNet-101/rfcn_end2end/solver_ohem.prototxt')
    Using config:
    {'DATA_DIR': '/home/yiwei.yw/R-FCN/py-R-FCN/data',
    'DEDUP_BOXES': 0.0625,
    'EPS': 1e-14,
    'EXP_DIR': 'rfcn_end2end_ohem',
    'GPU_ID': 0,
    'MATLAB': 'matlab',
    'MODELS_DIR': '/home/yiwei.yw/R-FCN/py-R-FCN/models/coco',
    'PIXEL_MEANS': array([[[ 102.9801, 115.9465, 122.7717]]]),
    'RNG_SEED': 3,
    'ROOT_DIR': '/home/yiwei.yw/R-FCN/py-R-FCN',
    'TEST': {'AGNOSTIC': True,
    'BBOX_REG': True,
    'HAS_RPN': True,
    'MAX_SIZE': 1000,
    'NMS': 0.3,
    'PROPOSAL_METHOD': 'selective_search',
    'RPN_MIN_SIZE': 16,
    'RPN_NMS_THRESH': 0.7,
    'RPN_POST_NMS_TOP_N': 300,
    'RPN_PRE_NMS_TOP_N': 6000,
    'SCALES': [600],
    'SVM': False},
    'TRAIN': {'AGNOSTIC': True,
    'ASPECT_GROUPING': True,
    'BATCH_SIZE': -1,
    'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
    'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
    'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
    'BBOX_NORMALIZE_TARGETS': True,
    'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
    'BBOX_REG': True,
    'BBOX_THRESH': 0.5,
    'BG_THRESH_HI': 0.5,
    'BG_THRESH_LO': 0.0,
    'FG_FRACTION': 0.25,
    'FG_THRESH': 0.5,
    'HAS_RPN': True,
    'IMS_PER_BATCH': 1,
    'MAX_SIZE': 1000,
    'PROPOSAL_METHOD': 'gt',
    'RPN_BATCHSIZE': 256,
    'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
    'RPN_CLOBBER_POSITIVES': False,
    'RPN_FG_FRACTION': 0.5,
    'RPN_MIN_SIZE': 16,
    'RPN_NEGATIVE_OVERLAP': 0.3,
    'RPN_NMS_THRESH': 0.7,
    'RPN_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
    'RPN_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
    'RPN_NORMALIZE_TARGETS': True,
    'RPN_POSITIVE_OVERLAP': 0.7,
    'RPN_POSITIVE_WEIGHT': -1.0,
    'RPN_POST_NMS_TOP_N': 300,
    'RPN_PRE_NMS_TOP_N': 6000,
    'SCALES': [600],
    'SNAPSHOT_INFIX': '',
    'SNAPSHOT_ITERS': 10000,
    'USE_FLIPPED': True,
    'USE_PREFETCH': False},
    'USE_GPU_NMS': True}
    loading annotations into memory...
    Done (t=12.61s)
    creating index...
    index created!
    Loaded dataset coco_2014_train for training
    Set proposal method: gt
    Appending horizontally-flipped training examples...
    wrote gt roidb to /home/yiwei.yw/R-FCN/py-R-FCN/data/cache/coco_2014_train_gt_roidb.pkl
    done
    Preparing training data...
    done
    loading annotations into memory...
    Done (t=17.12s)
    creating index...
    index created!
    165566 roidb entries
    Output will be saved to /home/yiwei.yw/R-FCN/py-R-FCN/output/rfcn_end2end_ohem/coco_2014_train
    Filtered 28290 roidb entries: 165566 -> 137276
    Computing bounding-box regression targets...
    bbox target means:
    [[ 0. 0. 0. 0.]
    [ 0. 0. 0. 0.]]
    [ 0. 0. 0. 0.]
    bbox target stdevs:
    bbox target stdevs:
    [[ 0.1 0.1 0.2 0.2]
    [ 0.1 0.1 0.2 0.2]]
    [ 0.1 0.1 0.2 0.2]
    Normalizing targets
    done
    WARNING: Logging before InitGoogleLogging() is written to STDERR
    I0919 11:01:31.636265 29315 solver.cpp:48] Initializing solver from parameters:
    train_net: "models/coco/ResNet-101/rfcn_end2end/train_agnostic_ohem.prototxt"
    base_lr: 0.0005
    display: 100
    lr_policy: "step"
    gamma: 0.1
    momentum: 0.9
    weight_decay: 0.0005
    stepsize: 1280000
    snapshot: 0
    snapshot_prefix: "resnet101_rfcn_ohem"
    average_loss: 100
    I0919 11:01:31.636314 29315 solver.cpp:81] Creating training net from train_net file: models/coco/ResNet-101/rfcn_end2end/train_agnostic_ohem.prototxt
    I0919 11:01:31.648705 29315 net.cpp:58] Initializing net from parameters:
    name: "ResNet-101"
    state {
    phase: TRAIN
    }
    layer {
    name: "input-data"
    ...
    Solving...
    I0919 11:01:33.715899 29315 solver.cpp:228] Iteration 0, loss = 5.67758
    I0919 11:01:33.715940 29315 solver.cpp:244] Train net output #0: accuarcy = 0
    I0919 11:01:33.715951 29315 solver.cpp:244] Train net output Small documentation issues for train&test ResNet-50 (without OHEM) #1: loss_bbox = 0 (* 1 = 0 loss)
    I0919 11:01:33.715958 29315 solver.cpp:244] Train net output Question about 'RPN' location! #2: loss_cls = 3.81131 (* 1 = 3.81131 loss)
    I0919 11:01:33.715965 29315 solver.cpp:244] Train net output Why the length of output bbox is 8 ?  #3: rpn_cls_loss = 0.708636 (* 1 = 0.708636 loss)
    I0919 11:01:33.715970 29315 solver.cpp:244] Train net output availability of demo code #4: rpn_loss_bbox = 1.15763 (* 1 = 1.15763 loss)
    I0919 11:01:33.715978 29315 sgd_solver.cpp:106] Iteration 0, lr = 0.0005
    I0919 11:02:08.437101 29315 solver.cpp:228] Iteration 100, loss = 1.20817
    I0919 11:02:08.437142 29315 solver.cpp:244] Train net output #0: accuarcy = 1
    I0919 11:02:08.437152 29315 solver.cpp:244] Train net output Small documentation issues for train&test ResNet-50 (without OHEM) #1: loss_bbox = 0 (* 1 = 0 loss)
    I0919 11:02:08.437160 29315 solver.cpp:244] Train net output Question about 'RPN' location! #2: loss_cls = 0 (* 1 = 0 loss)
    I0919 11:02:08.437166 29315 solver.cpp:244] Train net output Why the length of output bbox is 8 ?  #3: rpn_cls_loss = 0.568281 (* 1 = 0.568281 loss)
    I0919 11:02:08.437172 29315 solver.cpp:244] Train net output availability of demo code #4: rpn_loss_bbox = 0.639885 (* 1 = 0.639885 loss)
    I0919 11:02:08.437178 29315 sgd_solver.cpp:106] Iteration 100, lr = 0.0005
    I0919 11:02:42.693392 29315 solver.cpp:228] Iteration 200, loss = 0.17464
    I0919 11:02:42.693434 29315 solver.cpp:244] Train net output #0: accuarcy = 1
    I0919 11:02:42.693444 29315 solver.cpp:244] Train net output Small documentation issues for train&test ResNet-50 (without OHEM) #1: loss_bbox = 0 (* 1 = 0 loss)
    I0919 11:02:42.693451 29315 solver.cpp:244] Train net output Question about 'RPN' location! #2: loss_cls = 0 (* 1 = 0 loss)
    I0919 11:02:42.693459 29315 solver.cpp:244] Train net output Why the length of output bbox is 8 ?  #3: rpn_cls_loss = 0.125884 (* 1 = 0.125884 loss)
    I0919 11:02:42.693464 29315 solver.cpp:244] Train net output availability of demo code #4: rpn_loss_bbox = 0.0487557 (* 1 = 0.0487557 loss)
    I0919 11:02:42.693470 29315 sgd_solver.cpp:106] Iteration 200, lr = 0.0005
    ...

it quickly happens like above, can you please help to fix this, thanks!

@2017hack
Copy link
Author

this situation is reproducible, i just follow the README, download the code, compile, download the coco2014 train and val data, do not change anything, just run and problem happens! i don't known where is wrong

@2017hack
Copy link
Author

can you help me, i am confused for many days!

@xxxxxxxx-dl
Copy link

May be it's numpy version. In minibatch sample, if you use numpy version higher than 1.10.1 and you use astype to transform float to int, you may get illegal value like -Nan. This will lead to 0 of positive samples and may cause the problem you encounter.

@sherryxie1
Copy link

sherryxie1 commented Jan 25, 2018

@2017hack
Hi, I don't know whether you have solved the problem or not, but I think I have found the reason.
The problem is caused by the numpy package.
In lib/rpn/proposal_target_layer.py line 60, you may be changed the code as

fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image).astype(np.int)

it works well for normal number, but in training with ohem, the rois_per_image is np.inf, so if you do this , fg_rois_per_image will be a negative number.

What you should do is using a if statement, like this:

if rois_per_image == np.inf:
fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image)
else:
fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image).astype(np.int)

By the way, in lib/roi_data_layer/minibatch.py line 26, there is the same problem, but it only matters if you use Selective Search

@starxhong
Copy link

@sherryxie1 Thank you very much! it perfectly solved my problem. My numpy version is 1.14.1 , higher than 1.10.1,so when i train faster r-cnn i meet "TypeError: 'numpy.float64' object cannot be interpreted as an index" problem. Following the solution given in py-faster-rcnn/issues/481, I use some astype to transform float to int and successfully train the faster r-cnn, as well as rfcn_end2end without ohem. However, when it comes to rfcn_end2end_ohem i get the bbox_loss=0 problem. Thank you for your solution!

@Huangswust182
Copy link

@VersionHX
Hello!
Excuse me, do you change it to this, and do you succeed?
If rois_per_image = = np.inf:
Fg_rois_per_image = np.round (cfg.TRAIN.FG_FRACTION * rois_per_image)
Else:
Fg_rois_per_image = np.round (cfg.TRAIN.FG_FRACTION * rois_per_image).Astype (np.int)

@starxhong
Copy link

@Huangswust182 Yes i do! @sherryxie1's solution works well on my machine~

@Huangswust182
Copy link

@VersionHX
Thank you very much

@Cazforshort
Copy link

There doesn't seem to be a minbatch.py file in the tensorflow package to make these changes in? Where do I change it to int()?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants