Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.

Trouble training custom dataset #169

Closed
francoto opened this issue Feb 20, 2018 · 30 comments
Closed

Trouble training custom dataset #169

francoto opened this issue Feb 20, 2018 · 30 comments

Comments

@francoto
Copy link

francoto commented Feb 20, 2018

Training Detectron on custom dataset

I'm trying to train Mask RCNN on my custom dataset to perform segmentation task on new classes that coco or ImageNet never seen.

  • I first converted my dataset to coco format so it can be loaded by pycocotools.
  • I added my dataset path into dataset_catalog.py and created the correct link to images directory and annotations path.
    The config file I used is based on configs/getting_started/tutorial_1gpu_e2e_faster_rcnn_R-50-FPN.yaml . My dataset contains only 4 classes without background so I set NUM_CLASSES to 5 ( 4 does not work either). When I try to train using the command bellow :
    python2 tools/train_net.py --cfg configs/encov/copy_maskrcnn_R-101-FPN.yaml OUTPUT_DIR /tmp/detectron-output/

ERROR 1:

I get the following error (complete log file is here output.txt)
At: /home/encov/Softwares/Detectron/lib/roi_data/fast_rcnn.py(269): _expand_bbox_targets /home/encov/Softwares/Detectron/lib/roi_data/fast_rcnn.py(181): _sample_rois /home/encov/Softwares/Detectron/lib/roi_data/fast_rcnn.py(112): add_fast_rcnn_blobs /home/encov/Softwares/Detectron/lib/ops/collect_and_distribute_fpn_rpn_proposals.py(62): forward terminate called after throwing an instance of 'caffe2::EnforceNotMet' what(): [enforce fail at pybind_state.h:423] . Exception encountered running PythonOp function: ValueError: could not broadcast input array from shape (4) into shape (0)

This error comes from the expand box procedure that increase the size of bounding box weights by 4 (see roi_data/fast_rcnn.py). It basically takes the first element which represents the class, checks that it is not 0 (the background) and copy weights values at index_class x 4. Error happens because the index is greater than the NUM_CLASSES parameter which has been used to create the output array.


ERROR 2

I try same training except I set NUM_CLASSES to 81 which was the number of classes used for coco training which is working on my set-up by the way.
The error I described above does not appear but in the really early beginning of the the iterations, bounding box areas is null which cause some divisions by zero.
output2.txt

Has someone experienced the same issue for training fast rcnn or mask rcnn on a custom dataset ?
I really suspect an error in my json coco-like file because training on coco dataset in working correctly.
Thank you for your help,

System information

  • Operating system: Ubuntu 16.04
  • Compiler version: GCC 5.4.0
  • CUDA version: 8.0
  • cuDNN version: 7.0
  • NVIDIA driver version: 384
  • GPU model: GeForce GTX 1080 (x1)
  • python --version output: Python 2.7.12
@realwecan
Copy link

How many classes do you have in your custom dataset? If you have N classes, then you should set NUM_CLASSES: N+1 in your yaml config file. For example, for six classes you should set NUM_CLASSES: 7. For 80 classes COCO you should set it to 81.

@francoto
Copy link
Author

Thank you 👍 . I have 4 classes so I should set NUM_CLASSES to 5.
Now I now I must put this value but I already tried it and I got first ERROR 1 I described above.

The error (from what I understood in lib/roi_data/fast_rcnn.py) comes from the fact _expand_boxes_targets create an array with size defined by NUM_CLASSES parameter but when this array is filled up in for loop, it takes first box element as the class index and error happens when this class index is greater than the NUM_CLASSES parameter. The fact I can get a greater class index value than NUM_CLASSES is weird.


For the record, I put bellow the lines of code I talking about (in lib/roi_data/fast_rcnn.py ):

l.251 num_bbox_reg_classes = cfg.MODEL.NUM_CLASSES

l.256 bbox_targets = blob_utils.zeros((clss.size, 4 * num_bbox_reg_classes))

ll.260-270

    inds = np.where(clss > 0)[0]
    # print("DEBUG: inds value is {}".format(inds))
    for ind in inds:
        cls = int(clss[ind])
        start = 4 * cls
        end = start + 4
        bbox_targets[ind, start:end] = bbox_target_data[ind, 1:]
        bbox_inside_weights[ind, start:end] = (1.0, 1.0, 1.0, 1.0)

Error occurs when cls is greater than cfg.MODEL.NUM_CLASSES

@raninbowlalala
Copy link

@francoto I have a question, how you converted your dataset to coco format?
Thanks in advance.

@francoto
Copy link
Author

@raninbowlalala
From my initial dataset (not COCO_like dataset), I write a Python script to fill every field of COCO dataset dict:
You can find COCO dataset format here.
I also installed pycocotools and copy/paste coco.py as mycustomdataset.py.
Then, you "just" have to redefine your constructor method in order to create similar format dataset.
Make sure it is working by trying to load your final .json file using COCO API.

Hope it will help you

@raninbowlalala
Copy link

@francoto Thanks for your help, I converted my dataset to coco format successfully.

@francoto
Copy link
Author

francoto commented Mar 7, 2018

I finally made it:

  • first, the bounding box coordinates in my dataset were wrong. I realize my mistakes when I tried to visualize them using pycocotools API (which by default doesn't have a specific method to show them by the way).
  • Finally, I misunderstood the part where I need a 'background' class (for labelling every pixel not in other classes) so I add one in my dataset but actually json_datatset.py is creating its own one. Delete my 'background' label in my dataset allows me to finally start the training.

@francoto francoto closed this as completed Mar 7, 2018
@YanWang2014
Copy link

Hi francoto,
I am also training Mask-RCNN using my own data. But I got a problem, the bbox precision is satisfying (mAP 0.5+, mAR 0.6+), but the segmentation or mask accuracy is poor (mAP 0.2, mAR 0.2). Do you achieve good performance on instance segmentation?

@francoto
Copy link
Author

Hello @YanWang2014,
In my case, I got similar performances for bbox and mask (AP ~ 0.8).
My current dataset is quite small (~350 images for test and 40 images for validation) so I don't know if the number I gave is relevant.
Good luck for your task.

@mattifrind
Copy link

I'm sorry but I'm still struggling with training on a different number of classes. I have 2 classes in my annotation file so I set the number of classes in my config file to 3. I added some lines in the net.py to prevent the class related layers from loading (after this line):

if (keyname == 'cls_score_w' or keyname == 'cls_score_b' or keyname == 'bbox_pred_w' or keyname == 'bbox_pred_b'):
            logger.info('ignore: ' + keyname)
            continue

That way Detectron should not load the weights from these layers and leave them in the dimensions as configured in the .yaml file.
That's the only code I've changed but I still get the error: could not broadcast input array from shape (4) into shape (0)
@francoto How did you solve this problem or did you train from scratch?

I'm happy for any help.

@francoto
Copy link
Author

francoto commented Apr 5, 2018

Hello @mattifrind !
From my perspective, I'd say that you should let Detectron deal with the configuration you describe in your .yaml file. I re used weights models used in *getting_started/yaml examples.

I would say that you should not 'force' detectron to forget about weights.
The only issue I got was that the name of the classes detected displayed in the pdf results remains the 'old' ones: 'person', 'bicycle', etc.

@gabriellapizzuto
Copy link

gabriellapizzuto commented Apr 5, 2018

@francoto are you using inference to show your pdf results? as I was initially doing that and in infer_simple.py it uses a dummy dataset in dummy_coco_dataset = dummy_datasets.get_coco_dataset() ... with the COCO dataset labels. Also, when you get your bounding boxes, do they make sense? Because I get decent masks, but the bounding boxes are not around these masks.

@mattifrind
Copy link

Hey @francoto! Thanks for your help.
I tried this because of a tip from Kaiming He in this issue. I tried to understand the code and found out that the model structure defined in the .yaml file will be overridden by the weights of the .pkl file. So if I configure 3 classes the, for example, cls_score layer which would be 3 depth will be replaced by the layer from the pkl file with a dimension of 81. Am I wrong?
Unfortunately, I get errors with or without my code change in the net.py.

@francoto
Copy link
Author

francoto commented Apr 5, 2018

Hey @GabriellaP,
the commands I use is :
to train:

$ python2 tools/train_net.py \
--cfg configs/<custom_config>.yaml \
OUTPUT_DIR /tmp/detectron-output

to test:

$python2 tools/infer_simple.py --cfg configs/<custom_config> \
--output-dir /tmp/detection-visualizations \
--image-ext png \
--wts /tmp/detectron-output/<ouput_train_directory>/generalized_rcnn/model_final.pkl \
demo # location of the images

I can't share publicly my results but my bounding boxes location and mask are quite fine (I obviously have some errors but considering my dataset is only ~350 images, I think its pretty amazing) but as I said I still have the COCO dataset labels. I need to check the infer_simple.py file.

@francoto
Copy link
Author

francoto commented Apr 6, 2018

Hey @mattifrind, from what I remember, the error could not broadcast input array from shape (4) into shape (0) happened in my case when the parameter cfg.MODEL.NUM_CLASSES is not matching with clss in lib/roi_data/fast_rcnn.py. I guess that when you apply your fix to delete manually the weights corresponding to the class you don't use, they may still have one index corresponding to an index of your class greater than your cfg.MODEL.NUM_CLASSES.

For the record, I put bellow the lines of code I talking about (in lib/roi_data/fast_rcnn.py ):

l.251 num_bbox_reg_classes = cfg.MODEL.NUM_CLASSES

l.256 bbox_targets = blob_utils.zeros((clss.size, 4 * num_bbox_reg_classes))

ll.260-270

inds = np.where(clss > 0)[0]
# print("DEBUG: inds value is {}".format(inds))
for ind in inds:
    cls = int(clss[ind])
    start = 4 * cls
    end = start + 4
    bbox_targets[ind, start:end] = bbox_target_data[ind, 1:]
    bbox_inside_weights[ind, start:end] = (1.0, 1.0, 1.0, 1.0)

Error occurs when cls is greater than cfg.MODEL.NUM_CLASSES


Have you tried to train without changing the code for the weights ?
Have you added a 'background' label in your dataset ? In my case, I tried to add manually one and that was messing everything up.

Hope that may help you out,

@mattifrind
Copy link

Hey, @francoto thanks for your help!
without changing the code I get this error:

Traceback (most recent call last):
  File "tools/train_net.py", line 128, in <module>
    main()
  File "tools/train_net.py", line 110, in main
    checkpoints = utils.train.train_model()
  File "/home/ubuntu/detectron/lib/utils/train.py", line 58, in train_model
    setup_model_for_training(model, weights_file, output_dir)
  File "/home/ubuntu/detectron/lib/utils/train.py", line 161, in setup_model_for_training
    nu.initialize_gpu_from_weights_file(model, weights_file, gpu_id=0)
  File "/home/ubuntu/detectron/lib/utils/net.py", line 119, in initialize_gpu_from_weights_file
    src_blobs[src_name].shape)
AssertionError: Workspace blob cls_score_w with shape (3, 1024) does not match weights file shape (81, 1024)

Didn't you had this problem to when you changed the number of classes and used a pre-trained model?

With my change, I get the broadcast error. My dataset has no background class and my 2 categories have the indices 1 and 2 (i also tried 0 and 1 with the same effect).

@francoto
Copy link
Author

francoto commented Apr 9, 2018

Hello @mattifrind, I haven't seem these kind of errors so I can't really help you on this.
Good luck 🤞

@ambigus9
Copy link

ambigus9 commented Apr 17, 2018

@mattifrind and @francoto I got that error because I tried with a pre-trained model with 81 classes, so to fix this I just use the ImageNet pretrained model in MODEL_ZOO
Did you find any solution to train without WEIGHTS?, I tried with WEIGHTS: '' (empy) and got AssertionError: Negative areas founds So, any idea?

@ZSSNIKE
Copy link

ZSSNIKE commented May 17, 2018

Will you solve the problem? I encountered the same problem. @mattifrind Thanks in advance.

@mattifrind
Copy link

@ZSSNIKE because I need to get my task done I stopped trying to fix that. It works for me with 81 classes as a workaround. Good luck!

@chenweisomebody126
Copy link

@mattifrind how do you set 81 classes? I mean, only changing NUM_CLASSES to 81 is not enough? right? Do you also need to convert the annotations to contains 81 categories?

@mattifrind
Copy link

@chenweisomebody126 yes the pre-trained models from Detectron have 81 classes and so the configuration files (.yaml) too. I wrote a Java program to convert my dataset in the COCO format. After the conversion, the program delets 2 classes of the original COCO dataset and adds the two of me. That's how I train.

@vsd550
Copy link

vsd550 commented Jun 18, 2018

@francoto I am getting exactly the same erroras yours.
ValueError: could not broadcast input array from shape (4) into shape (0)
My custom dataset has 4 classes and I have set Num classes to 5. I have added the dataset in dataset_catalog.py and generated the json for the dataset. A sample annotation in the json file looks like the following :

'id': 6, 'image_id': 1, 'category_id': 1, 'iscrowd': 0, 'area': 4674, 'bbox': [630.0, 482.0, 82.0, 59.0], 'segmentation': [[650.0, 540.5, 629.5, 540.0, 630.0, 483.5, 711.5, 482.0, 711.0, 538.5, 650.0, 540.5]], 'width': 1599, 'height': 1903}

U have written the steps but I can't understand them clearly. Can u please elaborate on the steps u took ,i.e. :
bounding box coordinates in my dataset were wrong : How are they wrong and how did u correct them'
Finally, I misunderstood the part where I need a 'background' class : How did u correct this part

Thanks in advance

@francoto
Copy link
Author

Hello @vsd550, it has been a while I post this and I haven't use Detectron since I got my first results but I will try to explain.

  • "bounding box coordinates in my dataset were wrong" : as I said, I convert my custom dataset into COCO-like form on my own and I was not taking the correct parameters to compute the bounding box according to the segmentation polygon (if I remember right, my bounding box was only 1 pixel height and width).

  • "background" Previously I was manually adding a 'background' class in my COCO dataset with id=0 but without any occurrence in the dataset. My problem got solve when I remove this 'background class' from the dataset I design. I think that Detectron is actually creating this background class in the very beginning of the training, when it loads your dataset.

I hope I make my steps clear (or clearer) for you.

@JaosonMa
Copy link

JaosonMa commented Aug 6, 2018

I meet error 1, after check my data, i found that my number_class is right = 150, so in the yaml file ,number_class = 151, but error 1. finaly i found that one of 150 classes is not right, i was added a ' ' ,
delete ' ' ,it works all right

my en is so pool!!!
我训练的时候出现了错误1 ,150个类别,yaml文件写的151,我确认这样是正确的写法,因为我以前跑别的数据成功了,所以我去检查了我这次的数据,结果 一个类别的名称 "Bao_yan_sheng_chou_1750ml" ,它有两个写法,其中一个是前面多了一个空格,导致实际类别是151个,所以出错了,删了以后就好了,
所以我觉得error1,80%都是自己数据有问题导致的

@lssily
Copy link

lssily commented Oct 7, 2018

Hello @YanWang2014,
In my case, I got similar performances for bbox and mask (AP ~ 0.8).
My current dataset is quite small (~350 images for test and 40 images for validation) so I don't know if the number I gave is relevant.
Good luck for your task.

Hi! I got a same problem as you when I trained my custom dataset. The box AP is ~0.6, while the mask AP is ~0.5. Did you find the cause for this phenomenon? Look forward to your reply!

@maiff
Copy link

maiff commented Jun 15, 2019

@francoto

The only issue I got was that the name of the classes detected displayed in the pdf results remains the 'old' ones: 'person', 'bicycle', etc.

I got the same problem. Did you fix it. How? Would you please tell me? Thanks.

@francoto
Copy link
Author

Hello @maiff,
I actually find out that the category name where written directly in the file detectron/datasets/dummy_datasets.py in method get_coco_dataset() so I just created my get_custom_dataset() method with the category name I wanted. Then you update the file tools/infer_simple.py with your new method. It did the trick for me.

(I'm still using an old version from january 2019)

Good luck :)

@maiff
Copy link

maiff commented Jun 17, 2019

@francoto Thank you very much, I have solved it

@vaibhavkumar049
Copy link

vaibhavkumar049 commented Sep 16, 2019

@francoto which cloud service you used or do you have gpu on personal computer?

@francoto
Copy link
Author

@vaibhavkumar049 I use my local GPU which is GeFoce GTX 1080.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

16 participants