Skip to content
This repository has been archived by the owner on Mar 12, 2024. It is now read-only.

Failing to converge on small datasets (Getting zeros on small custom data) #125

Open
eslambakr opened this issue Jun 30, 2020 · 16 comments
Open

Comments

@eslambakr
Copy link

❓ How to do something using DETR

Hello All,

I am using DETR on custom data, which contains 2k images for training. I have followed your suggestion proposed in #9 to fine-tune to avoid getting zeros, and I succeeded in achieving comparable accuracy.

But when I tried to train from scratch using the default configuration in main.py, I got zeros for the first 300 epochs until now, so should I wait for more epochs? I think it is so weird
So what do u think should I do to be able to get a good accuracy from scratch?
Is this a limitation in DETR due to the fact that transformers needs more data to converge? I think we should have some tricks to overcome this :D
Another question if there is no hope to train with too small data like this, so what is the minimum size which is proven DETR is working properly with it?

Final note: I am posting new issue as #9 has other questions which are irrelevant so I opened a new issue after reading the whole thread there.

Thanks for sharing your amazing work with the community, I hope to be able to give it back and contribute or add any benefit for your amazing work

@alcinos
Copy link
Contributor

alcinos commented Jun 30, 2020

Hi @eslambakr
Thanks for your interest in DETR.

2K does sound to small to me, we had success with 10-15k but never tried smaller than that.
It's a bit difficult to know what's going on, you could check the predictions to see if the model is doing something at all (both on test and train images). I'd also look at the train/test losses and look for sign of divergence (the most likely explanation here). I'd not rule out the possibility of a bug either, especially if your mAP is exactly 0.
Finally, I'd like to point out that the important metric is not really the number of epochs but rather the number of updates. Since your dataset is about 50x smaller than coco, one coco epoch correspond to 50 in your dataset. In other words, it's as if you trained for 6 epochs so far.

Hope this helps.

@eslambakr
Copy link
Author

Aha, I understand your point, Thanks for your clarification.
I will do more analysis and update you with my observation to benefit others who may stack in the same issue

@m-klasen
Copy link

@eslambakr
Below some of my experiments with a ~2k images dataset featuring only 4 classes, my best result exceeded detectron2 MaskRCNN ResNet50 FPN by ~5% mAP. If you have further questions please feel free to ask.
image

@eslambakr
Copy link
Author

Thanks for sharing your results, but I am wondering
1- Is the x-axis in terms of epochs! :D, do you mean you train the model for only 50 epochs?
2- If yes, I wonder how you could achieve that while training from scratch without loading any weights? Do you change the default arguments or what?
3- The class Error is stack at 100. Do you face it, or from your experience, have an explanation for that?

I trained for almost 600 epoch, and I am getting zeros, so It is wired for me though the same data in the same format, I trained other models on it, so I think there is no error in my dataset.
And unfortunately, I didn't know I have to set the output_dir to get output logs and weights, I thought It should default was set on so I couldn't draw training curves or test the model on images to debug this behavior, I will rerun another experiment while setting it on and update you, and I will change the number of classes variable to 2 as my data has only one class and change the num_quiries to 30 to make it easier to the model, but I am asking you as your results are impressive.

Thanks in advance.

@fmassa
Copy link
Contributor

fmassa commented Jul 2, 2020

@eslambakr I believe @mlk1337 is fine-tuning his model from a pre-trained model on COCO.

@m-klasen
Copy link

m-klasen commented Jul 2, 2020

Hi @eslambakr,
i wrote a small gist on how i trained my model with COCO using weights. Here.
Hope this helps.

@unanan
Copy link

unanan commented Jul 3, 2020

Thanks for sharing your results, but I am wondering
1- Is the x-axis in terms of epochs! :D, do you mean you train the model for only 50 epochs?
2- If yes, I wonder how you could achieve that while training from scratch without loading any weights? Do you change the default arguments or what?
3- The class Error is stack at 100. Do you face it, or from your experience, have an explanation for that?

I trained for almost 600 epoch, and I am getting zeros, so It is wired for me though the same data in the same format, I trained other models on it, so I think there is no error in my dataset.
And unfortunately, I didn't know I have to set the output_dir to get output logs and weights, I thought It should default was set on so I couldn't draw training curves or test the model on images to debug this behavior, I will rerun another experiment while setting it on and update you, and I will change the number of classes variable to 2 as my data has only one class and change the num_quiries to 30 to make it easier to the model, but I am asking you as your results are impressive.

Thanks in advance.

If you mean the "too high & abnormal class_error", you can check this reply at #41

@eslambakr
Copy link
Author

Hello @alcinos @fmassa ,
Here you are my results using fine tuning on my custom dataset (2k images)
DETR_fine_tunned
For me it is a very good results :D

And this is the result of training from scratch, I think it is too bad, so do u have any ideas to make DETR able to converge on small datasets? or from the graphs do u think I have to tune any hyper-parameters?
Note: I think I make a mistake in this experiment by keeping args.lr_drop=200, I will rerun after making it 700
DETR

@sompjang
Copy link

sompjang commented Jul 10, 2020

I have got similar results on small datasets. I tried several different configurations but with no success.

I have trained my dataset on 560 images.

params:

lr_backbone = 1e-5
lr = 1e-2
weight_decay = 1e-4
epochs = 1200
lr_drop = 400
num_queries = 20
num_classes = 1
batch_size = 2

myplot

@fmassa
Copy link
Contributor

fmassa commented Jul 10, 2020

@sompjang please try finetuning instead of training from scratch, I'm afraid training on 560 images from scratch might suffer from severe overfitting.

@sompjang
Copy link

@sompjang please try finetuning instead of training from scratch, I'm afraid training on 560 images from scratch might suffer from severe overfitting.

@fmassa Thanks for your answer. After fine-tuning results look much better. Are there any recommendations on the dataset size.

@guysoft
Copy link

guysoft commented Aug 10, 2020

How are you plotting the loss functions?

@fmassa
Copy link
Contributor

fmassa commented Aug 13, 2020

@cyy21
Copy link

cyy21 commented Nov 26, 2020

@sompjang hi, i got the same problem, do you meet the condition that networks output is all the same, and after you follow the finetune, did the output go well? how many class are there in your dataset?

@azamshoaib
Copy link

azamshoaib commented Dec 16, 2020

Hi @eslambakr,
i wrote a small gist on how i trained my model with COCO using weights. Here.
Hope this helps.

@m-klasen The link to your gist is not working. Can you please provide the link. I am training my network from scratch and it is not converging. Any insights to my problem will be very helpful. Thank you.

@eslambakr did you solve your issue with training the network from scratch.

@Flyooofly
Copy link

@sompjang请尝试微调而不是从头开始训练,恐怕从头开始训练 560 图像可能会出现严重的过度拟合。

Hello, I used the pre-trained model of the DETR architecture provided in DETReg (https://github.com/amirbar/DETReg) for fine-tuning. The number of fine-tuned images is about 1000, and I fine-tuned the training for 50 epochs(I have modified the num_classes is max id of calsses+1) , but all indicators are still 0, can you help to find out why? thanks.
image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants