Questions about reproducing results on COCO #36

GhostWnd · 2021-01-31T16:28:30Z

Hello, I tried to reproduce the result on COCO, I implemented my own framework and most of my files are the same as you, I only write my new train.py.
As is introduced in your paper, I have implemented EMA with 0.999, 1cycle policy with max learing rate 2e-4, Adam optimizer with weigth_decay 1e-4, img_size = 448*448 and batchsize = 16.

But when I train my model, the loss decreases from 120 to around 90 and then it just doesn't decrease and the performance on validation date is very bad, whose mAP is around 10, at first I guess it's because I didn't spend much time training, I only trained it for an hour, but when I try to train it longer, the loss still doesn't decrease, could you please tell me what I have done wrong?

My code is avaliable at https://github.com/GhostWnd/reproducingASL, thank you for your help.

mrT23 · 2021-01-31T17:09:15Z

three observations to start with:

for testing convergence, use smaller resolution (224) and larger batch size (128)

i don't see any augmentations in your training code.
use RandAugment or AutoAugment at least + cutout

also, something weird with your scheduler:
scheduler = lr_scheduler.OneCycleLR(optimizer, max_lr = 0.0002, total_steps = total_step, epochs = 25)
its hardcoded to 25 epochs, yet you loop over only 5 epochs
add epochs as hyperparameter to arg list, and use it everywhere instead of hard-coded numbers. search for other hyper-parameters that should belong to arg list as well

mrT23 · 2021-01-31T17:35:04Z

p.s. 1
also for testing and prototyping, use tresnet_m

p.s. 2
also you need to implement "true weight-decay" (not doing weight decay on bias and batch norm)

p.s. 3
i will probably notice other problems in the future, but we need to start from somewhere :-)

GhostWnd · 2021-01-31T20:39:36Z

Thanks for your comment, I have tried to follow your instructions and the new train.py is available at https://github.com/GhostWnd/reproducingASL, the newest one is train_ver3.py, I will try to run it and report later.

I have run the train_ver3.py for around 600 iterations, and it appears that the training speed is much slower than the beginning, at the begining, it requires 3 second for each iteration, but after 600 iteration, it requires 9 second for an iteration. It puzzles me much, I doubu whether I have implemented the code right.

mrT23 · 2021-02-01T06:10:38Z

i will take a look at the code and try to run It when i have the time.

good work so far, i think with joint forces we are on our way to finally have a modern multi-label code for the community to use, that vast majority of repo that exist are way outdated.

several more corrections and suggestions:

args.do_bottleneck_head = False (not True)

one more correction is that you are using the 2017 split. while this is not a "mistake" (and your results will be a little higher), in articles people use the 2014 split.

what about mixed precision ? with modern pytorch it is a few line of code ("with autocast():"...)

to improve speed, you don't have to update EMA every iteration. you cant update it every ~5 iterations with slightly higher decaey rate, and still ger similar results.

load a pretrained model, run only inference, and make sure you reproduce the article results (after switching to 2014 split)

make sure, especially in validation, that you are not building enormous vectors along the training that clog the RAM memory. sometimes its better to pre-allocate memory if you need to store large vectors

you have not implemented true WD correctly. this is not AdamW.
see example for true WD in:
https://github.com/rwightman/pytorch-image-models/blob/198f6ea0f3dae13f041f3ea5880dd79089b60d61/timm/optim/optim_factory.py
(def add_weight_decay...)

GhostWnd · 2021-02-01T08:51:16Z

Thank you for your comment, I will try to correct it.
And if it's possible, could you please release the loss change when you run my code? Pure data is the best.
Thank you very much.

I have tried to correct true WD, it's now train_ver4.py, available at https://github.com/GhostWnd/reproducingASL

mrT23 · 2021-02-01T18:19:08Z

Hi GhostWnd

I took a deeper look at the code. there are several major problems there.
make sure you understand whats the problem in each and every one, and apply proper corrections. don't skip a single one.
most of these problems are "deal-breakers".
after correcting all of them, repeat your runs, and we can compare results.
I hope i will have some results to compare until then (If i won't find more bugs)

don't get discouraged, we are making progress, and sometimes the journey is more educational than the destination.

problems:

currently not using randaugment (commented in train_loader)
using uninitilizaing model (for training and comparison to article, you should initialize model from relevant imagenet model https://github.com/Alibaba-MIIL/TResNet/blob/master/MODEL_ZOO.md)
using 2017 coco split is wrong (use instead 2014 coco split, different json files only )
Cutout(n_holes = 1, length = 16) -> Cutout(n_holes = 1, args.image_size/2)
validation should be done once an epoch, no more and no less
```
  preds.append(output.cpu())
  targets.append(target.cpu())
  
  ->
```
preds.append(output.cpu().detach())
targets.append(target.cpu().detach())

mAP_score = validate_multi(val_loader, model, args, ema)
->
model.eval()
mAP_score = validate_multi(val_loader, model, args, ema)
model.train()

calculate only mAP metrics. remove other metrics, they are only confusing during training

mrT23 · 2021-02-02T05:59:27Z

just to give you motivation, i got a good score last night when running a corrected code...

GhostWnd · 2021-02-02T08:58:19Z

Thank you for your comment and effort, I will try to correct the code and run it.
Thank you very much.

I have tried to fix the problems you mentioned , my code is train_ver5.py available at https://github.com/GhostWnd/reproducingASL

Other than train_ver5.py, I also edit helper_functions.py and to allow me to use 2014 json to train 2017 data.
Here is the change:

path = coco.loadImgs(img_id)[0]['file_name']
img = Image.open(os.path.join(self.root, path)).convert('RGB')
->
path = coco.loadImgs(img_id)[0]['file_name']
path = path.split('_')[-1] #remove 'MSCOCO_2014'
img = Image.open(os.path.join(self.root, path)).convert('RGB')

When I try to use 2014json to train 2017data, it seems that when validate, there are some images that in 2014 validate dataset while not in 2017 validate dataset, I would like to know, does the difference between 2014 and 2017 affect the result much?
Thank you very much.

GhostWnd · 2021-02-03T10:18:03Z

Sorry to bother again, I know due to commercial issues you can't release your training code, but could you release the code you correct based on my train.py?

If that is not possible, could you please release the loss record of your corrected code based on my training code, so that I can compare the result myself?

Thank you very much.

mrT23 · 2021-02-03T10:22:41Z

Hi GhostWnd

there were other problems in the code.
The two major ones:

sigmoid was done twice (!) - once in the direct prediction, second in the loss.
EMA was not performed correctly (its a separate model with separate validation)

anyway, this code fully reproduces the article results (i think it even surpasses it):
train_asl_reproduce.zip

i will attach logs for 224 and 448 trainings later

you are welcome to test it yourself and give me feedback.

thanks for the collaboration, together we will release the first publicly available modern multi-label code
:-)

GhostWnd · 2021-02-03T10:40:58Z

Thank you so much! :-)
I will upload the train file to make it publicly available as well as test it by myself and give feedback to you as soon as possible.

mrT23 · 2021-02-03T13:38:11Z

this is an example log file (notice - resolution 224, mtresnet)
mtresnet_224.txt

mrT23 · 2021-02-03T13:52:23Z

Thank you so much! :-)
I will upload the train file to make it publicly available as well as test it by myself and give feedback to you as soon as possible.

do you have an objection that i will add the code also to
https://github.com/Alibaba-MIIL/ASL ?
i think it will help it gain more traction. there are very few (zero) modern multi-label code-bases like this, with top results.

i will of course share credit with you, i had made a lot of changes and enhancements to the code, but you provided the base implementation

GhostWnd · 2021-02-03T15:25:31Z

No objection, it's my pleasure, thank you very much.

GhostWnd · 2021-02-03T15:38:43Z

And I wonder whether you can put the mode based on tresnet_m and input size 224 into you pretrained model in https://github.com/Alibaba-MIIL/ASL/blob/main/MODEL_ZOO.md?

I would like to adjust some hyper parameters to test the influence of those hyperparameters.
And apply it to other dataset as pretrained model.
Thank you very much.

mrT23 · 2021-02-03T15:56:05Z

i am not sure i fully understand your question.

models in
https://github.com/Alibaba-MIIL/ASL/blob/main/MODEL_ZOO.md
are standard imagenet models for downstream tasks. these are the models you should use to initialize training on COCO.

GhostWnd · 2021-02-03T16:01:42Z

Well, I think if I don't make a mistake
models in ASL/blob/main/MODEL_ZOO.md are models trained on MSCOCO, the link is https://github.com/Alibaba-MIIL/ASL/blob/main/MODEL_ZOO.md
while models in TResNet/blob/master/MODEL_ZOO.md are standard imagenet models, the link is https://github.com/Alibaba-MIIL/TResNet/blob/master/MODEL_ZOO.md, right?

I just wonder whether you could upload the model you trained with tresnet_m with input size 224 into https://github.com/Alibaba-MIIL/ASL/blob/main/MODEL_ZOO.md, the ASL/blob/main/MODEL_ZOO.md one

GhostWnd · 2021-02-03T16:40:15Z

Or could you please share the model you trained on tresnet_m and input size 224 with me?

I would like to adjust some hyper parameters to test the influence of those hyperparameters.
And apply it to other dataset as pretrained model.
Thank you very much.

mrT23 · 2021-02-03T18:06:06Z

just to be clear:
tresnet_m 224 model trained on MS-COCO ?

GhostWnd · 2021-02-03T18:37:25Z

yes, the one that produces the log file mtresnet_224.txt

mrT23 · 2021-02-04T07:18:44Z

added to
https://github.com/Alibaba-MIIL/ASL/blob/main/MODEL_ZOO.md

GhostWnd · 2021-02-04T09:28:59Z

Thank you very much.

LOOKCC · 2021-04-08T08:06:22Z

this is an example log file (notice - resolution 224, mtresnet)
mtresnet_224.txt

Can you attach logs for 448 resolution with tresnet_l using this training code? I found it's hard for me to reprodect the 86.8mAP resault in paper.

mrT23 · 2021-04-08T08:12:20Z

@LOOKCC
run
https://github.com/Alibaba-MIIL/ASL/blob/main/train.py

mrT23 closed this as completed Feb 1, 2021

mrT23 reopened this Feb 1, 2021

mrT23 closed this as completed Feb 4, 2021

This was referenced Feb 6, 2021

Dataset path #37

Closed

Adding a distributed script and try to reproduce 86.6 mAP #39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about reproducing results on COCO #36

Questions about reproducing results on COCO #36

GhostWnd commented Jan 31, 2021 •

edited

Loading

mrT23 commented Jan 31, 2021 •

edited

Loading

mrT23 commented Jan 31, 2021 •

edited

Loading

GhostWnd commented Jan 31, 2021 •

edited

Loading

mrT23 commented Feb 1, 2021 •

edited

Loading

GhostWnd commented Feb 1, 2021 •

edited

Loading

mrT23 commented Feb 1, 2021 •

edited

Loading

mrT23 commented Feb 2, 2021

GhostWnd commented Feb 2, 2021 •

edited

Loading

GhostWnd commented Feb 3, 2021

mrT23 commented Feb 3, 2021 •

edited

Loading

GhostWnd commented Feb 3, 2021 •

edited

Loading

mrT23 commented Feb 3, 2021 •

edited

Loading

mrT23 commented Feb 3, 2021 •

edited

Loading

GhostWnd commented Feb 3, 2021

GhostWnd commented Feb 3, 2021 •

edited

Loading

mrT23 commented Feb 3, 2021

GhostWnd commented Feb 3, 2021 •

edited

Loading

GhostWnd commented Feb 3, 2021

mrT23 commented Feb 3, 2021

GhostWnd commented Feb 3, 2021

mrT23 commented Feb 4, 2021

GhostWnd commented Feb 4, 2021

LOOKCC commented Apr 8, 2021

mrT23 commented Apr 8, 2021

Questions about reproducing results on COCO #36

Questions about reproducing results on COCO #36

Comments

GhostWnd commented Jan 31, 2021 • edited Loading

mrT23 commented Jan 31, 2021 • edited Loading

mrT23 commented Jan 31, 2021 • edited Loading

GhostWnd commented Jan 31, 2021 • edited Loading

mrT23 commented Feb 1, 2021 • edited Loading

GhostWnd commented Feb 1, 2021 • edited Loading

mrT23 commented Feb 1, 2021 • edited Loading

mrT23 commented Feb 2, 2021

GhostWnd commented Feb 2, 2021 • edited Loading

GhostWnd commented Feb 3, 2021

mrT23 commented Feb 3, 2021 • edited Loading

GhostWnd commented Feb 3, 2021 • edited Loading

mrT23 commented Feb 3, 2021 • edited Loading

mrT23 commented Feb 3, 2021 • edited Loading

GhostWnd commented Feb 3, 2021

GhostWnd commented Feb 3, 2021 • edited Loading

mrT23 commented Feb 3, 2021

GhostWnd commented Feb 3, 2021 • edited Loading

GhostWnd commented Feb 3, 2021

mrT23 commented Feb 3, 2021

GhostWnd commented Feb 3, 2021

mrT23 commented Feb 4, 2021

GhostWnd commented Feb 4, 2021

LOOKCC commented Apr 8, 2021

mrT23 commented Apr 8, 2021

GhostWnd commented Jan 31, 2021 •

edited

Loading

mrT23 commented Jan 31, 2021 •

edited

Loading

mrT23 commented Jan 31, 2021 •

edited

Loading

GhostWnd commented Jan 31, 2021 •

edited

Loading

mrT23 commented Feb 1, 2021 •

edited

Loading

GhostWnd commented Feb 1, 2021 •

edited

Loading

mrT23 commented Feb 1, 2021 •

edited

Loading

GhostWnd commented Feb 2, 2021 •

edited

Loading

mrT23 commented Feb 3, 2021 •

edited

Loading

GhostWnd commented Feb 3, 2021 •

edited

Loading

mrT23 commented Feb 3, 2021 •

edited

Loading

mrT23 commented Feb 3, 2021 •

edited

Loading

GhostWnd commented Feb 3, 2021 •

edited

Loading

GhostWnd commented Feb 3, 2021 •

edited

Loading