-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about reproducing results on COCO #36
Comments
three observations to start with: for testing convergence, use smaller resolution (224) and larger batch size (128) i don't see any augmentations in your training code. also, something weird with your scheduler: |
p.s. 1 p.s. 2 p.s. 3 |
Thanks for your comment, I have tried to follow your instructions and the new train.py is available at https://github.com/GhostWnd/reproducingASL, the newest one is train_ver3.py, I will try to run it and report later. I have run the train_ver3.py for around 600 iterations, and it appears that the training speed is much slower than the beginning, at the begining, it requires 3 second for each iteration, but after 600 iteration, it requires 9 second for an iteration. It puzzles me much, I doubu whether I have implemented the code right. |
i will take a look at the code and try to run It when i have the time. good work so far, i think with joint forces we are on our way to finally have a modern multi-label code for the community to use, that vast majority of repo that exist are way outdated. several more corrections and suggestions: args.do_bottleneck_head = False (not True) one more correction is that you are using the 2017 split. while this is not a "mistake" (and your results will be a little higher), in articles people use the 2014 split. what about mixed precision ? with modern pytorch it is a few line of code ("with autocast():"...) to improve speed, you don't have to update EMA every iteration. you cant update it every ~5 iterations with slightly higher decaey rate, and still ger similar results. load a pretrained model, run only inference, and make sure you reproduce the article results (after switching to 2014 split) make sure, especially in validation, that you are not building enormous vectors along the training that clog the RAM memory. sometimes its better to pre-allocate memory if you need to store large vectors you have not implemented true WD correctly. this is not AdamW. |
Thank you for your comment, I will try to correct it. I have tried to correct true WD, it's now train_ver4.py, available at https://github.com/GhostWnd/reproducingASL |
Hi GhostWnd I took a deeper look at the code. there are several major problems there. don't get discouraged, we are making progress, and sometimes the journey is more educational than the destination. problems:
mAP_score = validate_multi(val_loader, model, args, ema)
|
just to give you motivation, i got a good score last night when running a corrected code... |
Thank you for your comment and effort, I will try to correct the code and run it. I have tried to fix the problems you mentioned , my code is train_ver5.py available at https://github.com/GhostWnd/reproducingASL Other than train_ver5.py, I also edit helper_functions.py and to allow me to use 2014 json to train 2017 data. path = coco.loadImgs(img_id)[0]['file_name'] When I try to use 2014json to train 2017data, it seems that when validate, there are some images that in 2014 validate dataset while not in 2017 validate dataset, I would like to know, does the difference between 2014 and 2017 affect the result much? |
Sorry to bother again, I know due to commercial issues you can't release your training code, but could you release the code you correct based on my train.py? If that is not possible, could you please release the loss record of your corrected code based on my training code, so that I can compare the result myself? Thank you very much. |
Hi GhostWnd there were other problems in the code.
anyway, this code fully reproduces the article results (i think it even surpasses it): i will attach logs for 224 and 448 trainings later you are welcome to test it yourself and give me feedback. thanks for the collaboration, together we will release the first publicly available modern multi-label code |
Thank you so much! :-) |
this is an example log file (notice - resolution 224, mtresnet) |
do you have an objection that i will add the code also to i will of course share credit with you, i had made a lot of changes and enhancements to the code, but you provided the base implementation |
No objection, it's my pleasure, thank you very much. |
And I wonder whether you can put the mode based on tresnet_m and input size 224 into you pretrained model in https://github.com/Alibaba-MIIL/ASL/blob/main/MODEL_ZOO.md? I would like to adjust some hyper parameters to test the influence of those hyperparameters. |
i am not sure i fully understand your question. models in |
Well, I think if I don't make a mistake I just wonder whether you could upload the model you trained with tresnet_m with input size 224 into https://github.com/Alibaba-MIIL/ASL/blob/main/MODEL_ZOO.md, the ASL/blob/main/MODEL_ZOO.md one |
Or could you please share the model you trained on tresnet_m and input size 224 with me? I would like to adjust some hyper parameters to test the influence of those hyperparameters. |
just to be clear: |
yes, the one that produces the log file mtresnet_224.txt |
Thank you very much. |
Can you attach logs for 448 resolution with tresnet_l using this training code? I found it's hard for me to reprodect the 86.8mAP resault in paper. |
Hello, I tried to reproduce the result on COCO, I implemented my own framework and most of my files are the same as you, I only write my new train.py.
As is introduced in your paper, I have implemented EMA with 0.999, 1cycle policy with max learing rate 2e-4, Adam optimizer with weigth_decay 1e-4, img_size = 448*448 and batchsize = 16.
But when I train my model, the loss decreases from 120 to around 90 and then it just doesn't decrease and the performance on validation date is very bad, whose mAP is around 10, at first I guess it's because I didn't spend much time training, I only trained it for an hour, but when I try to train it longer, the loss still doesn't decrease, could you please tell me what I have done wrong?
My code is avaliable at https://github.com/GhostWnd/reproducingASL, thank you for your help.
The text was updated successfully, but these errors were encountered: