Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About GOT-10k train and test #14

Open
wjc0602 opened this issue Nov 9, 2021 · 6 comments
Open

About GOT-10k train and test #14

wjc0602 opened this issue Nov 9, 2021 · 6 comments

Comments

@wjc0602
Copy link

wjc0602 commented Nov 9, 2021

Thanks for your excellent work. I meet a problem when I want to reproduce the results in the paper. When I used GOT alone for training, the iou suppression remained at about 0.38 and there was no improvement. I wonder if there was something wrong with the configuration?

@fzh0917
Copy link
Owner

fzh0917 commented Nov 10, 2021

The hyper-parameter batch_size and start_lr are important for training. What are your batch_size and start_lr? 32 and 1e-6 are recommended.
If you have difficulty in setting the batch_size to 32 due the limitation of hardwares or something else, you can try to increase the start_lr to 1e-2.

Good luck!

@wjc0602
Copy link
Author

wjc0602 commented Nov 10, 2021

The hyper-parameter batch_size and start_lr are important for training. What are your batch_size and start_lr? 32 and 1e-6 are recommended. If you have difficulty in setting the batch_size to 32 due the limitation of hardwares or something else, you can try to increase the start_lr to 1e-2.

Good luck!

Thanks for your reply. I set the batch_size and start_lr to 32 and 1e-2, I didn't change the Settings in the 'stmtrack-googlenet-trn' in the got10k flord. I trained it on 3 Tesla V100. Is it because of equipment problems?

@fzh0917
Copy link
Owner

fzh0917 commented Nov 10, 2021

Can the model converge if you use two GPUs?

@hekaijie123
Copy link

I ran into a similar problem.The first time I ran with one RTX3090,and just set "amp" to "True" , "num_processes" to "1" and "num_workers" to "16" .Keep the default Settings for others.It train in GOT,and the results are shown below.
"ao": 0.8214040485044509,
"sr50": 0.9221410022121765,
"sr75": 0.8248533230739636,
"speed_fps": 30.02842858137559,

The second time,I use two RTX3090. I set the "amp" is "False","num_processes" to "2" and "num_workers" to "16" .Keep the default Settings for others.I just want to see the influence of "amp",but the model didn't converge this time:
"ao": 0.18841102443887056,
"sr50": 0.08507261710108685,
"sr75": 0.01986149850918534,
"speed_fps": 31.019533580571974,
When the model is training ,just "cls" and "ctr" reduce , but "reg" and "iou" seem hard to change.

@luhannan
Copy link

I met the same problem, the model trained with multi-GPU didn`t converge, with or without synchronized-BN. Though it seems converge when trained with 1 gpu

@Kevoen
Copy link

Kevoen commented Apr 13, 2022

I ran into a similar problem.The first time I ran with one RTX3090,and just set "amp" to "True" , "num_processes" to "1" and "num_workers" to "16" .Keep the default Settings for others.It train in GOT,and the results are shown below. "ao": 0.8214040485044509, "sr50": 0.9221410022121765, "sr75": 0.8248533230739636, "speed_fps": 30.02842858137559,

The second time,I use two RTX3090. I set the "amp" is "False","num_processes" to "2" and "num_workers" to "16" .Keep the default Settings for others.I just want to see the influence of "amp",but the model didn't converge this time: "ao": 0.18841102443887056, "sr50": 0.08507261710108685, "sr75": 0.01986149850918534, "speed_fps": 31.019533580571974, When the model is training ,just "cls" and "ctr" reduce , but "reg" and "iou" seem hard to change.

I met the same problem, the model trained with multi-GPU didn`t converge, with or without synchronized-BN. Though it seems to converge when trained with 1 GPU

The main reason is that training with multiple GPU requires rewriting the training code, and the author only provides a single GPU training code. Since the authors' code framework is video_analyst, I refer to the (main/dist-train.py) distributed training code in video_anaylst for training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants