-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training speed #43
Comments
Hmm weird, I just booted up a p3.2xlarge instance to test and I can't reproduce this. Running the command Running the command Are you doing anything special that would slow it down? If not, what's your Pytorch and CUDA versions? The timer column is the time it took for one training iteration, while the ETA averages out over a large number of iterations. And to view training performance over time, log your console output to a file, pull my latest commit, and from the project's root directory run To plot validation mAP over time, use |
Thanks for a quick reply! I am using my own dataset, does that matter? (It should just resize images and proceed the same right?) I have generated coco-style annotations and have only added a new config, nothing else. Error exploded. |
For debugging purposes, can you try training on COCO to see what the ETA is? There might be an issue with how you set up your dataset / config. |
@dbolya do you have a script that
@dbolya do you have a script that plots both the training loss and mAP AFTER the training has finished (i.e. perhaps during evaluation)? Would be of great help as it is urgently needed. Thank you nonetheless for the great work on YOLACT. |
@harrybolingot Those scripts plot those things after training, but you need the log of stdout during training or else there's nowhere to get the info from. Or are you asking for the final training loss? |
Yeah my problem is that I was not able to save the stdout during training. I've cut the training from time to time as training from 0-100k iterations took 36 hours. During this time I didn't realize that I should have logged the stdout to plot the training loss. I want to be able to plot the training loss from start to end having finished the training :( |
Yeah sorry that's not possible cause that data's not saved anywhere. And oof that training time is really bad. Is that expected given your hardware or do you think something's wrong? |
Our lab has four GTX 1080 Ti GPU. I used all of it during training. I'm not sure if this is the standard training speed for this algorithm on this GPU setup. I also have no idea if something's wrong as the ETA was fairly accurate during my first run of your algorithm on my custom dataset. |
Does your custom dataset have huge images or something? On COCO with 1 1080ti, the training time should be expected to be ~5 days, so 100k iterations in 36 hours is 3 days for 200k iterations (800k iters / 4 gpus), which is not even a 2x speed up. If you haven't already, you can check out #8 to see some tips for multiple GPUs. Though I must confess, since I didn't use multiple GPUs while developing YOLACT, my code doesn't support multiple GPUs very well. Have you tried training on a single GPU and comparing the ETA (after the first 1k iterations though because it's not accurate until then)? Btw the ETA is based on the current speed of training, so that's why it's accurate (as long as the training speed is consistent). It being accurate doesn't necessarily mean everything's working properly. |
Since this has been open so long, I'm going to close it. Feel free to reopen if you have any updates. |
@dbolya - I am training using my set of 20 images for training and 10 for validation - I purposefully kept the set small in the beginning to test out the system. However, my output says ETA 58 days!! My config is as follows. +-----------------------------------------------------------------------------+ Timer varies between 8.5 and 11.6 seconds. Tensorflow: 2.0.0 Am I doing something wrong? |
I am not having consistent GPU utilization, and it says 18 days for 1 v100 gpu(p3.2xlarge) with batchsize of 12 and num-workers 8. Does this make sense?
Is there any explanation of timer column and is there tensorboard equivalent for viewing performance over time?
Thank you very much!
The text was updated successfully, but these errors were encountered: