Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training speed #43

Closed
jodusan opened this issue May 14, 2019 · 11 comments
Closed

Training speed #43

jodusan opened this issue May 14, 2019 · 11 comments

Comments

@jodusan
Copy link

jodusan commented May 14, 2019

I am not having consistent GPU utilization, and it says 18 days for 1 v100 gpu(p3.2xlarge) with batchsize of 12 and num-workers 8. Does this make sense?

Is there any explanation of timer column and is there tensorboard equivalent for viewing performance over time?

Thank you very much!

@dbolya
Copy link
Owner

dbolya commented May 15, 2019

Hmm weird, I just booted up a p3.2xlarge instance to test and I can't reproduce this.

Running the command
python train.py --config=yolact_base_config
(i.e., with the default batch size of 8) I get an ETA of 4 days.

Running the command
python train.py --config=yolact_base_config --batch_size=12 --num_workers=8
results in an ETA of 5 days (also note that the training parameters are optimized for a batch size of 8, so you should use the first command).

Are you doing anything special that would slow it down? If not, what's your Pytorch and CUDA versions?

The timer column is the time it took for one training iteration, while the ETA averages out over a large number of iterations.

And to view training performance over time, log your console output to a file, pull my latest commit, and from the project's root directory run
python -m "scripts.plot_loss" <my_log_file>
to plot the loss over time.

To plot validation mAP over time, use
python -m "scripts.plot_loss" <my_log_file> val

@jodusan
Copy link
Author

jodusan commented May 15, 2019

Thanks for a quick reply! I am using my own dataset, does that matter? (It should just resize images and proceed the same right?) I have generated coco-style annotations and have only added a new config, nothing else. Error exploded.

@dbolya
Copy link
Owner

dbolya commented May 15, 2019

For debugging purposes, can you try training on COCO to see what the ETA is? There might be an issue with how you set up your dataset / config.

@harryb-kyutech
Copy link

@dbolya do you have a script that

Hmm weird, I just booted up a p3.2xlarge instance to test and I can't reproduce this.

Running the command
python train.py --config=yolact_base_config
(i.e., with the default batch size of 8) I get an ETA of 4 days.

Running the command
python train.py --config=yolact_base_config --batch_size=12 --num_workers=8
results in an ETA of 5 days (also note that the training parameters are optimized for a batch size of 8, so you should use the first command).

Are you doing anything special that would slow it down? If not, what's your Pytorch and CUDA versions?

The timer column is the time it took for one training iteration, while the ETA averages out over a large number of iterations.

And to view training performance over time, log your console output to a file, pull my latest commit, and from the project's root directory run
python -m "scripts.plot_loss" <my_log_file>
to plot the loss over time.

To plot validation mAP over time, use
python -m "scripts.plot_loss" <my_log_file> val

@dbolya do you have a script that plots both the training loss and mAP AFTER the training has finished (i.e. perhaps during evaluation)? Would be of great help as it is urgently needed. Thank you nonetheless for the great work on YOLACT.

@dbolya
Copy link
Owner

dbolya commented Aug 5, 2019

@harrybolingot Those scripts plot those things after training, but you need the log of stdout during training or else there's nowhere to get the info from. Or are you asking for the final training loss?

@harryb-kyutech
Copy link

@harrybolingot Those scripts plot those things after training, but you need the log of stdout during training or else there's nowhere to get the info from. Or are you asking for the final training loss?

Yeah my problem is that I was not able to save the stdout during training. I've cut the training from time to time as training from 0-100k iterations took 36 hours. During this time I didn't realize that I should have logged the stdout to plot the training loss. I want to be able to plot the training loss from start to end having finished the training :(

@dbolya
Copy link
Owner

dbolya commented Aug 5, 2019

Yeah sorry that's not possible cause that data's not saved anywhere. And oof that training time is really bad. Is that expected given your hardware or do you think something's wrong?

@harryb-kyutech
Copy link

Our lab has four GTX 1080 Ti GPU. I used all of it during training. I'm not sure if this is the standard training speed for this algorithm on this GPU setup. I also have no idea if something's wrong as the ETA was fairly accurate during my first run of your algorithm on my custom dataset.

@dbolya
Copy link
Owner

dbolya commented Aug 5, 2019

Does your custom dataset have huge images or something? On COCO with 1 1080ti, the training time should be expected to be ~5 days, so 100k iterations in 36 hours is 3 days for 200k iterations (800k iters / 4 gpus), which is not even a 2x speed up. If you haven't already, you can check out #8 to see some tips for multiple GPUs.

Though I must confess, since I didn't use multiple GPUs while developing YOLACT, my code doesn't support multiple GPUs very well. Have you tried training on a single GPU and comparing the ETA (after the first 1k iterations though because it's not accurate until then)?

Btw the ETA is based on the current speed of training, so that's why it's accurate (as long as the training speed is consistent). It being accurate doesn't necessarily mean everything's working properly.

@dbolya
Copy link
Owner

dbolya commented Sep 19, 2019

Since this has been open so long, I'm going to close it. Feel free to reopen if you have any updates.

@dbolya dbolya closed this as completed Sep 19, 2019
@maskrcnnuser
Copy link

@dbolya - I am training using my set of 20 images for training and 10 for validation - I purposefully kept the set small in the beginning to test out the system. However, my output says ETA 58 days!! My config is as follows.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 426.23 Driver Version: 426.23 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 TCC | 00000001:00:00.0 Off | 0 |
| N/A 80C P0 143W / 149W | 10404MiB / 11448MiB | 96% Default |
+-------------------------------+----------------------+----------------------+

Timer varies between 8.5 and 11.6 seconds.

Tensorflow: 2.0.0
Pytorch: 1.6.0

Am I doing something wrong?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants