Training speed #43

jodusan · 2019-05-14T22:50:25Z

I am not having consistent GPU utilization, and it says 18 days for 1 v100 gpu(p3.2xlarge) with batchsize of 12 and num-workers 8. Does this make sense?

Is there any explanation of timer column and is there tensorboard equivalent for viewing performance over time?

Thank you very much!

dbolya · 2019-05-15T01:43:20Z

Hmm weird, I just booted up a p3.2xlarge instance to test and I can't reproduce this.

Running the command
python train.py --config=yolact_base_config
(i.e., with the default batch size of 8) I get an ETA of 4 days.

Running the command
python train.py --config=yolact_base_config --batch_size=12 --num_workers=8
results in an ETA of 5 days (also note that the training parameters are optimized for a batch size of 8, so you should use the first command).

Are you doing anything special that would slow it down? If not, what's your Pytorch and CUDA versions?

The timer column is the time it took for one training iteration, while the ETA averages out over a large number of iterations.

And to view training performance over time, log your console output to a file, pull my latest commit, and from the project's root directory run
python -m "scripts.plot_loss" <my_log_file>
to plot the loss over time.

To plot validation mAP over time, use
python -m "scripts.plot_loss" <my_log_file> val

jodusan · 2019-05-15T19:37:15Z

Thanks for a quick reply! I am using my own dataset, does that matter? (It should just resize images and proceed the same right?) I have generated coco-style annotations and have only added a new config, nothing else. Error exploded.

dbolya · 2019-05-15T21:22:58Z

For debugging purposes, can you try training on COCO to see what the ETA is? There might be an issue with how you set up your dataset / config.

harryb-kyutech · 2019-08-05T07:19:21Z

@dbolya do you have a script that

Hmm weird, I just booted up a p3.2xlarge instance to test and I can't reproduce this.

Running the command
python train.py --config=yolact_base_config
(i.e., with the default batch size of 8) I get an ETA of 4 days.

Running the command
python train.py --config=yolact_base_config --batch_size=12 --num_workers=8
results in an ETA of 5 days (also note that the training parameters are optimized for a batch size of 8, so you should use the first command).

Are you doing anything special that would slow it down? If not, what's your Pytorch and CUDA versions?

The timer column is the time it took for one training iteration, while the ETA averages out over a large number of iterations.

And to view training performance over time, log your console output to a file, pull my latest commit, and from the project's root directory run
python -m "scripts.plot_loss" <my_log_file>
to plot the loss over time.

To plot validation mAP over time, use
python -m "scripts.plot_loss" <my_log_file> val

@dbolya do you have a script that plots both the training loss and mAP AFTER the training has finished (i.e. perhaps during evaluation)? Would be of great help as it is urgently needed. Thank you nonetheless for the great work on YOLACT.

dbolya · 2019-08-05T07:41:32Z

@harrybolingot Those scripts plot those things after training, but you need the log of stdout during training or else there's nowhere to get the info from. Or are you asking for the final training loss?

harryb-kyutech · 2019-08-05T07:49:35Z

@harrybolingot Those scripts plot those things after training, but you need the log of stdout during training or else there's nowhere to get the info from. Or are you asking for the final training loss?

Yeah my problem is that I was not able to save the stdout during training. I've cut the training from time to time as training from 0-100k iterations took 36 hours. During this time I didn't realize that I should have logged the stdout to plot the training loss. I want to be able to plot the training loss from start to end having finished the training :(

dbolya · 2019-08-05T07:58:19Z

Yeah sorry that's not possible cause that data's not saved anywhere. And oof that training time is really bad. Is that expected given your hardware or do you think something's wrong?

harryb-kyutech · 2019-08-05T08:11:55Z

Our lab has four GTX 1080 Ti GPU. I used all of it during training. I'm not sure if this is the standard training speed for this algorithm on this GPU setup. I also have no idea if something's wrong as the ETA was fairly accurate during my first run of your algorithm on my custom dataset.

dbolya · 2019-08-05T16:13:56Z

Does your custom dataset have huge images or something? On COCO with 1 1080ti, the training time should be expected to be ~5 days, so 100k iterations in 36 hours is 3 days for 200k iterations (800k iters / 4 gpus), which is not even a 2x speed up. If you haven't already, you can check out #8 to see some tips for multiple GPUs.

Though I must confess, since I didn't use multiple GPUs while developing YOLACT, my code doesn't support multiple GPUs very well. Have you tried training on a single GPU and comparing the ETA (after the first 1k iterations though because it's not accurate until then)?

Btw the ETA is based on the current speed of training, so that's why it's accurate (as long as the training speed is consistent). It being accurate doesn't necessarily mean everything's working properly.

dbolya · 2019-09-19T15:36:31Z

Since this has been open so long, I'm going to close it. Feel free to reopen if you have any updates.

maskrcnnuser · 2020-09-07T08:56:41Z

@dbolya - I am training using my set of 20 images for training and 10 for validation - I purposefully kept the set small in the beginning to test out the system. However, my output says ETA 58 days!! My config is as follows.

Timer varies between 8.5 and 11.6 seconds.

Tensorflow: 2.0.0
Pytorch: 1.6.0

Am I doing something wrong?

sdimantsd mentioned this issue Jun 23, 2019

convert yolact to ONNX #74

Open

dbolya closed this as completed Sep 19, 2019

maskrcnnuser mentioned this issue Sep 7, 2020

Training too slow #517

Closed

zhangyueyjy mentioned this issue Feb 26, 2021

RuntimeError: strides[cur-1] == sizes[cur]*strides[cur] ASSERT FAILED at /pytorch/torch/csrc/jit/fuser/executor.cpp:149, please report a bug to PyTorch. #611

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training speed #43

Training speed #43

jodusan commented May 14, 2019

dbolya commented May 15, 2019 •

edited

Loading

jodusan commented May 15, 2019

dbolya commented May 15, 2019

harryb-kyutech commented Aug 5, 2019

dbolya commented Aug 5, 2019

harryb-kyutech commented Aug 5, 2019

dbolya commented Aug 5, 2019

harryb-kyutech commented Aug 5, 2019

dbolya commented Aug 5, 2019

dbolya commented Sep 19, 2019

maskrcnnuser commented Sep 7, 2020

Training speed #43

Training speed #43

Comments

jodusan commented May 14, 2019

dbolya commented May 15, 2019 • edited Loading

jodusan commented May 15, 2019

dbolya commented May 15, 2019

harryb-kyutech commented Aug 5, 2019

dbolya commented Aug 5, 2019

harryb-kyutech commented Aug 5, 2019

dbolya commented Aug 5, 2019

harryb-kyutech commented Aug 5, 2019

dbolya commented Aug 5, 2019

dbolya commented Sep 19, 2019

maskrcnnuser commented Sep 7, 2020

dbolya commented May 15, 2019 •

edited

Loading