Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Inference speed may decrease slightly when batch-size increase #6396

Closed
nicklhy opened this issue May 23, 2017 · 19 comments
Closed

Inference speed may decrease slightly when batch-size increase #6396

nicklhy opened this issue May 23, 2017 · 19 comments

Comments

@nicklhy
Copy link
Contributor

nicklhy commented May 23, 2017

Recently, I did some speed benchmark experiments of CNN inference task over a few popular deep learning frameworks (including mxnet). Though mxnet performs excellently in most modern CNN structures (i.e. resnet, inception), I also found mxnet's inference speed curves often drop slightly when batch size is 16, 32 or 64 as below. Is this pretty common?

resnet50

inception-v3

Some of other benchmark results can be found here(Titan X) and here(GTX1080).

@piiswrong
Copy link
Contributor

@ptrendx @ap-hynninen The drop looks weird. Do you observe this? Is it due to cudnn algo selection?

@nicklhy
Copy link
Contributor Author

nicklhy commented May 23, 2017

By the way, MXNET_CUDNN_AUTOTUNE_DEFAULT is 1 in my experiments. The program needs to run performance tests to find the best conv algo each time you run it.

@ghost
Copy link

ghost commented May 23, 2017

Similarly, I did some performance test for cudnn autotuning with the modes "off-0, limited-1, fastest-2", by evaluating time cost for 1000 convolutions.
Surprisingly, the fastest mode takes more time, while turning cudnn off achieves the best speed.

@ptrendx
Copy link
Member

ptrendx commented May 23, 2017

@xioryu Could you share the repro for that?

@nicklhy, @piiswrong We did not do any inference tests, but that is definitely weird behavior - cudnn should not slow down with the increasing batch size. Maybe IO? Do you have a script we can use to repro that?

@nicklhy
Copy link
Contributor Author

nicklhy commented May 23, 2017

@ptrendx , you can find my test script here. Notice that all samples are generated randomly before the inference loop. I believe IO is not really a problem at all.

@piiswrong
Copy link
Contributor

Had a quick look at the script. I don't see a burn-in section. Usually you should let the script run for a few iterations for warming up before you start timing it

@nicklhy
Copy link
Contributor Author

nicklhy commented May 24, 2017

@piiswrong , Yes, I didn't do the warm-up running in this script. But I will remove the smallest and largest values over 10 epochs, and then use the remaining's mean value as the final result.

Moreover, from the program's log, missing of warm-up running doesn't affect the speed measurement of an entire epoch (n_samples is set to 1000) very much. Though the speed values of each epoch can vary a lot, the smallest one is not definitely from the first epoch. Here is a log print when I test the inception-v3 network.

# python inference_mxnet.py --network inception-v3 --n-sample 1000 --n-epoch 10 --gpu 0 --batch-size 128
===================== benchmark for mxnet inception-v3 =====================
n_sample=1000, batch_size=128, n_epoch=10
[09:20:40] src/operator/././cudnn_algoreg-inl.h:65: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
Init parameters randomly
Finish loading model in 9.9899s
Generate 1000 random images in 6.3187s
Epoch 0, finish 1000 images in 2.9740s, speed = 336.2460 image/s
Epoch 1, finish 1000 images in 2.8387s, speed = 352.2794 image/s
Epoch 2, finish 1000 images in 2.8898s, speed = 346.0436 image/s
Epoch 3, finish 1000 images in 2.8196s, speed = 354.6634 image/s
Epoch 4, finish 1000 images in 2.8310s, speed = 353.2313 image/s
Epoch 5, finish 1000 images in 3.0288s, speed = 330.1593 image/s
Epoch 6, finish 1000 images in 2.8374s, speed = 352.4338 image/s
Epoch 7, finish 1000 images in 2.9013s, speed = 344.6708 image/s
Epoch 8, finish 1000 images in 3.0996s, speed = 322.6236 image/s
Epoch 9, finish 1000 images in 2.9490s, speed = 339.0955 image/s
Finish 1000 images for 10 times in 29.1700s, speed = 344.0849 image/s (2.9063 ms/image)

But if necessary, I will add a warm-up running in my script. Is 10 iterations enough for that?

@piiswrong
Copy link
Contributor

piiswrong commented May 24, 2017 via email

@nicklhy
Copy link
Contributor Author

nicklhy commented May 24, 2017

100 iterations? If batch size is set to 128, 100 iterations means processing 12800 images before the real benchmark loop. I never expect mxnet could require such large data size for warming up ...

BTW, the example benchmark code (example/image-classification/benchmark_score.py) only runs 5 iterations for warming up.

@piiswrong
Copy link
Contributor

piiswrong commented May 24, 2017

You need to burn your GPU to working temp (usually around 80C) to get stable numbers.
This should be done for all frameworks. Otherwise you get volatile speed at the beginning.

@nicklhy
Copy link
Contributor Author

nicklhy commented May 24, 2017

Just did the test following your advice(100 iterations' warm-up), the results of alexnet are:

bs=1, speed=468.662964
bs=2, speed=738.736337
bs=4, speed=1030.423765
bs=8, speed=1511.132060
bs=16, speed=1656.789473
bs=32, speed=1733.751776
bs=64, speed=1443.326092
bs=128, speed=1342.492282

The temp of GPU is very stable during the test and the speed drop still exists.

@ghost
Copy link

ghost commented May 24, 2017 via email

@nicklhy
Copy link
Contributor Author

nicklhy commented May 24, 2017

@xioryu , I think you should not count the first eval time. Test the time like this is better:

ex = conv.eval(ctx=mx.gpu())

tic = time.time()
for i in xrange(100):
    ex = conv.eval(ctx=mx.gpu())
toc = time.time()
print toc - tic

CUDNN autotune needs to run performance tests to find the best convolution algorithm. It is pretty common that the first eval time is longer than the case when cudnn_tune is set to "off".

@ghost
Copy link

ghost commented May 24, 2017 via email

@ptrendx
Copy link
Member

ptrendx commented May 26, 2017

@nicklhy As I mentioned before, we did not do any inference tests ourselves, so I don't have a good intuition whether those results look fine, but FWIW, I tested your script on P100 and got this for ResNet50:

Batch size speed
1 147
2 248
4 388
8 506
16 589
32 661
64 696
128 709

So I don't really see the problem there... Is the Titan X Pascal result the old Titan X or the new one (perfect)? Also, what is the version of MXNet you are testing it on?

@nicklhy
Copy link
Contributor Author

nicklhy commented May 27, 2017

@ptrendx , Thanks for your help. My testing environment is:

  • GPU: Titan X (Pascal)
  • CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
  • OS: Ubuntu 16.04 LTS
  • Nvidia Driver: 375.26
  • CUDA: 8.0.61
  • CUDNN: 5.1.5
  • MXNet GitHub hash: 5a7aa20

And my ResNet50's results are:

Batch size Speed
1 183.54
2 273.00
4 386.20
8 452.96
16 506.05
32 513.13
64 482.73
128 481.10

Your speed results are increasing much more stable than mine. Now, I guess the problem I met may be caused by my testing environment.

Beside the slight speed drop problem I mentioned above, I also found the speed of a specific batch size(i.e. 32, 64 or 128) may accidentally drop a very large gap during my benchmark experiments. And if I run it again, this speed value may recover to a normal level.

In a word, it seems that my gpu does not works as stable as it should be. I will check about the hardware and try to test my scripts on some other machines later.

@nicklhy
Copy link
Contributor Author

nicklhy commented May 27, 2017

@ptrendx , Just did a test in a K80 machine. The speed results are all stable like your results. I think my problem is caused by the new titan x gpu or its power supply.

Thank you for your help.

@yajiedesign
Copy link
Contributor

This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks!

@Ram-Godavarthi
Copy link

Hi,
Anybody worked on faster rcnn in mxnet with batch size more than 1?
what is the performance ?? Is it possible to train the network with batch size more than 1 in mxnet-rcnn?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants