Inference speed may decrease slightly when batch-size increase #6396

nicklhy · 2017-05-23T01:48:06Z

Recently, I did some speed benchmark experiments of CNN inference task over a few popular deep learning frameworks (including mxnet). Though mxnet performs excellently in most modern CNN structures (i.e. resnet, inception), I also found mxnet's inference speed curves often drop slightly when batch size is 16, 32 or 64 as below. Is this pretty common?

Some of other benchmark results can be found here(Titan X) and here(GTX1080).

piiswrong · 2017-05-23T05:49:20Z

@ptrendx @ap-hynninen The drop looks weird. Do you observe this? Is it due to cudnn algo selection?

nicklhy · 2017-05-23T09:50:15Z

By the way, MXNET_CUDNN_AUTOTUNE_DEFAULT is 1 in my experiments. The program needs to run performance tests to find the best conv algo each time you run it.

ghost · 2017-05-23T14:17:17Z

Similarly, I did some performance test for cudnn autotuning with the modes "off-0, limited-1, fastest-2", by evaluating time cost for 1000 convolutions.
Surprisingly, the fastest mode takes more time, while turning cudnn off achieves the best speed.

ptrendx · 2017-05-23T14:26:31Z

@xioryu Could you share the repro for that?

@nicklhy, @piiswrong We did not do any inference tests, but that is definitely weird behavior - cudnn should not slow down with the increasing batch size. Maybe IO? Do you have a script we can use to repro that?

nicklhy · 2017-05-23T14:33:38Z

@ptrendx , you can find my test script here. Notice that all samples are generated randomly before the inference loop. I believe IO is not really a problem at all.

piiswrong · 2017-05-23T16:34:23Z

Had a quick look at the script. I don't see a burn-in section. Usually you should let the script run for a few iterations for warming up before you start timing it

nicklhy · 2017-05-24T01:26:51Z

@piiswrong , Yes, I didn't do the warm-up running in this script. But I will remove the smallest and largest values over 10 epochs, and then use the remaining's mean value as the final result.

Moreover, from the program's log, missing of warm-up running doesn't affect the speed measurement of an entire epoch (n_samples is set to 1000) very much. Though the speed values of each epoch can vary a lot, the smallest one is not definitely from the first epoch. Here is a log print when I test the inception-v3 network.

# python inference_mxnet.py --network inception-v3 --n-sample 1000 --n-epoch 10 --gpu 0 --batch-size 128
===================== benchmark for mxnet inception-v3 =====================
n_sample=1000, batch_size=128, n_epoch=10
[09:20:40] src/operator/././cudnn_algoreg-inl.h:65: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
Init parameters randomly
Finish loading model in 9.9899s
Generate 1000 random images in 6.3187s
Epoch 0, finish 1000 images in 2.9740s, speed = 336.2460 image/s
Epoch 1, finish 1000 images in 2.8387s, speed = 352.2794 image/s
Epoch 2, finish 1000 images in 2.8898s, speed = 346.0436 image/s
Epoch 3, finish 1000 images in 2.8196s, speed = 354.6634 image/s
Epoch 4, finish 1000 images in 2.8310s, speed = 353.2313 image/s
Epoch 5, finish 1000 images in 3.0288s, speed = 330.1593 image/s
Epoch 6, finish 1000 images in 2.8374s, speed = 352.4338 image/s
Epoch 7, finish 1000 images in 2.9013s, speed = 344.6708 image/s
Epoch 8, finish 1000 images in 3.0996s, speed = 322.6236 image/s
Epoch 9, finish 1000 images in 2.9490s, speed = 339.0955 image/s
Finish 1000 images for 10 times in 29.1700s, speed = 344.0849 image/s (2.9063 ms/image)

But if necessary, I will add a warm-up running in my script. Is 10 iterations enough for that?

piiswrong · 2017-05-24T02:55:59Z

Try do 100 iterations of warm up first nicklhy <notifications@github.com>于2017年5月23日周二下午6:27写道：

…

@piiswrong <https://github.com/piiswrong> , Yes, I didn't do the warm-up running in this script. But I will remove the smallest and largest values over 10 epochs, and then use the remaining's mean value as the final result. Moreover, from the program's log, missing of warm-up running doesn't affect the speed measurement of an entire epoch (n_samples is set to 1000) very much. Though the speed values of each epoch can vary a lot, the smallest one is not definitely from the first epoch. Here is a log print when I test the inception-v3 network. # python inference_mxnet.py --network inception-v3 --n-sample 1000 --n-epoch 10 --gpu 0 --batch-size 128 ===================== benchmark for mxnet inception-v3 ===================== n_sample=1000, batch_size=128, n_epoch=10 [09:20:40] src/operator/././cudnn_algoreg-inl.h:65: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) Init parameters randomly Finish loading model in 9.9899s Generate 1000 random images in 6.3187s Epoch 0, finish 1000 images in 2.9740s, speed = 336.2460 image/s Epoch 1, finish 1000 images in 2.8387s, speed = 352.2794 image/s Epoch 2, finish 1000 images in 2.8898s, speed = 346.0436 image/s Epoch 3, finish 1000 images in 2.8196s, speed = 354.6634 image/s Epoch 4, finish 1000 images in 2.8310s, speed = 353.2313 image/s Epoch 5, finish 1000 images in 3.0288s, speed = 330.1593 image/s Epoch 6, finish 1000 images in 2.8374s, speed = 352.4338 image/s Epoch 7, finish 1000 images in 2.9013s, speed = 344.6708 image/s Epoch 8, finish 1000 images in 3.0996s, speed = 322.6236 image/s Epoch 9, finish 1000 images in 2.9490s, speed = 339.0955 image/s Finish 1000 images for 10 times in 29.1700s, speed = 344.0849 image/s (2.9063 ms/image) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6396 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAiudCxlYFw2F-SdOIT6QSG7Xzh-SdRPks5r84eBgaJpZM4NjFST> .

nicklhy · 2017-05-24T05:48:03Z

100 iterations? If batch size is set to 128, 100 iterations means processing 12800 images before the real benchmark loop. I never expect mxnet could require such large data size for warming up ...

BTW, the example benchmark code (example/image-classification/benchmark_score.py) only runs 5 iterations for warming up.

piiswrong · 2017-05-24T05:57:42Z

You need to burn your GPU to working temp (usually around 80C) to get stable numbers.
This should be done for all frameworks. Otherwise you get volatile speed at the beginning.

nicklhy · 2017-05-24T06:33:08Z

Just did the test following your advice(100 iterations' warm-up), the results of alexnet are:

bs=1, speed=468.662964
bs=2, speed=738.736337
bs=4, speed=1030.423765
bs=8, speed=1511.132060
bs=16, speed=1656.789473
bs=32, speed=1733.751776
bs=64, speed=1443.326092
bs=128, speed=1342.492282

The temp of GPU is very stable during the test and the speed drop still exists.

ghost · 2017-05-24T08:53:54Z

Here I pasted my evaluation code below. You may try them on your own machine. ======================================================================= import mxnet as mx import time data = mx.sym.random_uniform(low=-1, high=1, shape=(32, 32, 56, 56)) weight = mx.sym.random_normal(loc=0, scale=0.1, shape=(128, 32, 3, 3)) bias = mx.sym.zeros((128,)) conv = mx.sym.Convolution(data=data, weight=weight, bias=bias, kernel=(3,3), pad=(1,1), num_filter=128, cudnn_tune=None) tic = time.time() for i in xrange(100): ex = conv.eval(ctx=mx.gpu()) toc = time.time() print toc - tic ================================================================== When switching the cudnn tune mode, it takes longer time for the first-time evaluation (e.g. 150ms), i7-6800k, titan x pascal ------------------------------------------ cudnn tune mode time (ms) None 46 off 56 limited_workspace 48 fastest 44 2017-05-24 13:57 GMT+08:00 Eric Junyuan Xie <notifications@github.com>:

…

You need to burn your GPU to working temp (usually around 80C) to get stable numbers. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6396 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYAbchaWPw-wUzzuuTSBSdMH7BflfkXeks5r88bmgaJpZM4NjFST> .

nicklhy · 2017-05-24T09:07:56Z

@xioryu , I think you should not count the first eval time. Test the time like this is better:

ex = conv.eval(ctx=mx.gpu())

tic = time.time()
for i in xrange(100):
    ex = conv.eval(ctx=mx.gpu())
toc = time.time()
print toc - tic

CUDNN autotune needs to run performance tests to find the best convolution algorithm. It is pretty common that the first eval time is longer than the case when cudnn_tune is set to "off".

ghost · 2017-05-24T10:01:51Z

@nicklhy, yeah, the reported results did not include the first-time count. They were averaged time after several evals. 2017-05-24 17:08 GMT+08:00 nicklhy <notifications@github.com>:

…

@xioryu <https://github.com/xioryu> , I think you should not count the first eval time. Test the time like this is better: ex = conv.eval(ctx=mx.gpu()) tic = time.time() for i in xrange(100): ex = conv.eval(ctx=mx.gpu()) toc = time.time() print toc - tic CUDNN autotune needs to run performance tests to find the best convolution algorithm. It is pretty common that the first eval time is longer than the case when cudnn_tune is set to "off". — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6396 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYAbcr0KCfh4UjLuQH4bK_Xm2gSh2tDwks5r8_N-gaJpZM4NjFST> .

ptrendx · 2017-05-26T16:38:24Z

@nicklhy As I mentioned before, we did not do any inference tests ourselves, so I don't have a good intuition whether those results look fine, but FWIW, I tested your script on P100 and got this for ResNet50:

Batch size	speed
1	147
2	248
4	388
8	506
16	589
32	661
64	696
128	709

So I don't really see the problem there... Is the Titan X Pascal result the old Titan X or the new one (perfect)? Also, what is the version of MXNet you are testing it on?

nicklhy · 2017-05-27T01:52:43Z

@ptrendx , Thanks for your help. My testing environment is:

GPU: Titan X (Pascal)
CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
OS: Ubuntu 16.04 LTS
Nvidia Driver: 375.26
CUDA: 8.0.61
CUDNN: 5.1.5
MXNet GitHub hash: 5a7aa20

And my ResNet50's results are:

Batch size	Speed
1	183.54
2	273.00
4	386.20
8	452.96
16	506.05
32	513.13
64	482.73
128	481.10

Your speed results are increasing much more stable than mine. Now, I guess the problem I met may be caused by my testing environment.

Beside the slight speed drop problem I mentioned above, I also found the speed of a specific batch size(i.e. 32, 64 or 128) may accidentally drop a very large gap during my benchmark experiments. And if I run it again, this speed value may recover to a normal level.

In a word, it seems that my gpu does not works as stable as it should be. I will check about the hardware and try to test my scripts on some other machines later.

nicklhy · 2017-05-27T07:09:03Z

@ptrendx , Just did a test in a K80 machine. The speed results are all stable like your results. I think my problem is caused by the new titan x gpu or its power supply.

Thank you for your help.

yajiedesign · 2017-09-30T17:17:04Z

This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks!

Ram-Godavarthi · 2018-06-26T08:58:23Z

Hi,
Anybody worked on faster rcnn in mxnet with batch size more than 1?
what is the performance ?? Is it possible to train the network with batch size more than 1 in mxnet-rcnn?

yajiedesign closed this as completed Sep 30, 2017

cbasavaraj mentioned this issue Nov 30, 2020

Do you support batch inference? facebookresearch/detectron2#282

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference speed may decrease slightly when batch-size increase #6396

Inference speed may decrease slightly when batch-size increase #6396

nicklhy commented May 23, 2017

piiswrong commented May 23, 2017

nicklhy commented May 23, 2017 •

edited

ghost commented May 23, 2017

ptrendx commented May 23, 2017

nicklhy commented May 23, 2017

piiswrong commented May 23, 2017

nicklhy commented May 24, 2017 •

edited

piiswrong commented May 24, 2017 via email

nicklhy commented May 24, 2017

piiswrong commented May 24, 2017 •

edited

nicklhy commented May 24, 2017 •

edited

ghost commented May 24, 2017 via email

nicklhy commented May 24, 2017

ghost commented May 24, 2017 via email

ptrendx commented May 26, 2017

nicklhy commented May 27, 2017

nicklhy commented May 27, 2017

yajiedesign commented Sep 30, 2017

Ram-Godavarthi commented Jun 26, 2018

Inference speed may decrease slightly when batch-size increase #6396

Inference speed may decrease slightly when batch-size increase #6396

Comments

nicklhy commented May 23, 2017

piiswrong commented May 23, 2017

nicklhy commented May 23, 2017 • edited

ghost commented May 23, 2017

ptrendx commented May 23, 2017

nicklhy commented May 23, 2017

piiswrong commented May 23, 2017

nicklhy commented May 24, 2017 • edited

piiswrong commented May 24, 2017 via email

nicklhy commented May 24, 2017

piiswrong commented May 24, 2017 • edited

nicklhy commented May 24, 2017 • edited

ghost commented May 24, 2017 via email

nicklhy commented May 24, 2017

ghost commented May 24, 2017 via email

ptrendx commented May 26, 2017

nicklhy commented May 27, 2017

nicklhy commented May 27, 2017

yajiedesign commented Sep 30, 2017

Ram-Godavarthi commented Jun 26, 2018

nicklhy commented May 23, 2017 •

edited

nicklhy commented May 24, 2017 •

edited

piiswrong commented May 24, 2017 •

edited

nicklhy commented May 24, 2017 •

edited