Inference speed may decrease slightly when batch-size increase #6396
Comments
@ptrendx @ap-hynninen The drop looks weird. Do you observe this? Is it due to cudnn algo selection? |
By the way, MXNET_CUDNN_AUTOTUNE_DEFAULT is 1 in my experiments. The program needs to run performance tests to find the best conv algo each time you run it. |
Similarly, I did some performance test for cudnn autotuning with the modes "off-0, limited-1, fastest-2", by evaluating time cost for 1000 convolutions. |
@xioryu Could you share the repro for that? @nicklhy, @piiswrong We did not do any inference tests, but that is definitely weird behavior - cudnn should not slow down with the increasing batch size. Maybe IO? Do you have a script we can use to repro that? |
Had a quick look at the script. I don't see a burn-in section. Usually you should let the script run for a few iterations for warming up before you start timing it |
@piiswrong , Yes, I didn't do the warm-up running in this script. But I will remove the smallest and largest values over 10 epochs, and then use the remaining's mean value as the final result. Moreover, from the program's log, missing of warm-up running doesn't affect the speed measurement of an entire epoch (n_samples is set to 1000) very much. Though the speed values of each epoch can vary a lot, the smallest one is not definitely from the first epoch. Here is a log print when I test the inception-v3 network.
But if necessary, I will add a warm-up running in my script. Is 10 iterations enough for that? |
Try do 100 iterations of warm up first
nicklhy <notifications@github.com>于2017年5月23日 周二下午6:27写道:
… @piiswrong <https://github.com/piiswrong> , Yes, I didn't do the warm-up
running in this script. But I will remove the smallest and largest values
over 10 epochs, and then use the remaining's mean value as the final result.
Moreover, from the program's log, missing of warm-up running doesn't
affect the speed measurement of an entire epoch (n_samples is set to 1000)
very much. Though the speed values of each epoch can vary a lot, the
smallest one is not definitely from the first epoch. Here is a log print
when I test the inception-v3 network.
# python inference_mxnet.py --network inception-v3 --n-sample 1000 --n-epoch 10 --gpu 0 --batch-size 128
===================== benchmark for mxnet inception-v3 =====================
n_sample=1000, batch_size=128, n_epoch=10
[09:20:40] src/operator/././cudnn_algoreg-inl.h:65: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
Init parameters randomly
Finish loading model in 9.9899s
Generate 1000 random images in 6.3187s
Epoch 0, finish 1000 images in 2.9740s, speed = 336.2460 image/s
Epoch 1, finish 1000 images in 2.8387s, speed = 352.2794 image/s
Epoch 2, finish 1000 images in 2.8898s, speed = 346.0436 image/s
Epoch 3, finish 1000 images in 2.8196s, speed = 354.6634 image/s
Epoch 4, finish 1000 images in 2.8310s, speed = 353.2313 image/s
Epoch 5, finish 1000 images in 3.0288s, speed = 330.1593 image/s
Epoch 6, finish 1000 images in 2.8374s, speed = 352.4338 image/s
Epoch 7, finish 1000 images in 2.9013s, speed = 344.6708 image/s
Epoch 8, finish 1000 images in 3.0996s, speed = 322.6236 image/s
Epoch 9, finish 1000 images in 2.9490s, speed = 339.0955 image/s
Finish 1000 images for 10 times in 29.1700s, speed = 344.0849 image/s (2.9063 ms/image)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6396 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAiudCxlYFw2F-SdOIT6QSG7Xzh-SdRPks5r84eBgaJpZM4NjFST>
.
|
100 iterations? If batch size is set to 128, 100 iterations means processing 12800 images before the real benchmark loop. I never expect mxnet could require such large data size for warming up ... BTW, the example benchmark code (example/image-classification/benchmark_score.py) only runs 5 iterations for warming up. |
You need to burn your GPU to working temp (usually around 80C) to get stable numbers. |
Just did the test following your advice(100 iterations' warm-up), the results of alexnet are:
The temp of GPU is very stable during the test and the speed drop still exists. |
Here I pasted my evaluation code below. You may try them on your own
machine.
=======================================================================
import mxnet as mx
import time
data = mx.sym.random_uniform(low=-1, high=1, shape=(32, 32, 56, 56))
weight = mx.sym.random_normal(loc=0, scale=0.1, shape=(128, 32, 3, 3))
bias = mx.sym.zeros((128,))
conv = mx.sym.Convolution(data=data, weight=weight, bias=bias,
kernel=(3,3), pad=(1,1),
num_filter=128, cudnn_tune=None)
tic = time.time()
for i in xrange(100):
ex = conv.eval(ctx=mx.gpu())
toc = time.time()
print toc - tic
==================================================================
When switching the cudnn tune mode, it takes longer time for the first-time
evaluation (e.g. 150ms),
i7-6800k, titan x pascal
------------------------------------------
cudnn tune mode time (ms)
None 46
off 56
limited_workspace 48
fastest 44
2017-05-24 13:57 GMT+08:00 Eric Junyuan Xie <notifications@github.com>:
… You need to burn your GPU to working temp (usually around 80C) to get
stable numbers.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6396 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AYAbchaWPw-wUzzuuTSBSdMH7BflfkXeks5r88bmgaJpZM4NjFST>
.
|
@xioryu , I think you should not count the first eval time. Test the time like this is better:
CUDNN autotune needs to run performance tests to find the best convolution algorithm. It is pretty common that the first eval time is longer than the case when cudnn_tune is set to "off". |
@nicklhy, yeah, the reported results did not include the first-time count.
They were averaged time after several evals.
2017-05-24 17:08 GMT+08:00 nicklhy <notifications@github.com>:
… @xioryu <https://github.com/xioryu> , I think you should not count the
first eval time. Test the time like this is better:
ex = conv.eval(ctx=mx.gpu())
tic = time.time()
for i in xrange(100):
ex = conv.eval(ctx=mx.gpu())
toc = time.time()
print toc - tic
CUDNN autotune needs to run performance tests to find the best convolution
algorithm. It is pretty common that the first eval time is longer than the
case when cudnn_tune is set to "off".
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6396 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AYAbcr0KCfh4UjLuQH4bK_Xm2gSh2tDwks5r8_N-gaJpZM4NjFST>
.
|
@nicklhy As I mentioned before, we did not do any inference tests ourselves, so I don't have a good intuition whether those results look fine, but FWIW, I tested your script on P100 and got this for ResNet50:
So I don't really see the problem there... Is the Titan X Pascal result the old Titan X or the new one (perfect)? Also, what is the version of MXNet you are testing it on? |
@ptrendx , Thanks for your help. My testing environment is:
And my ResNet50's results are:
Your speed results are increasing much more stable than mine. Now, I guess the problem I met may be caused by my testing environment. Beside the slight speed drop problem I mentioned above, I also found the speed of a specific batch size(i.e. 32, 64 or 128) may accidentally drop a very large gap during my benchmark experiments. And if I run it again, this speed value may recover to a normal level. In a word, it seems that my gpu does not works as stable as it should be. I will check about the hardware and try to test my scripts on some other machines later. |
@ptrendx , Just did a test in a K80 machine. The speed results are all stable like your results. I think my problem is caused by the new titan x gpu or its power supply. Thank you for your help. |
This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks! |
Hi, |
Recently, I did some speed benchmark experiments of CNN inference task over a few popular deep learning frameworks (including mxnet). Though mxnet performs excellently in most modern CNN structures (i.e. resnet, inception), I also found mxnet's inference speed curves often drop slightly when batch size is 16, 32 or 64 as below. Is this pretty common?
Some of other benchmark results can be found here(Titan X) and here(GTX1080).
The text was updated successfully, but these errors were encountered: