Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

How to speed up mxnet prediction? Copying gpu->cpu takes a long time #9884

Open
johnbroughton2017 opened this issue Feb 25, 2018 · 3 comments

Comments

@johnbroughton2017
Copy link

johnbroughton2017 commented Feb 25, 2018

Hi all,

Doing prediction using mxnet has two major part: forward pass and copy results from gpu to cpu memory, as

mod.forward(Batch([mx.nd.array(data)]))
prob = mod.get_outputs(0)[0][0].asnumpy()

I did a quick timing based on batch size (see below). It seems like the second operation above takes a lot of time when batch size increases.

  batch size    mod.forward() (ms)    mod.get_outputs...asnumpy() (ms)
------------------------------------------------------------------------------------------------
          16                   5.8                                30.1
          32                  10.5                                51.1
          48                  14                                  78.7
          64                  17.8                                95.6
          80                  33.2                               121.3
          96                  36.2                               147.5
         112                  41.3                               174.3
         128                  46.4                               245.5
         144                  52                                 219
         160                  56.9                               241.2
         176                  64.9                               267.4
         192                  69.5                               329.1
         208                  73.4                               317.1
         224                  80.7                               337.4
         240                  83.4                               446.7
         256                  93.4                               380.7

I don't understand this because copying data from gpu to cpu should be really fast. For example, the following code takes only 0.2ms to run.

# speed test
import time
import mxnet as mx
a = mx.nd.random_uniform(shape=(256, 1000), ctx=mx.cpu())
b = mx.nd.random_uniform(shape=(256, 1000), ctx=mx.gpu())

t0 = time.time()
b.copyto(a)
print time.time()-t0

Am I doing this in a wrong way? Any help is highly appreciated. Thanks.

-- John

PS: The architecture is resnet50.

@johnbroughton2017
Copy link
Author

johnbroughton2017 commented Feb 25, 2018

Follow-up.

Found this more interesting. Using caffenet instead of resnet50, it looks like this:

  batch size    mod.forward() (ms)    mod.get_outputs...asnumpy() (ms)
------------  --------------------  ----------------------------------
          16                 156.6                                61.3
          32                 183.4                                28.9
          48                 166.4                                25.3
          64                 166.7                                32.1
          80                 171.3                                38.6
          96                 181.8                                33.4
         112                 181.4                                41.6
         128                 188.2                                46.8
         144                 236.5                                61.2
         160                 193.1                                54.4
         176                 195.8                                61.8
         192                 198.9                                65.9
         208                 196.7                                70.3
         224                 199.5                                75.3
         240                 203.5                                77.4
         256                 206                                  81.9

The output dimension should be the same but for some reason the data copying time is reduced a lot. Cannot figure out why

-- John

@johnbroughton2017 johnbroughton2017 changed the title How to speed up prediction run time? Copying gpu->cpu takes a long time How to speed up mxnet prediction? Copying gpu->cpu takes a long time Feb 25, 2018
@reminisce
Copy link
Contributor

mod.forward and 'NDArray.copyto are async functions. It's not accurate to simply time the line of python code. You need to insertwaitall()function betweenforwardandget_outputs` and measure the time difference between the beginning and the end of each code block. More precisely, you can use the MXNet built-in profiler to get the accurate execution time of each operation.

https://mxnet.incubator.apache.org/how_to/perf.html#profiler

@johnbroughton2017
Copy link
Author

Thanks @reminisce! Will try it out.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants