New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to make the GPU to CPU memory copy faster? #979

Closed
siju-samuel opened this Issue Mar 9, 2018 · 15 comments

Comments

Projects
None yet
7 participants
@siju-samuel
Member

siju-samuel commented Mar 9, 2018

Im running yolo model and the gpu to cpu copy (get_output) takes considerable amount of time. Any method to optimize this copy operation?

@cnuernber

This comment has been minimized.

Contributor

cnuernber commented Mar 9, 2018

https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/

This may give you some ideas to start with. In general you want to use pinned memory and you want to interleave computation with copying; so you want to be upload the next thing while you are computing the current thing while you are downloading the last thing.

I recommend nvprof.

@tqchen

This comment has been minimized.

Member

tqchen commented Mar 9, 2018

Please confirm if it is really the case, or because the copy is slow due to previously issued asynchronous CUDA functions. To check, do a tvm.gpu(0).sync() before you run get_output

@dayanandasiet

This comment has been minimized.

Contributor

dayanandasiet commented Mar 11, 2018

@tqchen

We are using openCL(on Huawei Mate 9 phone Mali GPU), with tvm.cl(0).sync() still get_output(copying from GPU to CPU) is consuming comparatively more time(~2.7seconds). Where we can check to troubleshoot?

@janboeye

This comment has been minimized.

Contributor

janboeye commented Mar 12, 2018

@dayanandasiet
clEnqueueReadBuffer will take 3 seconds, right?

@dayanandasiet

This comment has been minimized.

Contributor

dayanandasiet commented Mar 12, 2018

@janboeye

yes clEnqueueReadBuffer(because tvm call clEnqueueReadBuffer asynchronous) consuming 2.7 seconds.

@tqchen

This comment has been minimized.

Member

tqchen commented Mar 12, 2018

If you are running this through RPC, then there could be rpc overhead through networking.

The RPC overhead won't be the real problem when you are deploying it to the phone(and no RPC is used). If you just time the clEnqueueReadBuffer and see the perf on the Mali, you might get an idea on what the overhead is. But there is less we can do here because it is overhead of opencl runtime

@siju-samuel

This comment has been minimized.

Member

siju-samuel commented Mar 13, 2018

We tried both the methods, RPC and standalone APP. The RPC overhead is negligible compared to OpenCL overhead. clEnqueueReadBuffer in synchronous/asynchronous mode will take 2.7seconds for me for each getoutput(gpu to cpu copy). And the whole advantage of using GPU is lost with this delay. My output size is only 425x19*19 = 153425. Any tricks to reduce this overhead?
Is 2.7sec normal for this scenario?

@tqchen

This comment has been minimized.

Member

tqchen commented Mar 13, 2018

I am not very sure how does the cost compares in other cases. This definitely sound super slow compared to the memory bandwidth you could get, in mobile the CPU and GPU even shares the memory. I would recommend you do some standalone testing of just copy API in OpenCL and possibly report that to the related OpenCL driver writer

@janboeye

This comment has been minimized.

Contributor

janboeye commented Mar 14, 2018

@siju-samuel

Could you add clFinish before clEnqueueReadBuffer and then measure the time consumed by clEnqueueReadBuffer? Maybe it is not blocked by clEnqueueReadBuffer.

@dayanandasiet

This comment has been minimized.

Contributor

dayanandasiet commented Mar 14, 2018

@janboeye
please find the performance bench mark for api invocation

Iteration 1. No any changes in Opencl_device_api.cc

set_input->(consume 186ms)
    ->CopyDataFromTo
        ->kDLOpenCL to kDLCPU
            ->clEnqueueWriteBuffer
            ->clFinish
get_output->(consume 2891ms)
    ->CopyDataFromTo
        ->kDLOpenCL to kDLCPU
            ->clEnqueueReadBuffer
            ->clFinish

Iteration 2. comment clFinish while clEnqueueReadBuffer Opencl_device_api.cc

set_input->(consume 3185ms)
    ->CopyDataFromTo
        ->kDLOpenCL to kDLCPU
            ->clEnqueueWriteBuffer
            ->clFinish
get_output->(consume 174ms)
    ->CopyDataFromTo
        ->kDLOpenCL to kDLCPU
            ->clEnqueueReadBuffer
            ->//clFinish

Iteration 3. comment clFinish for both clEnqueueWriteBuffer and clEnqueueReadBuffer Opencl_device_api.cc

set_input->(consume 257ms)
    ->CopyDataFromTo
        ->kDLOpenCL to kDLCPU
            ->clEnqueueWriteBuffer
            ->//clFinish
get_output->(consume 54ms)
    ->CopyDataFromTo
        ->kDLOpenCL to kDLCPU
            ->clEnqueueReadBuffer
            ->//clFinish

Note: Iteration 1 and 2 i can get proper output but in iteration 3 i don't get output from target.

@janboeye

This comment has been minimized.

Contributor

janboeye commented Mar 14, 2018

please add clFinish before clEnqueueReadBuffer in get_output, this will make sure all works finished by GPU before clEnqueueReadBuffer

@dayanandasiet

This comment has been minimized.

Contributor

dayanandasiet commented Mar 14, 2018

@janboeye

after add clfinish before clEnqueueReadBuffer in get_output, prediction is proper but operation consume 2855ms and memory copy slow still exist

@tqchen

This comment has been minimized.

Member

tqchen commented Mar 17, 2018

close this for now as the problem may not be on the tvm side bu on the opencl driver

@tqchen tqchen closed this Mar 17, 2018

@ehsanmok

This comment has been minimized.

Contributor

ehsanmok commented May 17, 2018

I have the similar issue with OpenCL 1.2 and 2.0. It doesn't seem right though!

@zhiics

This comment has been minimized.

Contributor

zhiics commented Aug 27, 2018

Did anyone look into this? I am also having the same problem. I split the graph into OpenCL part and GPU part. Three data copy nodes were inserted. It turned out that the first data copy operation takes about (2.7s), the second takes around 40ms, and the third one only needs ~0.5ms. Any idea why is it getting faster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment