Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

very slow on hd graphics 5000 #1

Open
anguoyang opened this issue Feb 5, 2018 · 8 comments
Open

very slow on hd graphics 5000 #1

anguoyang opened this issue Feb 5, 2018 · 8 comments

Comments

@anguoyang
Copy link

Hi@ganyc717
Thank you for your project, I have tested on my intel hd graphics 5000, it is very slow, about 8 seconds for 1 single image, e.g, the dog.jpg
I have never modified your source code.

@ganyc717
Copy link
Owner

ganyc717 commented Feb 5, 2018

Hi@anguoyang
As far as I am concerned, the performance issue is heavily depend on the hardware. For this project, I tested on GTX 970, the work that yolo.cfg to detect dog.jpg spend about 0.18 seconds. About double time cost compared with the origin cuda project on the same hardware(GTX 970).
Thank you for giving me this ticket.

@AndrewSivrit
Copy link

Hi@anguoyang
What mode did you use ? Debug or Release ?
Check Release mode in Visual Studio.

@anguoyang
Copy link
Author

hi@AndrewSivrit, I used Release mode in vs, thank you

@anguoyang
Copy link
Author

hi@ganyc717 , yes maybe, but I want to use intel GPU instead of nvidia, which is cos-efficient for production. thank you for your quick reply.

@anguoyang
Copy link
Author

double time cost compared to cuda is acceptable and reasonable, however, my test result is...really slow, almost hundreds over cuda(similar hardware), so I suppose there maybe something wrong with my program?

@anguoyang
Copy link
Author

D:\Darknet-On-OpenCL\x64\Release>darknet_cl detect cfg/yolo.cfg yolo.weights data/dog.jpg
layer filters size input output
0 conv 32 3 x 3 / 1 608 x 608 x 3 -> 608 x 608 x 32
1 blas_kernels_1.cl build log:
1:82:37: warning: double precision constant requires cl_khr_fp64, casting to single precision
1:82:58: warning: double precision constant requires cl_khr_fp64, casting to single precision
fcl build 1 succeeded.
fcl build 2 succeeded.
bcl build succeeded.

max 2 x 2 / 2 608 x 608 x 32 -> 304 x 304 x 32
2 conv 64 3 x 3 / 1 304 x 304 x 32 -> 304 x 304 x 64
3 max 2 x 2 / 2 304 x 304 x 64 -> 152 x 152 x 64
4 conv 128 3 x 3 / 1 152 x 152 x 64 -> 152 x 152 x 128
5 conv 64 1 x 1 / 1 152 x 152 x 128 -> 152 x 152 x 64
6 conv 128 3 x 3 / 1 152 x 152 x 64 -> 152 x 152 x 128
7 max 2 x 2 / 2 152 x 152 x 128 -> 76 x 76 x 128
8 conv 256 3 x 3 / 1 76 x 76 x 128 -> 76 x 76 x 256
9 conv 128 1 x 1 / 1 76 x 76 x 256 -> 76 x 76 x 128
10 conv 256 3 x 3 / 1 76 x 76 x 128 -> 76 x 76 x 256
11 max 2 x 2 / 2 76 x 76 x 256 -> 38 x 38 x 256
12 conv 512 3 x 3 / 1 38 x 38 x 256 -> 38 x 38 x 512
13 conv 256 1 x 1 / 1 38 x 38 x 512 -> 38 x 38 x 256
14 conv 512 3 x 3 / 1 38 x 38 x 256 -> 38 x 38 x 512
15 conv 256 1 x 1 / 1 38 x 38 x 512 -> 38 x 38 x 256
16 conv 512 3 x 3 / 1 38 x 38 x 256 -> 38 x 38 x 512
17 max 2 x 2 / 2 38 x 38 x 512 -> 19 x 19 x 512
18 conv 1024 3 x 3 / 1 19 x 19 x 512 -> 19 x 19 x1024
19 conv 512 1 x 1 / 1 19 x 19 x1024 -> 19 x 19 x 512
20 conv 1024 3 x 3 / 1 19 x 19 x 512 -> 19 x 19 x1024
21 conv 512 1 x 1 / 1 19 x 19 x1024 -> 19 x 19 x 512
22 conv 1024 3 x 3 / 1 19 x 19 x 512 -> 19 x 19 x1024
23 conv 1024 3 x 3 / 1 19 x 19 x1024 -> 19 x 19 x1024
24 conv 1024 3 x 3 / 1 19 x 19 x1024 -> 19 x 19 x1024
25 route 16
26 conv 64 1 x 1 / 1 38 x 38 x 512 -> 38 x 38 x 64
27 reorg / 2 38 x 38 x 64 -> 19 x 19 x 256
28 route 27 24
29 conv 1024 3 x 3 / 1 19 x 19 x1280 -> 19 x 19 x1024
30 conv 425 1 x 1 / 1 19 x 19 x1024 -> 19 x 19 x 425
31 detection
mask_scale: Using default '1.000000'
Loading weights from yolo.weights...Done!
im2col_kernels.cl build log:
2:36:18: warning: '/*' within block comment
fcl build 1 succeeded.
fcl build 2 succeeded.
bcl build succeeded.

activation_kernels.cl build log:
4:21:12: warning: double precision constant requires cl_khr_fp64, casting to single precision
fcl build 1 succeeded.
fcl build 2 succeeded.
bcl build succeeded.

maxpool_layer_kernels.cl build log:
fcl build 1 succeeded.
fcl build 2 succeeded.
bcl build succeeded.

blas_kernels_2.cl build log:
fcl build 1 succeeded.
fcl build 2 succeeded.
bcl build succeeded.

data/dog.jpg: Predicted in 7.806060 seconds.
dog: 82%
car: 28%
truck: 64%
bicycle: 85%

@ganyc717
Copy link
Owner

ganyc717 commented Feb 6, 2018

Hi @anguoyang
I have tested on my laptop with intel HD 4600, seems the majority of kernel time spend on sgemm function, this is BLAS function, and I suggest not modify this. But I noticed that clBLAS have special optimization with AMD GPU, and didn't include it in this repo, you may change another GPU and tried again. Or just choose a smaller scale of network like tiny-yolo.
Best Regards!

@victorv
Copy link

victorv commented Mar 8, 2018

OpenCL performance is not platform independent so you would need to tune any CL code to the target platform to avoid register spilling, local memory overflow, etc..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants