Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[WIP] more powerful of NNPACK #4373

Closed
wants to merge 10 commits into from
Closed

[WIP] more powerful of NNPACK #4373

wants to merge 10 commits into from

Conversation

tornadomeet
Copy link
Contributor

@tornadomeet tornadomeet commented Dec 26, 2016

currently NNPACK only support convolution operator with batch-size=1, and this does'n t utilize fully performance of NNPACK when do inference, so we want to make it more powerful.

  • set number of threads by environment var.
  • update convolution of using NNPACK

set MXNET_CPU_NNPACK_NTHREADS=4, before this pr of using NNPACK, run example/image-classification/benchmark_score.py, here is log:

INFO:root:network: alexnet
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 6.733318
INFO:root:batch size  2, image/sec: 7.980731
INFO:root:batch size  4, image/sec: 9.090355
INFO:root:batch size  8, image/sec: 9.589279
INFO:root:batch size 16, image/sec: 9.836241
INFO:root:batch size 32, image/sec: 9.975417
INFO:root:batch size 64, image/sec: 10.075369
INFO:root:batch size 128, image/sec: 10.053556
INFO:root:batch size 256, image/sec: 9.972228
INFO:root:network: vgg
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 2.360555
INFO:root:batch size  2, image/sec: 1.320285
INFO:root:batch size  4, image/sec: 1.381983
INFO:root:batch size  8, image/sec: 1.406876
INFO:root:batch size 16, image/sec: 1.415913
INFO:root:batch size 32, image/sec: 1.428377
INFO:root:batch size 64, image/sec: 1.431983
INFO:root:batch size 128, image/sec: 1.428631
INFO:root:batch size 256, image/sec: 1.433979

after update convolution based on this pr, the log is:

INFO:root:network: alexnet
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 8.720897
INFO:root:batch size  2, image/sec: 9.831821
INFO:root:batch size  4, image/sec: 14.671470
INFO:root:batch size  8, image/sec: 18.792820
INFO:root:batch size 16, image/sec: 21.152899
INFO:root:batch size 32, image/sec: 23.229446
INFO:root:batch size 64, image/sec: 25.290079
INFO:root:batch size 128, image/sec: 26.793525
INFO:root:batch size 256, image/sec: 27.801579
INFO:root:network: vgg
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 2.259832
INFO:root:batch size  2, image/sec: 2.058064
INFO:root:batch size  4, image/sec: 3.224434
INFO:root:batch size  8, image/sec: 4.630889
INFO:root:batch size 16, image/sec: 6.084865
INFO:root:batch size 32, image/sec: 7.371423
INFO:root:batch size 64, image/sec: 8.461219
INFO:root:batch size 128, image/sec: 9.123711
INFO:root:batch size 256, image/sec: 9.689774

from the log, we can konw that when batch-size > 1, it's also very useful using NNPACK to speed up to 2x~7x.

  • suport max-pooling op

from now, NNPACK only support max-pooling with kernel=2, stride=2, and pooling_convention=kFull. so the speed log is the same as above which after update conv, because these symbols use pooling_convention=kValid.

  • support fully-connect op

after NNPACK support fully-connected , the speed is a litter higher:

INFO:root:network: alexnet
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 19.027215
INFO:root:batch size  2, image/sec: 12.879975
INFO:root:batch size  4, image/sec: 17.424076
INFO:root:batch size  8, image/sec: 21.283966
INFO:root:batch size 16, image/sec: 24.469325
INFO:root:batch size 32, image/sec: 25.910348
INFO:root:batch size 64, image/sec: 27.441672
INFO:root:batch size 128, image/sec: 28.009156
INFO:root:batch size 256, image/sec: 28.918950
INFO:root:network: vgg
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 3.980907
INFO:root:batch size  2, image/sec: 2.392069
INFO:root:batch size  4, image/sec: 3.610553
INFO:root:batch size  8, image/sec: 4.994450
INFO:root:batch size 16, image/sec: 6.396612
INFO:root:batch size 32, image/sec: 7.614288
INFO:root:batch size 64, image/sec: 8.826084
INFO:root:batch size 128, image/sec: 9.193653
INFO:root:batch size 256, image/sec: 9.991472

  • document of using NNPACK

@tornadomeet
Copy link
Contributor Author

tornadomeet commented Dec 27, 2016

@mli @clcarwin

now support conv/max-pool/fc with NNPACK, the speed-up is about 2x~7x.
and the inference result should be ok, i test caffenet on imaget val dataset(50k) with batch-size = 256, result is:

lib top-1 top-5
cuDNN 54.257% 78.199%
NNPACK 54.229% 78.125%

@mli
Copy link
Member

mli commented Dec 27, 2016

can you please add a link in how_to/perf.md to mention NNPack can also accelerate CPU performance?

also do you have ec2 c4.8xlarge numbers? and have you tried arm, e.g. android or Raspberry Pi ?

@tornadomeet
Copy link
Contributor Author

@mli i found the newest how_to/perf.md is in master branch, so shall i change this pr to master branch instead of nnvm branch?
i have no ec2 c4.8xlarge, i test on my linux server with 256G memory, cat /proc/cpuinfo :

processor   : 0  
vendor_id   : GenuineIntel
cpu family  : 6  
model       : 63 
model name  : Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
stepping    : 2  
microcode   : 0x31 
cpu MHz     : 1200.312
cache size  : 30720 KB
physical id : 0
siblings    : 24 
core id     : 0  
cpu cores   : 12 
apicid      : 0  
initial apicid  : 0  
fpu     : yes
fpu_exception   : yes
cpuid level : 15 
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc
bogomips    : 4599.79
clflush size    : 64 
cache_alignment : 64 
address sizes   : 46 bits physical, 48 bits virtual
power management:

i'm not familiar with ARM or Android, @clcarwin would you help test this nnpack on android? thanks.

@clcarwin
Copy link
Contributor

I'm busy these days. Maybe I can test it next week on android.

@tornadomeet
Copy link
Contributor Author

@clcarwin thanks. it's better test with batch-size >= 1.

@xlvector
Copy link
Contributor

xlvector commented Jan 5, 2017

Why large batch_size have better performance

@piiswrong
Copy link
Contributor

cache efficiency mostly

@tornadomeet tornadomeet deleted the nnpack branch January 9, 2017 02:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants