[WIP] more powerful of NNPACK #4373

tornadomeet · 2016-12-26T05:58:12Z

currently NNPACK only support convolution operator with batch-size=1, and this does'n t utilize fully performance of NNPACK when do inference, so we want to make it more powerful.

set number of threads by environment var.
update convolution of using NNPACK

set MXNET_CPU_NNPACK_NTHREADS=4, before this pr of using NNPACK, run example/image-classification/benchmark_score.py, here is log:

INFO:root:network: alexnet
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 6.733318
INFO:root:batch size  2, image/sec: 7.980731
INFO:root:batch size  4, image/sec: 9.090355
INFO:root:batch size  8, image/sec: 9.589279
INFO:root:batch size 16, image/sec: 9.836241
INFO:root:batch size 32, image/sec: 9.975417
INFO:root:batch size 64, image/sec: 10.075369
INFO:root:batch size 128, image/sec: 10.053556
INFO:root:batch size 256, image/sec: 9.972228
INFO:root:network: vgg
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 2.360555
INFO:root:batch size  2, image/sec: 1.320285
INFO:root:batch size  4, image/sec: 1.381983
INFO:root:batch size  8, image/sec: 1.406876
INFO:root:batch size 16, image/sec: 1.415913
INFO:root:batch size 32, image/sec: 1.428377
INFO:root:batch size 64, image/sec: 1.431983
INFO:root:batch size 128, image/sec: 1.428631
INFO:root:batch size 256, image/sec: 1.433979

after update convolution based on this pr, the log is:

INFO:root:network: alexnet
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 8.720897
INFO:root:batch size  2, image/sec: 9.831821
INFO:root:batch size  4, image/sec: 14.671470
INFO:root:batch size  8, image/sec: 18.792820
INFO:root:batch size 16, image/sec: 21.152899
INFO:root:batch size 32, image/sec: 23.229446
INFO:root:batch size 64, image/sec: 25.290079
INFO:root:batch size 128, image/sec: 26.793525
INFO:root:batch size 256, image/sec: 27.801579
INFO:root:network: vgg
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 2.259832
INFO:root:batch size  2, image/sec: 2.058064
INFO:root:batch size  4, image/sec: 3.224434
INFO:root:batch size  8, image/sec: 4.630889
INFO:root:batch size 16, image/sec: 6.084865
INFO:root:batch size 32, image/sec: 7.371423
INFO:root:batch size 64, image/sec: 8.461219
INFO:root:batch size 128, image/sec: 9.123711
INFO:root:batch size 256, image/sec: 9.689774

from the log, we can konw that when batch-size > 1, it's also very useful using NNPACK to speed up to 2x~7x.

suport max-pooling op

from now, NNPACK only support max-pooling with kernel=2, stride=2, and pooling_convention=kFull. so the speed log is the same as above which after update conv, because these symbols use pooling_convention=kValid.

support fully-connect op

after NNPACK support fully-connected , the speed is a litter higher:

INFO:root:network: alexnet
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 19.027215
INFO:root:batch size  2, image/sec: 12.879975
INFO:root:batch size  4, image/sec: 17.424076
INFO:root:batch size  8, image/sec: 21.283966
INFO:root:batch size 16, image/sec: 24.469325
INFO:root:batch size 32, image/sec: 25.910348
INFO:root:batch size 64, image/sec: 27.441672
INFO:root:batch size 128, image/sec: 28.009156
INFO:root:batch size 256, image/sec: 28.918950
INFO:root:network: vgg
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 3.980907
INFO:root:batch size  2, image/sec: 2.392069
INFO:root:batch size  4, image/sec: 3.610553
INFO:root:batch size  8, image/sec: 4.994450
INFO:root:batch size 16, image/sec: 6.396612
INFO:root:batch size 32, image/sec: 7.614288
INFO:root:batch size 64, image/sec: 8.826084
INFO:root:batch size 128, image/sec: 9.193653
INFO:root:batch size 256, image/sec: 9.991472

document of using NNPACK

tornadomeet · 2016-12-27T09:39:24Z

@mli @clcarwin

now support conv/max-pool/fc with NNPACK, the speed-up is about 2x~7x.
and the inference result should be ok, i test caffenet on imaget val dataset(50k) with batch-size = 256, result is:

lib	top-1	top-5
cuDNN	54.257%	78.199%
NNPACK	54.229%	78.125%

mli · 2016-12-27T22:11:15Z

can you please add a link in how_to/perf.md to mention NNPack can also accelerate CPU performance?

also do you have ec2 c4.8xlarge numbers? and have you tried arm, e.g. android or Raspberry Pi ?

tornadomeet · 2016-12-28T01:33:05Z

@mli i found the newest how_to/perf.md is in master branch, so shall i change this pr to master branch instead of nnvm branch?
i have no ec2 c4.8xlarge, i test on my linux server with 256G memory, cat /proc/cpuinfo :

processor   : 0  
vendor_id   : GenuineIntel
cpu family  : 6  
model       : 63 
model name  : Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
stepping    : 2  
microcode   : 0x31 
cpu MHz     : 1200.312
cache size  : 30720 KB
physical id : 0
siblings    : 24 
core id     : 0  
cpu cores   : 12 
apicid      : 0  
initial apicid  : 0  
fpu     : yes
fpu_exception   : yes
cpuid level : 15 
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc
bogomips    : 4599.79
clflush size    : 64 
cache_alignment : 64 
address sizes   : 46 bits physical, 48 bits virtual
power management:

i'm not familiar with ARM or Android, @clcarwin would you help test this nnpack on android? thanks.

clcarwin · 2016-12-28T03:23:46Z

I'm busy these days. Maybe I can test it next week on android.

tornadomeet · 2016-12-28T03:38:44Z

@clcarwin thanks. it's better test with batch-size >= 1.

xlvector · 2017-01-05T14:32:18Z

Why large batch_size have better performance

piiswrong · 2017-01-05T17:10:50Z

cache efficiency mostly

wuwei added 8 commits December 26, 2016 13:42

[NNPACK] change nnpack thread number as environment variable

275114f

remove USE_NNPACK_NUM_THREADS in config.mk

e9b13ff

update nnpack convolution

f5ed221

support nnpack max-pooling

33af729

fix build error when MXNET_USE_NNPACK=0

871e7ef

support nnpack fully-connected

3e49eb8

add missing file

75b5d55

docs of nnpack

3b669ac

wuwei added 2 commits December 27, 2016 20:00

fix nnpack numerical stability of fully-connected op

6de7ac9

fix build error in nnpack_util.cc

b7caa3b

tornadomeet mentioned this pull request Dec 27, 2016

nnp_fully_connected_output will get wrong result if batch-size != 2^n Maratyszcza/NNPACK#43

Closed

tornadomeet mentioned this pull request Dec 29, 2016

Generate the Amalgamation for android with NNPACK #4419

Closed

piiswrong closed this Dec 29, 2016

This was referenced Jan 4, 2017

add cmake option to support NNPACK #4233

Closed

[nnpack] update && support more op #4519

Merged

tornadomeet deleted the nnpack branch January 9, 2017 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] more powerful of NNPACK #4373

[WIP] more powerful of NNPACK #4373

tornadomeet commented Dec 26, 2016 •

edited

Loading

tornadomeet commented Dec 27, 2016 •

edited

Loading

mli commented Dec 27, 2016

tornadomeet commented Dec 28, 2016

clcarwin commented Dec 28, 2016

tornadomeet commented Dec 28, 2016

xlvector commented Jan 5, 2017

piiswrong commented Jan 5, 2017

[WIP] more powerful of NNPACK #4373

[WIP] more powerful of NNPACK #4373

Conversation

tornadomeet commented Dec 26, 2016 • edited Loading

tornadomeet commented Dec 27, 2016 • edited Loading

mli commented Dec 27, 2016

tornadomeet commented Dec 28, 2016

clcarwin commented Dec 28, 2016

tornadomeet commented Dec 28, 2016

xlvector commented Jan 5, 2017

piiswrong commented Jan 5, 2017

tornadomeet commented Dec 26, 2016 •

edited

Loading

tornadomeet commented Dec 27, 2016 •

edited

Loading