Permalink
Browse files

[nnpack] update && support more op (#4519)

* [nnpack]docs and makefile

* add missing files

* udpate with recently docs change

* move use_nnpack to when creating op

* change fully_connected createop interface to get batch-size
  • Loading branch information...
1 parent 614fd94 commit 29307c29027263654b0d21b3ce8e12bf1f4f9606 @tornadomeet tornadomeet committed with piiswrong Jan 8, 2017
View
@@ -73,7 +73,6 @@ endif
ifeq ($(USE_NNPACK), 1)
CFLAGS += -DMXNET_USE_NNPACK=1
- CFLAGS += -DMXNET_USE_NNPACK_NUM_THREADS=$(USE_NNPACK_NUM_THREADS)
LDFLAGS += -lnnpack
endif
@@ -12,7 +12,9 @@ Typically, you wouldn't need to change these settings, but they are listed here
* MXNET_CPU_WORKER_NTHREADS (default=1)
- The maximum number of threads that do the CPU computation job.
* MXNET_CPU_PRIORITY_NTHREADS (default=4)
- - The number of threads given to prioritized CPU jobs.
+ - The number of threads given to prioritized CPU jobs.
+* MXNET_CPU_NNPACK_NTHREADS (default=4)
+ - The number of threads used for NNPACK.
## Memory Options
View
@@ -0,0 +1,91 @@
+### Descriptions
+
+[NNPACK](https://github.com/Maratyszcza/NNPACK) is an acceleration package for neural network computations, which can run on x86-64, ARMv7, or ARM64 architecture cpus. it's very useful for us using NNPACK to speed up runing speed when deploy the trainned model on mobile device.
+
+MXNet(nnvm branch) has integrate NNPACK for forward propagation(only inference) in convolution/max-pooling/fully-connected, so you may consider using NNPACK now.
+
+
+### Conditions
+The underlying implementation of NNPACK utilize some other acceleration methods, such as [fft](https://arxiv.org/abs/1312.5851), [winograd](https://arxiv.org/abs/1509.09308), but these algorithms work better on some specical `batch size`, `kernel size`, `stride` etc., so not all convolution/max-pooling/fully-connected can be powed by NNPACK. If some conditions are not met, it will change to the default implementation with MXNet automatically.
+
+The following table will tell you which satisfaction will NNPACK work.
+
+| operation | conditions |
+|:--------- |:---------- |
+|convolution |2d convoltuion `and` no-bias=False `and` dilate=(1,1) `and` num_group=1 `and` batch-size = 1 or batch-size > 1 && stride = (1,1);|
+|pooling | max-pooling `and` kernel=(2,2) `and` stride=(2,2) `and` pooling_convention=full |
+|fully-connected| batch-size = 2^n |
+
+### Build/Install NNPACK with MXNet
+
+Now, if the trained model meets some conditions of using NNPACK, you can built MXNet with NNPACK support. here is the steps for you:
+* instll NNPACK based on this [tutorials](https://github.com/Maratyszcza/NNPACK#building), that's to say you need ninja to build NNPACK. make sure add `--enable-shared` when runing configure.py(i.e. `python configure.py --enable-shared`), because MXNet will link NNPACK dynamically.
+* set lib path of NNPACK as the environment variable, such as `export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$YOUR_NNPACK_INSTALL_PATH/lib`
+* add the inlcude file of NNPACK and its third-paty to `ADD_CFLAGS` in config.mk, such as `ADD_CFLAGS = -I$(YOUR_NNPACK_INSTALL_PATH)/include -I$(YOUR_NNPACK_INSTALL_PATH)/pthreadpool/include`
+* set `USE_NNPACK = 1` in config.mk.
+* [build MXNet](http://mxnet.io/get_started/setup.html#overview).
+
+### NNPACK Performance
+
+Thoung not all conv/pool/fc layer can make full use of NNPACK, it indeed can speed up some popular deep learning models such as Alexnet, VGG, Inception-bn.
+
+here we use `example/image-classification/benchmark_score.py`(changed with more range of batch-size) to benchmark it, cpu is e5-2670, MXNET_CPU_NNPACK_NTHREADS=4.
+
+bulid MXNet without NNPACK, the log is:
+```
+INFO:root:network: alexnet
+INFO:root:device: cpu(0)
+INFO:root:batch size 1, image/sec: 6.389429
+INFO:root:batch size 2, image/sec: 7.961457
+INFO:root:batch size 4, image/sec: 8.950112
+INFO:root:batch size 8, image/sec: 9.578176
+INFO:root:batch size 16, image/sec: 9.701248
+INFO:root:batch size 32, image/sec: 9.839940
+INFO:root:batch size 64, image/sec: 10.075369
+INFO:root:batch size 128, image/sec: 10.053556
+INFO:root:batch size 256, image/sec: 9.972228
+INFO:root:network: vgg
+INFO:root:device: cpu(0)
+INFO:root:batch size 1, image/sec: 1.223822
+INFO:root:batch size 2, image/sec: 1.322814
+INFO:root:batch size 4, image/sec: 1.383586
+INFO:root:batch size 8, image/sec: 1.402376
+INFO:root:batch size 16, image/sec: 1.415972
+INFO:root:batch size 32, image/sec: 1.428377
+INFO:root:batch size 64, image/sec: 1.443987
+INFO:root:batch size 128, image/sec: 1.427531
+INFO:root:batch size 256, image/sec: 1.435279
+```
+
+build MXNet with NNPACK, log is:
+
+```
+INFO:root:network: alexnet
+INFO:root:device: cpu(0)
+INFO:root:batch size 1, image/sec: 19.027215
+INFO:root:batch size 2, image/sec: 12.879975
+INFO:root:batch size 4, image/sec: 17.424076
+INFO:root:batch size 8, image/sec: 21.283966
+INFO:root:batch size 16, image/sec: 24.469325
+INFO:root:batch size 32, image/sec: 25.910348
+INFO:root:batch size 64, image/sec: 27.441672
+INFO:root:batch size 128, image/sec: 28.009156
+INFO:root:batch size 256, image/sec: 28.918950
+INFO:root:network: vgg
+INFO:root:device: cpu(0)
+INFO:root:batch size 1, image/sec: 3.980907
+INFO:root:batch size 2, image/sec: 2.392069
+INFO:root:batch size 4, image/sec: 3.610553
+INFO:root:batch size 8, image/sec: 4.994450
+INFO:root:batch size 16, image/sec: 6.396612
+INFO:root:batch size 32, image/sec: 7.614288
+INFO:root:batch size 64, image/sec: 8.826084
+INFO:root:batch size 128, image/sec: 9.193653
+INFO:root:batch size 256, image/sec: 9.991472
+```
+
+It shows that NNPACK will speed up about 2X~7X against the original MXNet cpu.
+
+### Tips
+
+NNPACK aims to provide high-performance implementations of some layers for multi-core CPUs, so you can easliy set the thread number by change environment value of `MXNET_CPU_NNPACK_NTHREADS`. but we found that the performance is not proportional to the number of threads, sugget use 4~8 threads when using NNPACK.
View
@@ -44,6 +44,10 @@ with MXNet commit `0a03417`
| 16 | 280.82 | 40.00 | 20.85 | 11.77 | 55.00 | 16.93 |
| 32 | 285.41 | 44.40 | 31.03 | 12.45 | 55.70 | 17.02 |
+## Other CPU
+
+if using cpus(not just intel cpus, such as ARMs), NNAPCK will also imporve the running performance with 2x~7x, plese check [nnpack.md](./nnpack.md) for details.
+
## Nvidia GPU
`cuDNN` often greatly accelerate performance on Nvidia GPUs, especially for
View
@@ -90,7 +90,6 @@ USE_MKL2017_EXPERIMENTAL = 0
# whether use NNPACK library
USE_NNPACK = 0
-USE_NNPACK_NUM_THREADS = 4
# choose the version of blas you want to use
# can be: mkl, blas, atlas, openblas
@@ -37,13 +37,15 @@ Operator* CreateOp<cpu>(ConvolutionParam param, int dtype,
break;
}
}
- if (enableMKLWarnGenerated())
- LOG(INFO) << MKLConvolutionOp<cpu, float>::getName() << " Skip MKL optimization";
+ LOG(INFO) << MKLConvolutionOp<cpu, float>::getName() << " Skip MKL optimization";
#endif
#if MXNET_USE_NNPACK == 1
+ const size_t batch_size = (*in_shape)[0][0];
if ((param.dilate[0] == 1 && param.dilate[1] == 1)
&& param.kernel.ndim() == 2 && (!param.no_bias)
- && param.num_group == 1) {
+ && param.num_group == 1 && (batch_size == 1 ||
+ ((batch_size > 1) && (param.stride[0] == 1) &&
+ (param.stride[1] == 1)))) {
switch (dtype) {
case mshadow::kFloat32:
return new NNPACKConvolutionOp<cpu, float>(param);
@@ -78,4 +80,3 @@ MXNET_REGISTER_OP_PROPERTY(Convolution, ConvolutionProp)
} // namespace op
} // namespace mxnet
-
@@ -135,7 +135,10 @@ class FullyConnectedOp : public Operator {
// Decalre Factory function, used for dispatch specialization
template<typename xpu>
-Operator* CreateOp(FullyConnectedParam param, int dtype);
+Operator* CreateOp(FullyConnectedParam param, int dtype,
+ std::vector<TShape> *in_shape,
+ std::vector<TShape> *out_shape,
+ Context ctx);
#if DMLC_USE_CXX11
class FullyConnectedProp : public OperatorProperty {
@@ -9,11 +9,17 @@
#include "./mkl/mkl_memory-inl.h"
#include "./mkl/mkl_fully_connected-inl.h"
#endif // MXNET_USE_MKL2017
+#if MXNET_USE_NNPACK == 1
+#include "./nnpack/nnpack_fully_connected-inl.h"
+#endif // MXNET_USE_NNPACK
namespace mxnet {
namespace op {
template<>
-Operator* CreateOp<cpu>(FullyConnectedParam param, int dtype) {
+Operator* CreateOp<cpu>(FullyConnectedParam param, int dtype,
+ std::vector<TShape> *in_shape,
+ std::vector<TShape> *out_shape,
+ Context ctx) {
Operator *op = NULL;
#if MXNET_USE_MKL2017 == 1
switch (dtype) {
@@ -22,11 +28,25 @@ Operator* CreateOp<cpu>(FullyConnectedParam param, int dtype) {
case mshadow::kFloat64:
return new MKLFullyConnectedOp<cpu, double>(param);
default:
- if (enableMKLWarnGenerated())
- LOG(INFO) << MKLFullyConnectedOp<cpu, float>::getName() << " Skip MKL optimization";
+ LOG(INFO) << MKLFullyConnectedOp<cpu, float>::getName() << " Skip MKL optimization";
break;
}
#endif
+#if MXNET_USE_NNPACK == 1
+ const size_t batch_size = (*in_shape)[0][0];
+ // nnp_fully_connected_inference will do optimization for batch-size = 1
+ // nnp_fully_connected_output will do optimization for batch-size > 1
+ // but just found FullyConnected in NNPACK result is wrong when batch_size != 2^n
+ // so here only using NNPACK when batch_size = 2^n.
+ if ((batch_size == 1) || ((batch_size > 1) && (!(batch_size & (batch_size - 1))))) {
+ switch (dtype) {
+ case mshadow::kFloat32:
+ return new NNPACKFullyConnectedOp<cpu, float>(param);
+ default:
+ break;
+ }
+ }
+#endif
switch (dtype) {
case mshadow::kFloat32:
op = new FullyConnectedOp<cpu, float>(param);
@@ -52,7 +72,7 @@ Operator *FullyConnectedProp::CreateOperatorEx(Context ctx, std::vector<TShape>
std::vector<int> out_type, aux_type;
CHECK(InferType(in_type, &out_type, &aux_type));
CHECK(InferShape(in_shape, &out_shape, &aux_shape));
- DO_BIND_DISPATCH(CreateOp, param_, (*in_type)[0]);
+ DO_BIND_DISPATCH(CreateOp, param_, (*in_type)[0], in_shape, &out_shape, ctx);
}
DMLC_REGISTER_PARAMETER(FullyConnectedParam);
@@ -7,7 +7,10 @@
namespace mxnet {
namespace op {
template<>
-Operator* CreateOp<gpu>(FullyConnectedParam param, int dtype) {
+Operator* CreateOp<gpu>(FullyConnectedParam param, int dtype,
+ std::vector<TShape> *in_shape,
+ std::vector<TShape> *out_shape,
+ Context ctx) {
Operator *op = NULL;
MSHADOW_REAL_TYPE_SWITCH(dtype, DType, {
op = new FullyConnectedOp<gpu, DType>(param);
@@ -17,34 +17,11 @@
#include <utility>
#include "../convolution-inl.h"
#include "nnpack.h"
+#include "nnpack_util.h"
namespace mxnet {
namespace op {
-class NNPACKInitialize {
- public:
- pthreadpool_t threadpool;
-
- public:
- NNPACKInitialize() {
- nnp_status status = nnp_initialize();
- if (nnp_status_success != status) {
- LOG(FATAL) << "nnp_initialize failed status=" << status;
- }
- int num_threads = MXNET_USE_NNPACK_NUM_THREADS;
- this->threadpool = pthreadpool_create(num_threads);
- }
- virtual ~NNPACKInitialize() {
- nnp_status status = nnp_deinitialize();
- if (nnp_status_success != status) {
- LOG(FATAL) << "nnp_deinitialize failed status=" << status;
- }
- pthreadpool_destroy(threadpool);
- }
-};
-
-static NNPACKInitialize nnpackinitialize;
-
template <typename xpu, typename DType>
class NNPACKConvolutionOp : public ConvolutionOp<xpu, DType> {
private:
@@ -65,50 +42,61 @@ class NNPACKConvolutionOp : public ConvolutionOp<xpu, DType> {
using namespace mshadow::expr;
Stream<xpu> *s = ctx.get_stream<xpu>();
Tensor<xpu, 4, DType> data = in_data[conv::kData].get<xpu, 4, DType>(s);
+ const size_t batch_size = data.shape_[0];
+ const size_t input_c = data.shape_[1];
+ const size_t input_h = data.shape_[2];
+ const size_t input_w = data.shape_[3];
Shape<3> wmat_shape =
Shape3(param_.num_group, param_.num_filter / param_.num_group,
- data.shape_[1] / param_.num_group * param_.kernel[0] *
+ input_c / param_.num_group * param_.kernel[0] *
param_.kernel[1]);
Tensor<xpu, 3, DType> wmat =
in_data[conv::kWeight].get_with_shape<xpu, 3, DType>(wmat_shape, s);
Tensor<xpu, 4, DType> out = out_data[conv::kOut].get<xpu, 4, DType>(s);
+ nnp_size input_size = {input_w, input_h};
+ nnp_padding input_padding = {param_.pad[0], param_.pad[1], param_.pad[0],
+ param_.pad[1]};
+ nnp_size kernel_size = {param_.kernel[1], param_.kernel[0]};
+ nnp_size output_subsampling = {param_.stride[1], param_.stride[0]};
+ Tensor<xpu, 1, DType> bias = in_data[conv::kBias].get<xpu, 1, DType>(s);
- // nnp_convolution_inference optimize for batch_size==1
- // when W or H less than 16, ConvolutionOp fast than nnpack's convolution
- if ((data.shape_[0] != 1) || (data.shape_[2] < 16) ||
- (data.shape_[3] < 16)) {
- ConvolutionOp<xpu, DType>::Forward(ctx, in_data, req, out_data, aux_args);
+ nnp_convolution_algorithm algorithm = nnp_convolution_algorithm_auto;
+ nnp_convolution_transform_strategy kts = nnp_convolution_transform_strategy_tuple_based;
+ nnp_status status = nnp_status_success;
+ if (batch_size == 1) {
+ status = nnp_convolution_inference(
+ algorithm, // enum nnp_convolution_algorithm,
+ kts, // enum nnp_convolution_transform_strategy,
+ input_c, // size_t input_channels,
+ param_.num_filter, // size_t output_channels,
+ input_size, // struct nnp_size input_size,
+ input_padding, // struct nnp_padding input_padding,
+ kernel_size, // struct nnp_size kernel_size,
+ output_subsampling, // struct nnp_size output_subsampling,
+ data.dptr_, // const float input[],
+ wmat.dptr_, // const float kernel[],
+ bias.dptr_, // const float bias[],
+ out.dptr_, // float output[],
+ nnpackinitialize.threadpool, // pthreadpool_t threadpool,
+ nullptr);
} else {
- nnp_size input_size = {data.shape_[3], data.shape_[2]};
- nnp_padding input_padding = {param_.pad[0], param_.pad[1], param_.pad[0],
- param_.pad[1]};
- nnp_size kernel_size = {param_.kernel[1], param_.kernel[0]};
- nnp_size output_subsampling = {param_.stride[1], param_.stride[0]};
- Tensor<xpu, 1, DType> bias = in_data[conv::kBias].get<xpu, 1, DType>(s);
-
- nnp_convolution_algorithm algorithm = nnp_convolution_algorithm_auto;
- if ((data.shape_[2] < 32) || (data.shape_[3] < 32)) {
- algorithm = nnp_convolution_algorithm_implicit_gemm;
- }
-
- nnp_status status = nnp_convolution_inference(
- algorithm, // enum nnp_convolution_algorithm algorithm,
- nnp_convolution_transform_strategy_tuple_based,
- data.shape_[1], // size_t input_channels,
- param_.num_filter, // size_t output_channels,
- input_size, // struct nnp_size input_size,
- input_padding, // struct nnp_padding input_padding,
- kernel_size, // struct nnp_size kernel_size,
- output_subsampling, // struct nnp_size output_subsampling,
- data.dptr_, // const float input[],
- wmat.dptr_, // const float kernel[],
- bias.dptr_, // const float bias[],
- out.dptr_, // float output[],
- nnpackinitialize.threadpool, // pthreadpool_t threadpool,
- nullptr);
- if (nnp_status_success != status) {
- LOG(FATAL) << "nnp_convolution_inference failed status=" << status;
- }
+ status = nnp_convolution_output(
+ algorithm, // enum nnp_convolution_algorithm algorithm,
+ batch_size, // size_t batch size of input tensor
+ input_c, // size_t input_channels,
+ param_.num_filter, // size_t output_channels,
+ input_size, // struct nnp_size input_size,
+ input_padding, // struct nnp_padding input_padding,
+ kernel_size, // struct nnp_size kernel_size,
+ data.dptr_, // const float input[],
+ wmat.dptr_, // const float kernel[],
+ bias.dptr_, // const float bias[],
+ out.dptr_, // float output[],
+ nnpackinitialize.threadpool, // pthreadpool_t threadpool,
+ nullptr);
+ }
+ if (nnp_status_success != status) {
+ LOG(FATAL) << "nnpack convolution feedforward failed status=" << status;
}
}
}; // class NNPACKConvolutionOp
Oops, something went wrong.

0 comments on commit 29307c2

Please sign in to comment.