-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[call for contribution] Improving CPU performance #2986
Comments
Can you first confirm that the bottleneck is in computation? As far as I know, caffe uses the same computational engine as ours, but much faster than ours. |
@winstywang @hjk41 |
@Darwin2011 can you try to do some change to see if that is really the case? Thanks! |
@hjk41 After uncomment OpenMP in mshadow/mshadow/tensor_cpu-inl.h, mxnet is still slower than caffe. The detail performance data on Intel(R) Xeon(R) CPU E5-4657L v2 is shown as follow:
|
Unlike MKL, NNPACK is not x86-only: it now includes kernels implemented with Clang vector extensions. The implementation was originally developed for PNaCl port of NNPACK, but can be compiled to any clang-supported architecture; LLVM will automatically lower the vector operations to the SIMD ISA of the target. To build NNPACK with portable SIMD implementation instead of x86-64 assembly, configure with |
@xmchen1987 Could you help run a profile on the CPU analysis? Let's figure out the bottleneck. |
@hjk41 @winstywang @Darwin2011 I add time measurement function in src/operator/convolution-inl.h, and dump the running time for Alexnet.
now there are two options to fix this problem:
what do you guys think about? |
@xmchen1987 great finding. Integrating MKL DNN is also a good idea. But I think if we decide to do that, we should do it systematically and replace every operator that MKL DNN has. @xmchen1987 are you interested in making this contribution? |
@ALL I just updated the issue. As some of you suggested, we should do some profiling first. And we should make sure we have comparable performance with Caffe on most of the critical applications. I just listed MNIST-CNN and Cifar as two of the candidate benchmarks. What else can you think of? |
@hjk41 sure, i can do that. Caffe uses memory copy to do patch2col, which is more effciency. no problem to use that method. |
@hjk41 can we add Alexnet, as most of benchmark for DL framework has Alexnet? |
@xmchen1987 does MKL DNN only support intel cpu? because many user will use mxnet in the mobile devices using ARM, so it's better using the general implementation. |
Good idea. Alexnet would show us the performance of the CNN implementation. On Fri, Aug 12, 2016 at 11:19 AM, xmchen notifications@github.com wrote:
|
@tornadomeet MKL supports only x86 CPUs. I think we should add a compile option like we do for OpenBLAS/MKL. I believe MKL DNN will become the de-facto choice for DNN on x86, just like cuDNN for GPU. |
@tornadomeet For ARM, i think Eigen is a good choice. If we want to get better performance, we can't use a genneral implement, right? General implement usally means just ok performance. @hjk41 so right now, i will integrate MKL DNN as a start to improve performance on x86 cpus. |
@xmchen1987 Do you consider integration of NNPACK? It has some benefits vs MKL DNN:
|
@Maratyszcza i prefer NNPACK personal. |
I think both NNPACK and MKL DNN are good libraries, we can support both if we have enough resource, just like we support both OpenBLAS and MKL. As to which library support to implement first, I am fine with either way. I will leave it to @xmchen1987 to decide which library to use first. Whatever he chooses to integrate first, it is a good contribution to mxnet community. If anyone else would like to join the effort and integrate NNPACK, he is more than welcome. |
@Maratyszcza I agree. NNPACK is a promising library, it implements latest algorithm for convolution. compared with MKL DNN, it may be faster. @hjk41 I think we can support both of them, and i choose NNPACK as a start. |
@Maratyszcza i find it only implements forward, not backward in the caffe integration. |
@xmchen1987 Backward pass for convolution is implemented in NNPACK. Caffe bindings are |
@xmchen1987 @tornadomeet: An argument for Intel DAAL over NNPACK:
Agreed that it would be nice to have both in the long run. Correction: the implementation of the DNN layers in Intel DAAL are not open source, as they come from MKL. The relevant pieces of MKL are included in the DAAL binaries, which are very permisively licensed (Apache License 2.0). |
Also, this issue is relevant for this discussion: #2435 |
@sbodenstein Intel DAAL is irrelevant. It is a high-level library, similar to MxNet. For the actual implementation of high-intensity operations, it leverages Intel MKL DNN functions. |
@Maratyszcza: you are correct, DAAL does indeed call MKL (I didn't know this). But:
So DAAL is similar to cuDNN, the implementation is not open source, but comes with permissive license to use. I will will correct this above. Also, you are right, we should probably use MKL directly (unless perhaps the license for DAAL is much more permissive?) Also, DAAL is not similar to MXNet. It was designed to be useable from other frameworks, for example:
|
As a first step, I add both NNPACK and MKL DNN in forward convolution function. I have tested the prediction accuracy, it's the same with original implement. As NNPACK can't support stride when batch size is large than 1, I compare the performance on VGG-19 model. The forward performance(Batch size 128) on Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz is shown as below: The MKL DNN implement can achieve 2.6x performance, and has no problem on core number scalability. |
After implemented MKL DNN in MXNet, I achieve much better performance with almost the same accuracy. One problem we need to figure out is, when training cifar10, not matter using base MKL implement or MKLDNN implement, the training speed is only 20+ images/sec. I find the CPU utilization is very low, @hjk41 @tqchen @mli do you have some hints for that problem? |
@xmchen1987: MKL 2017 has just been released: https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-2017-release-notes |
@xmchen1987 I guess it is the OpenMP problem. MShadow uses OpenMP to support multi-threading, but it is turned off by default. If you use MShadow operators too much, that could hurt performance. Could you do some profiling and tell us if this is the problem? If it is, then turning on OpenMP would help. Un-comment this line to enable OpenMP: |
@sbodenstein thanks for reminding. I will check if any API change compared with beta version. @hjk41 I have un-comment https://github.com/dmlc/mshadow/blob/478e5fdf13372121421250a987b77083c186f6fd/mshadow/tensor_cpu-inl.h#L148.
|
@antinucleon @tqchen any suggestions? |
FYI, now NNPACK supports Android. |
@xmchen1987 You are more than welcomed to propose a PR for the BatchNorm. Normally, the omp won't provide much improvement over simple elementwise ops. We can enable some of them manually in the performance critical ops by hand crafting some of the loops |
@tqchen Recently I'm busy to tune MKL performance, and forget to add the PR. |
@xmchen1987 @hjk41 @tqchen I looked at NNPACK bindings in MXNet, and they have room for improvement:
|
@Maratyszcza Thanks for pointing it out! I created #9719 for this. |
Currently we are still slow on CPU (#1222).
There are several things we can do:
Here are some candidates:
The text was updated successfully, but these errors were encountered: