Use depthwise convolution(group convolution) by cuDNNv7 if available by kice · Pull Request #10804 · apache/mxnet

kice · 2018-05-04T03:31:39Z

Description

Use group convolution by cuDNNv7 to improve GPU memory usage. This commit is base on terrychenism@90cc3d5

Related pr -> #7393

nihui · 2018-05-04T03:49:43Z

how about the speed improvment over mxnet native implementation ?

szha · 2018-05-04T05:24:57Z

@kice would you list some comparison on the speed and memory usage? also, does cudnn api have any additional constraint?

snflake · 2018-05-04T08:21:39Z

Great work! This seems to explain the current low performance of Mxnet compared to Tensorflow when dilation rate > 1 is used together with depthhwise convolution. PR #7393 only addresses dilation rate = 1. Tensorflow custom CUDA implementation also works with dilation rate 1. They use CuDNN otherwise. The reason is MxNet did not use group feature of CuDNN v7 which is implemented in this PR.
Would you fix merge failure? I would like to test this.

snflake · 2018-05-04T08:25:29Z

The CI failure seems to not related to this PR.

unknown file: Failure
C++ exception with description "[04:51:17] /work/mxnet/tests/cpp/operator/mkldnn.cc:85: Check failed: mkldnn_format_last == 56 (67 vs. 56)

snflake · 2018-05-04T08:36:02Z

About the speed, I used TensorRT with cudnn 7 for inference and depthwise conv is very fast regardless of dilation rate. There is no need for custom depthwise conv implementation if cudnn 7 group is used.

IFeelBloated · 2018-05-04T15:22:22Z

I have been working on something recently with heavy use of ResNeXt building blocks, would be nice to have grouped convolutions directly backed by cudnn7

piiswrong · 2018-05-04T16:56:12Z

How does cudnn implementation compare to the custom kernels from tf? Should we always use cudnn?

snflake · 2018-05-04T19:43:35Z

I got similar runtime with MobileNet v2 on laptop (Nvidia Quadro M1000M) using custom kernel and grouped conv by cudnn 7. IMO, we should always use cudnn if cudnn 7 is available.

piiswrong · 2018-05-06T05:15:20Z

src/operator/nn/cudnn/cudnn_convolution-inl.h

    DType *out_ptr = GetNdPtr(out_data[conv::kOut], param_.kernel.ndim() + 2, s);

+    #if CUDNN_MAJOR >= 7
+      typename DataType<DType>::ScaleType alpha = 1.0f;


dont indent for #if statements

piiswrong · 2018-05-06T05:18:52Z

src/operator/nn/cudnn/cudnn_convolution-inl.h

+      CUDNN_CALL(cudnnSetConvolutionGroupCount(back_conv_desc_w_, param_.num_group));
    #endif
+
+  #if CUDNN_MAJOR <= 6


how about creating new variable effective_num_group and set it to 1 for cudnn7 and num_group otherwise, instead of always use #if tests.

piiswrong · 2018-05-06T05:20:59Z

src/operator/nn/cudnn/cudnn_convolution-inl.h

-      }
-    }
+    #if CUDNN_MAJOR >= 7
+        typename DataType<DType>::ScaleType alpha = 1.0f;


This code looks the same as the old version if you change for loop to 0->effective_num_group?

piiswrong · 2018-05-06T05:21:50Z

src/operator/nn/cudnn/cudnn_convolution-inl.h

+                                out_ptr));
+      }
+    #else
    for (uint32_t g = 0; g < param_.num_group; ++g) {


change to for (uint32_t g = 0; g < effective_num_group_; ++g) and you don't need #if tests anymore?

piiswrong · 2018-05-06T05:22:40Z

@kice could you change the algorithm selection test to always prefer cudnn than custom kernel when cudnn > 7?

nihui · 2018-05-16T06:42:06Z

hello

some feedback about the speed

hardware: tesla-m40 24G x 2
system: centos-7
nvidia-387.26
cuda-9.1
cudnn-v7.1

model: mobilenet-v2
batchsize 256 (128 per gpu)

mxnet implementation: 68s/10iter
cudnnv7 implementation: 9.5s/10iter

ps: need to comment mxnet DepthwiseConvolutionOp path in src/operator/nn/convolution.cu to enable cudnn one

HaoLiuHust · 2018-05-23T08:23:37Z

@nihui could you explain how to get the improvement?

piiswrong · 2018-05-24T18:04:23Z

@kice Any updates?

austingg · 2018-05-29T09:06:28Z

cudnn has optimized some special path for grouped convolution mostly in cudnn 7.0.3 and 7.0.4 . Performance improvements for grouped convolutions when input channels and output channels per group are 1, 2, or 4 for the following algorithms. cudnn release note

we may referecence nvidia-caffe's verfication

#if CUDNN_VERSION_MIN(7, 0, 2)
  #define CUDNN_GROUPING
#endif
#if CUDNN_VERSION_MIN(7, 0, 3)
  #define CUDNN_GROUPING2
#endif

bool use_v7grouping() const {
#if defined(CUDNN_GROUPING2)
    return (this->channels_ == this->group_
         || this->channels_ == this->group_ * 2
         || this->channels_ == this->group_* 4)
        && (this->num_output_ == this->group_
         || this->num_output_ == this->group_ * 2
         || this->num_output_ == this->group_* 4);
#elif defined(CUDNN_GROUPING)
    return this->channels_ == this->num_output_ && this->channels_ == this->group_;
#else
    return false;
#endif
  }

for old path, it still uses for-loop.

austingg · 2018-05-29T10:58:59Z

I have tested this pr on two 1080ti. batch_size 96 each, and the speed is about 480 image/sec. However, the original depthwise conv path gets 650 images sec. So the speedup maybe architecture related. Besides, I also test nvidia-caffe on mobilenet-v2-1.0, with cudnn 7.0.5. it get 620 images/sec. I believe further funetune is need.

piiswrong · 2018-05-29T18:38:14Z

Please move to #11076

Use group convolution by cuDNNv7 if available

9760a48

kice force-pushed the master branch from a73eea6 to 9760a48 Compare May 4, 2018 04:01

Fix coding style

96cba28

piiswrong reviewed May 6, 2018

View reviewed changes

nihui mentioned this pull request May 28, 2018

[MXNET-491] Use depthwise convolution by cuDNNv7 if available, updated version #11076

Merged

piiswrong closed this May 29, 2018

Conversation

kice commented May 4, 2018

Description

Uh oh!

nihui commented May 4, 2018

Uh oh!

szha commented May 4, 2018

Uh oh!

snflake commented May 4, 2018

Uh oh!

snflake commented May 4, 2018

Uh oh!

snflake commented May 4, 2018

Uh oh!

IFeelBloated commented May 4, 2018

Uh oh!

piiswrong commented May 4, 2018

Uh oh!

snflake commented May 4, 2018

Uh oh!

piiswrong May 6, 2018

Choose a reason for hiding this comment

Uh oh!

piiswrong May 6, 2018

Choose a reason for hiding this comment

Uh oh!

piiswrong May 6, 2018

Choose a reason for hiding this comment

Uh oh!

piiswrong May 6, 2018

Choose a reason for hiding this comment

Uh oh!

piiswrong commented May 6, 2018

Uh oh!

nihui commented May 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HaoLiuHust commented May 23, 2018

Uh oh!

piiswrong commented May 24, 2018

Uh oh!

austingg commented May 29, 2018

Uh oh!

austingg commented May 29, 2018

Uh oh!

piiswrong commented May 29, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

nihui commented May 16, 2018 •

edited

Loading