add depthwise convolution's gpu version optimization#7393
add depthwise convolution's gpu version optimization#7393piiswrong merged 8 commits intoapache:masterfrom
Conversation
src/operator/convolution-inl.h
Outdated
| .describe("Set layout for input, output and weight. Empty for\n " | ||
| "default layout: NCW for 1d, NCHW for 2d and NCDHW for 3d."); | ||
| DMLC_DECLARE_FIELD(depthwise_conv_off).set_default(false) | ||
| .describe("whether to turn off depthwise convolution for this layer"); |
There was a problem hiding this comment.
any reason we would want to turn this off?
There was a problem hiding this comment.
just provide a choice, like the cudnn_off
There was a problem hiding this comment.
Let's remove this if there are no important reasons. Convolution has too many switches.
| param.num_filter == (*in_shape)[conv::kData][1] && | ||
| param.kernel.ndim() == 2 && | ||
| param.dilate == mshadow::Shape2(1, 1) && | ||
| dtype == mshadow::kFloat32) { |
There was a problem hiding this comment.
any reason for limiting to float32?
There was a problem hiding this comment.
We are no processing in cuda kernel when dtype==mshadow::kFloat16
There was a problem hiding this comment.
Is there plan to support dilation with depthwise kernel? It is used in mobilenet v2 + deeplabv3 for segmentation. Tensorflow has efficient implementation. mxnet is much slower in this case.
There was a problem hiding this comment.
@crazy-cat Will you please implement support for dilation rate > 1?
| for arr1, arr2 in zip(exe1.outputs + exe1.grad_arrays, exe2.outputs + exe2.grad_arrays): | ||
| np.testing.assert_allclose(arr1.asnumpy(), arr2.asnumpy(), rtol=1e-3, atol=1e-4) | ||
|
|
||
| def test_depthwise_convolution(): |
There was a problem hiding this comment.
Please move tests to tests/python/gpu/test_operator_gpu.py and use the standard consistency and numerical gradient tests.
There was a problem hiding this comment.
Actually, this is fine. please reduce 224 to something like 32 or 64 and test for 2 or 3 more configs. Like differetn num_base and kernal_size/pad/stride
| kernel = (3, 3) | ||
| stride = (1, 1) | ||
| pad = (1,1) | ||
| shape = (2, num_base, 224, 224) |
There was a problem hiding this comment.
input doesn't need to be this big. Use smaller number to make tests run faster.
| num_filter=num_filter/num_group, kernel=kernel, stride=stride, pad=pad) | ||
| for i in range(num_group)]) | ||
|
|
||
| dev = mx.gpu(0) |
There was a problem hiding this comment.
dev = default_context().
This will fail on gpu-less test servers.
src/common/cuda_utils.h
Outdated
| << "cuRAND: " << common::cuda::CurandGetErrorString(e); \ | ||
| } | ||
|
|
||
| #define CUDA_1D_KERNEL_LOOP(i, n) \ |
There was a problem hiding this comment.
Function 'CUDA_KERNEL_LOOP' was defined in 'mxnet_op.h'. It's same to the 'CUDA_1D_KERNEL_LOOP'.
| namespace cuda { | ||
| template<typename DType, int kFilterWidth, int kFilterHeight> | ||
| __global__ void __launch_bounds__(1024, 2) | ||
| DepthwiseConv2dBackwardFilterKernel(const DepthwiseArgs args, |
There was a problem hiding this comment.
Why not put this gpu function into the file depthwise_convolution_tf.cuh
There was a problem hiding this comment.
In depthwise_convolution_tf.cuh, the cuda kernel's main logic are from tensorflow, but this kernel is done by us alone, so we keep it in mxnet namespace.
There was a problem hiding this comment.
Does this logic faster than the logic in tensorflow? There are total 4 loops in this logic.
Why there is no cross - border judgment during filter * input in this logic. Just like
' if (in_r_start >= 0 && in_c_start >= 0 && in_r_end < in_rows && in_c_end < in_cols)'
There was a problem hiding this comment.
Yes, we have test that. The tf version is just atomicAdd by all threads.
The cross-border judgement is in line 135-138.
There was a problem hiding this comment.
Others, the filter size is very small compared with the input or output shape, so the conflict will be serious when thread atomicAdd frequently.
|
@piiswrong When will mxnet support cudnn-v7? Grouped Convolutions was included in this version. |
|
@solin319 , it is said grouped conv in cudnn v7 is not efficient when used in depthwise conv, where the group num == input channels |
|
Nice work, It's a good feature for mobilenet! |
|
@crazy-cat Still not compiling on windows: https://builds.apache.org/blue/organizations/jenkins/incubator-mxnet/detail/PR-7393/2/pipeline |
|
@piiswrong All checks have passed. |
|
Could you rebase to master and push again? Somehow test is failing |
|
Thanks! |
|
@austingg do you have some benchmark about grouped conv in cudnn v7? |
|
I have tested cudnn7 for mobilenet on single Titan X. terrychenism@90cc3d5 cudnn v5 - 66 samples/sec |
|
@terrychenism Thx for your code. I'll test it again. |
|
@BiranLi Thanks for your test data, now there are two opposite results, cudnn v7 faster in your test and depthwise faster in @terrychenism ' s test. BTW, in your test "cudnn v7 with dw -- 1200samples" is in one second including forward and backward ? |
|
@7oud Yes, a entire train process. And I think I got a mistake in code. I'll update my test data after recent testing. OK, I get a similar results to @terrychenism . |
|
I think we also need a CPU version of depthwise-conv as this operator is assumed to use without cuda in practice. |
|
@terrychenism how about the convergence, I used the symbol in |
|
I run mobilenet.py when setting num_group=num_filter and num_group=1 on GTX 1080ti on 1024 640x480 images with batch size=8 (total 128 batches) and compute average run-time. Using num_group=num_filter achieves 7.5 ms/fr vs. 10.9 ms/fr in case of num_group=1. Great !
Is there smth specific about mobilenet that allows for a speed-up when using depth-wise factorized convolutions, which does not hold for ResNet-18? |
|
@leonid-pishchulin Thx for your experimental data! Now there are three implement of depthwise conv,
|
|
what is the best way of measuring GFLOPS? |
|
This PR (#7393) is just the optimized conv without cudnn. |
|
I double-checked: DepthwiseConvolutionOp is called when num_group=num_filter for both mobilenet and resnet. Have you ever measured the speed-ups when running resnet with depth-separable convolutions? I get not speed-up, the performance is even a bit slower when setting num_group=num_filter |
|
found a bug in my code. ResNet-18 with num_group=num_filter for 3x3 conv layers is ~2x faster compared to num_group=1. Thanks for the great feature ! |
* add depthwise convolution's gpu version optimization * add more config for test_depthwise_convolution * remove CUDA_1D_KERNEL_LOOP * fix windows compiling error * add support for kAddTo when cal input's backward * remove depthwise_conv_off params * Update convolution.cu * Update test_operator.py
|
Did someone do some tests the depthwise conv of cudnn v7 on Pascal GPU? I think we can get some performance improvement on latest architecture. |
As the cudnn is not optimized for depthwise convolution, we optimized the gpu version of depthwise 2D convolution.
The training effect is as follows:
cudnn version mobilenet training in imagenet
our version mobilenet training in imagenet
The defaule depthwise conv will go in optimized version, you can change depthwise_conv_off to True in symbols/mobilenet.py if you want to use cudnn version.
Hardware :
TITAN X (Pascal) + Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz * 16 + 128GMemSoftware :
cuda8.0 + cudnn5.1As described above, we get about 3-4 times speed compared the cudnn version. About the test, we have compared the result in every depthwise layer with the conv version.