Fast and generic implementation using OpenMP and CUDA #45

shikishima-TasakiLab · 2021-06-29T07:43:27Z

close #44

d-li14 · 2021-07-01T01:27:52Z

@shikishima-TasakiLab My compilation fails with the error info:

src/pytorch_wrapper.cpp:12:65:   required from here
/usr/include/c++/8/bits/move.h:87:21: error: static assertion failed: template argument substituting _Tp is an lvalue reference type
       static_assert(!std::is_lvalue_reference<_Tp>::value, "template argument"
                     ^~~~
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

shikishima-TasakiLab · 2021-07-01T02:23:32Z

@d-li14
That error info alone will not help us determine the cause of the error.
To me it looks like you tried to compile the C++ source code with the C compilation settings.

Are you running the following command in the environment where PyTorch is installed to compile it?

python3 setup.py build

or

python3 setup.py install

d-li14 · 2021-07-01T02:34:45Z

@shikishima-TasakiLab Yes, I am running python3 setup.py build
ENV: CUDA 11.0, gcc/g++ 8.3.0, pytorch 1.7.1+cu110

shikishima-TasakiLab · 2021-07-01T06:16:12Z

@d-li14
After trying it out in various environments, it seems that my implementation only works with the latest PyTorch 1.9.0.

In the following Docker environment, I was able to build.

d-li14 · 2021-07-01T17:12:29Z

@shikishima-TasakiLab
I see. Since PyTorch 1.9.0 is too new, would it be possible to modify your implementation to support backward compatibility? It would be helpful for people with more common environments.

shikishima-TasakiLab · 2021-07-02T10:19:02Z

@d-li14
I'll try.

shikishima-TasakiLab · 2021-07-04T13:28:08Z

@d-li14
By modifying some parts of the code, I was able to get my implementation to work with PyTorch 1.7.0 and later.

d-li14 · 2021-07-04T15:27:56Z

@shikishima-TasakiLab
Good Job. I will retry soon.

csvance · 2021-07-08T17:13:11Z

Hi, thank you very much for implementing this, it seems to work very well in full precision mode. However, I do get some issues with numerical stability when used automatic mixed precision training (loss goes to nan in a few steps). I am guessing that the CUDA implementation expects a full precision input but AMP gives it half precision.

As a quick workaround to I made a patch to _involution2d so I could at least use the rest of my network with mixed precision while using this.

def _involution2d(
        input: torch.Tensor,
        weight: torch.Tensor,
        kernel_size: Union[int, Tuple[int, int]] = 7,
        stride: Union[int, Tuple[int, int]] = 1,
        padding: Union[int, Tuple[int, int]] = 0,
        dilation: Union[int, Tuple[int, int]] = 1,
        groups: int = 1,
        bias: torch.Tensor = None,
    ) -> torch.Tensor:
    kernel_size_ = _pair(kernel_size)
    stride_ = _pair(stride)
    padding_ = _pair(padding)
    dilation_ = _pair(dilation)

    if input.dtype == torch.half:
        input = input.float()
    output: torch.Tensor = ops.involution.involution2d(input, weight, kernel_size_, stride_, padding_, dilation_, groups)

    if bias is not None:
        output += bias.view(1, -1, 1, 1)

    return output

d-li14 · 2021-07-09T15:16:29Z

@shikishima-TasakiLab
When I test inference speed with RedNet-101 on a single V100 GPU, your CUDA implementation seems to be slower. The throughput is 523 images/s, while our official implementation is 668 images/s (batch size 256). I wonder why there is this difference between testing a single involution op on 2080Ti as you reported.

shikishima-TasakiLab · 2021-07-10T05:56:14Z

include/involution2d_wrapper.h

+at::Tensor involution2d_autocast(
+    const torch::autograd::Variable& input,
+    const torch::autograd::Variable& weight,
+    const std::vector<int64_t>& kernel_size,
+    const std::vector<int64_t>& stride,
+    const std::vector<int64_t>& padding,
+    const std::vector<int64_t>& dilation,
+    const int64_t groups
+) {
+    c10::impl::ExcludeDispatchKeyGuard no_autocast(c10::DispatchKey::Autocast);
+    auto exec_type = at::autocast::promote_type(at::kFloat, input, weight);
+    return involution2d_autograd(
+        at::autocast::cached_cast(exec_type, input),
+        at::autocast::cached_cast(exec_type, weight),
+        kernel_size, stride, padding, dilation, groups
+    );
+}
+


@csvance
Fixed CUDA implementation input to be full precision using Autocast.

shikishima-TasakiLab · 2021-07-10T06:03:59Z

include/involution2d_cuda.cuh

-#define CUDA_MAX_THREADS 512u
+#define CUDA_MAX_THREADS 1024u



@d-li14
In your CuPy implementation, the maximum number of CUDA threads was set to 1024. However, when I experimented, my CuPy reimplementation did not work with 1024, so I set it to 512.

My CUDA implementation does work with 1024. However, when I experimented, I set it to 512 and forgot to change it back to 1024.

@shikishima-TasakiLab
Thanks, but I have tried to change the maximum CUDA threads, and it seems the result is still similar.

shikishima-TasakiLab · 2021-07-13T11:02:06Z

setup.py

@@ -27,7 +35,7 @@
            ],
            extra_compile_args={
                'cxx': EXTRA_COMPILE_ARGS,
-                'nvcc': ['-O3'],
+                'nvcc': ['-O3'] + GENERATE_CODES,
            }
        )
    )


@d-li14
I changed the arguments of NVCC to be optimized for different architectures.
Hopefully this will increase the speed.

@shikishima-TasakiLab
Thanks for your efforts. However, the new code still does not lead to an expected speedup from my side.

shikishima-TasakiLab and others added 6 commits June 12, 2021 15:07

First commit

221b32b

CPU-only support

c32b399

Update README.md

41a5fb0

Remove "AutoNonVariableTypeMode".

1140e73

Merge branch 'main' of https://github.com/shikishima-TasakiLab/Involu…

684769a

…tion-PyTorch into main

Merge remote-tracking branch 'upstream/main' into main

4b036d3

shikishima-TasakiLab and others added 3 commits July 4, 2021 22:17

Compatible with PyTorch 1.7.0 or later

ce9100f

Update README.md

4dc6c8f

Merge remote-tracking branch 'upstream/main' into main

57a6d6a

shikishima-TasakiLab added 3 commits July 10, 2021 14:15

Fixed the value of CUDA_MAX_THREADS

0c190bc

Fixed auto-casting to float32 when using AMP.

353c22a

Merge remote-tracking branch 'upstream/main' into main

265c309

shikishima-TasakiLab commented Jul 10, 2021

View reviewed changes

shikishima-TasakiLab added 2 commits July 13, 2021 17:46

Support fatbin.

2319dea

Merge remote-tracking branch 'upstream/main' into main

7734b83

shikishima-TasakiLab commented Jul 13, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast and generic implementation using OpenMP and CUDA #45

Fast and generic implementation using OpenMP and CUDA #45

shikishima-TasakiLab commented Jun 29, 2021

d-li14 commented Jul 1, 2021

shikishima-TasakiLab commented Jul 1, 2021

d-li14 commented Jul 1, 2021 •

edited

shikishima-TasakiLab commented Jul 1, 2021 •

edited

d-li14 commented Jul 1, 2021

shikishima-TasakiLab commented Jul 2, 2021

shikishima-TasakiLab commented Jul 4, 2021

d-li14 commented Jul 4, 2021

csvance commented Jul 8, 2021 •

edited

d-li14 commented Jul 9, 2021

shikishima-TasakiLab Jul 10, 2021

shikishima-TasakiLab Jul 10, 2021 •

edited

d-li14 Jul 10, 2021

shikishima-TasakiLab Jul 13, 2021

d-li14 Jul 13, 2021

Fast and generic implementation using OpenMP and CUDA #45

Are you sure you want to change the base?

Fast and generic implementation using OpenMP and CUDA #45

Conversation

shikishima-TasakiLab commented Jun 29, 2021

d-li14 commented Jul 1, 2021

shikishima-TasakiLab commented Jul 1, 2021

d-li14 commented Jul 1, 2021 • edited

shikishima-TasakiLab commented Jul 1, 2021 • edited

d-li14 commented Jul 1, 2021

shikishima-TasakiLab commented Jul 2, 2021

shikishima-TasakiLab commented Jul 4, 2021

d-li14 commented Jul 4, 2021

csvance commented Jul 8, 2021 • edited

d-li14 commented Jul 9, 2021

shikishima-TasakiLab Jul 10, 2021

Choose a reason for hiding this comment

shikishima-TasakiLab Jul 10, 2021 • edited

Choose a reason for hiding this comment

d-li14 Jul 10, 2021

Choose a reason for hiding this comment

shikishima-TasakiLab Jul 13, 2021

Choose a reason for hiding this comment

d-li14 Jul 13, 2021

Choose a reason for hiding this comment

d-li14 commented Jul 1, 2021 •

edited

shikishima-TasakiLab commented Jul 1, 2021 •

edited

csvance commented Jul 8, 2021 •

edited

shikishima-TasakiLab Jul 10, 2021 •

edited