Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast and generic implementation using OpenMP and CUDA #45

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

shikishima-TasakiLab
Copy link

close #44

@d-li14
Copy link
Owner

d-li14 commented Jul 1, 2021

@shikishima-TasakiLab My compilation fails with the error info:

src/pytorch_wrapper.cpp:12:65:   required from here
/usr/include/c++/8/bits/move.h:87:21: error: static assertion failed: template argument substituting _Tp is an lvalue reference type
       static_assert(!std::is_lvalue_reference<_Tp>::value, "template argument"
                     ^~~~
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

@shikishima-TasakiLab
Copy link
Author

@d-li14
That error info alone will not help us determine the cause of the error.
To me it looks like you tried to compile the C++ source code with the C compilation settings.

Are you running the following command in the environment where PyTorch is installed to compile it?

python3 setup.py build

or

python3 setup.py install

@d-li14
Copy link
Owner

d-li14 commented Jul 1, 2021

@shikishima-TasakiLab Yes, I am running python3 setup.py build
ENV: CUDA 11.0, gcc/g++ 8.3.0, pytorch 1.7.1+cu110

@shikishima-TasakiLab
Copy link
Author

shikishima-TasakiLab commented Jul 1, 2021

@d-li14
After trying it out in various environments, it seems that my implementation only works with the latest PyTorch 1.9.0.

In the following Docker environment, I was able to build.

@d-li14
Copy link
Owner

d-li14 commented Jul 1, 2021

@shikishima-TasakiLab
I see. Since PyTorch 1.9.0 is too new, would it be possible to modify your implementation to support backward compatibility? It would be helpful for people with more common environments.

@shikishima-TasakiLab
Copy link
Author

@d-li14
I'll try.

@shikishima-TasakiLab
Copy link
Author

@d-li14
By modifying some parts of the code, I was able to get my implementation to work with PyTorch 1.7.0 and later.

@d-li14
Copy link
Owner

d-li14 commented Jul 4, 2021

@shikishima-TasakiLab
Good Job. I will retry soon.

@csvance
Copy link

csvance commented Jul 8, 2021

Hi, thank you very much for implementing this, it seems to work very well in full precision mode. However, I do get some issues with numerical stability when used automatic mixed precision training (loss goes to nan in a few steps). I am guessing that the CUDA implementation expects a full precision input but AMP gives it half precision.

As a quick workaround to I made a patch to _involution2d so I could at least use the rest of my network with mixed precision while using this.

def _involution2d(
        input: torch.Tensor,
        weight: torch.Tensor,
        kernel_size: Union[int, Tuple[int, int]] = 7,
        stride: Union[int, Tuple[int, int]] = 1,
        padding: Union[int, Tuple[int, int]] = 0,
        dilation: Union[int, Tuple[int, int]] = 1,
        groups: int = 1,
        bias: torch.Tensor = None,
    ) -> torch.Tensor:
    kernel_size_ = _pair(kernel_size)
    stride_ = _pair(stride)
    padding_ = _pair(padding)
    dilation_ = _pair(dilation)

    if input.dtype == torch.half:
        input = input.float()
    output: torch.Tensor = ops.involution.involution2d(input, weight, kernel_size_, stride_, padding_, dilation_, groups)

    if bias is not None:
        output += bias.view(1, -1, 1, 1)

    return output

@d-li14
Copy link
Owner

d-li14 commented Jul 9, 2021

@shikishima-TasakiLab
When I test inference speed with RedNet-101 on a single V100 GPU, your CUDA implementation seems to be slower. The throughput is 523 images/s, while our official implementation is 668 images/s (batch size 256). I wonder why there is this difference between testing a single involution op on 2080Ti as you reported.

Comment on lines +212 to +229
at::Tensor involution2d_autocast(
const torch::autograd::Variable& input,
const torch::autograd::Variable& weight,
const std::vector<int64_t>& kernel_size,
const std::vector<int64_t>& stride,
const std::vector<int64_t>& padding,
const std::vector<int64_t>& dilation,
const int64_t groups
) {
c10::impl::ExcludeDispatchKeyGuard no_autocast(c10::DispatchKey::Autocast);
auto exec_type = at::autocast::promote_type(at::kFloat, input, weight);
return involution2d_autograd(
at::autocast::cached_cast(exec_type, input),
at::autocast::cached_cast(exec_type, weight),
kernel_size, stride, padding, dilation, groups
);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csvance
Fixed CUDA implementation input to be full precision using Autocast.

Comment on lines 10 to 11
#define CUDA_MAX_THREADS 512u
#define CUDA_MAX_THREADS 1024u

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d-li14
In your CuPy implementation, the maximum number of CUDA threads was set to 1024. However, when I experimented, my CuPy reimplementation did not work with 1024, so I set it to 512.

My CUDA implementation does work with 1024. However, when I experimented, I set it to 512 and forgot to change it back to 1024.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shikishima-TasakiLab
Thanks, but I have tried to change the maximum CUDA threads, and it seems the result is still similar.

@@ -27,7 +35,7 @@
],
extra_compile_args={
'cxx': EXTRA_COMPILE_ARGS,
'nvcc': ['-O3'],
'nvcc': ['-O3'] + GENERATE_CODES,
}
)
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d-li14
I changed the arguments of NVCC to be optimized for different architectures.
Hopefully this will increase the speed.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shikishima-TasakiLab
Thanks for your efforts. However, the new code still does not lead to an expected speedup from my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fast and generic implementation using OpenMP and CUDA
3 participants