-
-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use cublasGemmEx in tensordot_core when CUDA11 #3719
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a very quick 1st pass and left a few comments/questions.
cupy/core/core.pyx
Outdated
cdef struct cuComplex: | ||
float x, y | ||
|
||
|
||
cdef struct cuDoubleComplex: | ||
double x, y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you do this at the top instead, for consistency?
cupy/cupy_backends/cuda/libs/cusparse.pyx
Lines 8 to 13 in 0cbfedd
cdef extern from '../cupy_cuComplex.h': | |
ctypedef struct cuComplex 'cuComplex': | |
float x, y | |
ctypedef struct cuDoubleComplex 'cuDoubleComplex': | |
double x, y |
double x, y | ||
|
||
|
||
cpdef ndarray tensordot_core_v11( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel a lot of boilerplate code in this new function overlaps with its predecessor tensordot_core()
, at least for input/output preparation. Can we defer the code splitting point to later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, code duplication is a concern of mine as well 😓
Since cublasGemmEx allows you to select a different data type for output matrix C than the data type of input matrices A and B, I was thinking of using this to reduce amount of copy after gemm (this was not implemented yet). That's why I was branching out early, but there aren't that many opportunities for copy reduction, so I'm going to prioritize reducing code duplication first.
cupy/core/core.pyx
Outdated
if m == 1 and n == 1: | ||
_tensordot_core_mul_sum( | ||
a.ravel(), b.ravel(), _manipulation._reshape(out, ())) | ||
if out is not ret: | ||
elementwise_copy(out, ret) | ||
return ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an example of code duplication mentioned above, note that #3678 is fixing this part, so if duplication is not avoided as much as possible, we'd need to fix it twice 😅
cupy/core/core.pyx
Outdated
return ret | ||
|
||
|
||
cdef int _get_cuda_dtype(ndarray a): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Need to propagate exception if it's raised
- Compare char directly
cdef int _get_cuda_dtype(ndarray a): | |
cdef int _get_cuda_dtype(ndarray a) except -1: | |
cdef str a_type = a.dtype.char |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this function should go to cupy/core/_dtype.pyx
instead...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think a little refactoring would be great: note that the very same function is also needed in cuSPARSE and cuTENSOR, for example:
Lines 70 to 80 in ca79633
def _dtype_to_DataType(dtype): | |
if dtype == 'f': | |
return runtime.CUDA_R_32F | |
elif dtype == 'd': | |
return runtime.CUDA_R_64F | |
elif dtype == 'F': | |
return runtime.CUDA_C_32F | |
elif dtype == 'D': | |
return runtime.CUDA_C_64F | |
else: | |
raise TypeError |
Lines 44 to 56 in 8299e83
def get_cuda_dtype(numpy_dtype): | |
if numpy_dtype == numpy.float16: | |
return runtime.CUDA_R_16F | |
elif numpy_dtype == numpy.float32: | |
return runtime.CUDA_R_32F | |
elif numpy_dtype == numpy.float64: | |
return runtime.CUDA_R_64F | |
elif numpy_dtype == numpy.complex64: | |
return runtime.CUDA_C_32F | |
elif numpy_dtype == numpy.complex128: | |
return runtime.CUDA_C_64F | |
else: | |
raise TypeError('Dtype {} is not supported'.format(numpy_dtype)) |
How about modifying the signature like this:
cdef int _dtype_to_cuda_type(dtype, bint is_half_allowed=False) except -1
and reuse it everywhere in the codebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I also think it's better to reuse a function that convert from numpy data types to CUDA data types. I'd like to see the following implementation, what would you think on this?
cpdef int dtype_to_cuda_dtype(dtype_char, available_dtype_char=None) except -1:
if available_dtype_char is None:
available_dtype_char = 'fdFD'
if dtype_char not in available_dtype_char:
raise TypeError('dtype is not available: %s' % str(dtype_char))
if dtype_char == 'e':
return runtime.CUDA_R_16F
elif dtype_char == 'f':
return runtime.CUDA_R_32F
elif dtype_char == 'd':
return runtime.CUDA_R_64F
elif dtype_char == 'F':
return runtime.CUDA_C_32F
elif dtype_char == 'D':
return runtime.CUDA_C_64F
else:
raise TypeError('dtype is not supported: %s' % str(dtype_char))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @anaruse My preference is to keep the NumPy dtype as input, because when raising an error it offers a better description than a single char. Also, we can avoid double comparison (your first not in
and then the if
's). Last, available_dtype_char
is useless because the if branches are limited.
I think this could be simpler:
cpdef int dtype_to_cuda_dtype(dtype, bint is_half_allowed=False) except -1:
cdef str dtype_char = dtype.char
if dtype_char == 'e' and is_half_allowed:
return runtime.CUDA_R_16F
elif dtype_char == 'f':
return runtime.CUDA_R_32F
elif dtype_char == 'd':
return runtime.CUDA_R_64F
elif dtype_char == 'F':
return runtime.CUDA_C_32F
elif dtype_char == 'D':
return runtime.CUDA_C_64F
else:
raise TypeError('dtype is not supported: {}'.format(dtype))
Thank you for your comment, @leofang ! I've updated the branch based on your comment. Could you take a look when you have time? |
cupy/core/_dtype.pxd
Outdated
@@ -1,2 +1,3 @@ | |||
cpdef get_dtype(t) | |||
cpdef tuple get_dtype_with_itemsize(t) | |||
cpdef int dtype_to_cuda_dtype(dtype, bint is_half_allowed=?) except -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this is the correct syntax? (See the Cython doc)
cpdef int dtype_to_cuda_dtype(dtype, bint is_half_allowed=?) except -1 | |
cpdef int dtype_to_cuda_dtype(dtype, bint is_half_allowed=*) except -1 |
@@ -2856,14 +2866,15 @@ cpdef ndarray tensordot_core( | |||
b.data.ptr, runtime.CUDA_R_16F, <int>ldb, | |||
a.data.ptr, runtime.CUDA_R_16F, <int>lda, | |||
<size_t>&zero_fp32, | |||
c.data.ptr, Ctype, <int>m, | |||
c.data.ptr, runtime.CUDA_R_16F, <int>m, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know for sure c
is of type float16
at this stage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I checked the original code, and if dtype of matrix a
and b
is float16
, then the dtype of matrix c
will be always float16
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
runtime.CUDA_R_16F, <int>lda, 0, c.data.ptr, Ctype, <int>m) | ||
b.data.ptr, runtime.CUDA_R_16F, <int>ldb, | ||
a.data.ptr, runtime.CUDA_R_16F, <int>lda, 0, | ||
c.data.ptr, runtime.CUDA_R_16F, <int>m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
cupy/core/core.pyx
Outdated
compute_capability = int(device.get_compute_capability()) | ||
algo = cublas.CUBLAS_GEMM_DEFAULT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compute_capability = int(device.get_compute_capability()) | |
algo = cublas.CUBLAS_GEMM_DEFAULT | |
cdef int compute_capability = int(device.get_compute_capability()) | |
cdef int algo = cublas.CUBLAS_GEMM_DEFAULT |
cdef double one_d, zero_d | ||
cdef cuComplex one_F, zero_F | ||
cdef cuDoubleComplex one_D, zero_D | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cdef int compute_type |
cupy/core/core.pyx
Outdated
a_cuda_dtype = dtype_to_cuda_dtype(a.dtype, is_half_allowed=True) | ||
b_cuda_dtype = dtype_to_cuda_dtype(b.dtype, is_half_allowed=True) | ||
c_cuda_dtype = dtype_to_cuda_dtype(c.dtype, is_half_allowed=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a_cuda_dtype = dtype_to_cuda_dtype(a.dtype, is_half_allowed=True) | |
b_cuda_dtype = dtype_to_cuda_dtype(b.dtype, is_half_allowed=True) | |
c_cuda_dtype = dtype_to_cuda_dtype(c.dtype, is_half_allowed=True) | |
cdef int a_cuda_dtype = dtype_to_cuda_dtype(a.dtype, is_half_allowed=True) | |
cdef int b_cuda_dtype = dtype_to_cuda_dtype(b.dtype, is_half_allowed=True) | |
cdef int c_cuda_dtype = dtype_to_cuda_dtype(c.dtype, is_half_allowed=True) |
cdef cuDoubleComplex one_D, zero_D | ||
|
||
if c.dtype.char in 'efF': | ||
compute_type = cublas.CUBLAS_COMPUTE_32F |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we wanna use CUBLAS_COMPUTE_16F
for half precision? Does it not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use CUBLAS_COMPUTE_16F
, but I didn't use it here for a few reasons.
Performance: On a GPU with TensorCore, if the data types of matrix a
, b
and c
are half
precision, there is little difference in performance of matrix multiply between using CUBLAS_COMPUTE_16F
and CUBLAS_COMPUTE_32F
as the compute type.
Accuracy: If CUBLAS_COMPUTE_32F
is used as the compute type, the accumulation in the matrix multiply is performance in float
precision, which reduces the rounding-error accumulation compared to using CUBLAS_COMPUTE_16F
, resulting in more accurate results.
Code maintenance: If you specify CUBLAS_COMPUTE_16F
as the compute type, then the parameters alpha
and beta
of cublasGemmEx
must be pointers of half
dtype. However, half
is not a 1st citizen in Cython, requiring a bit complicated code. I prefer to keep the source code simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @anaruse! It makes perfect sense 👍
cupy/core/core.pyx
Outdated
if compute_type == cublas.CUBLAS_COMPUTE_32F: | ||
one_f = 1 | ||
zero_f = 0 | ||
cublas.gemmEx( | ||
handle, <int>transa, <int>transb, <int>m, <int>n, <int>k, | ||
<size_t>&one_f, | ||
a.data.ptr, a_cuda_dtype, <int>lda, | ||
b.data.ptr, b_cuda_dtype, <int>ldb, | ||
<size_t>&zero_f, c.data.ptr, c_cuda_dtype, <int>ldc, | ||
compute_type, algo) | ||
elif compute_type == cublas.CUBLAS_COMPUTE_64F: | ||
one_d = 1 | ||
zero_d = 0 | ||
cublas.gemmEx( | ||
handle, <int>transa, <int>transb, <int>m, <int>n, <int>k, | ||
<size_t>&one_d, | ||
a.data.ptr, a_cuda_dtype, <int>lda, | ||
b.data.ptr, b_cuda_dtype, <int>ldb, | ||
<size_t>&zero_d, c.data.ptr, c_cuda_dtype, <int>ldc, | ||
compute_type, algo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like they can be combined?!
if compute_type == cublas.CUBLAS_COMPUTE_32F: | |
one_f = 1 | |
zero_f = 0 | |
cublas.gemmEx( | |
handle, <int>transa, <int>transb, <int>m, <int>n, <int>k, | |
<size_t>&one_f, | |
a.data.ptr, a_cuda_dtype, <int>lda, | |
b.data.ptr, b_cuda_dtype, <int>ldb, | |
<size_t>&zero_f, c.data.ptr, c_cuda_dtype, <int>ldc, | |
compute_type, algo) | |
elif compute_type == cublas.CUBLAS_COMPUTE_64F: | |
one_d = 1 | |
zero_d = 0 | |
cublas.gemmEx( | |
handle, <int>transa, <int>transb, <int>m, <int>n, <int>k, | |
<size_t>&one_d, | |
a.data.ptr, a_cuda_dtype, <int>lda, | |
b.data.ptr, b_cuda_dtype, <int>ldb, | |
<size_t>&zero_d, c.data.ptr, c_cuda_dtype, <int>ldc, | |
compute_type, algo) | |
if compute_type in (cublas.CUBLAS_COMPUTE_32F, cublas.CUBLAS_COMPUTE_64F): | |
one = 1 | |
zero = 0 | |
cublas.gemmEx( | |
handle, <int>transa, <int>transb, <int>m, <int>n, <int>k, | |
<size_t>&one, | |
a.data.ptr, a_cuda_dtype, <int>lda, | |
b.data.ptr, b_cuda_dtype, <int>ldb | |
<size_t>&zero, c.data.ptr, c_cuda_dtype, <int>ldc, | |
compute_type, algo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to be able to do so, but the dtypes of the parameter alpha
and beta
of cublasGemmEx (in this case, one
and zero
) have to be float
pointer when compute type is COMPUTE_32F
and double
pointer when compute type is COMPUTE_64F
.
https://docs.nvidia.com/cuda/cublas/index.html#cublas-GemmEx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the following implementation?
if compute_type == cublas.CUBLAS_COMPUTE_32F:
one_f = 1
zero_f = 0
one_ptr = <size_t>&one_f
zero_ptr = <size_t>&zero_f
elif compute_type == cublas.CUBLAS_COMPUTE_64F):
...
else:
...
cublas.gemmEx(
handle, <int>transa, <int>transb, <int>m, <int>n, <int>k,
one_ptr, a.data.ptr, a_cuda_dtype, <int>lda,
b.data.ptr, b_cuda_dtype, <int>ldb
zero_ptr, c.data.ptr, c_cuda_dtype, <int>ldc,
compute_type, algo)n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, certainly, there is the way to do it. Thanks @asi1024 !
cupy/core/core.pyx
Outdated
if compute_type == cublas.CUBLAS_COMPUTE_32F: | ||
one_F = cuComplex(1, 0) | ||
zero_F = cuComplex(0, 0) | ||
cublas.gemmEx( | ||
handle, <int>transa, <int>transb, <int>m, <int>n, <int>k, | ||
<size_t>&one_F, | ||
a.data.ptr, a_cuda_dtype, <int>lda, | ||
b.data.ptr, b_cuda_dtype, <int>ldb, | ||
<size_t>&zero_F, c.data.ptr, c_cuda_dtype, <int>ldc, | ||
compute_type, algo) | ||
elif compute_type == cublas.CUBLAS_COMPUTE_64F: | ||
one_D = cuDoubleComplex(1, 0) | ||
zero_D = cuDoubleComplex(0, 0) | ||
cublas.gemmEx( | ||
handle, <int>transa, <int>transb, <int>m, <int>n, <int>k, | ||
<size_t>&one_D, | ||
a.data.ptr, a_cuda_dtype, <int>lda, | ||
b.data.ptr, b_cuda_dtype, <int>ldb, | ||
<size_t>&zero_D, c.data.ptr, c_cuda_dtype, <int>ldc, | ||
compute_type, algo) | ||
else: | ||
raise ValueError('Invalid compute type: {}'.format(compute_type)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, something like
if compute_type == cublas.CUBLAS_COMPUTE_32F: | |
one_F = cuComplex(1, 0) | |
zero_F = cuComplex(0, 0) | |
cublas.gemmEx( | |
handle, <int>transa, <int>transb, <int>m, <int>n, <int>k, | |
<size_t>&one_F, | |
a.data.ptr, a_cuda_dtype, <int>lda, | |
b.data.ptr, b_cuda_dtype, <int>ldb, | |
<size_t>&zero_F, c.data.ptr, c_cuda_dtype, <int>ldc, | |
compute_type, algo) | |
elif compute_type == cublas.CUBLAS_COMPUTE_64F: | |
one_D = cuDoubleComplex(1, 0) | |
zero_D = cuDoubleComplex(0, 0) | |
cublas.gemmEx( | |
handle, <int>transa, <int>transb, <int>m, <int>n, <int>k, | |
<size_t>&one_D, | |
a.data.ptr, a_cuda_dtype, <int>lda, | |
b.data.ptr, b_cuda_dtype, <int>ldb, | |
<size_t>&zero_D, c.data.ptr, c_cuda_dtype, <int>ldc, | |
compute_type, algo) | |
else: | |
raise ValueError('Invalid compute type: {}'.format(compute_type)) | |
if compute_type == cublas.CUBLAS_COMPUTE_32F: | |
one = cuComplex(1, 0) | |
zero = cuComplex(0, 0) | |
elif compute_type == cublas.CUBLAS_COMPUTE_64F: | |
one = cuDoubleComplex(1, 0) | |
zero = cuDoubleComplex(1, 0) | |
else: | |
raise ValueError('Invalid compute type: {}'.format(compute_type)) | |
cublas.gemmEx( | |
handle, <int>transa, <int>transb, <int>m, <int>n, <int>k, | |
<size_t>&one, | |
a.data.ptr, a_cuda_dtype, <int>lda, | |
b.data.ptr, b_cuda_dtype, <int>ldb, | |
<size_t>&zero, c.data.ptr, c_cuda_dtype, <int>ldc, | |
compute_type, algo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the same reasons as above, I'm afraid, we cannot do this either..
<void*>C, <runtime.DataType>Ctype, ldc, | ||
<runtime.DataType>computeType, <GemmAlgo>algo) | ||
if computeType >= CUBLAS_COMPUTE_16F: | ||
status = cublasGemmEx_v11( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: It seems there's a C++ overloaded version of cublasGemmEx
that supports the old cudaDataType
?https://docs.nvidia.com/cuda/cublas/index.html#cublas-GemmEx
I wonder if using that could help, or is it to be deprecated soon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, even with cublasGemmEx
of CUDA11, you can still specify the compute type with cudaDataType
, as long as you're in C++. However, in the old way, you cannot specify, for example, TF32 (TensorFloat32) as a compute type, so you need to specify the compute type with cublasComputeType
, which is added in CUDA11.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, so this is why we need the new interface...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will split tensordot_core
and tensordot_core_v11
into another file _routines_linalg.pyx
after the merge of this PR.
cupy/core/core.pyx
Outdated
if compute_type == cublas.CUBLAS_COMPUTE_32F: | ||
one_f = 1 | ||
zero_f = 0 | ||
cublas.gemmEx( | ||
handle, <int>transa, <int>transb, <int>m, <int>n, <int>k, | ||
<size_t>&one_f, | ||
a.data.ptr, a_cuda_dtype, <int>lda, | ||
b.data.ptr, b_cuda_dtype, <int>ldb, | ||
<size_t>&zero_f, c.data.ptr, c_cuda_dtype, <int>ldc, | ||
compute_type, algo) | ||
elif compute_type == cublas.CUBLAS_COMPUTE_64F: | ||
one_d = 1 | ||
zero_d = 0 | ||
cublas.gemmEx( | ||
handle, <int>transa, <int>transb, <int>m, <int>n, <int>k, | ||
<size_t>&one_d, | ||
a.data.ptr, a_cuda_dtype, <int>lda, | ||
b.data.ptr, b_cuda_dtype, <int>ldb, | ||
<size_t>&zero_d, c.data.ptr, c_cuda_dtype, <int>ldc, | ||
compute_type, algo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the following implementation?
if compute_type == cublas.CUBLAS_COMPUTE_32F:
one_f = 1
zero_f = 0
one_ptr = <size_t>&one_f
zero_ptr = <size_t>&zero_f
elif compute_type == cublas.CUBLAS_COMPUTE_64F):
...
else:
...
cublas.gemmEx(
handle, <int>transa, <int>transb, <int>m, <int>n, <int>k,
one_ptr, a.data.ptr, a_cuda_dtype, <int>lda,
b.data.ptr, b_cuda_dtype, <int>ldb
zero_ptr, c.data.ptr, c_cuda_dtype, <int>ldc,
compute_type, algo)n
Jenkins, test this please. |
Successfully created a job for commit 752de1b: |
@asi1024 @takagi @kmaehashi I think Jenkins is dead since yesterday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
if use_sgemmEx: | ||
Ctype = runtime.CUDA_R_16F if c.dtype == 'e' else runtime.CUDA_R_32F | ||
|
||
global _cuda_runtime_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I think we no longer need to check this as we're on CUDA 9.0+ starting CuPy v8! I will send a PR to remove it from a few places, but for the ease of backport let's keep it here.
@@ -2856,14 +2866,15 @@ cpdef ndarray tensordot_core( | |||
b.data.ptr, runtime.CUDA_R_16F, <int>ldb, | |||
a.data.ptr, runtime.CUDA_R_16F, <int>lda, | |||
<size_t>&zero_fp32, | |||
c.data.ptr, Ctype, <int>m, | |||
c.data.ptr, runtime.CUDA_R_16F, <int>m, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Jenkins, test this please. |
Successfully created a job for commit 752de1b: |
Jenkins CI test (for commit 752de1b, target branch master) succeeded! |
I will retrigger CI after chainer/chainer-test#593 is merged. |
Jenkins, test this please. |
Jenkins CI test (for commit 752de1b, target branch master) succeeded! |
LGTM! |
This PR modifies to use
cublasGemmEx
, an extension ofcublas<t>gemm
, as the matrix multiply backend to be called intensordot_core
in case of CUDA11.cublasGemmEx
is flexible, allowing users to specify the data types for each of the matrices A, B, C, the precision of computation and the matrix multiply algorithm to be used.https://docs.nvidia.com/cuda/cublas/index.html#cublas-GemmEx
This is a kind of preparatory PR and the following PR will allow the use of TF32 (TensorFloat32) as the compute precision of the matrix multiply.
This is related to #3602