Use cublasGemmEx in tensordot_core when CUDA11 #3719

anaruse · 2020-08-04T11:49:50Z

This PR modifies to use cublasGemmEx, an extension of cublas<t>gemm, as the matrix multiply backend to be called in tensordot_core in case of CUDA11. cublasGemmEx is flexible, allowing users to specify the data types for each of the matrices A, B, C, the precision of computation and the matrix multiply algorithm to be used.
https://docs.nvidia.com/cuda/cublas/index.html#cublas-GemmEx

This is a kind of preparatory PR and the following PR will allow the use of TF32 (TensorFloat32) as the compute precision of the matrix multiply.

This is related to #3602

leofang

Did a very quick 1st pass and left a few comments/questions.

leofang · 2020-08-04T19:51:01Z

cupy/core/core.pyx

+cdef struct cuComplex:
+    float x, y
+
+
+cdef struct cuDoubleComplex:
+    double x, y


Could you do this at the top instead, for consistency?

cupy/cupy_backends/cuda/libs/cusparse.pyx

Lines 8 to 13 in 0cbfedd

cdef extern from '../cupy_cuComplex.h':

ctypedef struct cuComplex 'cuComplex':

float x, y

ctypedef struct cuDoubleComplex 'cuDoubleComplex':

double x, y

leofang · 2020-08-04T19:52:39Z

cupy/core/core.pyx

+    double x, y
+
+
+cpdef ndarray tensordot_core_v11(


I feel a lot of boilerplate code in this new function overlaps with its predecessor tensordot_core(), at least for input/output preparation. Can we defer the code splitting point to later?

Yes, code duplication is a concern of mine as well 😓
Since cublasGemmEx allows you to select a different data type for output matrix C than the data type of input matrices A and B, I was thinking of using this to reduce amount of copy after gemm (this was not implemented yet). That's why I was branching out early, but there aren't that many opportunities for copy reduction, so I'm going to prioritize reducing code duplication first.

leofang · 2020-08-04T19:54:24Z

cupy/core/core.pyx

+    if m == 1 and n == 1:
+        _tensordot_core_mul_sum(
+            a.ravel(), b.ravel(), _manipulation._reshape(out, ()))
+        if out is not ret:
+            elementwise_copy(out, ret)
+        return ret


As an example of code duplication mentioned above, note that #3678 is fixing this part, so if duplication is not avoided as much as possible, we'd need to fix it twice 😅

leofang · 2020-08-04T20:46:20Z

cupy/core/core.pyx

+    return ret
+
+
+cdef int _get_cuda_dtype(ndarray a):


Need to propagate exception if it's raised

Compare char directly

Suggested change

cdef int _get_cuda_dtype(ndarray a):

cdef int _get_cuda_dtype(ndarray a) except -1:

cdef str a_type = a.dtype.char

I'm wondering if this function should go to cupy/core/_dtype.pyx instead...?

Yeah I think a little refactoring would be great: note that the very same function is also needed in cuSPARSE and cuTENSOR, for example:

cupy/cupy/cusparse.py

Lines 70 to 80 in ca79633

def _dtype_to_DataType(dtype):

if dtype == 'f':

return runtime.CUDA_R_32F

elif dtype == 'd':

return runtime.CUDA_R_64F

elif dtype == 'F':

return runtime.CUDA_C_32F

elif dtype == 'D':

return runtime.CUDA_C_64F

else:

raise TypeError

cupy/cupy/cutensor.py

Lines 44 to 56 in 8299e83

def get_cuda_dtype(numpy_dtype):

if numpy_dtype == numpy.float16:

return runtime.CUDA_R_16F

elif numpy_dtype == numpy.float32:

return runtime.CUDA_R_32F

elif numpy_dtype == numpy.float64:

return runtime.CUDA_R_64F

elif numpy_dtype == numpy.complex64:

return runtime.CUDA_C_32F

elif numpy_dtype == numpy.complex128:

return runtime.CUDA_C_64F

else:

raise TypeError('Dtype {} is not supported'.format(numpy_dtype))

How about modifying the signature like this:

cdef int _dtype_to_cuda_type(dtype, bint is_half_allowed=False) except -1

and reuse it everywhere in the codebase?

Agreed. I also think it's better to reuse a function that convert from numpy data types to CUDA data types. I'd like to see the following implementation, what would you think on this?

cpdef int dtype_to_cuda_dtype(dtype_char, available_dtype_char=None) except -1: if available_dtype_char is None: available_dtype_char = 'fdFD' if dtype_char not in available_dtype_char: raise TypeError('dtype is not available: %s' % str(dtype_char)) if dtype_char == 'e': return runtime.CUDA_R_16F elif dtype_char == 'f': return runtime.CUDA_R_32F elif dtype_char == 'd': return runtime.CUDA_R_64F elif dtype_char == 'F': return runtime.CUDA_C_32F elif dtype_char == 'D': return runtime.CUDA_C_64F else: raise TypeError('dtype is not supported: %s' % str(dtype_char))

Hi @anaruse My preference is to keep the NumPy dtype as input, because when raising an error it offers a better description than a single char. Also, we can avoid double comparison (your first not in and then the if's). Last, available_dtype_char is useless because the if branches are limited.

I think this could be simpler:

cpdef int dtype_to_cuda_dtype(dtype, bint is_half_allowed=False) except -1: cdef str dtype_char = dtype.char if dtype_char == 'e' and is_half_allowed: return runtime.CUDA_R_16F elif dtype_char == 'f': return runtime.CUDA_R_32F elif dtype_char == 'd': return runtime.CUDA_R_64F elif dtype_char == 'F': return runtime.CUDA_C_32F elif dtype_char == 'D': return runtime.CUDA_C_64F else: raise TypeError('dtype is not supported: {}'.format(dtype))

anaruse · 2020-08-05T08:23:53Z

Thank you for your comment, @leofang ! I've updated the branch based on your comment. Could you take a look when you have time?

leofang · 2020-08-06T15:56:53Z

cupy/core/_dtype.pxd

@@ -1,2 +1,3 @@
 cpdef get_dtype(t)
 cpdef tuple get_dtype_with_itemsize(t)
+cpdef int dtype_to_cuda_dtype(dtype, bint is_half_allowed=?) except -1


I thought this is the correct syntax? (See the Cython doc)

Suggested change

cpdef int dtype_to_cuda_dtype(dtype, bint is_half_allowed=?) except -1

cpdef int dtype_to_cuda_dtype(dtype, bint is_half_allowed=*) except -1

leofang · 2020-08-06T16:27:36Z

cupy/core/core.pyx

@@ -2856,14 +2866,15 @@ cpdef ndarray tensordot_core(
                b.data.ptr, runtime.CUDA_R_16F, <int>ldb,
                a.data.ptr, runtime.CUDA_R_16F, <int>lda,
                <size_t>&zero_fp32,
-                c.data.ptr, Ctype, <int>m,
+                c.data.ptr, runtime.CUDA_R_16F, <int>m,


Do we know for sure c is of type float16 at this stage?

Yes, I checked the original code, and if dtype of matrix a and b is float16, then the dtype of matrix c will be always float16.

leofang · 2020-08-06T16:28:10Z

cupy/core/core.pyx

-                runtime.CUDA_R_16F, <int>lda, 0, c.data.ptr, Ctype, <int>m)
+                b.data.ptr, runtime.CUDA_R_16F, <int>ldb,
+                a.data.ptr, runtime.CUDA_R_16F, <int>lda, 0,
+                c.data.ptr, runtime.CUDA_R_16F, <int>m)


leofang · 2020-08-06T16:32:40Z

cupy/core/core.pyx

+    compute_capability = int(device.get_compute_capability())
+    algo = cublas.CUBLAS_GEMM_DEFAULT


Suggested change

compute_capability = int(device.get_compute_capability())

algo = cublas.CUBLAS_GEMM_DEFAULT

cdef int compute_capability = int(device.get_compute_capability())

cdef int algo = cublas.CUBLAS_GEMM_DEFAULT

leofang · 2020-08-06T16:33:49Z

cupy/core/core.pyx

+    cdef double one_d, zero_d
+    cdef cuComplex one_F, zero_F
+    cdef cuDoubleComplex one_D, zero_D
+


Suggested change

cdef int compute_type

leofang · 2020-08-06T16:38:06Z

cupy/core/core.pyx

+    a_cuda_dtype = dtype_to_cuda_dtype(a.dtype, is_half_allowed=True)
+    b_cuda_dtype = dtype_to_cuda_dtype(b.dtype, is_half_allowed=True)
+    c_cuda_dtype = dtype_to_cuda_dtype(c.dtype, is_half_allowed=True)


Suggested change

a_cuda_dtype = dtype_to_cuda_dtype(a.dtype, is_half_allowed=True)

b_cuda_dtype = dtype_to_cuda_dtype(b.dtype, is_half_allowed=True)

c_cuda_dtype = dtype_to_cuda_dtype(c.dtype, is_half_allowed=True)

cdef int a_cuda_dtype = dtype_to_cuda_dtype(a.dtype, is_half_allowed=True)

cdef int b_cuda_dtype = dtype_to_cuda_dtype(b.dtype, is_half_allowed=True)

cdef int c_cuda_dtype = dtype_to_cuda_dtype(c.dtype, is_half_allowed=True)

leofang · 2020-08-06T16:42:05Z

cupy/core/core.pyx

+    cdef cuDoubleComplex one_D, zero_D
+
+    if c.dtype.char in 'efF':
+        compute_type = cublas.CUBLAS_COMPUTE_32F


Don't we wanna use CUBLAS_COMPUTE_16F for half precision? Does it not work?

You can use CUBLAS_COMPUTE_16F, but I didn't use it here for a few reasons.

Performance: On a GPU with TensorCore, if the data types of matrix a, b and c are half precision, there is little difference in performance of matrix multiply between using CUBLAS_COMPUTE_16F and CUBLAS_COMPUTE_32F as the compute type.

Accuracy: If CUBLAS_COMPUTE_32F is used as the compute type, the accumulation in the matrix multiply is performance in float precision, which reduces the rounding-error accumulation compared to using CUBLAS_COMPUTE_16F, resulting in more accurate results.

Code maintenance: If you specify CUBLAS_COMPUTE_16F as the compute type, then the parameters alpha and beta of cublasGemmEx must be pointers of half dtype. However, half is not a 1st citizen in Cython, requiring a bit complicated code. I prefer to keep the source code simple.

Thanks, @anaruse! It makes perfect sense 👍

leofang · 2020-08-06T16:44:57Z

cupy/core/core.pyx

+        if compute_type == cublas.CUBLAS_COMPUTE_32F:
+            one_f = 1
+            zero_f = 0
+            cublas.gemmEx(
+                handle, <int>transa, <int>transb, <int>m, <int>n, <int>k,
+                <size_t>&one_f,
+                a.data.ptr, a_cuda_dtype, <int>lda,
+                b.data.ptr, b_cuda_dtype, <int>ldb,
+                <size_t>&zero_f, c.data.ptr, c_cuda_dtype, <int>ldc,
+                compute_type, algo)
+        elif compute_type == cublas.CUBLAS_COMPUTE_64F:
+            one_d = 1
+            zero_d = 0
+            cublas.gemmEx(
+                handle, <int>transa, <int>transb, <int>m, <int>n, <int>k,
+                <size_t>&one_d,
+                a.data.ptr, a_cuda_dtype, <int>lda,
+                b.data.ptr, b_cuda_dtype, <int>ldb,
+                <size_t>&zero_d, c.data.ptr, c_cuda_dtype, <int>ldc,
+                compute_type, algo)


Looks like they can be combined?!

Suggested change

if compute_type == cublas.CUBLAS_COMPUTE_32F:

one_f = 1

zero_f = 0

cublas.gemmEx(

handle, <int>transa, <int>transb, <int>m, <int>n, <int>k,

<size_t>&one_f,

a.data.ptr, a_cuda_dtype, <int>lda,

b.data.ptr, b_cuda_dtype, <int>ldb,

<size_t>&zero_f, c.data.ptr, c_cuda_dtype, <int>ldc,

compute_type, algo)

elif compute_type == cublas.CUBLAS_COMPUTE_64F:

one_d = 1

zero_d = 0

cublas.gemmEx(

handle, <int>transa, <int>transb, <int>m, <int>n, <int>k,

<size_t>&one_d,

a.data.ptr, a_cuda_dtype, <int>lda,

b.data.ptr, b_cuda_dtype, <int>ldb,

<size_t>&zero_d, c.data.ptr, c_cuda_dtype, <int>ldc,

compute_type, algo)

if compute_type in (cublas.CUBLAS_COMPUTE_32F, cublas.CUBLAS_COMPUTE_64F):

one = 1

zero = 0

cublas.gemmEx(

handle, <int>transa, <int>transb, <int>m, <int>n, <int>k,

<size_t>&one,

a.data.ptr, a_cuda_dtype, <int>lda,

b.data.ptr, b_cuda_dtype, <int>ldb

<size_t>&zero, c.data.ptr, c_cuda_dtype, <int>ldc,

compute_type, algo)

It would be nice to be able to do so, but the dtypes of the parameter alpha and beta of cublasGemmEx (in this case, one and zero) have to be float pointer when compute type is COMPUTE_32F and double pointer when compute type is COMPUTE_64F.
https://docs.nvidia.com/cuda/cublas/index.html#cublas-GemmEx

How about the following implementation?

if compute_type == cublas.CUBLAS_COMPUTE_32F: one_f = 1 zero_f = 0 one_ptr = <size_t>&one_f zero_ptr = <size_t>&zero_f elif compute_type == cublas.CUBLAS_COMPUTE_64F): ... else: ... cublas.gemmEx( handle, <int>transa, <int>transb, <int>m, <int>n, <int>k, one_ptr, a.data.ptr, a_cuda_dtype, <int>lda, b.data.ptr, b_cuda_dtype, <int>ldb zero_ptr, c.data.ptr, c_cuda_dtype, <int>ldc, compute_type, algo)n

Ah, certainly, there is the way to do it. Thanks @asi1024 !

leofang · 2020-08-06T16:50:10Z

cupy/core/core.pyx

+        if compute_type == cublas.CUBLAS_COMPUTE_32F:
+            one_F = cuComplex(1, 0)
+            zero_F = cuComplex(0, 0)
+            cublas.gemmEx(
+                handle, <int>transa, <int>transb, <int>m, <int>n, <int>k,
+                <size_t>&one_F,
+                a.data.ptr, a_cuda_dtype, <int>lda,
+                b.data.ptr, b_cuda_dtype, <int>ldb,
+                <size_t>&zero_F, c.data.ptr, c_cuda_dtype, <int>ldc,
+                compute_type, algo)
+        elif compute_type == cublas.CUBLAS_COMPUTE_64F:
+            one_D = cuDoubleComplex(1, 0)
+            zero_D = cuDoubleComplex(0, 0)
+            cublas.gemmEx(
+                handle, <int>transa, <int>transb, <int>m, <int>n, <int>k,
+                <size_t>&one_D,
+                a.data.ptr, a_cuda_dtype, <int>lda,
+                b.data.ptr, b_cuda_dtype, <int>ldb,
+                <size_t>&zero_D, c.data.ptr, c_cuda_dtype, <int>ldc,
+                compute_type, algo)
+        else:
+            raise ValueError('Invalid compute type: {}'.format(compute_type))


ditto, something like

Suggested change

if compute_type == cublas.CUBLAS_COMPUTE_32F:

one_F = cuComplex(1, 0)

zero_F = cuComplex(0, 0)

cublas.gemmEx(

handle, <int>transa, <int>transb, <int>m, <int>n, <int>k,

<size_t>&one_F,

a.data.ptr, a_cuda_dtype, <int>lda,

b.data.ptr, b_cuda_dtype, <int>ldb,

<size_t>&zero_F, c.data.ptr, c_cuda_dtype, <int>ldc,

compute_type, algo)

elif compute_type == cublas.CUBLAS_COMPUTE_64F:

one_D = cuDoubleComplex(1, 0)

zero_D = cuDoubleComplex(0, 0)

cublas.gemmEx(

handle, <int>transa, <int>transb, <int>m, <int>n, <int>k,

<size_t>&one_D,

a.data.ptr, a_cuda_dtype, <int>lda,

b.data.ptr, b_cuda_dtype, <int>ldb,

<size_t>&zero_D, c.data.ptr, c_cuda_dtype, <int>ldc,

compute_type, algo)

else:

raise ValueError('Invalid compute type: {}'.format(compute_type))

if compute_type == cublas.CUBLAS_COMPUTE_32F:

one = cuComplex(1, 0)

zero = cuComplex(0, 0)

elif compute_type == cublas.CUBLAS_COMPUTE_64F:

one = cuDoubleComplex(1, 0)

zero = cuDoubleComplex(1, 0)

else:

raise ValueError('Invalid compute type: {}'.format(compute_type))

cublas.gemmEx(

handle, <int>transa, <int>transb, <int>m, <int>n, <int>k,

<size_t>&one,

a.data.ptr, a_cuda_dtype, <int>lda,

b.data.ptr, b_cuda_dtype, <int>ldb,

<size_t>&zero, c.data.ptr, c_cuda_dtype, <int>ldc,

compute_type, algo)

For the same reasons as above, I'm afraid, we cannot do this either..

leofang · 2020-08-06T16:51:47Z

cupy_backends/cuda/libs/cublas.pyx

-            <void*>C, <runtime.DataType>Ctype, ldc,
-            <runtime.DataType>computeType, <GemmAlgo>algo)
+        if computeType >= CUBLAS_COMPUTE_16F:
+            status = cublasGemmEx_v11(


Question: It seems there's a C++ overloaded version of cublasGemmEx that supports the old cudaDataType?https://docs.nvidia.com/cuda/cublas/index.html#cublas-GemmEx
I wonder if using that could help, or is it to be deprecated soon?

That's right, even with cublasGemmEx of CUDA11, you can still specify the compute type with cudaDataType, as long as you're in C++. However, in the old way, you cannot specify, for example, TF32 (TensorFloat32) as a compute type, so you need to specify the compute type with cublasComputeType, which is added in CUDA11.

Ah I see, so this is why we need the new interface...

asi1024

We will split tensordot_core and tensordot_core_v11 into another file _routines_linalg.pyx after the merge of this PR.

cupy/core/_dtype.pyx

asi1024 · 2020-08-07T18:58:08Z

cupy/core/core.pyx

+        if compute_type == cublas.CUBLAS_COMPUTE_32F:
+            one_f = 1
+            zero_f = 0
+            cublas.gemmEx(
+                handle, <int>transa, <int>transb, <int>m, <int>n, <int>k,
+                <size_t>&one_f,
+                a.data.ptr, a_cuda_dtype, <int>lda,
+                b.data.ptr, b_cuda_dtype, <int>ldb,
+                <size_t>&zero_f, c.data.ptr, c_cuda_dtype, <int>ldc,
+                compute_type, algo)
+        elif compute_type == cublas.CUBLAS_COMPUTE_64F:
+            one_d = 1
+            zero_d = 0
+            cublas.gemmEx(
+                handle, <int>transa, <int>transb, <int>m, <int>n, <int>k,
+                <size_t>&one_d,
+                a.data.ptr, a_cuda_dtype, <int>lda,
+                b.data.ptr, b_cuda_dtype, <int>ldb,
+                <size_t>&zero_d, c.data.ptr, c_cuda_dtype, <int>ldc,
+                compute_type, algo)


How about the following implementation?

if compute_type == cublas.CUBLAS_COMPUTE_32F: one_f = 1 zero_f = 0 one_ptr = <size_t>&one_f zero_ptr = <size_t>&zero_f elif compute_type == cublas.CUBLAS_COMPUTE_64F): ... else: ... cublas.gemmEx( handle, <int>transa, <int>transb, <int>m, <int>n, <int>k, one_ptr, a.data.ptr, a_cuda_dtype, <int>lda, b.data.ptr, b_cuda_dtype, <int>ldb zero_ptr, c.data.ptr, c_cuda_dtype, <int>ldc, compute_type, algo)n

…mplify code

asi1024 · 2020-08-08T18:16:52Z

Jenkins, test this please.

pfn-ci-bot · 2020-08-08T18:16:58Z

Successfully created a job for commit 752de1b:

Dashboard for commit 752de1b

leofang · 2020-08-08T19:38:50Z

@asi1024 @takagi @kmaehashi I think Jenkins is dead since yesterday.

leofang

LGTM!

leofang · 2020-08-09T01:37:29Z

cupy/core/core.pyx

-    if use_sgemmEx:
-        Ctype = runtime.CUDA_R_16F if c.dtype == 'e' else runtime.CUDA_R_32F
+
+    global _cuda_runtime_version


Note: I think we no longer need to check this as we're on CUDA 9.0+ starting CuPy v8! I will send a PR to remove it from a few places, but for the ease of backport let's keep it here.

leofang · 2020-08-09T01:37:59Z

cupy/core/core.pyx

@@ -2856,14 +2866,15 @@ cpdef ndarray tensordot_core(
                b.data.ptr, runtime.CUDA_R_16F, <int>ldb,
                a.data.ptr, runtime.CUDA_R_16F, <int>lda,
                <size_t>&zero_fp32,
-                c.data.ptr, Ctype, <int>m,
+                c.data.ptr, runtime.CUDA_R_16F, <int>m,


leofang · 2020-08-09T04:53:20Z

Jenkins, test this please.

pfn-ci-bot · 2020-08-09T04:53:23Z

Successfully created a job for commit 752de1b:

Dashboard for commit 752de1b

chainer-ci · 2020-08-09T05:42:04Z

Jenkins CI test (for commit 752de1b, target branch master) succeeded!

asi1024 · 2020-08-09T15:40:22Z

I will retrigger CI after chainer/chainer-test#593 is merged.

asi1024 · 2020-08-17T08:43:54Z

Jenkins, test this please.

chainer-ci · 2020-08-17T09:11:45Z

Jenkins CI test (for commit 752de1b, target branch master) succeeded!

asi1024 · 2020-08-17T11:11:29Z

LGTM!

anaruse added 4 commits July 31, 2020 17:54

Support cublasGemmEx of CUDA 11

df1029d

Merge branch 'master' into cublas11

65c685c

Use cublasGemmEx by default in tensordot_core on CUDA11

e195787

Code clean-up

8332953

leofang suggested changes Aug 4, 2020

View reviewed changes

leofang reviewed Aug 4, 2020

View reviewed changes

kmaehashi assigned asi1024 Aug 5, 2020

Reduce code duplication

8207112

Fix bugs

b61180d

leofang suggested changes Aug 6, 2020

View reviewed changes

Add variable declarations

3b21e4a

asi1024 added the cat:enhancement Improvements to existing features label Aug 7, 2020

asi1024 reviewed Aug 7, 2020

View reviewed changes

leofang mentioned this pull request Aug 7, 2020

Fix handle types to intptr_t #3746

Merged

Change name of function to get cuda dtype and use pointer dtype to si…

752de1b

…mplify code

asi1024 added this to the v8.0.0rc1 milestone Aug 8, 2020

leofang approved these changes Aug 9, 2020

View reviewed changes

leofang mentioned this pull request Aug 11, 2020

Improve cupy.cutensor #3700

Merged

asi1024 merged commit 1713823 into cupy:master Aug 17, 2020

leofang mentioned this pull request Aug 20, 2020

Fix build and import errors for ROCm #3786

Merged

	cdef extern from '../cupy_cuComplex.h':
	ctypedef struct cuComplex 'cuComplex':
	float x, y

	ctypedef struct cuDoubleComplex 'cuDoubleComplex':
	double x, y

	cdef int _get_cuda_dtype(ndarray a):
	cdef int _get_cuda_dtype(ndarray a) except -1:
	cdef str a_type = a.dtype.char

	def _dtype_to_DataType(dtype):
	if dtype == 'f':
	return runtime.CUDA_R_32F
	elif dtype == 'd':
	return runtime.CUDA_R_64F
	elif dtype == 'F':
	return runtime.CUDA_C_32F
	elif dtype == 'D':
	return runtime.CUDA_C_64F
	else:
	raise TypeError

	def get_cuda_dtype(numpy_dtype):
	if numpy_dtype == numpy.float16:
	return runtime.CUDA_R_16F
	elif numpy_dtype == numpy.float32:
	return runtime.CUDA_R_32F
	elif numpy_dtype == numpy.float64:
	return runtime.CUDA_R_64F
	elif numpy_dtype == numpy.complex64:
	return runtime.CUDA_C_32F
	elif numpy_dtype == numpy.complex128:
	return runtime.CUDA_C_64F
	else:
	raise TypeError('Dtype {} is not supported'.format(numpy_dtype))

	cpdef int dtype_to_cuda_dtype(dtype, bint is_half_allowed=?) except -1
	cpdef int dtype_to_cuda_dtype(dtype, bint is_half_allowed=*) except -1

		compute_capability = int(device.get_compute_capability())
		algo = cublas.CUBLAS_GEMM_DEFAULT

Use cublasGemmEx in tensordot_core when CUDA11 #3719

Use cublasGemmEx in tensordot_core when CUDA11 #3719

Conversation

anaruse commented Aug 4, 2020

leofang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leofang Aug 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leofang Aug 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anaruse commented Aug 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asi1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asi1024 commented Aug 8, 2020

pfn-ci-bot commented Aug 8, 2020

leofang commented Aug 8, 2020

leofang left a comment

Choose a reason for hiding this comment

leofang Aug 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leofang commented Aug 9, 2020

pfn-ci-bot commented Aug 9, 2020

chainer-ci commented Aug 9, 2020

asi1024 commented Aug 9, 2020

asi1024 commented Aug 17, 2020

chainer-ci commented Aug 17, 2020

asi1024 commented Aug 17, 2020

leofang Aug 4, 2020 •

edited

Loading

leofang Aug 4, 2020 •

edited

Loading

leofang Aug 9, 2020 •

edited

Loading