COO GPU spmv kernel #57

yhmtsai · 2018-05-19T16:11:57Z

COO GPU spmv kernel

hartwiganzt · 2018-05-20T08:04:59Z

Does the code run for you? I receive for the tester:

Running main() from gtest_main.cc
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from Coo
[ RUN      ] Coo.SimpleApplyIsEquivalentToRef
unknown file: Failure
C++ exception with description "/Users/hanzt0114cl306/work/public_ginkgo/ginkgo/gpu/base/executor.cpp:118: raw_copy_to: cudaErrorLaunchFailure: unspecified launch failure" thrown in the test body.
/Users/hanzt0114cl306/work/public_ginkgo/ginkgo/gpu/test/matrix/coo_kernels.cpp:70: Failure
Expected: gpu->synchronize() doesn't throw an exception.
  Actual: it throws.
Unrecoverable CUDA error on device 0 in free: cudaErrorLaunchFailure: unspecified launch failure
Exiting program

hartwiganzt · 2018-05-20T08:57:56Z

works now - looks good to me!

hartwiganzt · 2018-05-23T07:07:43Z

@gflegar no conflicts here.

pratikvn · 2018-05-29T13:15:27Z

gpu/matrix/coo_kernels.cu

+__forceinline__ __device__ static cuDoubleComplex atomic_add(
+    cuDoubleComplex *address, cuDoubleComplex val)
+{
+    // Seperate to real part and imag part


nit: s/seperate/separate : here and a few places below.

pratikvn · 2018-05-29T13:16:35Z

gpu/matrix/coo_kernels.cu

+
+
+template <typename ValueType, typename IndexType>
+__global__ __launch_bounds__(128) void spmv_kernel(


As this is the actual kernel, some documentation explaining the working would be nice.

pratikvn · 2018-05-29T13:18:48Z

gpu/matrix/coo_kernels.cu

+          const matrix::Coo<ValueType, IndexType> *a,
+          const matrix::Dense<ValueType> *b, matrix::Dense<ValueType> *c)
+{
+    int multiple = 8;


Hard-coding values like this is always difficult to warrant. I guess here it is probably okay. But maybe it should always accompany some metrics/ reasoning as to why this value was chosen.

I agree with you that it shouldn't look like this.

However, AFAIK, these are tuned parameters obtained from running experiments (and might differ for different architectures). I also have a similar problem in #49, and have them hardcoded.
The problem is that we don't have this problem resolved in Ginkgo - the idea is to be able to say something like: "this is a parameter that should be tuned, valid values are from the set S, and use x by default". Then the user should be able to do an autotuning run of ginkgo, which will set the parameters to proper values for their system.

However we do it right now, there's a good chance we'll have to change it when we implement autotuning, so I'm fine for now with having it like this + adding a "TODO" that this should be changed when we support autotuning.

gflegar

Looks good. However, there is some code duplication, see my comments for suggestions on how to resolve it.

gflegar · 2018-05-29T10:55:17Z

gpu/matrix/coo_kernels.cu

+#include "gpu/components/shuffle.cuh"
+#include "gpu/components/synchronization.cuh"
+
+namespace gko {


nit: another empty line above this

gflegar · 2018-05-29T11:19:37Z

gpu/matrix/coo_kernels.cu

+    cuDoubleComplex *cuval = reinterpret_cast<cuDoubleComplex *>(&val);
+    atomic_add(cuaddr, *cuval);
+    return *address;
+}


Let's do the following for all overloads of atomic_add:

All of them could be useful in other algorithms, so let's move them into a separate header, say: gpu/components/atomic.cuh and nest them insides the gko::kernels::gpu namespace.

There is no need to have them in anonymous namespace, static, and __forceinline__. Any one of them will prevent symbol ambiguities. In this case, we can forget about anonymous namespace and static qualifiers and just use __forceinline__.

There's no need for cuComplex and cuDoubleComplex overloads - we always use thrust::complex, the other types are only here for cuBLAS / cuSPARSE interoperability.

The complex versions are incorrect, as they return the new value, instead of the old one. I don't think it's possible to implement this correctly without significant performance penalties (basically, we would have to implement a mutex to do it properly), so I suggest we just remove the return value, and have all our atomic_add functions return void.

gflegar · 2018-05-29T12:44:09Z

gpu/matrix/coo_kernels.cu

+{
+    ValueType add_val;
+#pragma unroll
+    for (int i = 1; i < 32; i <<= 1) {


s/32/cuda_config::warp_size

also do this everywhere in the code - a good rule of thumb is to never rely on "magic values" like 32 or 128. Instead, you define these values as constants somewhere, and use the constant. This way the code is easier to understand (as you have something like warp_size, or warps_in_block * warp_size and not something strange like 32 or 128), and easier to maintain if you need to change the value with something else in the future.

gflegar · 2018-05-29T12:47:19Z

gpu/matrix/coo_kernels.cu

+
+
+template <typename ValueType, typename IndexType>
+__device__ __forceinline__ void segment_scan(IndexType *ind, ValueType *val,


nit: ind is not modified by this function so it should be just a plain variable

gflegar · 2018-05-29T12:48:54Z

gpu/matrix/coo_kernels.cu

+#pragma unroll
+    for (int i = 1; i < 32; i <<= 1) {
+        const IndexType add_ind = warp::shuffle_up(*ind, i);
+        add_val = zero<ValueType>();


nit: In contrast to C and Fortran, C++ allows you to declare variables wherever you want, and the guidelines advise to avoid uninitialized variables, and define them as late as possible.
Here, you should remove ValueType add_val; from the beginning of the function, and in this line use:

auto add_val = zero<ValueType>();

gflegar · 2018-05-29T13:03:49Z

gpu/matrix/coo_kernels.cu

+    IndexType next_row;
+    for (; ind < ind_end; ind += 32) {
+        temp_val += (ind < nnz) ? val[ind] * b[col[ind]] : zero<ValueType>();
+        next_row = (ind + 32 < nnz) ? row[ind + 32] : row[nnz - 1];


nit: same as before, auto next_row = ..., and remove declaration of next_row before the loop

gflegar · 2018-05-29T13:05:49Z

gpu/matrix/coo_kernels.cu

+        next_row = (ind + 32 < nnz) ? row[ind + 32] : row[nnz - 1];
+        // segmented scan
+        const bool is_scan = temp_row != next_row;
+        if (warp::any(is_scan)) {


nit: why not just if (warp::any(curr_row != next_row)) it reads just fine: "if any thread has different curr_row than next_row". The version with a temporary bool value is more confusing than this

gflegar · 2018-05-29T13:07:15Z

gpu/matrix/coo_kernels.cu

+        if (warp::any(is_scan)) {
+            atomichead = true;
+            segment_scan(&temp_row, &temp_val, &atomichead);
+            if (atomichead) {


nit: maybe s/atomichead/is_first_in_segment would make it clearer

gflegar · 2018-05-29T13:15:43Z

gpu/matrix/coo_kernels.cu

+    int nwarps = config * multiple;
+    if (nwarps > ceildiv(nnz, 32)) {
+        nwarps = ceildiv(nnz, 32);
+    }


This part is repeated in both spmv kernels. We should extract it into a separate function to avoid code duplication

gflegar · 2018-05-29T13:29:26Z

gpu/matrix/coo_kernels.cu

+            atomic_add(&(c[temp_row]), alpha_val * temp_val);
+        }
+    }
+}


Is the only difference between this kernel and the other one in how atomic_add is called (this one with alpha, the other one without it)? If so, I propose we do the following to remove code duplication:

extract this entire kernel into a device function,

add a functor template parameter to it:
template </* other template parameters */, typename Closure> __device__ /* specifiers */ spmv(/* parameters */, Closure scale) { /* body */ }

replace
atomic_add(/* destination */, alpha_val * temp_val)
with
atomic_add(/* destination */, scale(temp_val));
in device function body.

implement the "simple" apply kernel as:
spmv(/* parameters */, [](const ValueType &x) { return x; });

implement the "advanced" apply kernel as:
ValueType scale_factor = alpha[0]; spmv(/* parameters */, [&scale_factor](const ValueType &x) { return scale_factor * x; });

yhmtsai · 2018-05-30T13:47:28Z

Thanks for your help.
I have fixed them.
@gflegar the complex version of atomic_add is not really atomic, but I think it is correct in COO spmv, right?

gflegar · 2018-06-06T10:30:50Z

gpu/components/atomic.cuh

+    (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 600))
+
+
+__forceinline__ __device__ static double atomic_add(double *addr, double val)


this should also return void

and no need for static

gflegar · 2018-06-06T10:42:08Z

@yhmtsai yes, as it is written now (with functions returning void), the complex versions of atomic add are "atomic enough" for our use-case (atomic enough = if we have multiple calls to atomic add, they will all accumulate their values, even though it won't exactly be done atomically).

Everything looks good now. There's just this minor thing with the wrong return value in Kepler implementation of double atomic add. We can merge this as soon as it is fixed.

gflegar

LGTM

PR #57 introduced atomic operations for standard types, but not for custom value types, which causes compilation to fail for other value types (e.g. FloatX numbers). This PR adds a dummy implementation of the general atomic_add (which just causes an assertion failure), so Ginkgo at least compiles with custom types, even though parts which use it do not work properly.

yhmtsai added 4 commits May 19, 2018 01:56

change name and add cuda atomic_add

cea1371

Implement COO spmv and add test

17d595b

add advanced_spmv and its test

7064e83

Update from the latest ginkgo/develop

1eaa3c0

fix error detected by cuda-memcheck

6a250b5

hartwiganzt requested review from gflegar and hartwiganzt May 20, 2018 08:58

hartwiganzt approved these changes May 20, 2018

View reviewed changes

gflegar assigned yhmtsai May 24, 2018

gflegar added the is:new-feature A request or implementation of a feature that does not exist yet. label May 24, 2018

hartwiganzt assigned pratikvn, yanjen and tcojean and unassigned pratikvn and tcojean May 29, 2018

hartwiganzt requested review from upsj, pratikvn, thoasm and tcojean May 29, 2018 12:53

pratikvn reviewed May 29, 2018

View reviewed changes

gflegar requested changes May 29, 2018

View reviewed changes

pratikvn mentioned this pull request May 29, 2018

Hyb core #61

Merged

yhmtsai added 3 commits May 30, 2018 03:05

move the atomic_add function to gpu/components/atomic.cuh

4307a96

merge two similar spmv_kernel

1d79a4b

add calculate_nwarps function and change naming

e8ac5c0

refine code according to comments

3df0c7d

change the type in std::min of calculate_nwarps

3ec7d26

gflegar reviewed Jun 6, 2018

View reviewed changes

fix Kepler atomicadd of double type

97dbbc1

gflegar approved these changes Jun 6, 2018

View reviewed changes

gflegar merged commit adde1c1 into ginkgo-project:develop Jun 6, 2018

This was referenced Jun 6, 2018

Fix atomic-related compilation issues for custom value types #64

Merged

Implement atomic operations for custom value types #65

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COO GPU spmv kernel #57

COO GPU spmv kernel #57

yhmtsai commented May 19, 2018

hartwiganzt commented May 20, 2018

hartwiganzt commented May 20, 2018

hartwiganzt commented May 23, 2018

pratikvn May 29, 2018

pratikvn May 29, 2018

pratikvn May 29, 2018

gflegar May 29, 2018

gflegar left a comment

gflegar May 29, 2018

gflegar May 29, 2018

gflegar May 29, 2018

gflegar May 29, 2018

gflegar May 29, 2018

gflegar May 29, 2018

gflegar May 29, 2018

gflegar May 29, 2018

gflegar May 29, 2018

gflegar May 29, 2018

gflegar May 29, 2018

yhmtsai commented May 30, 2018

gflegar Jun 6, 2018

gflegar Jun 6, 2018

yhmtsai Jun 6, 2018

gflegar commented Jun 6, 2018

gflegar left a comment



		template <typename ValueType, typename IndexType>
		__global__ __launch_bounds__(128) void spmv_kernel(



		template <typename ValueType, typename IndexType>
		__device__ __forceinline__ void segment_scan(IndexType ind, ValueType val,

		(defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 600))


		__forceinline__ __device__ static double atomic_add(double *addr, double val)

COO GPU spmv kernel #57

COO GPU spmv kernel #57

Conversation

yhmtsai commented May 19, 2018

hartwiganzt commented May 20, 2018

hartwiganzt commented May 20, 2018

hartwiganzt commented May 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gflegar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhmtsai commented May 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gflegar commented Jun 6, 2018

gflegar left a comment

Choose a reason for hiding this comment