# Understanding the memory performance of the GPU

In this section we are going to investigate a crucial aspect of the memory locality on the GPUs. It should be perceived in a slightly different way than on the CPUs. To demonstrate this, we will use BLAS matrix-vector kernel using all the tricks we have learned so far. The threads of the GPU operate row-wise in the input matrix, each one taking care of a single row to compute:

In [24]:
import numba
import numba.cuda as cuda
import numpy as np
import time


class time_region:
    def __init__(self, time_offset=0):
        self._time_off = time_offset

    def __enter__(self):
        self._t_start = time.time()
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        self._t_end = time.time()

    def elapsed_time(self):
        return self._time_off + (self._t_end - self._t_start)


class time_region_cuda:
    def __init__(self, time_offset=0, cuda_stream=0):
        self._t_start = cuda.event(timing=True)
        self._t_end = cuda.event(timing=True)
        self._time_off = time_offset
        self._cuda_stream = cuda_stream

    def __enter__(self):
        self._t_start.record(self._cuda_stream)
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        self._t_end.record(self._cuda_stream)
        self._t_end.synchronize()

    def elapsed_time(self):
        return self._time_off + 1.e-3*cuda.event_elapsed_time(self._t_start,
                                                              self._t_end)


@cuda.jit('void(float64, Array(float64, 2, "C"), Array(float64, 1, "C"), '
          'float64, Array(float64, 1, "C"))')
def _gemv_cuda(alpha, A, x, beta, y):
    i = cuda.grid(1)
    N, M = A.shape
    if i >= N:
        return

    prod = 0.0
    for j in range(M):
        prod += A[i, j]*x[j]

    y[i] = alpha*prod + beta*y[i]


def gemv_gpu(alpha, A, x, beta, y):
    # Works only for square matrices
    N = A.shape[0]
    with time_region_cuda() as t_xfer:
        d_A = cuda.to_device(A)
        d_x = cuda.to_device(x)
        d_y = cuda.to_device(y)
        y_ret = cuda.pinned_array(N)
        
    block_size = 128
    num_blocks = N // block_size
    if N % block_size:
        num_blocks += 1

    with time_region_cuda() as t_kernel:
        _gemv_cuda[num_blocks, block_size](alpha, d_A, d_x, beta, d_y)

    with time_region_cuda(t_xfer.elapsed_time()) as t_xfer:
        d_y.copy_to_host(y_ret)

    print(f'  CUDA transfer overheads: {t_xfer.elapsed_time()}')
    print(f'  CUDA kernel time: {t_kernel.elapsed_time()}')
    print(f'  Consumed memory bandwidth: {1e-9*8*N*(N+2)/t_kernel.elapsed_time()} GB/s')
    return y_ret

N = 1024*16
A = np.random.rand(N, N)
x = np.random.rand(N)
y_orig = np.ones(N)
alpha = 0.2
beta = 1

with time_region() as t_gpu:
    y = gemv_gpu(alpha, A, x, beta, y_orig)

with time_region() as t_ref:
    y_ref = alpha*(A @ x) + beta*y_orig
    
    
cuda.profile_stop()

print(f'Total time (GPU): {t_gpu.elapsed_time()} s')
print(f'Total time (CPU): {t_ref.elapsed_time()} s')

assert np.allclose(y, y_ref)

  CUDA transfer overheads: 0.2141221150904894
  CUDA kernel time: 0.015257023811340331
  Consumed memory bandwidth: 140.77095366421406 GB/s
Total time (GPU): 0.23199772834777832 s
Total time (CPU): 0.04511618614196777 s


### Exercise

> Increase the array size and record the CUDA kernel performance time. How much faster is it compared to the CPU version?

As you might have noticed already, the GPU kernels is only about 2-3x faster than the CPU version. Not as good it has been with the vector addition kernel. Is this expected, is this how it should be? Let's look into the kernel in more detail.

The kernel needs to read the whole matrix $A$ and the vectors $x$ and $y$, i.e., $8(N^2 + 2N)$ bytes need to be transferred to/from main memory in total. At the same time, the kernel performs $2N^2 + 3N$ floating point operations in total. This leads to an arithmetic intensity or flop:byte ratio equals to $\frac{2N(N+
\frac{3}{2})}{8N(N+2)} \approx 0.25$. This ratio is much higher than that for the vector addition kernel, but it is very low to make the kernel compute bound.

> Given the nominal peak double precision performance (5.3 Tflop/s) and the nominal peak memory bandwidth of the P100 GPUs (732 GB/s), a kernel would need a flop:byte ratio of at least 7.24, so as to be compute bound.

So, theoretically, we should be approaching the effective memory bandwidth limit of the device, but we only achieve 1/4 of it. The CPU kernel on the other hand seems to be optimal:

In [23]:
print(f'CPU memory bandwidth consumed: {1e-9*8*N*(N+2)/t_ref.elapsed_time()}')

CPU memory bandwidth consumed: 47.884390070318666


What is going on?