# Introduction to GPU Programming Summer School
D. Quigley, University of Warwick
## Tutorial 1. Using CUDA accelerated libraries

### Step 1 : Sanity check of Python environment

All being well, you've reading this text inside a Jupyter notebook environment running on an SCRTP machine equipped with a CUDA-compatible GPU card and have launched the notebook from within an environment configured to support GPU computing. If in doubt, check the [connecting instructions](https://warwick.ac.uk/fac/sci/maths/research/events/2017-18/nonsymposium/gpu/connecting).

Let's run some simpple checks to make sure you can execute code on a GPU from within this notebook. For today we'll be working with the Python interface to CUDA provided via the numba package.

Some terminology we need aleady:

**Host**        : The traditional computer in which our code is running on a CPU with access to host RAM.

**CUDA Device** : The GPU card consisting of its own RAM and computing cores (lots of them). 

In [21]:
import platform          # So we can figure out where we're running
from numba import cuda   # Import python interface to CUDA

# Report where we're running
print("========================================================")
print("This notebook is running on ", platform.node())
print("========================================================")

# Test if CUDA is available. If so report on the devices present 
if cuda.is_available():  
    
    # List of CUDA capable devices in this system
    for device in cuda.list_devices():       
        print("Device ID : ", device.id, " : ", device.name.decode())         
    
else:
    print("There doesn't appear to be a CUDA capable device in this system")

This notebook is running on  brigitte.csc.warwick.ac.uk
Device ID :  0  :  Tesla K20c
Device ID :  1  :  NVS 310


Let's select the the most appropriate device and query its [compute capability](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities). CUDA devices have varying compute capability depending on the product range and when they were manufactured. Typically CUDA software will be written to support a minimum compute capability.

[Numba requires a compute capability of 2.0 or higher](https://numba.pydata.org/numba-doc/dev/cuda/overview.html), so we should check this.

In [33]:
my_instance = cuda.select_device(0) # Create a device instance to work with based on device 0 

# The compute capability is stored as a tuple (major, minor) so we're good to go if...
if my_instance.compute_capability[0] >= 2:
    print("The selected device (",my_instance.name.decode(),") has a sufficient compute capability")
else:
    print("The selected device does not have a sufficient compute capability")

The selected device ( Tesla K20c ) has a sufficient compute capability


### Step 2 : "Drop in" replacements for standard numerical libraries via Pyculib 

The computationally intensive part of many scientific codes reduces to standard numerical operations, e.g. manipulation of large matrices/vectors, performing Fourier transforms, dealing with sparse linear algebra etc.

In the traditional High Performance Computing (HPC) realm of compiled C/C++/Fortran code, there exists a suite of standard optimised libraries for such things. The CUDA toolkit provides GPU-enabled versions of these, and numba provides Python interfaces to these through the pyculib package which we'll experiment with below.

The advantage of this approach is that no real knowledge of GPU programming is required, we simply replace calls to standard CPU functions with GPU-accelerated equivalents. The disadvantage is that we only accelerate part of the code, and may suffer performance overheads associated with transering data between host and device every time we use one of the functions.


##### 2(a) cuBLAS

BLAS is the suite of [Basic Linear Algebra Subprograms](http://www.netlib.org/blas/). These come in three levels, for both real and complex data types. 

**Level 1** : Vector-vector operations 

**Level 2** : Matrix-vector operations

**Level 3** : Matrix-matrix operations

On any well-managed HPC system, the local installations of numpy and scipy packages will be built on top of BLAS routines (written in C or Fortran) that have been optimised for the particular hardware in use. Optimised BLAS implementations for CPUs include [OpenBLAS](https://www.openblas.net/), [Atlas](http://math-atlas.sourceforge.net/), [Intel MKL](https://software.intel.com/en-us/mkl) and [AMD ACML](https://developer.amd.com/building-with-acml/).

The CUDA toolkit include [cuBLAS](https://developer.nvidia.com/cublas), a GPU-accelerated BLAS implementation. Let's compare how this performs in comparison to numpy. If you're interested, the numpy implementation on the SCRTP desktops has been built using OpenBLAS, but optimised only for the most common Intel CPU features to ensure compatibility. CPU performace will not be great.

[Documentation for the pyculib interface to cuBLAS](http://pyculib.readthedocs.io/en/latest/cublas.html)

Let's illustrate this with a simple matrix-matrix multiplication example. The BLAS routine `dgemm` (double precision, general matrix-matrix multiply) performs the following operation.

$$ C = \alpha AB + \beta C $$

where $A$, $B$ and $C$ are matrices and $\alpha$ and $\beta$ are scalars. Other specialised routines are available for matrices with particular structure (e.g. banded, tri-diagonal, symmetric) but we won't worry about that today.

In [49]:
import numpy as np
import pyculib.blas as cublas      # Python interface to cuBLAS, cuFFT, cuSPARSE and cuRAND

# Set size of matrix to work with
size = 3

# Create some square matrices and fill them with random numbers
A = np.random.rand(size,size)
B = np.random.rand(size,size)
C = np.random.rand(size,size)

# Alpha and beta
alpha = 0.5
beta = 2.0

# Perform the operation described above using standard numpy
C_np = alpha * np.matmul(A,B) + beta * C

# The equivalent cuBLAS call needs additional arguments to specify if we want to use the
# matrices as supplied ('N' - no operation) or their transpose ('T'). 
C_cu = cublas.gemm('N', 'N', alpha, A, B, beta, C)

A lot just happend behind the scenes there. The numpy arrays have been copied from the host RAM into the device memory, the matrix operation has been performed using the CUDA cores on the device, and the result has been copied back into the numpy array C_cu in the host memory.

Let's be proper programmers and check that both computations gave the same result.

In [50]:
# Subtract the two results
C_cu - C_np

array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])