# Introduction to GPU Programming with Python
## Numba + CUDA


Questions
* How to port CPU code to GPU ?

Objectives
* Learn how to apply @cuda.jit in Numba CUDA
* Learn how to create a CUDA grid in Numba
* Understand the GPU memory allocation (implicit or explicit)

### Importing Numba CUDA

In [None]:
from numba import cuda
import numpy as np

### Numba GPU Device Management

First check to see whether the GPUs are available

In [None]:
cuda.gpus

If you have multiple GPUs, then you may need to select one:

In [None]:
cuda.select_device(0)

You can also get some valuable information about the GPU:

In [None]:
cc_cores_per_SM_dict = {
    (2,0) : 32,
    (2,1) : 48,
    (3,0) : 192,
    (3,5) : 192,
    (3,7) : 192,
    (5,0) : 128,
    (5,2) : 128,
    (6,0) : 64,
    (6,1) : 128,
    (7,0) : 64,
    (7,5) : 64,
    (8,0) : 64,
    (8,6) : 128,
    (8,9) : 128,
    (9,0) : 128
    }
device = cuda.get_current_device()
my_sms = getattr(device, 'MULTIPROCESSOR_COUNT')
my_cc = device.compute_capability
cores_per_sm = cc_cores_per_SM_dict.get(my_cc)
total_cores = cores_per_sm*my_sms
print("GPU compute capability: " , my_cc)
print("GPU total number of SMs: " , my_sms)
print("GPU cores per SM: ",cores_per_sm)
print("GPU total number of cores: ",total_cores)

### CUDA kernel declaration in Numba
CUDA Kernel is declzred by using @cuda.jit decorator.
CUDA Kernel is a function that is called from Host but executed on the Device.

In [None]:
@cuda.jit
def matmul(A,B,C):
    """
    Code for kernel.
    """
    # code here

### How to create a CUDA grid for matrix multiplication
Similarly to a parallelization we did in a previous chapter, here we need to distribute a computational load among available CUDA threads. It's just this time we don't loop but rather create enough threads so that each thread does exactly one matrix element. There are several ways of dsoing so:

#### 1. Using a single thread block
Given that matrtix is a 2-dimensional object, it makes sense to create a 2-dimensional CUDA block.

You can think of CUDA threads as workers.
Here we request one single CUDA block of (NumThreads x NumThreads) workers. 
The matrix elements are distributed among (or assigned to) those workers. Each matrix element is computed independently by a single thread-worker.

In [None]:
# Choose number of threads per block (Rule of thumb: 32-512)
NumThreads = 32
NumBlocks = 1
griddim = (NumBlocks,NumBlocks)
blockdim = (NumThreads,NumThreads)
print(griddim)
print(blockdim)

Here our CUDA grid consist of one block. Inside that block there are threads x threads.

Limitation: Each thread block can fit only 1024 threads - thus we can only do a matrix of size (32x32).

#### 2. Using multiple thread blocks
If the size of the matrix is > 32, we need multiple CUDA blocks in both rows and column axes. 
In other words, we request a 2-dimensional CUDA grid of blocks of (NumThreads x NumThreads) workers enough to cover the whole matrix. Number of blocks depends on the size of the matrix. 

In [None]:
A=np.random.rand(128,128).astype(np.float32)
B=np.random.rand(128,128).astype(np.float32)
C=np.zeros(shape=(128,128)).astype(np.float32)

In [None]:
# Assumption of a square matrix

threads = 32
blocks = (int)(C.shape[0]//threads)
griddim = (blocks,blocks)
blockdim = (threads,threads)
print(griddim)
print(blockdim)

![](images/multiplication_multiple_blocks.png)

Each thread computes one element of C:
* Loads a row of matrix A
* Load a column of matrix B
* Computes a dot product

Every value of A and B is loaded N times from global memory

### How to call a kernel

In [None]:
matmul[griddim,blockdim](A,B,C)

### Main example: Matrix multiplication using Numba CUDA (in global memory only)
The task is to re-write the function and make it a CUDA kernel with operations in global memory. The idea is to parallelize the problem by distributing the computational load across multiple CUDA threads. Here we can try a single and multiple CUDA blocks approaches.

In [None]:
import numpy as np
from numba import ...

In [None]:
def matmul(A,B,C):
    # iterating by row of A
    for i in range(len(A)):
  
        # iterating by coloum by B 
        for j in range(len(B[0])):
  
            # iterating by rows of B
            for k in range(len(B)):
                C[i][j] += A[i][k] * B[k][j]
  

In [None]:
#Part 1: Create matrices A,B,C as numpy arrays. Fill A and B with random numbers.


In [None]:
#Part 2: Calculate number of blocks and threads

In [None]:
#Part 3: Create a CUDA kernel with @cuda.jit decorator

In [None]:
#Part 4: Call the kernel function and time it to get the execution time

In [None]:
#Part 5: Create A,B,C manually on the GPU and copy data to the GPU arrays

In [None]:
#Part 6: Call the kernel function and time it to get the execution time. Compare the execution times.

### Explicit data management

Numba has been automatically transferring the NumPy arrays to the device when you invoke the kernel. However, it can only do so conservatively by always transferring the device memory back to the host when a kernel finishes. To avoid the unnecessary transfer for read-only arrays, it is possible to manually control the transfer.


In [None]:
device_array = cuda.device_array( shape ) #Allocates an empty device ndarray. Similar to numpy.empty().

In [None]:
device_array = cuda.to_device( array ) #Copy data from CPU array to GPU array

In [None]:
array = device_array.copy_to_host() #Copy data back to CPU

Now go back to Matrix multiplication exercise and modify the code by using the expicit data management.

### Exercise: Incrementation of array elements
In the following exercise each element of an array is incremented : array[i] = array[i] + 1

In [None]:
# Import all required libs
import ...
from numba import ...

In [None]:
# Write a GPU code (Kernel)
def increment(array):
    #define thread index i here ...
    if i<array.size:
        array[i] += 1

In [None]:
# Define CUDA grid: provide with number of blocks and threads per block
data=numpy.ones(12800)
NumThreads=32
NumBlocks =

In [None]:
# Run the kernel and measure execution time:
increment[NumBlocks,NumThreads](data)

In [None]:
# Take advatage of excplicit data management and copy an array to GPU before kernel execution. 
# Then measure the execution time again

### Exercise: Reversal of array elements
Here an integer array is sent to GPU where its indices are reversed, i.e. array[0]=array[N-1], array[1]=array[N-2], etc.

In [None]:
# Import required libs
import numpy as np
from numba import cuda

In [None]:
#Part 3: Create a CUDA kernel with @cuda.jit decorator
# Kernel: reverse the array content using appropriate indices. 
# To do so you may need input and output indices. Implement kernel with possibility of multiple thread blocks.

In [None]:
# Define CUDA grid
dim=256*1000
NumThreads=
NumBlocks = 

In [None]:
#Part 1: Create arrays on CPU and GPU (if you want to)

In [None]:
#Part 2: Initialize host array

In [None]:
#Part 4: Call the kernel function

In [None]:
#Part 5: Verify the result

## Key points
* **Numba @cuda.jit decorator** 
    * Device (GPU) won't work without a Host(CPU)
    * Both Host and Device have their own memory
* **Kernel and Device functions**
    * Kernel is declared with @cuda.jit. Kernel is called from  the Host
    * Device function is declared with @cuda.jit(device=True) and is called from the Device.
* **Explicit data transfers between CPU and GPU**
    * Data arrays can be allocated on GPU
    * Data can be copied manually to GPU/CPU