# Introduction to GPU Programming with Python
## Numba + CUDA

Universal functions are great for element wise operations. 
However, not all operations are element wise. To compile a function on the GPU that is not element wise, we must use `numba.cuda.jit`.

#### CUDA terminology
Before we jump into CUDA with Python lets talk about CUDA terminology and main execution concept:

![](images/host_device.png)

#### CUDA kernel
We have been talking about CUDA kernels, but what is CUDA kernel ? 
![](images/cuda_kernel.png)

In CUDA we divide a program into a grid of threads, and a kernel is a program executed on each of those threads independently.

It's different from how we create a CPU program as there we have to explicitate every operation, every loop, etc.

Lets look at the matrix addition example.
In the CPU implementation we would loop over all the elements of matrix A:

![](images/matrix_cpu.png)

![](images/matrix_gpu2.png)

Unfortunately there is another layer of complexity:

![](images/cuda_block_grid2.png)

### CUDA kernel declaration
Once again, here is how to declare a kernel:

In [None]:
import numpy
from numba import cuda

@cuda.jit
def my_kernel(io_array):
    """
    Code for kernel.
    """
    # code here

In [None]:
# Create the data array 
data=numpy.ones(12800)

# Set the number of threads in a block
threadsperblock = 32 

# Calculate the number of thread blocks in the grid
blockspergrid = (data.size + (threadsperblock - 1)) // threadsperblock

# Now finally start the kernel
my_kernel[blockspergrid, threadsperblock](data)


### Choosing the block size

* On the software side, the block size determines how many threads share a given area of shared memory.
* On the hardware side, the block size must be large enough for full occupation of execution units; 
The block size you choose depends on:
* The size of the data array
* The size of the shared mempory per block (e.g. 64KB)
* The maximum number of threads per block supported by the hardware (e.g. 512 or 1024)
* The maximum number of threads per multiprocessor (MP) (e.g. 2048)
* The maximum number of blocks per MP (e.g. 32)
* The number of threads that can be executed concurrently (a “warp” i.e. 32)

Rules of thumb for threads per block:

    Should be a round multiple of the warp size (32)
    A good place to start is 128-512 but benchmarking is required to determine the optimal value.


### Exercise 1
Lets do the following exercise where each element of an array is incremented : array[i] = array[i] + 1

In [None]:
# Import all required libs
import ...
from numba import ...

In [None]:
# Write a GPU code (Kernel)
def kernel1(array):
    #define thread index i here ...
    if i<array.size:
        array[i] += 1

In [None]:
# Define CUDA grid: provide with number of blocks and threads per block
data=numpy.ones(12800)
threads=32
blocks =

In [None]:
# Run the kernel and measure execution time:
kernel1[blocks,threads](data)

In [None]:
# Take advatage of excplicit data management and copy an array to GPU before kernel execution. 
# Then measure the execution time again

### Explicit data management

Numba has been automatically transferring the NumPy arrays to the device when you invoke the kernel. However, it can only do so conservatively by always transferring the device memory back to the host when a kernel finishes. To avoid the unnecessary transfer for read-only arrays, it is possible to manually control the transfer.


In [None]:
device_array = cuda.device_array( shape ) #Allocates an empty device ndarray. Similar to numpy.empty().

In [None]:
device_array = cuda.to_device( array ) #Copy data from CPU array to GPU array

In [None]:
array = device_array.copy_to_host() #Copy data back to CPU

Now go back to exercise 1 and modify the code by using the expicit data management.

### Exercise 2
Here an integer array is sent to GPU where its indices are reversed, i.e. array[0]=array[N-1], array[1]=array[N-2], etc.

In [None]:
# Import required libs
import numpy as np
from numba import cuda

In [None]:
#Part 3: Create a CUDA kernel with @cuda.jit decorator
# Kernel: reverse the array content using appropriate indices. 
# To do so you may need input and output indices. Implement kernel with possibility of multiple thread blocks.

In [None]:
# Define CUDA grid
dim=256*1000
NumThreadsPerBlock=
NumBlocks = 

In [None]:
#Part 1: Create arrays on CPU and GPU (if you want to)

In [None]:
#Part 2: Initialize host array

In [None]:
#Part 4: Call the kernel function

In [None]:
#Part 5: Verify the result

### Hands-on: Matrix multiplication on GPU (with global memory) 

In [None]:
def matmul(A,B,C):
    # iterating by row of A
    for i in range(len(A)):
  
        # iterating by coloum by B 
        for j in range(len(B[0])):
  
            # iterating by rows of B
            for k in range(len(B)):
                C[i][j] += A[i][k] * B[k][j]
  

In [None]:
#Part 1: Create matrices A,B,C as numpy arrays. Fill A and B with random numbers.


In [None]:
#Part 2: Calculate number of blocks and threads

In [None]:
#Part 3: Create a CUDA kernel with @cuda.jit decorator

In [None]:
#Part 4: Call the kernel function and time it to get the execution time

In [None]:
#Part 5: Create A,B,C manually on the GPU and copy data to the GPU arrays

In [None]:
#Part 6: Call the kernel function and time it to get the execution time. Compare the execution times.

### Shared memory
A limited amount of shared memory can be allocated on the device to speed up access to data. That memory is shared amongst all threads in a given block. It's so much faster than the regular device memory. It also allows threads to cooperate on a given solution.

In [None]:
 numba.cuda.shared.array(shape, type)

This function is called on the device, i.e. from the kernel or device function. A common pattern is to have each thread populate one element in the shared array, then wait for all threads to finish using syncthtreads:

In [None]:
 numba.cuda.syncthreads()

### Exercise 3

Here we re-use the code from Ex.2 and add shared memory into play

In [None]:
# Take this code and re-write it in the next cell by using a shared memory 
@cuda.jit
def reverseArrayBlock(d_out,d_in):
    ind_in = cuda.blockDim.x*cuda.blockIdx.x + cuda.threadIdx.x; ## Index of the current thread
    ind_out = cuda.gridsize(1)-ind_in-1 ## Total number of threads - in -1
    if ind_in<d_in.size:
        d_out[ind_out] = d_in[ind_in]

In [None]:
# Part 2: Here is the code with shared memory
@cuda.jit
def reverseArrayBlock_shared(d_out,d_in):
    # Declare/allocate array s in shared memory
    ....
    # Create input index
    ....
    # Populate array s from arrat d_in
    ....
    # Synchronize threads in each block
    ....
    # Create output index
    ....
    if ind_in<d_in.size:
        # Populate output array d_out from shared array s
        ....

In [None]:
dim=256*1000
NumThreadsPerBlock=128
NumBlocks = (dim + (NumThreadsPerBlock - 1)) // NumThreadsPerBlock

In [None]:
#Part 1: Create arrays on CPU and GPU (if you want to)
a = np.arange(0,dim,dtype=np.int32)
b = np.zeros(dim,dtype=np.int32)
print(memSize)

In [None]:
#Part 3: Call the kernel
reverseArrayBlock_shared[NumBlocks,NumThreadsPerBlock,0,memSize](b,a)

In [None]:
#Part 4: Modify the kernel as well as the call from the host by changing static shared memory declaration to dynamic

### Hands-on: Matrix multiplication with shared memory

![](images/05-matmulshared.png)

In [None]:
import numpy as np
from numba import cuda

In [None]:
#Part 3: Create a CUDA kernel with @cuda.jit decorator

# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
TPB = 16

def fast_matmul(A, B, C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    
    # Define global and thread indices
    
    # Define number of blocks per grid
    
    tmp = 0.
    for i in range(bpg):
        # Preload data into shared memory
        #####
        
        # Wait until all threads finish preloading
        
        # Computes partial product on the shared memory
        for j in range(TPB):
            #####
            
        # Wait until all threads finish computing
        
    # Put tmp into C matrix

In [None]:
#Part 1: Create matrices A,B,C as numpy arrays (size 128x128). Fill A and B with random numbers.

In [None]:
#Part 2: Calculate number of blocks and threads

In [None]:
#Part 4: Call the kernel function and time it to get the execution time