# Introduction to GPU Programming with Python
## Numba + CUDA

Universal functions are great for element wise operations. However, not all operations are element wise. To compile a function on the GPU that is not element wise, we must use `numba.cuda.jit`.


Several important terms in the topic of CUDA programming are listed here:

- host: the CPU

- device: the GPU

- host memory: the system main memory

- device memory: onboard memory on a GPU card

- kernels: a GPU function launched by the host and executed on the device

- device function: a GPU function executed on the device which can only be called from the device (i.e. from a kernel or another device function)


### CPU vs. GPU

CPUs are optimized for latency.

A CPU tries to execute a given instruction as quickly as possible, i.e., it tries to keep the latency (the time between issuing and executing an instruction) as short as possible. CPUs use caches and a lot of control logic to achieve this goal.

GPUs are optimized for throughput.

GPUs were (and are) made to display graphics on your screen. It doesn't matter how quickly a GPU can update a single pixel. It's important how quickly it can update all of the pixels on the screen (more than 2 million on an HD display). In addition it often must perform the same operation on a lot of vertices or pixels.


### Short intro to CUDA

CUDA - Compute Unified Device Architecture

Provides access to instructions and memory of massively parallel elements in CUDA-enabled GPUs

### GPU Architecture
![](images/GPU-architecture.png)

#### <font color='blue'>CUDA sees GPU as: </font>
- lots of streaming microprocessors(SMs)
- separate GPU memory
- scheduler
- thousands of compute threads (similar to CPU compute threads)

#### <font color='blue'>Thousands of threads attack the same code </font>
![](images/threads_attack.png)

#### <font color='blue'>To manage this many threads we need: </font>
- to ID each thread
- organize these threads
- schedule them for execution

#### <font color='blue'>How threads are organized: </font>
- threads within a bock cooperate (exchange data via shared memory)
- threads in different blocks can not cooperate

#### <font color='blue'>Threads layout: </font>
- threads are organized into blocks
- blocks are organized into a grid
- SM executes one block at a time

#### <font color='blue'>Simple CUDA program run on GPU threads is called </font><font color='red'>KERNEL </font>
![](images/block_grid.png)

### CUDA Programming Recepie
- copy input data from CPU memory to GPU memory
- load GPU program (KERNEL) and execute
- copy results from the GPU memory back to the CPU memory
### GPU execution model

GPUs use many lightweight threads.
GPUs hide latency instead of avoiding it
GPUs work best if the problem can me mapped on a grid.

### CUDA Block-Threading model
#### Thread positioning:
To help deal with multi-dimensional arrays, CUDA allows you to specify multi-dimensional blocks and grids. In the example above, you could make blockspergrid and threadsperblock tuples of one, two or three integers. Compared to 1-dimensional declarations of equivalent sizes, this doesn’t change anything to the efficiency or behaviour of generated code, but can help you write your algorithms in a more natural way.

In CUDA the following objects are defined: threadIdx.x, blockIdx.x, blockDim.x, gridDim.x . They help tp know the thread’s hierarchy.


#### Absolute positions:
- cuda.grid (ndim) : returns absolute position of the current thread in the entire grid
- cuda.gridsize (ndim) : returns absolute size in threads

In the previous examples the operations on arrays were uploaded to GPU which parallelized that operations automatically because we used @vectorize decorator which distributed the calculation among many GPU threads. However, it's not always the case and we need to have an explicit control over the threads.

### Device Management
Before doing any computing we need to find available GPUs and choose one.
It is possible to obtain a list of all the GPUs in the system using the following commands:


In [None]:
from numba import cuda
print(cuda.gpus)

If your machine has multiple GPUs, you might want to select which one to use. By default the CUDA driver selects the fastest GPU as the device 0, which is the default device used by Numba.

In [None]:
device_id=0
cuda.select_device( device_id )

This creates a new CUDA context for the selected device_id. device_id should be the number of the device (starting from 0; the device order is determined by the CUDA libraries). The context is associated with the current thread. Numba currently allows only one context per thread.

### Writing CUDA kernels
In CUDA, the code you write will be executed by multiple threads at once (often hundreds or thousands). Your solution will be modeled by defining a thread hierarchy of grid, blocks, and threads.
Numba also exposes three kinds of GPU memory:

    global device memory
    shared memory
    local memory
NVIDIA recommends :

    Find ways to parallelize sequential code
    Minimize data transfers between the host and the device
    Adjust kernel launch configuration to maximize device utilization
    Ensure global memory accesses are coalesced
    Minimize redundant accesses to global memory whenever possible
    Avoid different execution paths within the same warp
#### Kernel declaration
A kernel function is a GPU function that is meant to be called from CPU code. It has two fundamental characteristics:

    kernels cannot explicitly return a value; all result data must be written to an array passed to the function (if computing a scalar, you will probably pass a one-element array);
    kernels explicitly declare their thread hierarchy when called: i.e. the number of thread blocks and the number of threads per block (note that while a kernel is compiled once, it can be called multiple times with different block sizes or grid sizes).

In [None]:
import numpy
from numba import cuda

@cuda.jit
def my_kernel(io_array):
    """
    Code for kernel.
    """
    # code here

In [None]:
# Create the data array 
data=numpy.ones(12800)

# Set the number of threads in a block
threadsperblock = 32 

# Calculate the number of thread blocks in the grid
blockspergrid = (data.size + (threadsperblock - 1)) // threadsperblock

# Now finally start the kernel
my_kernel[blockspergrid, threadsperblock](data)


### Choosing the block size

* On the software side, the block size determines how many threads share a given area of shared memory.
* On the hardware side, the block size must be large enough for full occupation of execution units; 
The block size you choose depends on:
* The size of the data array
* The size of the shared mempory per block (e.g. 64KB)
* The maximum number of threads per block supported by the hardware (e.g. 512 or 1024)
* The maximum number of threads per multiprocessor (MP) (e.g. 2048)
* The maximum number of blocks per MP (e.g. 32)
* The number of threads that can be executed concurrently (a “warp” i.e. 32)

Rules of thumb for threads per block:

    Should be a round multiple of the warp size (32)
    A good place to start is 128-512 but benchmarking is required to determine the optimal value.


### Thread positioning in Numba
Numba uses similar to CUDA syntax to address the thread positioning
- cuda.threadIdx.x, cuda.blockIdx.x,   cuda.blockDim.x, cuda.gridDim.x are special objects provided by CUDA to know
These objects can be 1-, 2- or 3-dimensional, depending on how the kernel was invoked. To access the value at each dimension, use the x, y and z attributes of these objects, respectively.

### Absolute positions

Simple algorithms will tend to always use thread indices in the same way as shown in the example above. Numba provides additional facilities to automate such calculations:

* numba.cuda.grid(ndim) - Return the absolute position of the current thread in the entire grid of blocks. ndim should correspond to the number of dimensions declared when instantiating the kernel. If ndim is 1, a single integer is returned. If ndim is 2 or 3, a tuple of the given number of integers is returned.
* numba.cuda.gridsize(ndim) - Return the absolute size (or shape) in threads of the entire grid of blocks. ndim has the same meaning as in grid() above.
order to know which array element(s) it is responsible for. 

#### Exercice 1
Lets do the following exercise where each element of an array is incremented : array[i] = array[i] + 1

In [None]:
# Import all required libs
import ...
from numba import ...

In [None]:
# Write a GPU code (Kernel)
def kernel1(array):
    #define thread index i here ...
    if i<array.size:
        array[i] += 1

In [None]:
# Define CUDA grid: provide with number of blocks and threads per block
data=numpy.ones(12800)
threads=32
blocks =

In [None]:
# Run the kernel and measure execution time:
kernel1[blocks,threads](data)

In [None]:
# Take advatage of excplicit data management and copy an array to GPU before kernel execution. 
# Then measure the execution time again

#### Exercice 2
Integer array, sent to GPU where its indices are reversed, i.e. array[0]=array[N-1], array[1]=array[N-2], etc

In [None]:
# Import required libs
import numpy as np
from numba import cuda, float32

In [None]:
#Part 3: Create a CUDA kernel with @cuda.jit decorator
# Kernel: reverse the array content using appropriate indices. 
# To do so you may need input and output indices. Implement kernel with possibility of multiple thread blocks.

In [None]:
# Define CUDA grid
dim=256
NumBlocks=1
NumThreadsPerBlock=dim

In [None]:
#Part 1: Create arrays on CPU and GPU (if you want to)

In [None]:
#Part 2: Initialize host array

In [None]:
#Part 4: Call the kernel function

In [None]:
#Part 5: Verify the result

#### Exercice 3
Repeat Excercise 2 with multiple blocks per CUDA grid (NumBlocks > 1)

### Explicit data management

Numba has been automatically transferring the NumPy arrays to the device when you invoke the kernel. However, it can only do so conservatively by always transferring the device memory back to the host when a kernel finishes. To avoid the unnecessary transfer for read-only arrays, it is possible to manually control the transfer.


In [None]:
device_array = cuda.device_array( shape ) #Allocates an empty device ndarray. Similar to numpy.empty().

In [None]:
device_array = cuda.to_device( array ) #Copy data from CPU array to GPU array

In [None]:
array = device_array.copy_to_host() #Copy data back to CPU

In the below example we try to avoid automatic data transfer:

In [None]:
from numba import vectorize
import numpy as np

In [None]:
@vectorize(['int32(int32, int32)'], target='cuda')
def add(x, y):
    return x + y

In [None]:
from numba import cuda
n = 10
x=np.arange(n).astype(np.int32)
y=np.ones_like(x)

d_x=cuda.to_device(x)
d_y=cuda.to_device(y)

In [None]:
# Run and measure execution time:
add(d_x,d_y)

Here the result is returned back to CPU. Sometimes you need to leave it on the GPU (e.g. for further computing on GPU). This can be done bty creating an arrar directly on GPU:

In [None]:
d_res = cuda.device_array(shape=(n,), dtype=np.int32)

In [None]:
# Run again and measure execution time:
add(d_x, d_y, out=d_res)

### Calling device funcs
All the functions we created so far are run on GPU but called from the CPU. Sometimes it's needed to have a function callable from the GPU only. It can be done by adding an extra recipe "device=True" to @jit decorator:

In [None]:
from numba import vectorize, cuda

@cuda.jit('float32(float32, float32, float32)', device=True, inline=True)
def cu_device_fn(x, y, z):
    return x ** y / z

Then we create a function ( callable from the CPU ) which calls the above GPU function:

In [None]:
@vectorize(['float32(float32, float32, float32)'], target='cuda')
def cu_ufunc(x, y, z):
    return cu_device_fn(x, y, z)

In [None]:
from numba import cuda

@cuda.jit(device=True)
def device_add(a, b):
    return a + b

@vectorize(['float32(float32, float32)'], target='cuda')
def do_sum_pow(v1, v2):
    s1 = device_add(v1, v2)
    return s1 ** 2

In [None]:
n = 1000000
v1 = np.random.uniform(0.5, 1.5, size=n).astype(np.float32)
v2 = np.random.uniform(2.5, 5.5, size=n).astype(np.float32)

In [None]:
do_sum_pow(v1, v2)

#### Exercise 4
Polynomial evaluation on both GPU and CPU

In [None]:
import numpy as np
from numba import cuda

In [None]:
#Part 3: Modify polynomial function to make it work with numba.cuda
def host_polyval(result, array, coeffs):
    for i in range(len(array)):
        val = coeffs[0]
        for coeff in coeffs[1:]:
            val = val * array[i] + coeff
        result[i] = val

In [None]:
#Part 1: Allocate integer array (int32), size of 2048 * 1024. Also make an empty array for result, same size
array = 
coeffs = np.float32(range(1, 10))
result = 

In [None]:
#Part 2: Prepare grid
blocks=
threads=

In [None]:
#Part 4: Call the kernel and measure execution time

In [None]:
#Part 5: Call the built-in NumPy polynomial function  np.polyval(coeffs, array) and compare results

In [None]:
#Part 6: Go back to the kernel (Part 3) and modify it to make it work on CPU with @jit

#### Exercise 5
Matrix multiplication WITH GLOBAL MEMORY 

In [None]:
def matmul(A,B,C):
    # iterating by row of A
    for i in range(len(A)):
  
        # iterating by coloum by B 
        for j in range(len(B[0])):
  
            # iterating by rows of B
            for k in range(len(B)):
                C[i][j] += A[i][k] * B[k][j]
  

In [None]:
#Part 1: Create matrices A,B,C as numpy arrays. Fill A and B with random numbers.


In [None]:
#Part 2: Calculate number of blocks and threads

In [None]:
#Part 3: Create a CUDA kernel with @cuda.jit decorator

In [None]:
#Part 4: Call the kernel function and time it to get the execution time

In [None]:
#Part 5: Create A,B,C manually on the GPU and copy data to the GPU arrays

In [None]:
#Part 6: Call the kernel function and time it to get the execution time. Compare the execution times.

### Shared memory
A limited amount of shared memory can be allocated on the device to speed up access to data. That memory is shared amongst all threads in a given block. It's so much faster than the regular device memory. It also allows threads to cooperate on a given solution.
The memory is allocated once for the duration of the kernel, unlike traditional dynamic memory management.

In [None]:
 numba.cuda.shared.array(shape, type)

This function is called on the device, i.e. from the kernel or device function. A common pattern is to have each thread populate one element in the shared array, then wait for all threads to finish using syncthtreads:

In [None]:
 numba.cuda.syncthreads()