# Introduction to GPU Programming with Python
## Numba + CUDA


Questions
* How to port CPU code to GPU ?

Objectives
* Learn how to apply @cuda.jit in Numba CUDA
* Learn how to create a CUDA grid in Numba
* Understand the GPU memory allocation (implicit or explicit)

### Importing Numba CUDA

In [None]:
from numba import cuda

### Numba GPU Device Management

In [None]:
#Check GPU devices available:
cuda.gpus

# Select device
cuda.select_device(0)

#Get some info on the GPU
cc_cores_per_SM_dict = {
    (2,0) : 32,
    (2,1) : 48,
    (3,0) : 192,
    (3,5) : 192,
    (3,7) : 192,
    (5,0) : 128,
    (5,2) : 128,
    (6,0) : 64,
    (6,1) : 128,
    (7,0) : 64,
    (7,5) : 64,
    (8,0) : 64,
    (8,6) : 128,
    (8,9) : 128,
    (9,0) : 128
    }
device = cuda.get_current_device()
my_sms = getattr(device, 'MULTIPROCESSOR_COUNT')
my_cc = device.compute_capability
cores_per_sm = cc_cores_per_SM_dict.get(my_cc)
total_cores = cores_per_sm*my_sms
print("GPU compute capability: " , my_cc)
print("GPU total number of SMs: " , my_sms)
print("GPU cores per SM: ",cores_per_sm)
print("GPU total number of cores: ",total_cores)

### CUDA kernel declaration in Numba
CUDA Kernel is declzred by using @cuda.jit decorator.
CUDA Kernel is a function that is called from Host but executed on the Device.

In [None]:
@cuda.jit
def my_kernel(io_array):
    """
    Code for kernel.
    """
    # code here

### CUDA Device function declaration in Numba
CUDA Defice function is a function that is called from Device and executed on the Device
Here is how to declare a Device function with the use of @cuda.jit:

In [None]:
@cuda.jit(device=True)
def my_device_function(io_array):
    """
    Code for Device function.
    """
    # code here

### Here is how we create a CUDA grid in Numba

In [None]:
# Create the data array 
import numpy as np
data=np.ones(12800)

# Set the number of threads in a block
threadsperblock = 32 

# Calculate the number of thread blocks in the grid
blockspergrid = (data.size + (threadsperblock - 1)) // threadsperblock

# Now finally start the kernel
my_kernel[blockspergrid, threadsperblock](data)


What if we need a 2-dimensional blocks, 2-dimensional gird

In [None]:
# Set the number of threads in a block
threadsperblock = 32 
block = (threadsperblock,threadsperblock)

# Calculate the number of thread blocks in the grid
blockspergrid = (data.size + (threadsperblock - 1)) // threadsperblock
grid = (blockspergrid,blockspergrid)

# Now finally start the kernel
my_kernel[grid, block](data)

### Main example: Matrix multiplication using Numba CUDA (in global memory only)
The task is to re-write the function and make it a CUDA kernel with operations in global memory. The idea is to parallelize the problem by distributing the computational load across multiple CUDA threads.

In [None]:
import numpy as np
from numba import ...

In [None]:
def matmul(A,B,C):
    # iterating by row of A
    for i in range(len(A)):
  
        # iterating by coloum by B 
        for j in range(len(B[0])):
  
            # iterating by rows of B
            for k in range(len(B)):
                C[i][j] += A[i][k] * B[k][j]
  

In [None]:
#Part 1: Create matrices A,B,C as numpy arrays. Fill A and B with random numbers.


In [None]:
#Part 2: Calculate number of blocks and threads

In [None]:
#Part 3: Create a CUDA kernel with @cuda.jit decorator

In [None]:
#Part 4: Call the kernel function and time it to get the execution time

In [None]:
#Part 5: Create A,B,C manually on the GPU and copy data to the GPU arrays

In [None]:
#Part 6: Call the kernel function and time it to get the execution time. Compare the execution times.

### Exercise: Array elements incrementation
In the following exercise each element of an array is incremented : array[i] = array[i] + 1

In [None]:
# Import all required libs
import ...
from numba import ...

In [None]:
# Write a GPU code (Kernel)
def kernel1(array):
    #define thread index i here ...
    if i<array.size:
        array[i] += 1

In [None]:
# Define CUDA grid: provide with number of blocks and threads per block
data=numpy.ones(12800)
threads=32
blocks =

In [None]:
# Run the kernel and measure execution time:
kernel1[blocks,threads](data)

In [None]:
# Take advatage of excplicit data management and copy an array to GPU before kernel execution. 
# Then measure the execution time again

### Explicit data management

Numba has been automatically transferring the NumPy arrays to the device when you invoke the kernel. However, it can only do so conservatively by always transferring the device memory back to the host when a kernel finishes. To avoid the unnecessary transfer for read-only arrays, it is possible to manually control the transfer.


In [None]:
device_array = cuda.device_array( shape ) #Allocates an empty device ndarray. Similar to numpy.empty().

In [None]:
device_array = cuda.to_device( array ) #Copy data from CPU array to GPU array

In [None]:
array = device_array.copy_to_host() #Copy data back to CPU

Now go back to exercise 1 and modify the code by using the expicit data management.

### Exercise: Array reversal
Here an integer array is sent to GPU where its indices are reversed, i.e. array[0]=array[N-1], array[1]=array[N-2], etc.

In [None]:
# Import required libs
import numpy as np
from numba import cuda

In [None]:
#Part 3: Create a CUDA kernel with @cuda.jit decorator
# Kernel: reverse the array content using appropriate indices. 
# To do so you may need input and output indices. Implement kernel with possibility of multiple thread blocks.

In [None]:
# Define CUDA grid
dim=256*1000
NumThreadsPerBlock=
NumBlocks = 

In [None]:
#Part 1: Create arrays on CPU and GPU (if you want to)

In [None]:
#Part 2: Initialize host array

In [None]:
#Part 4: Call the kernel function

In [None]:
#Part 5: Verify the result

## Key points
* **Numba @cuda.jit decorator** 
    * Device (GPU) won't work without a Host(CPU)
    * Both Host and Device have their own memory
* **Kernel and Device functions**
    * Kernel is declared with @cuda.jit. Kernel is called from  the Host
    * Device function is declared with @cuda.jit(device=True) and is called from the Device.
* **Explicit data transfers between CPU and GPU**
    * 