# Introduction to GPU Programming with Python
## Numba on CPU: just-in-time library
Questions
* What is the main principle behind Numba just-in-time compiling ?
* How to speed up your code with @jit decorator ?

Objectives
* Install and import Numba 
* Write a code for matrix multiplication on CPU without Numba
* Apply Numba decorator to speed up matrix multiplication
* Parallelize the code across multiple CPU threads

### A bit of hardware info
The cluter we are using today is a virtual cluster build with the use of Magic Castle on [Arbutus cloud](https://docs.alliancecan.ca/wiki/Cloud_resources#Arbutus_cloud). It consists of 30 nodes, each node has the following:
* RAM: 22GB
* VCPUs: 4 VCPU
* VGPUs: 1
* Disk: 80GB

#### How to check number of cores available in Python ?

In [None]:
import multiprocessing
multiprocessing.cpu_count()

### Main example: Matrix multiplication (on CPU without Numba)
![](images/matrix_formula.png)

In [None]:
import numpy as np

In [None]:
# Write a matrix multiplication code (2 external loops over i,j 
# and one internal for multiplication and reduction)
def matmul(A,B,C):
    # iterating by row of A
    .....
  
        # iterating by column of B 
        .....
  
            # iterating by rows of B
            ....
                C[i][j] += A[i][k] * B[k][j]

In [None]:
#Create matrices A,B,C as numpy arrays (128,128). Fill A and B with random numbers.
A = 
B =
C = 

In [None]:
# Excute matmul without optimization for reference
%timeit matmul(A,B,C)

Let us see if we can speed up the calculation by using Numba.

### What is Numba ? 
Numba is a library that compiles Python code at runtime to native machine instructions
Numba is an on-the-fly compiler of specialized types, functions for CPU or GPU.
Important: you don't need to dramatically change you Python code

### Installing and importing Numba
Numba is already installed for us, so we don't have to do it. But it can be done as follows:

In [None]:
!pip install numba

In [None]:
from numba import jit

### Getting started with Numba: @jit decorator
Numba's central feature is a numba.jit decorator which modifies functions in a particular way. You can think of them as functions that take a function as input and produce a function as output:

In [None]:
@jit(nopython=True)
def function():
#    some code here 


or can be written like this:

In [None]:
def function():
    # some code here 
jitted_function = jit(nopython=True)(function)

Here you don't modify the original function.

Now let us modify the matrix multiplication example by applying Numba @jit decorator:

### Main example: Matrix multiplication (on CPU with Numba, optimization only)

In [None]:
# Add Numba decorator here to optimize the code
def matmul(A,B,C):
    # iterating by row of A
    for i in range(len(A)):
  
        # iterating by coloum by B 
        for j in range(len(B[0])):
  
            # iterating by rows of B
            for k in range(len(B)):
                C[i][j] += A[i][k] * B[k][j]

In [None]:
A=np.random.rand(128,128).astype(np.float32)
B=np.random.rand(128,128).astype(np.float32)
C=np.zeros(shape=(128,128)).astype(np.float32)

In [None]:
%timeit matmul(A,B,C)

It's faster than the non-optimized (non-jitted) code but still slow because it's running on only a single CPU core. Can we make it run on multiple cores ? Or in other words, can we parallelize the code ? 

### What is parallelization ? 
Serial computing
* A problem is broken into a discrete series of instructions
* Instructions are executed sequentially one after another
* Executed on a single processor
* Only one instruction may execute at any moment in time

![](images/serialProblem.gif)

Parallel computing
* A problem is broken into discrete parts that can be solved concurrently
* Each part is further broken down to a series of instructions
* Instructions from each part execute simultaneously on different processors
* An overall control/coordination mechanism is employed
![](images/parallelProblem.gif)

### Numba automatic parallelization

Numba allows for multi-threaded calculation. In order to parallelize the code one needs to add another option to decorator: @jit(nopython=True,parallel=True)

### Main example: Matrix multiplication (on CPU with Numba, optimization + parallelization)

In [None]:
# Add Numba decorator with parallel option to parallelize the code
def matmul(A,B,C):
    # iterating by row of A
    for i in range(len(A)):
  
        # iterating by coloum by B 
        for j in range(len(B[0])):
  
            # iterating by rows of B
            for k in range(len(B)):
                C[i][j] += A[i][k] * B[k][j]

In [None]:
A=np.random.rand(128,128).astype(np.float32)
B=np.random.rand(128,128).astype(np.float32)
C=np.zeros(shape=(128,128)).astype(np.float32)

In [None]:
%timeit matmul(A,B,C)

Numba is confused: it does not know which part of the code it should parallelize. We should help Numba.

### Numba explicit parallelization with prange

One can use Numba’s prange instead of range to specify which loop can be parallelized. Simply replace range by prange in the loop. 

In [None]:
# Add Numba decorator with parallel option and replace range with prange to parallelize the code
def matmul(A,B,C):
    # iterating by row of A
    for i in range(len(A)):
  
        # iterating by coloum by B 
        for j in range(len(B[0])):
  
            # iterating by rows of B
            for k in range(len(B)):
                C[i][j] += A[i][k] * B[k][j]

In [None]:
A=np.random.rand(128,128).astype(np.float32)
B=np.random.rand(128,128).astype(np.float32)
C=np.zeros(shape=(128,128)).astype(np.float32)

In [None]:
%timeit matmul(A,B,C)

### Exercise: Incrementation of array elements
Here each element of an array is incremented by 1. This should be done in parallel on CPU using jit decorator. Hint: replace range with prange in the loop (don't forget to import prange).

In [None]:
# Import all required libs
import numpy
from numba import ...

In [None]:
# Write a CPU parallel code (with the use of @jit decorator and prange)
def incrementation(array):
    for i in range(array.size):
        array[i] += 1

In [None]:
# Call the function and measure the execution time. Compare the results.
data=numpy.ones(12800)
%timeit incrementation(data)

## Key points
* **Numba decorator** 
    * many available @jit,@njit, etc
    * several decorators can be applied at the same time (in nested fashion)
* **nopython vs object compilation mode**
    * nopython mode : converts python objects into fast LLVM machine code
    * object node: leaves python objects but adds extra overhead
* **Numba parallelization**
    * Implicit/automatic with @jit(parallel=True)
    * Explicit/manual with @jit(parallel=True) and prange 