# Introduction to GPU Programming with Python
## Intro to Numba: just-in-time library
Numba is a library that compiles Python code at runtime to native machine instructions
Numba is an on-the-fly compiler of specialized types, functions for CPU or GPU.
Important: you don't need to dramatically change you Python code

Numba's central feature is a numba.jit decorator
Decorator: modifies functions in a prticular way. You can think of them as functions that take a function as input and produce a function as output:
- a function maybe wrapped by one or more decorator expression
- decorator expression is evaluated when the function is defined
- multiple decorators are applied in nested fashion

### Getting started with @jit

In [None]:
from numba import jit, njit

Now lets do an example with an error 

In [None]:
def original_function(input_list):
    output_list = []
    for item in input_list:
        if item % 2 == 0:
            output_list.append(2)
        else:
            output_list.append('1')
    return output_list

test_list = list(range(100000))

In [None]:
%time original_function(test_list)[1:10]

Now lets use Numba, or lets jit the code:

In [None]:
jitted_function = jit()(original_function)

In [None]:
%time jitted_function(test_list)[1:10]

In fact, with jit it's slower. Why ? 
Avoid jitting a function or using @jit AS IS. 
Use @jit(nopython=True) or njit

Numba has two compilations modes:
- nopython mode(nopython=True or njit): Numba compiler generates code that does not access the Python C API
- object mode(nopython=False) : Numba compiler generates code that handles all values as Python objects and uses the Python C API
  

In [None]:
njitted_function = njit()(original_function)

In [None]:
njitted_function(test_list)[0:10]

This time we got an error instead of a warning ? Why ?
Notice : the compilation happens at call time. This is because types are not specified so the compiler needs to see an example of the data being input to generate the code.

Now, lets correct our error and make a somewhat sane function this time :

In [None]:
def sane_function(input_list):
    output_list = []
    for item in input_list:
        if item % 2 == 0:
            output_list.append(2)
        else:
            output_list.append(1)
    return output_list

test_list = list(range(100000))

In [None]:
%timeit sane_function(test_list)[0:10]

In [None]:
njitted_sane_function = njit()(sane_function)

In [None]:
%time njitted_sane_function(test_list)[0:10]

Heh, it's slow. Where is a speedup ? 
It's not a good idea to throw a normal python list at numba because it'll take it a long time verifying types. Instead for now use numpy arrays. 

In [None]:
import numpy as np
test_list = np.arange(100000)

In [None]:
%time njitted_sane_function(test_list)[0:10]

Finally we have some speedup.

### Automatic parallelization

Numba allows for multi-threaded calculation. 

In [None]:
def reduction(x,result):
    for i in range(x.size):  
        result += x[i]

In [None]:
x = np.arange(1000000,dtype=np.int64)
result = np.zeros(1,dtype=np.int64)

In [None]:
%timeit reduction(x,result)

In [None]:
@njit
def njitted_reduction(x,result):
    for i in range(x.size):  
        result += x[i]

In [None]:
%timeit njitted_reduction(x,result)

In [None]:
@njit(parallel=True)
def njitted_parallel_reduction(x,result):
    for i in range(x.size):  
        result += x[i]

In [None]:
%timeit njitted_parallel_reduction(x,result)

### Analysis & Diagnostics

The parallel option for jit() can produce diagnostic information about the transforms undertaken in automatically parallelizing the decorated code. This information can be accessed in two ways:
* by setting the environment variable NUMBA_PARALLEL_DIAGNOSTICS
* by calling parallel_diagnostics()

In [None]:
njitted_parallel_reduction.parallel_diagnostics(level=4)

### Explicit parallelization with prange

One can use Numba’s prange instead of range to specify that a loop can be parallelized. 

In [None]:
from numba import prange 

In [None]:
@njit(parallel=True)
def njitted_parallel_prange_reduction(x,result):
    for i in prange(x.size):  
        result += x[i]

In [None]:
%timeit njitted_parallel_prange_reduction(x,result)

### Hands-on session: Matrix multiplication using @jit (on CPU)

![](images/Matrix_multiplication_diagram_2.svg.png)

![](images/matrix_formula.png)

In [None]:
# Part 0: Write a matrix multiplication code (2 external loops over i,j 
# and one internal for multiplication and reduction)
def matmul(A,B,C):
    # iterating by row of A
    .....
  
        # iterating by coloum by B 
        .....
  
            # iterating by rows of B
            ....
                C[i][j] += A[i][k] * B[k][j]

In [None]:
#Part 1: Create matrices A,B,C as numpy arrays (128,128). Fill A and B with random numbers.
A = 
B =
C = 

In [None]:
# Excute matmul without optimization for reference
%timeit matmul(A,B,C)

In [None]:
# Part 2: Copy the code from Part 1 and apply jit decorator to optimize it
from numba import ...

In [None]:
#Part 3: Execute matmul and measure execution time
%timeit matmul(A,B,C)

In [None]:
# Part 3: Copy the code from Part 2 and parallelize it 
from numba import ...


In [None]:
%timeit matmul(A,B,C)

### Vectorization and ufuncs

Before we switch to computing on GPUs, lets briefly discuss another decorator - `@vectorize`"

A universal function (or ufunc for short) is a function that operates on NumPy arrays (ndarrays) in an element-by-element fashion. A ufunc is a “vectorized” wrapper for a function that takes a fixed number of scalar inputs and produces a fixed number of scalar outputs.

In [None]:
from numba import vectorize

In [None]:
# Apply @vectorize to make it ufunc
@vectorize
def scalar_computation(num):
    if num % 2 == 0:
        return 2
    else:
        return 1
test_list = np.arange(100000)

Here we can write a function to operate on a single element, but then call it on a list!

In [None]:
%time scalar_computation(test_list)

In [None]:
%time scalar_computation(test_list)

Second execution was faster because numba ensures a properly sized output list is pre-allocated, which is an optimization over the past form of the function where the list was being grown to an unknown size. This can be fixed by allocating an output array first.

In [None]:
@njit
def allocated_func(input_list):
    output_list = np.zeros_like(input_list)
    for ii, item in enumerate(input_list):
        if item % 2 == 0:
            output_list[ii] = 2
        else:
            output_list[ii] = 1
    return output_list

In [None]:
%time allocated_func(test_list)

In [None]:
%time allocated_func(test_list)

### Signatures, ufuncs, and GPUs

Numba with CUDA can produce a ufunc-like objects. This object is a close analog but not fully compatible with a regular NumPy ufunc.

In [None]:
import numpy as np
from numba import vectorize
import math

In [None]:
@vectorize(['float32(float32, float32, float32)'])
def cpu_discriminant(a, b, c):
    return math.sqrt(b ** 2 - 4 * a * c)

In [None]:
A = np.array(np.random.sample(10000), dtype=np.float32)
B = np.array(np.random.sample(10000)+10, dtype=np.float32)
C = np.array(np.random.sample(10000), dtype=np.float32)

In [None]:
%time cpu_discriminant(A,B,C)

In [None]:
@vectorize(['float32(float32, float32, float32)'], target='cuda')
def cu_discriminant(a, b, c):
    return math.sqrt(b ** 2 - 4 * a * c)

In [None]:
%time cu_discriminant(A,B,C)

It's very slow compared to the CPU ufunc. However, the advantage here is that such a CUDA ufunc adds support for passing intra-device arrays (already on the GPU device) to reduce traffic over the PCI-express bus.

In other words: instead of A,B, and C which are CPU variables we can pass the GPU variables.