# Introduction to GPU Programming with Python
## Numba: just-in-time library
Numba is a library that compiles Python code at runtime to native machine instructions
Numba is an on-the-fly compiler of specialized types, functions for CPU or GPU.
Important: you don't need to dramatically change you Python code

Numba's central feature is a numba.jit decorator
Decorator: modifies functions in a prticular way. You can think of them as functions that take a function as input and produce a function as output:
- a function maybe wrapped by one or more decorator expression
- decorator expression is evaluated when the function is defined
- multiple decorators are applied in nested fashion

### CPU

In [None]:
from numba import jit
from numpy import arange

@jit
def sum2d(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result

a = arange(9).reshape(3,3)
result=sum2d(a)
print(result)

Will Numba work for any code ? 
Limitation: you can only use NumPy and standard Python libraries inside the functions

In [None]:
from numba import jit
import pandas as pd
x = {'a': [1, 2, 3], 'b': [20, 30, 40]}

@jit
def use_pandas(a): 
	df = pd.DataFrame.from_dict(a) 
	df += 1 
	return df.cov()

print(use_pandas(x))

Numba doesn’t know about pd.DataFrame
Result:  Numba would simply run this code via the interpreter but with the added cost of the Numba internal overheads!

Numba does a good job optimizing loops:

In [None]:
import numpy as np
from numba import jit

@jit
def bubblesort(X):
    N = len(X)
    for end in range(N, 1, -1):
        for i in range(end - 1):
            cur = X[i]
            if cur > X[i + 1]:
                tmp = X[i]
                X[i] = X[i + 1]
                X[i + 1] = tmp

In [None]:
data = np.arange(0, 10, 0.01, dtype='f4')
bubblesort(data)

### Compilation modes
Numba has two compilations modes:
- nopython mode(default): Numba compiler generates code that does not access the Python C API
- object mode : Numba compiler generates code that handles all values as Python objects and uses the Python C API

### Automatic parallelization with JIT

In [None]:
import numpy as np

N = 100000000
dim=(10000,10000)
x = np.arange(N).reshape(dim)

In [None]:
@jit(nopython=True)
def doround(v):
    s = 0
    for i in range(v.shape[0]):  
        s += np.round(v[i, i])
    return v + s           

In [None]:
#Now lets execute it and measure timing with and without Numba decorator
doround(x)

In order to enable auto-parallelism with Numba, you must specify parallel to the @jit decorator. Numba will try to find the regions it can parallelize.

In [None]:
@jit(nopython=True, parallel=True)
def doround_par(v):
    s = 0
    for i in range(1000):  
        s += np.round(v[i, i])
    return v + s             

In [None]:
#Again, execute and measure timing with and without parallel option
doround_par(x)

### Explicit Parallel Loops
One can use Numba’s prange instead of range to specify that a loop can be parallelized. The user is required to make sure that the loop does not have cross iteration dependencies except for supported reductions.

In [None]:
@jit(nopython=True, parallel=True)
def doround_par(v):
    s = 0
    for i in prange(1000):  
        s += np.round(v[i, i])
    return v + s   

### Exercise 0
Matrix multiplication using jit decorator 

In [None]:
# Part 2: Run it as it is, then performa various optimizatons & parallelizations
def matmul(A,B,C):
    # iterating by row of A
    for i in range(len(A)):
  
        # iterating by coloum by B 
        for j in range(len(B[0])):
  
            # iterating by rows of B
            for k in range(len(B)):
                C[i][j] += A[i][k] * B[k][j]

In [None]:
#Part 1: Create matrices A,B,C as numpy arrays. Fill A and B with random numbers.


In [None]:
#Part 3: Execute matmul and measure execution time. Compare execution time of non-opimized
# with optimized and parallelized versios

### Diagnostics
The parallel option for jit() can produce diagnostic information about the transforms undertaken in automatically parallelizing the decorated code. This information can be accessed in two ways:
* by setting the environment variable NUMBA_PARALLEL_DIAGNOSTICS
* by calling parallel_diagnostics()

In [None]:
matmul.parallel_diagnostics(level=4)

* Loop fusion is a technique whereby loops with equivalent bounds may be combined under certain conditions to produce a loop with a larger body (aiming to improve data locality).
* Loop serialization occurs when any number of prange driven loops are present inside another prange driven loop.
* Loop invariant code motion is an optimization technique that analyses a loop to look for statements that can be moved outside the loop body without changing the result of executing the loop, these statements are then “hoisted” out of the loop to save repeated computation.
* Allocation hoisting is a specialized case of loop invariant code motion that is possible due to the design of some common NumPy allocation methods. 