![NASA](http://www.nasa.gov/sites/all/themes/custom/nasatwo/images/nasa-logo.svg)

<center>
<h1><font size="+3">GSFC Python Bootcamp</font></h1>
</center>

---

<CENTER>
<H1 style="color:red">
Introduction to Numba
</H1>
</CENTER>

In [None]:
from __future__ import print_function

> I’m becoming more and more convinced that Numba is the future of fast scientific computing in Python. 
>
> – Jake Vanderplas, 2013-06-15
>
> http://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/


![fig_numba](https://thedatafrog.com/static/blog/images/2019/07/python_fast.0d88afcb4f8a.png)
Image Source: Lison Bernet 2019

## <font color='red'>What will be Covered?</font>

* What is Numba?
* How Does Numba Work?
* Numpy and Numba
* How to Use Numba?
* Parallelization with Numba
* Numba and Pandas

## <font color='red'>Reference Documents</font>
- <a href="http://numba.pydata.org/">Numba: A High Performance Python Compiler</a>
- <a href="https://examples.dask.org/applications/stencils-with-numba.html">Stencil Computations with Numba</a>
- <a href="http://deepdata.com.pl/numba.html">Python on steroids - speeding up calculations with numba</a>
- <a href="https://colab.research.google.com/github/evaneschneider/parallel-programming/blob/master/COMPASS_gpu_intro.ipynb">Introduction to GPU programming with Numba</a>
- <a href="https://www.deeplearningwizard.com/deep_learning/production_pytorch/speed_optimization_basics_numba/">Speed Optimization Basics: Numba</a>
- <a href="https://murillogroupmsu.com/numba-versus-c/">High-Performance Python: Why?</a>
- <a href="https://flothesof.github.io/optimizing-python-code-numpy-cython-pythran-numba.html">Optimizing your code with NumPy, Cython, pythran and numba </a>
- <a href="https://www.polymorphe.org/index.php/looping-over-pandas-data-mkd">Looping over Pandas data</a>

## <font color='red'>What is Numba?</font>

> Numba is an open-source JIT compiler that translates a subset of Python and NumPy into fast machine code using `LLVM`, via the llvmlite Python package. It offers a range of options for parallelising Python code for CPUs and GPUs, often with only minor code changes. 
>
>Wikipedia

- Numba is a Python open source package that was originally developed by Continuum Analytics.
- The core application area are math-heavy and array-oriented functions, which are in native Python pretty slow.
- It accelerates Python code (numerical functions) for both CPU and GPU:
   - **Function Compiler**: Numba compiles Python functions, not whole applications or parts of it. It is a Python module meant to improve the performance of functions with the goal of achieving a speed comparable to `C`.
   - **Just-in-time**: (Dynamic translation) Numba translates the bytecode (intermediate code more abstract than the machine code) to machine code immediately before its execution to improve the execution speed.
   - **Numerically-focused**: Numba is focused on numerical data, such as int, float, complex. 

## <font color='red'>How Does Numba Work?</font>

- Assume that you have a function `do_math` that is decorated with the Numba `@jit` decorator. 
- Compilation will be deferred until the first function execution. 
- Numba will infer the argument types at call time, and generate optimized code based on this information. 
- Numba will also be able to compile separate specializations depending on the input types. 
- The diagram below, shows all the steps carried out by Numba to execute `do_math`. 

![fig_numba](https://miro.medium.com/max/1400/1*S0S4QUjR-BsdTICtT9797Q.png)
Image Source: Continuum Analytics

- **IR**: Intermediate Representations
- **Bytecode Analysis**: Intermediate code more abstract than machine code
- **LLVM**: Low Level Virtual Machine, infrastructure to develop compilers
- **NVVM**: It is an IR compiler based on LLVM, it is designed to represent GPU kernels

## <font color='red'>Numpy and Numba</font>
- One objective of Numba is having a seamless integration with NumPy. 
- Numba excels at generating code that executes on top of NumPy arrays.
- NumPy support in Numba comes in many forms:
    1. Numba understands calls to NumPy ufuncs (universal functions: there are over 60 of them) and is able to generate equivalent native code for many of them.
    2. NumPy arrays are directly supported in Numba.
    3. Numba is able to generate ufuncs and gufuncs (generalized universal functions). This means that it is possible to implement ufuncs and gufuncs within Python, getting speeds comparable to that of ufuncs/gufuncs implemented in C extension modules using the NumPy C API.

## <font color='red'>Usage</font>
- Numba provides several utilities for code generation.
- Its central feature is the `numba.jit()` decorator. 
- Using this decorator, you can mark a function for optimization by Numba’s JIT compiler. - - - Various invocation modes trigger differing compilation options and behaviours.

In [None]:
import time
import numpy as np
from numba import jit
from numba import njit
from numba import prange

**Example**

Consider the function that multiplies two `nxn` matrices.

In [None]:

def matrix_multiplication(A, B):
    """
        Multiply matrices A and B using a loop
    """
    n = len(A[0])
    C = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            for k in range(n):
                C[i, j] += A[i, k]*B[k, j]

In [None]:
N = 200
A = np.random.rand(N, N)
B = np.random.rand(N, N)
D = np.random.rand(N)

In [None]:
%timeit matrix_multiplication(A, B)

We can now decorate the above multiplication with `jit`:

In [None]:
@jit
def matrix_multiplication_numba(A, B):
    """
        Multiply matrices A and B using a loop
    """
    n = len(A[0])
    C = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            for k in range(n):
                C[i, j] += A[i, k]*B[k, j]

In [None]:
%timeit matrix_multiplication_numba(A, B)

**Measuring the Performance of Numba**

- Once the compilation has taken place, Numba runs the machine code version of your function. 
- If it is called again with same argument types, it can reuse the cached version instead of having to compile again.
- A common mistake when measuring performance is not accounting for the above behaviour and to time code once with a simple timer that includes the time taken to compile your function in the execution time.

DO NOT REPORT THIS... COMPILATION TIME IS INCLUDED IN THE EXECUTION TIME!

In [None]:
start_1 = time.time()
matrix_multiplication_numba(A, B)
end_1 = time.time()
print("Elapsed (with compilation) = %s" % (end_1 - start_1))

NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE

In [None]:
start_2 = time.time()
matrix_multiplication_numba(A, B)
end_2 = time.time()
print("Elapsed (after compilation) = %s" % (end_2 - start_2))

### Compilation Options
A number of keyword-only arguments can be passed to the `@jit` decorator:
1. `nopython`: Numba has two compilation modes:
     - nopython mode (`nopython=True`): Compile the decorated function so that it will run entirely without the involvement of the Python interpreter. It produces much faster code, but has limitations that can force Numba to fall back to the object mode. Note that <font color="red">**`@njit`**</font> is an alias for <font color="red">**`@jit(nopython=True)`**</font>.
     - object mode: In this mode Numba will identify loops that it can compile and compile those into functions that run in machine code, and it will run the rest of the code in the interpreter. For best performance avoid using this mode!
2. `nogil`: 
     - Whenever Numba optimizes Python code to native code that only works on native types and variables (rather than Python objects), it is not necessary anymore to hold Python’s global interpreter lock (GIL). 
     - Numba will release the GIL when entering such a compiled function if you passed `nogil=True`.
     - When using `nogil=True`, you need to be wary of the usual pitfalls of multi-threaded programming (consistency, synchronization, race conditions, etc.).
3. `cache`:
     - To avoid compilation times each time you invoke a Python program, you can instruct Numba to write the result of function compilation into a file-based cache. 
     - This is done by passing `cache=True`.
4. `parallel`: 
     - Enables automatic parallelization (and related optimizations) for operations in the function known to have parallel semantics.
     - This feature is enabled by passing `parallel=True` and must be used in conjunction with `nopython=True`.

### Fastmath
- In certain classes of applications strict IEEE 754 compliance is less important. 
- It is possible to relax some numerical rigour with view of gaining additional performance. 
- The way to achieve this behaviour in Numba is through the use of the `fastmath` keyword argument.

In [None]:
@njit(fastmath=False)
def do_sum(A):
    acc = 0.
    # without fastmath, this loop must accumulate in strict order
    for x in A:
        acc += np.sqrt(x)
    return acc

@njit(fastmath=True)
def do_sum_fast(A):
    acc = 0.
    # with fastmath, the reduction can be vectorized as floating point
    # reassociation is permitted.
    for x in A:
        acc += np.sqrt(x)
    return acc

In [None]:
time_do_sum = %timeit -o acc1 = do_sum(D)

In [None]:
time_do_sum_fast = %timeit  -o acc2 = do_sum_fast(D)
print(time_do_sum.best / time_do_sum_fast.best)

## <font color="red"> Parallelization </font>

- The setting `parallel=True` in `jit()` enables a Numba transformation pass that attempts to automatically parallelize and perform other optimizations on (part of) a function.
- A user program may contain operations (for instance adding a scalar value to an array) that are known to have parallel semantics.
- Each operation could be parallelized individually but that might light to poor performance due to poor cache behavior.
- Numba uses instead auto-parallelization where it identifies all operations with parallel sementics and fuses adjacent ones together, to form one or more kernels that are automatically run in parallel.
- The process is fully automated without modifications to the user program.

### Explicit Parallel Loops

- Numba parallel execution also has support for explicit parallel loop declaration similar to that in OpenMP. 
- To indicate that a loop should be executed in parallel the `numba.prange` function should be used.
- This function behaves like Python `range` and if `parallel=True` is not set it acts simply as an alias of `range`. 
- Loops induced with `prange` can be used for embarrassingly parallel computation and also reductions.

In [None]:
@njit(parallel=True)
def matrix_multiplication_numba2(A, B):
    """
        Multiply matrices A and B using a loop
    """
    n = len(A[0])
    C = np.zeros((n, n))
    for i in prange(n):
        for j in prange(n):
            for k in prange(n):
                C[i, j] += A[i, k]*B[k, j]

In [None]:
%timeit matrix_multiplication_numba2(A, B)

### Diagnostics
- We can produce diagnostic information about the transforms undertaken in automatically parallelizing the decorated code. 
- This information can be accessed in two ways:
     1. Setting the environment variable: `NUMBA_PARALLEL_DIAGNOSTICS`
     2. Calling the function `parallel_diagnostics()`
- The level of verbosity in the diagnostic information is controlled by an integer argument of value between 1 and 4 inclusive, 1 being the least verbose and 4 the most.

For additional information, consult the webpage: <a href="http://numba.pydata.org/numba-doc/latest/user/parallel.html"> http://numba.pydata.org/numba-doc/latest/user/parallel.html</a>.

In [None]:
@njit(parallel=True)
def test(x):
    n = x.shape[0]
    a = np.sin(x)
    b = np.cos(a * a)
    acc = 0
    for i in prange(n - 2):
        for j in prange(n - 1):
            acc += b[i] + b[j + 1]
    return acc

test(np.arange(10))

test.parallel_diagnostics(level=4)

## <font color="red">Things to Consider when Using Numba</font>

- Numba allows its behaviour to be changed through the use of <a href="http://numba.pydata.org/numba-doc/latest/reference/envvars.html">environment variables</a>. Unless otherwise mentioned, those variables have integer values and default to zero.
- Not all the <a href="http://numba.pydata.org/numba-doc/latest/reference/pysupported.html">Python feautures</a> are supported by Numba.
- While Python has arbitrary-sized integers, integers in Numba-compiled functions get a fixed size through type inference (usually, the size of a machine integer). This means that arithmetic operations can wrapround or produce undefined results or overflow.
- Numba may or may not copy global variables referenced inside a compiled function. Small global arrays are copied for potential compiler optimization with immutability assumption. However, large global arrays are not copied to conserve memory. The definition of “small” and “large” may change.
- Numba does not work with recusive function.
- For some operations, Numba may use a different algorithm than Python or Numpy. The results may not be bit-by-bit compatible. The difference should generally be small and within reasonable expectations. However, small accumulated differences might produce large differences at the end, especially if a divergent function is involved.

## <font color="red">Numba and Pandas</font>

- Pandas is built on top of Numpy.
- Pandas offers flexibility in manipulating data but not necessary speed.
- This flexibility allows the creation of built-in function.
- Crude looping (over DataFrame rows for instance) in Pandas does not take advantage of any built-in optimizations, making it extremely inefficient.
- Using vectorized Pandas built-in functions (acting on Pandas Series) is almost always preferable to accomplishing similar ends with custom-written looping.

**Example**

An exponential moving average (EMA) is a first-order infinite impulse response filter that applies weighting factors which decrease exponentially. The EMA for a series Y may be calculated recursively:


$$S_{t}=\begin{cases}Y_{1},&t=1\\\alpha \cdot Y_{t}+(1-\alpha )\cdot S_{t-1},&t>1\end{cases}$$

In [None]:
import pandas as pd

@njit(fastmath=True)
def ewm(arr, alpha):
    """
    Calculate the EMA of an array arr
    :param arr: numpy array of floats
    :param alpha: float between 0 and 1
    :return: numpy array of floats
    """
    # initialise ewm_arr
    ewm_arr = np.zeros_like(arr)
    ewm_arr[0] = arr[0]
    for t in prange(1,arr.shape[0]):
        ewm_arr[t] = alpha*arr[t] + (1 - alpha)*ewm_arr[t-1]

    return ewm_arr

In [None]:
N = 10000
a = np.random.random(N)
df = pd.DataFrame(a)

In [None]:
%timeit df.ewm(com=0.5, adjust=False).mean()

In [None]:
%timeit ewm(a, 0.5)

**Could we claim that Numpy/Numba is faster than Pandas?**

- Not necessarily!
- Over time, Pandas relies more on  Cython operations.
- In Pandas 1.0 (and newer versions) Pandas’ `apply()` method (applies a function along a specific axis of a DataFrame) can make use of Numba (if installed) instead of cython and be faster. 