## What is Numba?

a JIT (Just-in-Time) compiler for Python that:



- generates optimized machine code using LLVM (Low Level Virtual Machine) compiler infrastructure


- provides toolbox for different targets and execution models:
    - Single-threaded CPU, multi-threaded CPU, GPU
    - regular functions, "universal functions (ufuncs)" (array functions), etc


- integrates well with the Scientific Python stack


- with a few annotations, array-oriented and math-heavy Python code provides: 
 - speedup: 2x (compared to basic NumPy code) to 200x (compared to pure Python)
  - performance similar to C, C++, Fortran, without having to switch languages or Python interpreters


- is **totally awesome!**

## Basic Example

### Lazy Compilation

- Use `@jit` decorator
- Let Numba decide when and how to optimize

In [None]:
import numpy as np
import numba
from numba import jit

In [None]:
@jit
def do_math(x, y):
    return x + y

In this mode:

- The compilation will be deferred until the first execution
- Numba will:
    - infer the argument types at call time
    - generate optimized code based on this information
- Numba will also be able to compile separate specializations depending on the input types. For instance, calling `do_math()` with integer or complex numbers will generate different code paths:

In [None]:
do_math.inspect_types()

In [None]:
%time do_math(1, 2)

In [None]:
%time do_math(1, 2)


**What is Numba doing to make code run quickly?**

Numba examines Python bytecode and then translates this into an 'intermediate representation'.  To view this IR, after running (compiling) `do_math()` and you can access the `inspect_types` method.

In [None]:
do_math.inspect_types()

In [None]:
%time do_math(1j, 2)

In [None]:
%time do_math(1j, 2)

In [None]:
do_math.inspect_types()

### Eager compilation

- Tell Numba the function signature you are expecting

In [None]:
from numba import int32

In [None]:
@jit(int32(int32, int32))
def eager_do_math(x, y):
    return x + y

In [None]:
%time eager_do_math(1, 2)

In [None]:
%time eager_do_math(1.0, 2.0)

In [None]:
%time eager_do_math(1j, 2)

## How does Numba work?

![](./images/how-does-numba-work.png)

Source: [Scaling Python Up and Out with Numba and Dask — Travis Oliphant](https://speakerdeck.com/teoliphant/scaling-python-up-and-out-with-numba-and-dask?slide=37)



### What about the actual LLVM code?
You can see the actual LLVM code generated by Numba using the `inspect_llvm()` method. 

In [None]:
for key, value in do_math.inspect_llvm().items():
    print(key, value)

**But there's a caveat....**

### Compilation Options

Numba has two compilation modes:

- **nopython mode (recommended and best-practice way)**: produces much faster code by running the code without the involvement of the Python interpreter.

- **object mode (should be avoided)**: Numba falls back to this mode when `nopython` mode fails.

To illustrate the above, let's watch what happens when we try to do something that is natural in Python (concatenating strings), but not particularly mathematically sound:

In [None]:
%time do_math('Hello', 'World')

In [None]:
do_math.inspect_types()

`do_math (unicode_type, unicode_type)` means that is has been compiled in `object` mode. 

To prevent Numba from falling back, and instead raise an error, we need to pass `nopython=True` to `@jit` decorator:

In [None]:
@jit
def f(x, y): # Function will not befenit from Numba jit
    a = str(x) * 10 # Numba doesn't know about str
    b = str(y)
    return a + b 

In [None]:
%timeit f(1, 2)

In [None]:
@jit(nopython=True) # Fore nopython mode
def f(x, y): # Function will not befenit from Numba jit
    a = str(x) * 10 # Numba doesn't know about str
    b = str(y)
    return a + b 

In [None]:
%timeit f(1, 2)

## Benchmarks using the all pairwise distance function

### Pure Python Version

In [None]:
def allpairs_distances_python(X,Y):
    result = np.zeros( (X.shape[0], Y.shape[0]), X.dtype)
    for i in range(X.shape[0]):
        for j in range(Y.shape[0]):
            result[i,j] = np.sum( (X[i,:] - Y[j,:]) ** 2)
    return result 

In [None]:
N = 1000 
X, Y = np.random.randn(200, N), np.random.randn(400, N)
X.shape, Y.shape 

In [None]:
pure_python = %timeit -o allpairs_distances_python(X, Y)

### Where is the bottleneck?

In [None]:
%load_ext line_profiler

In [None]:
%lprun -f allpairs_distances_python allpairs_distances_python(X,Y)

### Numba Version

In [None]:
from numba import jit

@jit(nopython=True)
def allpairs_distances_numba(X,Y):
    result = np.zeros((X.shape[0], Y.shape[0]), X.dtype)
    for i in range(X.shape[0]):
        for j in range(Y.shape[0]):
            result[i,j] = np.sum( (X[i,:] - Y[j,:]) ** 2)
    return result 

I should emphasize that this is the exact same code, except for numba's `jit` decorator. The results are pretty astonishing:

In [None]:
numba_version = %timeit -o allpairs_distances_numba(X,Y)

In [None]:
speedup = pure_python.best / numba_version.best 

In [None]:
print(f"This is a {round(speedup, 0)}x speedup, simply by adding a Numba decorator!")

## Loops


While NumPy has developed a strong idiom  around the use of **vectorized** operations, Numba is perfectly happy with loops too.

For C, and Fortran users, writing Python in this style will work fine in Numba (Thanks to LLVM!)

In [None]:
from numba import njit # njit is an alias for @jit(nopython=True)

In [None]:
# Pure NumPy
def ident_numpy(x):
    return np.cos(x) ** 2 + np.sin(x) ** 2

# Jitted NumPy 
@njit
def ident_numpy_jit(x):
    return np.cos(x) ** 2 + np.sin(x) ** 2

# NumPy with loops
def ident_numpy_loops(x):
    r = np.empty_like(x)
    n = len(x)
    for i in range(n):
        r[i] = np.cos(x[i] ** 2 + np.sin(x[1]) ** 2)
        
    return r 

# Jitted NumPy with loops 
@njit
def ident_numpy_loops_jit(x):
    r = np.empty_like(x)
    n = len(x)
    for i in range(n):
        r[i] = np.cos(x[i] ** 2 + np.sin(x[1]) ** 2)
        
    return r 


In [None]:
x = np.arange(1.e6)

In [None]:
%timeit ident_numpy(x)

In [None]:
%timeit ident_numpy_jit(x)

In [None]:
%timeit ident_numpy_loops(x)

In [None]:
%timeit ident_numpy_loops_jit(x)

## Creating  NumPy Universal Functions (Ufuncs)

- Ufuncs are a core concept in NumPy for array-oriented computig.

  - A function with scalar inputs is broadcast across the elements of the input arrays:
      

In [None]:
np.add([1, 2, 3], 3)

In [None]:
np.add([1, 2, 3], [10, 20, 30])

- Before Numba, creating fast ufuncs required writing C. **This is no longer the case!**


There's a tutorial on how to write ufuncs in NumPy from  [documentation](http://docs.scipy.org/doc/numpy/user/c-info.ufunc-tutorial.html), the example they post there is a ufunc to perform 

$$f(a) = \log \left(\frac{a}{1-a}\right)$$

It looks like this:

```c
static void double_logit(char **args, npy_intp *dimensions,
                            npy_intp* steps, void* data)
{
    npy_intp i;
    npy_intp n = dimensions[0];
    char *in = args[0], *out = args[1];
    npy_intp in_step = steps[0], out_step = steps[1];

    double tmp;

    for (i = 0; i < n; i++) {
        /*BEGIN main ufunc computation*/
        tmp = *(double *)in;
        tmp /= 1-tmp;
        *((double *)out) = log(tmp);
        /*END main ufunc computation*/

        in += in_step;
        out += out_step;
    }
}
```

**NOTE:** That's just for a `double`.  If you want `floats`, `long doubles`, etc... you have to write all of those, too.  And then create a `setup.py` file to install it, etc.

### The `@vectorize` decorator

In [None]:
import math

In [None]:
def trig_func(x, y):
    return ((math.sin(x) ** 2) + (math.cos(y) ** 2))

In [None]:
trig_func(1, 1.5)

Seems reasonable.  However, the `math` library only works on scalars.  If we try to pass in arrays, we'll get an error.

In [None]:
trig_func([1, 2], [1, 2])

Using `@vectorize` decorator, we are able to write our function as operating over input scalars, rather than arrays. Numba will generate teh surrounding loop (or kernel) allowing efficient iteration over the actual inputs.

In [None]:
from numba import vectorize, float64, float32, int32, int64

In [None]:
# Define ufunc with multiple signatures
@vectorize(['int32(int32, int32)',
            'int64(int64, int64)',
            'float32(float32, float32)',
            'float64(float64, float64)'])
def trig_func(x, y):
    return ((math.sin(x) ** 2) + (math.cos(y) ** 2))

And just like that, the scalar function `trig_func` is now a NumPy `ufunc` called `trig_func`

In [None]:
a = np.random.random((1000, 1000))
b = np.random.random((1000, 1000))

In [None]:
%timeit trig_func(a, b)

In [None]:
def trig_func_numpy(x, y):
    return ((np.sin(x) ** 2) + (np.cos(y) ** 2))

How does it compare to just using NumPy?  Let's check

In [None]:
%timeit trig_func_numpy(a, b)

**NOTE**: NumPy ufuncs automatically get other features such as:

- reduction
- accumulation
- broadcasting

By defining our ufunc using Numba, we get these additional features for **free**.

In [None]:
a = np.arange(12).reshape(3, 4)
a

In [None]:
trig_func.reduce(a, axis=0)

In [None]:
trig_func.accumulate(a)

In [None]:
%load_ext watermark

In [None]:
%watermark -u -n -t -iv -g -m


**To Be Continued...................**

<h1 align="center">The Sheer Joy of Accelerating Your Existing Python Code with Numba!</h1>

<h2 align="center">Part II : Numba for Cuda GPUs</h2>
