# Part 2: Numba

Numba is another Python package designed to offer increased performance with Python applications that use a lot of `numpy` code for computation. Numba is a 'just-in-time' compiler for numpy code - it works by compiling your numpy code at runtime into optimised machine code. 

Numba works best on numpy code that can be encapsulated neatly into separate, minimal functions, such as performing operations on arrays, or code with loops. Numba is most commonly used with python decorators, which are placed above the function you wish to optimise through just-in-time compilation.

You will likely see perfomance gains in most numpy code through using numba with its default options and using the commonly applied decorators `@jit`, however, these decorators themselves do not cause numba to run parallel computation (The speed up initially comes from having "jit-ted" code.) 

To exploit the features of parallelism in numba, we have to go beyond the basics of numba, but first lets look at some simple examples of jitted code with numba:



In [1]:
from numba import jit
import numpy as np
import time

x = np.arange(100000000).reshape(10000, 10000)

@jit(nopython=True) # Set "nopython" mode for best performance
def go_fast(a): # Function is compiled to machine code when called the first time
    trace = 0
    for i in range(a.shape[0]):   # Numba likes loops
        trace += np.tanh(a[i, i]) # Numba likes NumPy functions
    return a + trace              # Numba likes NumPy broadcasting

# You can optionally run the function the first time to compile it
result = go_fast(x)

If you are not using a jupyter notebook, you can uncomment the timing measurements below, or use a profiling tool of your choice. (I am using the built in feature of jupyter notebooks that allows us to time the texecution of a notebook cell: `%%timeit`.)

In [9]:
%%timeit
# If you are not using a jupyter notebook, you can uncomment the timing measurements below:
#t1 = time.time()
result = go_fast(x)
#t2 = time.time()
#delta_t = t2 - t1

#print("Time taken: {}".format(delta_t))

304 ms ± 8.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


And for comparison, the **non-numba** version:

In [10]:
import numpy as np
import time

x = np.arange(100000000).reshape(10000, 10000)

def go_slow(a): # Function is run as standard Python/Numpy code
    trace = 0
    for i in range(a.shape[0]):   
        trace += np.tanh(a[i, i]) 
    return a + trace              

In [11]:
%%timeit
#t1 = time.time()
result = go_slow(x)
#t2 = time.time()
#delta_t = t2 - t1

#print("Time taken: {}".format(delta_t))

373 ms ± 6.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### Summary of results so far:

The adding of a numba decorator `@jit` gives us fairly minor speed up when the plain numpy code is jitted:

 - Plain numpy: 304ms
 - Numba numpy: 373ms

Let's see if we can now get some better performance using the parallel options in numba

### Parallel numba

To make use of parallel methods with numba, we use the `nogil` feature. You will also see the `nogil` feature refered to in Cython when talking about parallelisation techniques. Nogil allows us to disable the python GIL given conditions: namely when we are using numba-optimised code that only operaties on native types and variables (rather than Python objects).

When using the `@jit` decorator, we pass an additional keyword argument: `@jit(nogil=True)` to use this.

In [12]:
from numba import jit
import numpy as np
import time

x = np.arange(100000000).reshape(10000, 10000)

@jit(nopython=True, nogil=True, parallel=True) # Set "nopython" mode for best performance
def go_fast_nogil(a): # Function is compiled to machine code when called the first time
    trace = 0
    for i in range(a.shape[0]):   # Numba likes loops
        trace += np.tanh(a[i, i]) # Numba likes NumPy functions
    return a + trace              # Numba likes NumPy broadcasting

# You can optionally run the function the first time to compile it
result = go_fast_nogil(x)

In [13]:
%%timeit
result = go_fast_nogil(x)

170 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


#### Summary of results so far:

The adding of arguments `nogil=True` and `parallel=True` to to numba decorator `@jit` gives us a much better speed up. The results from my (4 core) laptop were:

 - Plain numpy: 304ms
 - Numba numpy: 373ms
 - __Numba + `nogil` + `parallel`: 170ms__

Try it with you own machine and see if you get comparable results.

## More details on how numba works

We won't go into the details of how numba works in this mini-tutorial, but basically it uses clever heuristics to determine if a loop or other constructs in a function can be parallelised. This means you may not always get speedup using the `parallel=True` argument, as numba's internal logic may have decided that the loop cannot be parallelised or is not worth parallelising. The aim of the numba module is to make parallelisation easy to the end user (by just adding a decorator with a few keyword arguments) but at the expense of hiding a lot of the details of what is going on 'under the hood'. 




A more advanced example:


In [15]:
import math
import threading
from timeit import repeat

import numpy as np
from numba import jit

nthreads = 4
size = 10**6

def func_np(a, b):
    """
    Control function using Numpy.
    """
    return np.exp(2.1 * a + 3.2 * b)

@jit('void(double[:], double[:], double[:])', nopython=True, nogil=True)
def inner_func_nb(result, a, b):
    """
    Function under test.
    """
    for i in range(len(result)):
        result[i] = math.exp(2.1 * a[i] + 3.2 * b[i])

def timefunc(correct, s, func, *args, **kwargs):
    """
    Benchmark *func* and print out its runtime.
    """
    print(s.ljust(20), end=" ")
    # Make sure the function is compiled before we start the benchmark
    res = func(*args, **kwargs)
    if correct is not None:
        assert np.allclose(res, correct), (res, correct)
    # time it
    print('{:>5.0f} ms'.format(min(repeat(lambda: func(*args, **kwargs),
                                          number=5, repeat=2)) * 1000))
    return res

def make_singlethread(inner_func):
    """
    Run the given function inside a single thread.
    """
    def func(*args):
        length = len(args[0])
        result = np.empty(length, dtype=np.float64)
        inner_func(result, *args)
        return result
    return func

def make_multithread(inner_func, numthreads):
    """
    Run the given function inside *numthreads* threads, splitting its
    arguments into equal-sized chunks.
    """
    def func_mt(*args):
        length = len(args[0])
        result = np.empty(length, dtype=np.float64)
        args = (result,) + args
        chunklen = (length + numthreads - 1) // numthreads
        # Create argument tuples for each input chunk
        chunks = [[arg[i * chunklen:(i + 1) * chunklen] for arg in args]
                  for i in range(numthreads)]
        # Spawn one thread per chunk
        threads = [threading.Thread(target=inner_func, args=chunk)
                   for chunk in chunks]
        for thread in threads:
            thread.start()
        for thread in threads:
            thread.join()
        return result
    return func_mt


func_nb = make_singlethread(inner_func_nb)
func_nb_mt = make_multithread(inner_func_nb, nthreads)

a = np.random.rand(size)
b = np.random.rand(size)

correct = timefunc(None, "numpy (1 thread)", func_np, a, b)
timefunc(correct, "numba (1 thread)", func_nb, a, b)
timefunc(correct, "numba (%d threads)" % nthreads, func_nb_mt, a, b)

numpy (1 thread)        34 ms
numba (1 thread)       119 ms
numba (4 threads)       46 ms


array([ 3.00112562,  4.68331271, 38.09321246, ...,  7.84046845,
       13.55508374,  4.41382201])