# Part 2: Numba

Numba is another Python package designed to offer increased performance with Python applications that use a lot of `numpy` code for computation. Numba is a 'just-in-time' compiler for numpy code - it works by compiling your numpy code at runtime into optimised machine code. 

Numba works best on numpy code that can be encapsulated neatly into separate, minimal functions, such as performing operations on arrays, or code with loops. Numba is most commonly used with python decorators, which are placed above the function you wish to optimise through just-in-time compilation.

You will likely see perfomance gains in most numpy code through using numba with its default options and using the commonly applied decorators `@jit`, however, these decorators themselves do not cause numba to run parallel computation (The speed up initially comes from having "jit-ted" code.) 

To exploit the features of parallelism in numba, we have to go beyond the basics of numba, but first lets look at some simple examples of jitted code with numba:



In [19]:
from numba import jit
import numpy as np
import time

x = np.arange(100000000).reshape(10000, 10000)

@jit(nopython=True) # Set "nopython" mode for best performance
def go_fast(a): # Function is compiled to machine code when called the first time
    trace = 0
    for i in range(a.shape[0]):   # Numba likes loops
        trace += np.tanh(a[i, i]) # Numba likes NumPy functions
    return a + trace              # Numba likes NumPy broadcasting

# You can optionally run the function the first time to compile it
result = go_fast(x)

If you are not using a jupyter notebook, you can uncomment the timing measurements below, or use a profiling tool of your choice. (I am using the built in feature of jupyter notebooks that allows us to time the texecution of a notebook cell: `%%timeit`.)

In [20]:
%%timeit
# If you are not using a jupyter notebook, you can uncomment the timing measurements below:
#t1 = time.time()
result = go_fast(x)
#t2 = time.time()
#delta_t = t2 - t1

#print("Time taken: {}".format(delta_t))

127 ms ± 2.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


And for comparison, the **non-numba** version:

In [21]:
import numpy as np
import time

x = np.arange(100000000).reshape(10000, 10000)

def go_slow(a): # Function is run as standard Python/Numpy code
    trace = 0
    for i in range(a.shape[0]):   
        trace += np.tanh(a[i, i]) 
    return a + trace              

In [22]:
%%timeit
#t1 = time.time()
result = go_slow(x)
#t2 = time.time()
#delta_t = t2 - t1

#print("Time taken: {}".format(delta_t))

157 ms ± 3.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


#### Summary of results so far:

The adding of a numba decorator `@jit` gives us fairly minor speed up when the plain numpy code is jitted:

 - Plain numpy: 157ms
 - Numba numpy: 127ms

Let's see if we can now get some better performance using the parallel options in numba

### Parallel numba

To make use of parallel methods with numba, we use the `nogil` feature. You will also see the `nogil` feature refered to in Cython when talking about parallelisation techniques. Nogil allows us to disable the python GIL given conditions: namely when we are using numba-optimised code that only operaties on native types and variables (rather than Python objects).

When using the `@jit` decorator, we pass an additional keyword argument: `@jit(nogil=True)` to use this.

In [23]:
from numba import jit
import numpy as np
import time

x = np.arange(100000000).reshape(10000, 10000)

@jit(nopython=True, nogil=True) # Set "nopython" mode for best performance
def go_fast_nogil(a): # Function is compiled to machine code when called the first time
    trace = 0
    for i in range(a.shape[0]):   # Numba likes loops
        trace += np.tanh(a[i, i]) # Numba likes NumPy functions
    return a + trace              # Numba likes NumPy broadcasting

# You can optionally run the function the first time to compile it
result = go_fast_nogil(x)

In [24]:
%%timeit
result = go_fast_nogil(x)

126 ms ± 496 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
