# Numba

Numba provides a just-in-time (jit) compiler, a decorator `vectorize` that we can use to define `ufunc`s that are fast and flexible, and an interface to CUDA capable GPUs that allows us to write CUDA kernels in Python! In this notebook, we'll focus on the jit compiler.

In [1]:
from numba import jit
import numpy as np

## A simple example
Let's start with a simple sum. This example is discussed in more detail in [Accelerating Python Libraries with Numba (Part 2)](http://continuum.io/blog/numba_performance), where they also add C and Cython versions.

In Python we may define the sum like this:

In [2]:
def python_sum(a):
    res = a[0]
    for x in a[1:]:
        res += x
    return res

The only requirement for `a` is that its elements support the `+` operator. For the following little benchmark, we'll use an `ndarray` of random numbers.

In [7]:
a = np.random.random(10000)

In [8]:
%timeit python_sum(a)

1000 loops, best of 3: 1.23 ms per loop


Please calculate the floating point operations per second for `python_sum`. Btw., the peak performance of a single core of the workstation is about 27 GFLOP/s.

In [19]:
10000 / (1.23 / 1000) / 1024 / 1024 / 1024

0.007571728248906329

In [14]:
numba_sum = jit(python_sum)

In [15]:
%timeit -n 1 -r 1 numba_sum(a)

1 loop, best of 1: 50.2 ms per loop


The first time a "jitted" function is called with a specific argument type, numba compiles the code, which takes fairly long. Future calls are much faster:

In [16]:
%timeit numba_sum(a)

100000 loops, best of 3: 8.23 µs per loop


This is quite an impressive speed up. You can calculate the performance again.

In [20]:
10000 / (8.23 / 1000 / 1000) / 1024 / 1024 / 1024

1.1316191672120028

Let's compare the performance with numpy's `sum`:

In [21]:
%timeit np.sum(a)

The slowest run took 17.13 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.6 µs per loop


It's very similar. 

## A more complex example and nopython
Numba likes simple expressions with simple loops:

In [22]:
def mm(a,b):
    res = np.zeros((a.shape[0], b.shape[1]))
    for row in range(a.shape[0]):
        for col in range(b.shape[1]):
            for k in range(a.shape[1]):
                res[row, col] += a[row, k] * b[k,col]
    return res

In [23]:
a = np.random.random((100, 100))
b = np.random.random((100, 100))

In [24]:
%timeit mm(a,b)

1 loop, best of 3: 637 ms per loop


In [25]:
%timeit a.dot(b)

The slowest run took 247.50 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 33.7 µs per loop


OK, the Python loop is about 10000 times slower than numpy's `dot` method. Let's see if we can't make this faster using numba. This time, we'll use the `@jit` decorator.

In [26]:
@jit
def numba_mm(a,b):
    res = np.zeros((a.shape[0], b.shape[1]))
    for row in range(a.shape[0]):
        for col in range(b.shape[1]):
            for k in range(a.shape[1]):
                res[row, col] += a[row, k] * b[k,col]
    return res

In [27]:
%timeit -n 1 -r 1 numba_mm(a,b) # Warmup

1 loop, best of 1: 128 ms per loop


In [28]:
%timeit numba_mm(a,b)

1000 loops, best of 3: 854 µs per loop


In [29]:
a = np.random.random((1000, 1000))
b = np.random.random((1000, 1000))

In [30]:
%timeit numba_mm(a,b)

1 loop, best of 3: 6.49 s per loop


In [31]:
%timeit a.dot(b)

10 loops, best of 3: 22 ms per loop


The version of numpy, we used has been compiled against the MKL and is therefore about 300 times faster. If we used a version that has not been compiled against the MKL, it would take about the same time as the numba routine.


In [32]:
@jit
def numba_mm2(a,b):
    res = np.zeros((a.shape[0], b.shape[1]))
    for row in range(a.shape[0]):
        for col in range(b.shape[1]):
            res[row, col] = a[row].dot(b[:,col])
    return res

In [33]:
%timeit -r 1 -n 1 numba_mm2(a,b)

1 loop, best of 1: 9.37 s per loop


In [34]:
%timeit numba_mm2(a,b)

1 loop, best of 3: 9.19 s per loop


Most of the computation should now be done with numpy, but it's much harder to optimize a dot product than the matrix multiplication. Think back to our discussion about bottlenecks. 

In [35]:
@jit(nopython=True)
def numba_mm3(a, b, res):
    for row in range(a.shape[0]):
        for col in range(b.shape[1]):
            res[row, col] = a[row].dot(b[:,col])
    return res

In [36]:
res = np.zeros((a.shape[0], b.shape[1])) # This is also a call to the numpy api that cannot be translated, yet.

In [37]:
%timeit -r 1 -n 1 numba_mm3(a, b, res)

UntypedAttributeError: Failed at nopython (nopython frontend)
Unknown attribute "dot" of type array(float64, 1d, C)
File "<ipython-input-35-f2f2172ed241>", line 5

Our original version (without the call to `np.zeros`) works in nopython mode.

In [38]:
@jit(nopython = True)
def numba_mm4(a, b, res):
    for row in range(a.shape[0]):
        for col in range(b.shape[1]):
            for k in range(a.shape[1]):
                res[row, col] += a[row, k] * b[k,col]
    return res

In [39]:
numba_mm4(a, b, res)

array([[ 253.70586327,  243.13261786,  238.88908283, ...,  251.17895901,
         255.39878115,  245.48340762],
       [ 252.26852677,  247.21048583,  240.48872024, ...,  253.35050734,
         258.70238139,  253.32279088],
       [ 240.25783996,  241.47048803,  228.67841585, ...,  244.18030088,
         244.67413753,  246.70035997],
       ..., 
       [ 247.2916786 ,  241.07777965,  241.32837765, ...,  247.65836867,
         255.28292981,  251.21304133],
       [ 254.79880954,  244.09251286,  241.641837  , ...,  252.0830717 ,
         253.6424546 ,  254.73494   ],
       [ 250.24895516,  246.21808641,  236.3362129 , ...,  251.81318711,
         251.50339171,  247.72675731]])

In [40]:
%timeit numba_mm4(a, b, res)

1 loop, best of 3: 6.52 s per loop


Why isn't our first version affected by the python mode? Numba detects that the *loop* doesn't contain any calls to the Python API and compiles and optimizes it. 

## Exercise: Mandelbrot with Numba
Now, it's your turn. Use Numba to write a program to calculate the Mandelbrot set. I would recommend to start a new notebook to do that.