In [None]:
%%html
<style>
table {float:left}
</style>

# Code Coffee: Speeding up Python with Numba
**Sebastian Stammler (stammler@usm.lmu.de)**  
3rd, July 2018

## Sources

* [**Website** (latest version: 0.38.0)](https://numba.pydata.org/)
* [**Documentation**](http://numba.pydata.org/numba-doc/latest/index.html)

## Introduction & Tutorial on YouTube

* [**Github repository**](https://github.com/gforsyth/numba_tutorial_scipy2017)  
* [**YouTube video**](https://www.youtube.com/watch?v=1AwG0T4gaO0)

[![Numba Tutorial](img/numba_tutorial.jpg)](https://www.youtube.com/watch?v=1AwG0T4gaO0 "Numba Tutorial")

## Requirements

To follow the examples in this notebook you need to install a few modules.  
I recommend you to install the requirements from above tutorial, which are in the file `environment.yml`.

In [None]:
%%file environment.yml
name: numbatutorial
dependencies:
    - python=3.6
    - numpy
    - matplotlib
    - numba
    - jupyter
    - ipython
    - line_profiler
    - cython
    - scipy

### Loading the environment

```
conda env create -f environment.yml
source activate numbatutorial
```

## Why Python?

### Pros:
* easy to read / understand / use
* dynamic typing
* easy to install new packages
* ...

### Cons:
* interpreted language
* dynamic typing
* ...

# Example 1: Matrix-vector multiplication (basics)

As an example on what `Numba` actually does we look at a simple matrix-vector multiplication.

$
\begin{align*}
 \vec{x} = \mathbf{A} \cdot \vec{b} \qquad \qquad \mathbf{A} &\in \mathbb{R}^{M \times N} \\ \vec{b} &\in \mathbb{R}^N
\end{align*}
$

### Initialization

In [None]:
import numpy as np

# Dimensions
M = 1000
N = 2000

# Initialization
A = np.random.random((M, N))
b = np.random.random(N)

### Simple Python function

The elements $x_i$ of vector $\vec{x}$ are sums, such that a simple Python implementation could be realized by two `for` loops.

$
\begin{align}
x_i = \sum\limits_{j=1}^M A_{ij} b_j
\end{align}
$

In [None]:
def mat_vec_mult(A, b):
    M, N = A.shape
    x = np.zeros(M)
    for i in range(M):
        for j in range(N):
            x[i] += A[i, j] * b[j]
    return x

t_mat_vec_python = %timeit -o mat_vec_mult(A, b)

### Simple Python function with Numba

There are two methods for using the just-in-time compiler `jit` from the `numba` package. One is by using it as a decorator in front of the function that should be compiled.

Functions decorated with `jit` will be compiled when they are called for the first time. In order to not measure the compilation time, we call the functions once before using `timeit`.

In [None]:
from numba import jit

@jit
def mat_vec_mult_numba(A, b):
    M, N = A.shape
    x = np.zeros(M)
    for i in range(M):
        for j in range(N):
            x[i] += A[i, j] * b[j]
    return x

mat_vec_mult_numba(A, b)
t_mat_vec_numba = %timeit -o mat_vec_mult_numba(A, b)

### Numpy function

A simple matrix-vector multiplication is standard maths operation that is already implemented in `NumPy`, which is written in `C` and `Fortran`. How does this compare to `Numba`?

In [None]:
t_mat_vec_numpy = %timeit -o np.dot(A, b)

### Check for consistency

Just to be sure, that all three possibilities shown above actually return the same solution.

In [None]:
x_mat_vec_python                = mat_vec_mult(A, b)
x_mat_vec_python_numba          = mat_vec_mult_numba(A, b)
x_mat_vec_numpy                 = np.dot(A, b)

mat_vec_allclose =                                  np.allclose(x_mat_vec_python, x_mat_vec_python_numba)
mat_vec_allclose = np.logical_and(mat_vec_allclose, np.allclose(x_mat_vec_python, x_mat_vec_numpy))

if(mat_vec_allclose):
    from IPython.display import Image,display
    im = Image('img/thumbsup.jpg')
    display(im)

### Comparison of all methods

| Method | Time | Speedup |
|:--|--:|--:|
| Python |{{"{:.2e} ms".format(1000.*t_mat_vec_python.average)}}||
| Numba |{{"{:.2e} ms".format(1000.*t_mat_vec_numba.average)}}|x {{np.int(np.round(t_mat_vec_python.average/t_mat_vec_numba.average))}}|
| NumPy |{{"{:.2e} ms".format(1000.*t_mat_vec_numpy.average)}}|x {{np.int(np.round(t_mat_vec_python.average/t_mat_vec_numpy.average))}}|

## How does Numba work?

To better understand how `Numba` works, we look at a simple function that adds two variables `a` and `b`.

In [None]:
from numba import jit

@jit
def add(a, b):
    return a + b

To compile this function we call it with two integers as arguments.

In [None]:
add(1, 1)

With the method `inspect_types()` we can have a look on what types `numba` did infer for `a` and `b`.

In [None]:
add.inspect_types()

`Numba` correctly inferred integers as types of the arguments.  
Let's call the same function now with floating point numbers.

In [None]:
add(1.0, 1.0)

Now check the types again.

In [None]:
add.inspect_types()

`Numba` compiled the function again this time with floats. The function is now overloaded.

Let's do it again, but this time in reverse order.

In [None]:
from numba import jit

@jit
def add2(a, b):
    return a + b

First call with floats.

In [None]:
add2(1., 1.)

And now with integers.

In [None]:
add2(1, 1)

The function returned a floating point number, even though it was called with integers.

Let's have a look at the types.

In [None]:
add2.inspect_types()

`Numba` compiled the function only once with floats.  
The reason for that is, that the integers in the second call could be converted to floats, such that the function did not need to be compiled again. If you want your function to have the full functionality you have to call it in the right order: `int` -> `float` -> `complex`.

What happens, if we add strings, which is a valid operation (concatenation) in Python?  
It's also a demonstration of the second method for using `Numba`. The reason for the additional `()` will come clear later.

In [None]:
from numba import jit

def add_strings(a, b):
    return a + b

add_strings_jit = jit()(add_strings)

First, the standard Python function.

In [None]:
add_strings('a', 'b')

Just as we were expecting, it's concatenating the strings.

What does the `jit`ted function do?

In [None]:
add_strings_jit('a', 'b')

Surprise, surprise! It's just doing what it's supposed to do.

Let's check the types.

In [None]:
add_strings_jit.inspect_types()

The arguments here are `pyobject`s.  
`Numba` has two modes: `object` mode and `nopython` mode. Every time `Numba` fails to compile an expression it falls back to `object` mode, which is essentially Python. So in the worst case you are just left with Python code without any speed up.

We can also check, which parts of a function `Numba` could compile and where it failed. For this, we have to write the code into a file. We also added some nonsensical `for` loop.

In [None]:
%%file nopython_failure.py
from numba import jit

@jit
def add_strings(a, b):
    for i in range(100):
        c = i
        f = i + 7
        l = c + f
    return a + b

add_strings('a', 'b')

To investigate this code, we have to use the command line, which creates an `html` file.

In [None]:
!numba --annotate-html fail.html nopython_failure.py

[fail.html](fail.html)

`Numba` failed in the last expression that concatenated the strings. This is highlighted in red.  
But also note the green color of the `for` loop. This indicates that `Numba` could speed up that part. Even though `Numba` failed in compiling the whole function, it still could provide some speed up.

We can also force `Numba` to compile a function in `nopython` mode. In that case `Numba` will raise an error, if it cannot compile any part of a function. The `nopython` mode can be enforced by using `nopython=True`.

In [None]:
from numba import jit

@jit(nopython=True)
def add_strings_nopython(a, b):
    return a + b

Now it comes clear, why we needed the `()` in the example above: we can give `jit` some arguments.

Let's call the function.

In [None]:
#add_strings_nopython('a', 'b')

`Numba` failed compiling this function and raises an error.

Instead of using `nopython=True`, we could also use the `njit` function of `Numba`. Both methods are equivalent.

In [None]:
from numba import njit

@njit
def add_strings_nopython2(a, b):
    return a + b

#add_strings_nopython2('a', 'b')

Other useful compilation flags are
```Python
cache=True
```
which will save the compiled function in a `pyc` file in your `__pycache__` directory and
```Python
nogil=True
```
which releases the global interpreter lock.

We can also look directly at the `LLVM` code `numba` created.

In [None]:
for k, v in add.inspect_llvm().items():
    print(k, v)

As you could see already, `Numba` is aware of certain Python and `NumPy` features.

A full list of supported feature, functions, and data types can be found here:

* [Python features](https://numba.pydata.org/numba-doc/dev/reference/pysupported.html)
* [NumPy features](https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html)
* [Data types](https://numba.pydata.org/numba-doc/dev/reference/types.html)

## Precompiling Numba modules

It's possible to compile Numba modules ahead of time and make them available as Python modules. These can then be run by other users even if they do not have `Numba` installed. Only `NumPy` is required.

In [None]:
import numpy as np
from numba.pycc import CC

In [None]:
cc = CC('precompiled_module')  # Name of the module
cc.verbose = True

Since the compilation is done "ahead-of-time", we have to give the signatures of the arguments and the return value. Function overloading is not allowed. Every function compiled with a signature requires a unique name.

In [None]:
@cc.export('add_int', 
           'i8(i8, i8)')
@cc.export('add_single', 
           'f4(f4, f4)')
@cc.export('add_double', 
           'f8(f8, f8)')
def add_precompiled(a, b):
    return a + b

In [None]:
cc.compile()

# Example 2: Two-dimensional Poisson equation (loop unrolling)

Let's consider the two-dimensional Poisson equation

$
\begin{align}
\frac{\partial^2}{\partial x^2} u + \frac{\partial^2}{\partial y^2} u = 0
\end{align}
$

which can be discretized with a second order central differences scheme to

$
\begin{align}
\frac{u_{i+1, j}^{n} - 2 u_{i, j}^n + u_{i-1, j}^n}{\Delta x^2} + \frac{u_{i, j+1}^{n} - 2 u_{i, j}^n + u_{i, j-1}^n}{\Delta y^2} = 0
\end{align}
$

Solving this for $u_{i, j}^{n+1}$ yields:

$
\begin{align}
u_{i, j}^{n+1} = \frac{1}{4} \left( u_{i+1, j}^n + u_{i-1, j}^n + u_{i, j+1}^n + u_{i, j-1}^n \right)
\end{align}
$

In [None]:
import numpy as np

# Grid parameters
Nx = 101
Ny = 11
# Tolerance level
tol = 1.e-3

# Boundary value
u_bound_top    = 0.
u_bound_right  = 1.
u_bound_bottom = 0.
u_bound_left   = 0.

# Initial conditions
u_ini = np.zeros((Nx, Ny))

# Setting the boundaries
u_ini[ :, -1] = u_bound_top
u_ini[-1,  :] = u_bound_right
u_ini[ :,  0] = u_bound_bottom
u_ini[ 0,  :] = u_bound_left

In [None]:
def solve_poisson(u, tol):
    
    iter_diff = tol + 1
    n = 0
    while iter_diff > tol and n <= 500:

        un = u.copy()
        u[1:-1, 1:-1] = 0.25 * (un[2:  , 1:-1] +
                                un[ :-2, 1:-1] +
                                un[1:-1, 2:  ] +
                                un[1:-1,  :-2])
      
        
        iter_diff = np.sqrt(np.sum((u - un)**2)/np.sum(un**2))            
        n += 1
    return u

In [None]:
u_poisson = solve_poisson(u_ini, tol)
t_poisson = %timeit -o solve_poisson(u_ini, tol)

## How does the result look like?

In [None]:
%matplotlib inline

f = 10. / Nx
width  = f * Nx
height = f * Ny

fig = plt.figure(figsize=(width, height))
ax1 = fig.add_subplot(111)
plot = ax1.contourf(u_poisson.T)
ax1.set_xlabel("$x$", fontsize=16)
ax1.set_ylabel("$y$", fontsize=16)
ax1.set_xticks([])
ax1.set_yticks([])
plt.show()

## Jit it!

In [None]:
from numba import njit

solve_poisson_numba = njit()(solve_poisson)

## Timing

In [None]:
u_poisson_numba = solve_poisson_numba(u_ini, tol)
t_poisson_numba = %timeit -o solve_poisson_numba(u_ini, tol)

## Try loop unrolling

In [None]:
def solve_poisson_unroll(u, tol):
    
    Nx, Ny = u.shape
    
    iter_diff = tol + 1
    n = 0
    while iter_diff > tol and n <= 500:

        un = u.copy()
        
        for ix in range(1, Nx-1):
            for iy in range(1, Ny-1):
                u[ix, iy] = 0.25 * (un[ix+1, iy  ] +
                                    un[ix-1, iy  ] +
                                    un[ix  , iy+1] +
                                    un[ix  , iy-1])
      
        
        iter_diff = np.sqrt(np.sum((u - un)**2)/np.sum(un**2))            
        n += 1
    return u

In [None]:
u_poisson_unroll = solve_poisson_unroll(u_ini, tol)
t_poisson_unroll = %timeit -o solve_poisson_unroll(u_ini, tol)

In [None]:
from numba import njit

solve_poisson_unroll_numba = njit()(solve_poisson_unroll)

In [None]:
u_poisson_unroll_numba = solve_poisson_unroll_numba(u_ini, tol)
t_poisson_unroll_numba = %timeit -o solve_poisson_unroll_numba(u_ini, tol)

## Check results

In [None]:
poisson_allclose =                                  np.allclose(u_poisson, u_poisson_numba)
poisson_allclose = np.logical_and(poisson_allclose, np.allclose(u_poisson, u_poisson_unroll))
poisson_allclose = np.logical_and(poisson_allclose, np.allclose(u_poisson, u_poisson_unroll_numba))

if(poisson_allclose):
    from IPython.display import Image,display
    im = Image('img/thumbsup2.jpg')
    display(im)

### Comparison

| Method | Time | Speedup |
|:--|--:|--:|
| Python |{{"{:.2e} ms".format(1000.*t_poisson.average)}}||
| Numba |{{"{:.2e} ms".format(1000.*t_poisson_numba.average)}}|x {{"{:.2f}".format(np.round(t_poisson.average/t_poisson_numba.average, 2))}}|
| Unrolled Python |{{"{:.2e} ms".format(1000.*t_poisson_unroll.average)}}|x {{"{:.2f}".format(np.round(t_poisson.average/t_poisson_unroll.average, 2))}}|
| Unrolled Numba |{{"{:.2e} ms".format(1000.*t_poisson_unroll_numba.average)}}|x {{"{:.2f}".format(np.round(t_poisson.average/t_poisson_unroll_numba.average, 2))}}|

# Example 3: n-body problem (classes)

Let's consider an n-body problem where have to calculate the potential for every particle by summing over all other particles

In [None]:
import numpy as np

class Point():
    
    def __init__(self, domain=1.0):
        self.x = domain * np.random.random()
        self.y = domain * np.random.random()
        self.z = domain * np.random.random()
            
    def distance(self, other):
        return np.sqrt((self.x - other.x)**2 + 
                       (self.y - other.y)**2 + 
                       (self.z - other.z)**2)
    

class Particle(Point):
    
    def __init__(self, domain=1.0, m=1.0):
        Point.__init__(self, domain)
        self.m = m
        self.phi = 0.

In [None]:
n = 1000
particles = [Particle(m=1.) for i in range(n)]

In [None]:
def direct_sum(particles):
    for i, target in enumerate(particles):
        target.phi = 0.
        for source in (particles[:i] + particles[i+1:]):
            r = target.distance(source)
            target.phi += source.m / r

In [None]:
t_nbody = %timeit -o direct_sum(particles)

In [None]:
from numba import njit

direct_sum_numba = njit()(direct_sum)

#direct_sum_numba(particles)

Numba does not know how to treat classes.

## Option 1: jitclass

In [None]:
from numba import jitclass, float64
import numpy as np

# Specification of the data types
spec_particle = {}
spec_particle['x']   = float64
spec_particle['y']   = float64
spec_particle['z']   = float64
spec_particle['m']   = float64
spec_particle['phi'] = float64
    

@jitclass(spec_particle)
class Particle_jitclass():
    
    def __init__(self, domain=1.0, m=1.0):
        self.x   = domain * np.random.random()
        self.y   = domain * np.random.random()
        self.z   = domain * np.random.random()
        self.m   = m
        self.phi = 0.
        
    def distance(self, other):
        return np.sqrt((self.x - other.x)**2 + 
                       (self.y - other.y)**2 + 
                       (self.z - other.z)**2)

In [None]:
n = 1000
particles_jitclass = [Particle_jitclass(domain=1., m=1.) for i in range(n)]

In [None]:
#direct_sum_numba(particles_jitclass)

Nested memory-managed objects (like lists of objects) are not supported by `Numba`, yet.

## Option 2: NumPy's linked lists

In [None]:
import numpy as np

particle_dtype = np.dtype({'names':['x','y','z','m','phi'], 
                           'formats':[np.double, 
                                      np.double, 
                                      np.double, 
                                      np.double, 
                                      np.double]})

n = 1000
particles_numpy = np.zeros(n, dtype=particle_dtype)

for i in range(n):
    particles_numpy[i]['x'] = np.random.random()
    particles_numpy[i]['y'] = np.random.random()
    particles_numpy[i]['z'] = np.random.random()
    particles_numpy[i]['m'] = np.random.random()

In [None]:
from numba import njit

@njit
def create_random_particles(n, m, domain):
    particles = np.zeros((n), dtype=particle_dtype)
    for p in particles:
        p['x'] = domain * np.random.random()
        p['y'] = domain * np.random.random()
        p.z    = domain * np.random.random()
        p.m    = m
    return particles


@njit
def distance_numpy(p1, p2):
    return np.sqrt((p1.x - p2.x)**2 + 
                   (p1.y - p2.y)**2 + 
                   (p1.z - p2.z)**2)

@njit
def direct_sum_numpy(particles):
    for i, target in enumerate(particles):
        for j, source in enumerate(particles):
            if i != j:
                r = distance_numpy(target, source)
                target.phi += source.m / r
                
    return particles

In [None]:
n = 1000
particles_numpy = create_random_particles(n, 1., 1.)

In [None]:
i = 120
msg = "Particle "+repr(i)+"\n\nx   = {:.2f}\ny   = {:.2f}\nz   = {:.2f}\nm   = {:.2f}\nphi = {:.2f}".format(particles_numpy[i]['x'], particles_numpy[i]['y'], particles_numpy[i]['z'], particles_numpy[i]['m'], particles_numpy[i]['phi'])
print(msg)

In [None]:
particles_numpy = direct_sum_numpy(particles_numpy)
t_nbody_numpy = %timeit -o direct_sum_numpy(particles_numpy)

In [None]:
i = 120
msg = "Particle "+repr(i)+"\n\nx   = {:.2f}\ny   = {:.2f}\nz   = {:.2f}\nm   = {:.2f}\nphi = {:.2f}".format(particles_numpy[i]['x'], particles_numpy[i]['y'], particles_numpy[i]['z'], particles_numpy[i]['m'], particles_numpy[i]['phi'])
print(msg)

### Comparison

| Method | Time | Speedup |
|:--|--:|--:|
| Python |{{"{:.2e} ms".format(1000.*t_nbody.average)}}||
| Numba jitclass |–|–|
| Numba + linked lists |{{"{:.2e} ms".format(1000.*t_nbody_numpy.average)}}|x {{np.int(np.round(t_nbody.average/t_nbody_numpy.average, 2))}}|

# Example 4: Logit (ufuncs, vectorization)

Universal functions (ufuncs) that operate on arrays element-by-element-wise. Basically all `NumPy` math functions are ufuncs. With `numpy.vectorize()` it is possible to vectorize functions. However, from the `NumPy` documentation:

> *"The `vectorize` function is provided primarily for convenience, not for performance."*

`Numba` has it's own method to vectorize functions efficiently.

Consider the logit-function

$
\begin{align}
L(p) = \ln \left( \frac{p}{1-p} \right)
\end{align}
$

In [None]:
import math
import numpy as np

In [None]:
def logit(p):
    return math.log(p/(1.-p))

Using the `math` package here, since the `NumPy` functions are already vectorized.

In [None]:
n = 100000
v = np.random.random(n)

Our `logit()` function works on scalars

In [None]:
logit(v[0])

but not on vectors.

In [None]:
#logit(v)

Let's vectorize the function with `Numba`.

In [None]:
from numba import vectorize

logit_vectorized_numba = vectorize()(logit)

Now the function works on arrays, too.

In [None]:
logit_vectorized_numba(v)

The types of the argument is then assumed during the call.  
But it's also possible to give a signature. This is necessary, if we want to also use the keyword arguments.

In [None]:
logit_vectorized_numba_parallel = vectorize(["float64(float64)"], target="parallel")(logit)

In [None]:
logit_vectorized_numba_parallel(v)

Possible targets are:

single-threaded CPU
```python
target="cpu"
```

multi-threaded CPU
```python
target="parallel"
```

CUDA GPU
```python
target="cuda"
```

## Let's do the timing!

### Pure NumPy

In [None]:
ufunc_numpy = np.log(v/(1.-v))
t_ufunc_numpy = %timeit -o np.log(v/(1.-v))

### NumPy vectorization

In [None]:
logit_vectorized_numpy = np.vectorize(logit)

ufunc_vectorized_numpy = logit_vectorized_numpy(v)
t_ufunc_vectorized_numpy = %timeit -o logit_vectorized_numpy(v)

### Numba vectorization

In [None]:
ufunc_vectorized_numba = logit_vectorized_numba(v)
t_ufunc_vectorized_numba = %timeit -o logit_vectorized_numba(v)

### Numba vectorization parallel

In [None]:
ufunc_vectorized_numba_parallel = logit_vectorized_numba_parallel(v)
t_ufunc_vectorized_numba_parallel = %timeit -o logit_vectorized_numba_parallel(v)

## Check results

In [None]:
ufunc_allclose =                                np.allclose(ufunc_numpy, ufunc_vectorized_numpy)
ufunc_allclose = np.logical_and(ufunc_allclose, np.allclose(ufunc_numpy, ufunc_vectorized_numba))
ufunc_allclose = np.logical_and(ufunc_allclose, np.allclose(ufunc_numpy, ufunc_vectorized_numba_parallel))

if(ufunc_allclose):
    from IPython.display import Image,display
    im = Image('img/thumbsup4.gif')
    display(im)

### Comparison of all methods

| Method | Time | Speedup |
|:--|--:|--:|
| NumPy vectorization |{{"{:.2e} ms".format(1000.*t_ufunc_vectorized_numpy.average)}}||
| Numba vectorization |{{"{:.2e} ms".format(1000.*t_ufunc_vectorized_numba.average)}}|x {{np.int(np.round(t_ufunc_vectorized_numpy.average/t_ufunc_vectorized_numba.average))}}|
| Numba vectorization parallel |{{"{:.2e} ms".format(1000.*t_ufunc_vectorized_numba_parallel.average)}}|x {{np.int(np.round(t_ufunc_vectorized_numpy.average/t_ufunc_vectorized_numba_parallel.average))}}|
| pure NumPy |{{"{:.2e} ms".format(1000.*t_ufunc_numpy.average)}}|x {{np.int(np.round(t_ufunc_vectorized_numpy.average/t_ufunc_numpy.average))}}|

# Example 5: Fourier transformation (autoparallelization)

$
\begin{align}
F \left( k \right) &= \frac{1}{\left(2\pi\right)^\frac{n}{2}} \int_{\mathbb{R}^n} f\left(x\right)\cdot e^{-ikx} \mathrm{d} x
\end{align}
$

In [None]:
import math
import numpy as np

pi    = np.pi
twopi = 2. * np.pi

def sinc(x):
    return math.sin(pi * x) / (pi * x)

In [None]:
Nx     = 1000
xrange = 100.
Nk     = 1000
krange = 20.

x      = np.linspace(-xrange/2., xrange/2., Nx)
k      = np.linspace(-krange/2., krange/2., Nk)

In [None]:
from numba import vectorize

sinc_vec = vectorize(["float64(float64)"], target="parallel")(sinc)

In [None]:
def f1d(x, f, k):
    Nx, = x.shape
    Nk, = k.shape
    dx  = np.mean(x[1:] - x[:-1])
    F   = np.empty(Nk, dtype=np.complex128)
    for ik in range(Nk):
        s = 0
        for ix in range(Nx):
            s += f[ix] * np.exp(-1.j*k[ik]*x[ix])
        F[ik] = s * dx / np.sqrt(twopi)
    return F

In [None]:
f = sinc_vec(x)
F_python = f1d(x, f, k)

In [None]:
%matplotlib inline

width = 8.

fig = plt.figure(figsize=(2.*width, 2.*width/1.6))
ax1 = fig.add_subplot(221)
ax2 = fig.add_subplot(222)
ax4 = fig.add_subplot(224)

ax1.plot(x, f)
ax1.set_xlabel("$x$", fontsize=16)
ax1.set_ylabel("$f$", fontsize=16)
ax1.set_xlim(-xrange/2., xrange/2.)
ax1.grid(b=False)
ax2.plot(k, F_python.real)
ax2.set_xlabel("$k$", fontsize=16)
ax2.set_ylabel(r"$\Re \left( F \right)$", fontsize=16)
ax2.set_xlim(-krange/2., krange/2.)
ax2.grid(b=False)
ax4.plot(k, F_python.imag)
ax4.set_xlabel("$k$", fontsize=16)
ax4.set_ylabel(r"$\Im \left( F \right)$", fontsize=16)
ax4.set_xlim(-krange/2., krange/2.)
ax4.set_ylim(ax2.get_ylim())
ax4.grid(b=False)

fig.tight_layout
plt.show()

## Timing

### Python

In [None]:
F_python = f1d(x, f, k)
t_f1d_python = %timeit -o f1d(x, f, k)

### Numba jit

In [None]:
from numba import njit

f1d_numba_jit = njit()(f1d)
F_numba_jit = f1d_numba_jit(x, f, k)
t_f1d_numba_jit = %timeit -o f1d_numba_jit(x, f, k)

### Numba jit parallel

In [None]:
f1d_numba_jit_parallel = njit(parallel=True)(f1d)
F_numba_jit_parallel = f1d_numba_jit_parallel(x, f, k)
t_f1d_numba_jit_parallel = %timeit -o f1d_numba_jit_parallel(x, f, k)

## Optimization #1

In [None]:
def f1d_vol2(x, f, k):
    Nk, = k.shape
    dx  = np.mean(x[1:] - x[:-1])
    F   = np.empty(Nk, dtype=np.complex128)
    for ik in range(Nk):
        s = np.sum( f * np.exp(-1.j*k[ik]*x) )
        F[ik] = s * dx / np.sqrt(twopi)
    return F

### Python

In [None]:
F_vol2_python = f1d_vol2(x, f, k)
t_f1d_vol2_python = %timeit -o f1d_vol2(x, f, k)

### Numba jit

In [None]:
f1d_vol2_numba_jit = njit()(f1d_vol2)
F_vol2_numba_jit = f1d_vol2_numba_jit(x, f, k)
t_f1d_vol2_numba_jit = %timeit -o f1d_vol2_numba_jit(x, f, k)

### Numba jit parallel

In [None]:
f1d_vol2_numba_jit_parallel = njit(parallel=True)(f1d_vol2)
F_vol2_numba_jit_parallel = f1d_vol2_numba_jit_parallel(x, f, k)
t_f1d_vol2_numba_jit_parallel = %timeit -o f1d_vol2_numba_jit_parallel(x, f, k)

## Optimization #2

In [None]:
def f1d_vol3(x, f, k):
    dx = np.mean(x[1:] - x[:-1])
    F  = np.dot(np.exp(-1.j*k*x), f)
    F *= dx / np.sqrt(twopi)
    return F

### Python

In [None]:
F_vol3_python = f1d_vol3(x, f, k[:, None])
t_f1d_vol3_python = %timeit -o f1d_vol3(x, f, k[:, None])

### Numba jit

In [None]:
f1d_vol3_numba_jit = njit()(f1d_vol3)
F_vol3_numba_jit = f1d_vol3_numba_jit(x, f+0.j, k[:, None])
t_f1d_vol3_numba_jit = %timeit -o f1d_vol3_numba_jit(x, f+0.j, k[:, None])

### Numba jit parallel

In [None]:
f1d_vol3_numba_jit_parallel = njit(parallel=True)(f1d_vol3)
#F_vol3_numba_jit_parallel = f1d_vol3_numba_jit_parallel(x, f+0.j, k[:, None])

## Check results

In [None]:
f1d_allclose =                              np.allclose(F_python, F_numba_jit)
f1d_allclose = np.logical_and(f1d_allclose, np.allclose(F_python, F_numba_jit_parallel))
f1d_allclose = np.logical_and(f1d_allclose, np.allclose(F_python, F_vol2_python))
f1d_allclose = np.logical_and(f1d_allclose, np.allclose(F_python, F_vol2_numba_jit))
f1d_allclose = np.logical_and(f1d_allclose, np.allclose(F_python, F_vol2_numba_jit_parallel))
f1d_allclose = np.logical_and(f1d_allclose, np.allclose(F_python, F_vol3_python))
f1d_allclose = np.logical_and(f1d_allclose, np.allclose(F_python, F_vol3_numba_jit))

if(f1d_allclose):
    from IPython.display import Image,display
    im = Image('img/thumbsup5.gif')
    display(im)

### Comparison of all methods

| Method | Time | Speedup |
|:--|--:|--:|
| Python |{{"{:.2e} s".format(t_f1d_python.average)}}||
| Numba jit |{{"{:.2e} s".format(t_f1d_numba_jit.average)}}|x {{np.int(np.round(t_f1d_python.average/t_f1d_numba_jit.average))}}|
| Numba jit parallel |{{"{:.2e} s".format(t_f1d_numba_jit_parallel.average)}}|x {{np.int(np.round(t_f1d_python.average/t_f1d_numba_jit_parallel.average))}}|
| Python vol. 2 |{{"{:.2e} s".format(t_f1d_vol2_python.average)}}|x {{np.int(np.round(t_f1d_python.average/t_f1d_vol2_python.average))}}|
| Numba jit vol. 2 |{{"{:.2e} s".format(t_f1d_vol2_numba_jit.average)}}|x {{np.int(np.round(t_f1d_python.average/t_f1d_vol2_numba_jit.average))}}|
| Numba jit parallel vol. 2 |{{"{:.2e} s".format(t_f1d_vol2_numba_jit_parallel.average)}}|x {{np.int(np.round(t_f1d_python.average/t_f1d_vol2_numba_jit_parallel.average))}}|
| Python vol. 3 |{{"{:.2e} s".format(t_f1d_vol3_python.average)}}|x {{np.int(np.round(t_f1d_python.average/t_f1d_vol3_python.average))}}|
| Numba jit vol. 3 |{{"{:.2e} s".format(t_f1d_vol3_numba_jit.average)}}|x {{np.int(np.round(t_f1d_python.average/t_f1d_vol3_numba_jit.average))}}|

# Example 6: Non-unified Fast Fourier Transformation (NUFFT)

For another nice example of code optimization with `Numba` and comparisons to Fortran code check out:

[Optimizing Python in the Real World: NumPy, Numba, and the NUFFT](http://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/)