Just-in-time compilation (JIT)
====

For programmer productivity, it often makes sense to code the majority of your application in a high-level language such as Python and only optimize code bottlenecks identified by profiling. One way to speed up these bottlenecks is to compile the code to machine executables, often via an intermediate C or C-like stage. There are two common approaches to compiling Python code - using a Just-In-Time (JIT) compiler and using Cython for Ahead of Time (AOT) compilation.

This notebook mostly illustrates the JIT approach.

**References**

- [Numba](http://numba.pydata.org)
- [The need for speed without bothering too much: An introduction to numba](http://nbviewer.jupyter.org/github/akittas/presentations/blob/master/pythess/numba/numba.ipynb?utm_source=newsletter_mailer&utm_medium=email&utm_campaign=weekly)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

**Utility function for timing functions**

We write decorators to help in timing as an alternative to `timeit`.

In [None]:
import time
from numpy.testing import assert_almost_equal

In [None]:
def timer(f, *args, **kwargs):
    start = time.time()
    ans = f(*args, **kwargs)
    return ans, time.time() - start

In [None]:
def report(fs, *args, **kwargs):
    ans, t = timer(fs[0], *args, **kwargs)
    print('%s: %.1f' % (fs[0].__name__, 1.0))  
    for f in fs[1:]:
        ans_, t_ = timer(f, *args, **kwargs)
        print('%s: %.1f' % (f.__name__, t/t_))

Using `numexpr`
----

One of the simplest approaches is to use [`numexpr`](https://github.com/pydata/numexpr) which takes a `numpy` expression and compiles a more efficient version of the `numpy` expression written as a string. If there is a simple expression that is taking too long, this is a good choice due to its simplicity. However, it is quite limited.

In [None]:
import numpy as np
a = np.random.random(int(1e6))
b = np.random.random(int(1e6))
c = np.random.random(int(1e6))

In [None]:
%timeit -r3 -n3 b**2 - 4*a*c

In [None]:
import numexpr as ne

In [None]:
%timeit -r3 -n3 ne.evaluate('b**2 - 4*a*c')

Using `numba`
----

When it works, the JIT `numba` can speed up Python code tremendously with minimal effort. 

[Documentation for `numba`](http://numba.pydata.org/numba-doc/0.12.2/index.html)

### Example 1

#### Plain Python version

In [None]:
def matrix_multiply(A, B):
    m, n = A.shape
    n, p = B.shape
    C = np.zeros((m, p))
    for i in range(m):
        for j in range(p):
            for k in range(n):
                C[i,j] += A[i,k] * B[k, j]
    return C

In [None]:
A = np.random.random((30, 50))
B = np.random.random((50, 40))

#### Numba jit version

In [None]:
import numba
from numba import jit

In [None]:
@jit
def matrix_multiply_numba(A, B):
    m, n = A.shape
    n, p = B.shape
    C = np.zeros((m, p))
    for i in range(m):
        for j in range(p):
            for k in range(n):
                C[i,j] += A[i,k] * B[k, j]
    return C

We can remove the cost of indexing a matrix in the inner loop

In [None]:
@jit
def matrix_multiply_numba2(A, B):
    m, n = A.shape
    n, p = B.shape
    C = np.zeros((m, p))
    for i in range(m):
        for j in range(p):
            d = 0.0
            for k in range(n):
                d += A[i,k] * B[k, j]
            C[i,j] = d
    return C

In [None]:
%timeit -r3 -n3 matrix_multiply(A, B)
%timeit -r3 -n3 matrix_multiply_numba(A, B)
%timeit -r3 -n3 matrix_multiply_numba2(A, B)

#### Numpy version

In [None]:
def matrix_multiply_numpy(A, B):
    return A.dot(B)

#### Check that outputs are the same

In [None]:
assert_almost_equal(matrix_multiply(A, B), matrix_multiply_numba(A, B))
assert_almost_equal(matrix_multiply(A, B), matrix_multiply_numpy(A, B))

In [None]:
%timeit -r3 -n3 matrix_multiply_numba(A, B)

In [None]:
report([matrix_multiply, matrix_multiply_numba, matrix_multiply_numba2, matrix_multiply_numpy], A, B)

### Pre-compilation by giving specific signature 

In [None]:
@jit('double[:,:](double[:,:], double[:,:])')
def matrix_multiply_numba_1(A, B):
    m, n = A.shape
    n, p = B.shape
    C = np.zeros((m, p))
    for i in range(m):
        for j in range(p):
            d = 0.0
            for k in range(n):
                d += A[i,k] * B[k, j]
            C[i,j] = d
    return C

In [None]:
%timeit -r3 -n3 matrix_multiply_numba2(A, B)
%timeit -r3 -n3 matrix_multiply_numba_1(A, B)

### Example 2: Using nopython

#### Vectorized Python version

In [None]:
def mc_pi(n):
    x = np.random.uniform(-1, 1, (n,2))
    return 4*np.sum((x**2).sum(1) < 1)/n

In [None]:
n = int(1e6)

In [None]:
mc_pi(n)

In [None]:
%timeit mc_pi(n)

#### Numba on vectorized version

In [None]:
@jit
def mc_pi_numba(n):
    x = np.random.uniform(-1, 1, (n,2))
    return 4*np.sum((x**2).sum(1) < 1)/n

In [None]:
%timeit mc_pi_numba(n)

#### Using nopython

Using nopython, either with the `@njit` decorator or with `@jit(nopython = True)`, tells `numba` to not use any Python objects in the C code, but only native C types. If `numba` cannot do this, it will raise an error. It is usually useful to run this, so you are aware of bottlenecks in your code.

In [None]:
@jit(nopython=True)
def mc_pi_numba_njit(n):
    x = np.random.uniform(-1, 1, (n,2))
    return 4*np.sum((x**2).sum(1) < 1)/n

In [None]:
%timeit mc_pi_numba_njit(n)

#### Numba on unrolled version

In [None]:
@jit(nopython=True)
def mc_pi_numba_unrolled(n):
    s = 0
    for i in range(n):
        x = np.random.uniform(-1, 1)
        y = np.random.uniform(-1, 1)
        if (x*x + y*y) < 1:
            s += 1
    return 4*s/n

In [None]:
mc_pi_numba_unrolled(n)

In [None]:
%timeit -r3 -n3 mc_pi_numba_unrolled(n)

### Usig cache=True

This stores the compiled function in a file and avoids re-compilation on re-running a Python program.

In [None]:
@jit(nopython=True, cache=True)
def mc_pi_numba_unrolled_cache(n):
    s = 0
    for i in range(n):
        x = np.random.uniform(-1, 1)
        y = np.random.uniform(-1, 1)
        if (x*x + y*y) < 1:
            s += 1
    return 4*s/n

In [None]:
%timeit -r3 -n3 mc_pi_numba_unrolled_cache(n)

### Simple parallel loops with `numba`

In [None]:
from numba import njit, prange

In [None]:
@njit()
def sum_rows_range(A):
    s = 0
    for i in range(A.shape[0]):
        s += np.sum(np.exp(np.log(np.sqrt(A[i]**2.0))))
    return s

In [None]:
@njit(parallel=True)
def sum_rows_prange(A):
    s = 0
    for i in prange(A.shape[0]):
        s += np.sum(np.exp(np.log(np.sqrt(A[i]**2.0))))
    return s

In [None]:
A = np.random.randint(0, 10, (800, 100000))

In [None]:
A.shape

Run once so that compile times excluded in benchmarking

In [None]:
sum_rows_range(A), sum_rows_prange(A)

In [None]:
%%time

sum_rows_range(A)

In [None]:
%%time

sum_rows_prange(A)

Using numba vectorize and guvectoize
----

Sometimes it is convenient to use `numba` to convert functions to vectorized functions for use in `numpy`. See [documentation](http://numba.pydata.org/numba-doc/dev/user/vectorize.html) for details.

In [None]:
from numba import int32, int64, float32, float64

### Using `vectorize`

In [None]:
@numba.vectorize()
def f(x, y):
    return np.sqrt(x**2 + y**2)

In [None]:
xs = np.random.random(10)
ys = np.random.random(10)

In [None]:
np.array([np.sqrt(x**2 + y**2) for (x, y) in zip(xs, ys)])

In [None]:
f(xs, ys)

### Adding function signatures

In [None]:
@numba.vectorize([float64(float64, float64),
                  float32(float32, float32),
                  float64(int64, int64),
                  float32(int32, int32)])
def f_sig(x, y):
    return np.sqrt(x**2 + y**2)

In [None]:
f_sig(xs, ys)

### Using `guvectorize` 

**Create our own version of inner1d**

Suppose we have two matrices, each with `m` rows. We may want to calculate an "row-wise" inner product, that is, generate a scalar for each pair of row vectors. We cannot use `@vectorize` because the elements are not scalars.

The *layout* `(n),(n)->()` says the function to be vectorized takes two `n`-element one dimensional arrays  `(n)` and returns a scalar `()`. The type *signature* is a list that matches the order of the *layout*.

In [None]:
@numba.guvectorize([(float64[:], float64[:], float64[:])], '(n),(n)->()')
def nb_inner1d(u, v, res):
    res[0] = 0
    for i in range(len(u)):
        res[0] += u[i]*v[i]

In [None]:
xs = np.random.random((3,4))

In [None]:
nb_inner1d(xs, xs)

**Check**

In [None]:
from numpy.core.umath_tests import inner1d

In [None]:
inner1d(xs,xs)

#### Alternative to deprecated `inner1d` using Einstein summation notation

For more on how to use Einstein notation, see the help documentation and [here](https://rockt.github.io/2018/04/30/einsum)

In [None]:
np.einsum('ij,ij->i', xs, xs)

In [None]:
%timeit -r3 -n3 nb_inner1d(xs, xs)

In [None]:
%timeit -r3 -n3 inner1d(xs, xs)

**Create our own version of matrix_multiply**

In [None]:
@numba.guvectorize([(int64[:,:], int64[:,:], int64[:,:])], 
                    '(m,n),(n,p)->(m,p)')
def nb_matrix_multiply(u, v, res):
    m, n = u.shape
    n, p = v.shape
    for i in range(m):
        for j in range(p):
            res[i,j] = 0
            for k in range(n):
                res[i,j] += u[i,k] * v[k,j]

In [None]:
xs = np.random.randint(0, 10, (5, 2, 3))
ys = np.random.randint(0, 10, (5, 3, 2))

In [None]:
nb_matrix_multiply(xs, ys)

**Check**

In [None]:
from numpy.core.umath_tests import matrix_multiply

In [None]:
matrix_multiply(xs, ys)

In [None]:
%timeit -r3 -n3 nb_matrix_multiply(xs, ys)

In [None]:
%timeit -r3 -n3 matrix_multiply(xs, ys)

## Parallelization with vectorize and guvectorize

If you have an NVidia graphics card and CUDA drivers installed, you can also use `target = 'cuda'`.

In [None]:
@numba.vectorize([float64(float64, float64),
                  float32(float32, float32),
                  float64(int64, int64),
                  float32(int32, int32)],
                 target='parallel')
def f_parallel(x, y):
    return np.sqrt(x**2 + y**2)

In [None]:
xs = np.random.random(int(1e8))
ys = np.random.random(int(1e8))

In [None]:
%timeit -r3 -n3 f(xs, ys)

In [None]:
%timeit -r3 -n3 f_parallel(xs, ys)

### Mandelbrot example with `numba`

**Pure Python**

In [None]:
# color function for point at (x, y)
def mandel(x, y, max_iters):
    c = complex(x, y)
    z = 0.0j
    for i in range(max_iters):
        z = z*z + c
        if z.real*z.real + z.imag*z.imag >= 4:
            return i
    return max_iters

In [None]:
def create_fractal(xmin, xmax, ymin, ymax, image, iters):
    height, width = image.shape
    
    pixel_size_x = (xmax - xmin)/width
    pixel_size_y = (ymax - ymin)/height
        
    for x in range(width):
        real = xmin + x*pixel_size_x
        for y in range(height):
            imag = ymin + y*pixel_size_y
            color = mandel(real, imag, iters)
            image[y, x]  = color    

In [None]:
gimage = np.zeros((1024, 1536), dtype=np.uint8)
xmin, xmax, ymin, ymax = np.array([-2.0, 1.0, -1.0, 1.0]).astype('float32')
iters = 50

start = time.time()
create_fractal(xmin, xmax, ymin, ymax, gimage, iters)
dt = time.time() - start

print("Mandelbrot created on CPU in %f s" % dt)
plt.grid(False)
plt.imshow(gimage, cmap='jet')
pass

**Numba**

In [None]:
from numba import uint32, float32

**The jit decorator can also be called as a regular function**

In [None]:
mandel_numba = jit(uint32(float32, float32, uint32))(mandel)

In [None]:
@jit
def create_fractal_numba(xmin, xmax, ymin, ymax, image, iters):
    height, width = image.shape
    
    pixel_size_x = (xmax - xmin)/width
    pixel_size_y = (ymax - ymin)/height
        
    for x in range(width):
        real = xmin + x*pixel_size_x
        for y in range(height):
            imag = ymin + y*pixel_size_y
            color = mandel_numba(real, imag, iters)
            image[y, x]  = color  

In [None]:
gimage = np.zeros((1024, 1536), dtype=np.uint8)
xmin, xmax, ymin, ymax = np.array([-2.0, 1.0, -1.0, 1.0]).astype('float32')
iters = 50

start = time.time()
create_fractal_numba(xmin, xmax, ymin, ymax, gimage, iters)
dt = time.time() - start

print("Mandelbrot created wiht Numba in %f s" % dt)
plt.grid(False)
plt.imshow(gimage, cmap='jet')
pass

#### Using `numba` with `ipyparallel`

Using `numba.jit` is straightforward. See [example](https://github.com/barbagroup/numba_tutorial_scipy2016/blob/master/notebooks/10.optional.Numba.and.ipyparallel.ipynb)