![Py4Eng](img/logo.png)

# Cython and Numba
## Yoav Ram

# Cython at a glance

[Cython](http://docs.cython.org/src/userguide/numpy_tutorial.html#cython-at-a-glance) is a compiler which compiles Python-like code files to C code. Still, ‘’Cython is not a Python to C translator’‘. That is, it doesn’t take your full program and “turns it into C” – rather, the result makes full use of the Python runtime environment. A way of looking at it may be that your code is still Python in that it runs within the Python runtime environment, but rather than compiling to interpreted Python bytecode one compiles to native machine code (but with the addition of extra syntax for easy embedding of faster C-like code).

This has two important consequences:

- **Speed.** How much depends very much on the program involved though. Typical Python numerical programs would tend to gain very little as most time is spent in lower-level C that is used in a high-level fashion. However for-loop-style programs can gain many orders of magnitude, when typing information is added (and is so made possible as a realistic alternative).
- **Easy calling into C code.** One of Cython’s purposes is to allow easy wrapping of C libraries. When writing code in Cython you can call into C code as easily as into Python code.

Some Python constructs are not yet supported, though making Cython compile all Python code is a stated goal (among the more important omissions are inner functions and generator functions).

# Hello world!

Let's start with a simple *Hello World!* to check that everything is working.

We load the `Cython` magic that will allow us to quickly use Cython inside the notebook. The magic is installed with the `cython` package (`conda install cython` or `pip install cython`). The Cython requires that a C compiler is installed and can be found, and on Windows that may be tricky, here are some resources:

- [Install Cython on Windows](https://github.com/cython/cython/wiki/InstallingOnWindows)
- [Install C compiler on Windows](https://github.com/cython/cython/wiki/CythonExtensionsOnWindows)
- [Compiling Python extensions on Windows](https://blog.ionelmc.ro/2014/12/21/compiling-python-extensions-on-windows/)

On Linux/OSX you probably already have a compiler, check with `!gcc --version`. If `gcc` is not found, install it:

- On [Ubuntu](https://help.ubuntu.com/community/InstallingCompilers) you should run `sudo apt-get install build-essential`.
- On CentOS you just need to install development tools: `sudo yum group install "Development Tools"`.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import sys, os

import Cython
print("Cython", Cython.__version__)

%load_ext Cython

In [None]:
%%cython 
print("Hello World!")

# First Cython example

Let's see a quick example of what Cython can do for us. 

Consider the following Python [function that returns the first k prime numbers](http://docs.cython.org/en/latest/src/tutorial/cython_tutorial.html) (but not more than 1000):

# Example - Mandelbrot fractal

In [None]:
import numpy as np

def mandelbrot(m, size, iterations):    
    for i in range(size):
        for j in range(size):
            c = -2 + 3.0 / size * j + 1j * (1.5 - 3.0 / size * i)
            z = 0
            for n in range(iterations):
                if np.abs(z) <= 10:
                    z = z * z + c
                    m[i, j] = n
                else:
                    break

In [None]:
%%cython
def cmandelbrot(int[:,:] m,
                int size,
                int iterations):
    cdef int i, j, n
    cdef complex z, c
    for i in range(size):
        for j in range(size):
            c = -2 + 3.0 / size * j + 1j * (1.5 - 3.0 / size * i)
            z = 0
            for n in range(iterations):
                if z.real**2 + z.imag**2 <= 100: # note - no use of np.abs
                    z = z * z + c
                    m[i, j] = n
                else:
                    break

In [None]:
size = 200
iterations = 100
m = np.zeros((size, size), dtype=np.int32)

In [None]:
%timeit mandelbrot(m, size, iterations)
%timeit cmandelbrot(m, size, iterations)

In [None]:
mandelbrot(m, size, iterations)
plt.imshow(m, cmap='viridis')
plt.xticks([])
plt.yticks([]);

# Exercise - difference matrix

Write a Cython function that calculates the difference matrix for a given array.
Compare it to the NumPy implementation.

In [None]:
def diff_mat(x):
    return x.reshape(-1, 1) - x

In [None]:
x = np.random.random(10000)
assert np.allclose(diff_mat(x), cdiff_mat(x))
%timeit diff_mat(x)
%timeit cdiff_mat(x)

You can explore other [compiler directives](https://github.com/cython/cython/wiki/enhancements-compilerdirectives) such as `wraparound` and `nonecheck`.

# Numba

[Numba]((http://numba.pydata.org) speeds functions written directly in Python. 
With a few annotations, array-oriented and math-heavy Python code can be *just-in-time* (JIT) compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.

Roughly, [JIT](https://en.wikipedia.org/wiki/Just-in-time_compilation) compilation combines the speed of compiled code with the flexibility of interpretation, with the overhead of an interpreter and the additional overhead of compiling (not just interpreting).

Numba also allows to release the GIL, thus allowing multithreading in CPU-bounded applications; it also allows to automatically parallelize code - see details in a [blog post](https://www.anaconda.com/blog/developer-blog/parallel-python-with-numba-and-parallelaccelerator/) by Anaconda.

In [None]:
import numba
print('Numba', numba.__version__)

In [None]:
@numba.jit()
def numandelbrot(m, size, iterations):    
    for i in range(size):
        for j in range(size):
            c = -2 + 3.0 / size * j + 1j * (1.5 - 3.0 / size * i)
            z = 0
            for n in range(iterations):
                if np.abs(z) <= 10:
                    z = z * z + c
                    m[i, j] = n
                else:
                    break
numandelbrot(m, size, iterations); # run once for jit to work

In [None]:
size = 1000
iterations = 1000
m = np.zeros((size, size), dtype=np.int32)

# %timeit mandelbrot(m, size, iterations)
%timeit cmandelbrot(m, size, iterations)
%timeit numandelbrot(m, size, iterations)

# Exercise - Numba

Consider the following NumPy function which calculates the mean squared error of two arrays/

In [None]:
def mean_squared_error_np(yhat, y):
    return ((yhat - y)**2).mean()

Write a pure-Python version of this function, then JIT it with Numba.

In [None]:
def mean_squared_error_py(yhat, y):
    pass

def mean_squared_error_nm(yhat, y):
    pass

In [None]:
n = 100000
y = np.random.random(n)
yhat = np.random.random(n)
%timeit mean_squared_error_np(yhat, y)
%timeit mean_squared_error_py(yhat, y)
%timeit mean_squared_error_nm(yhat, y)

# Releasing the GIL

Cython can be directed to [release the GIL](http://docs.cython.org/src/userguide/external_C_code.html#acquiring-and-releasing-the-gil), thus enabling multiple threads to run in parallel (if the OS allows it) and achieve performence gain even in comparison to multi-processing, as threads are cheaper than processes and thread communication is faster than process communication.

In [None]:
def display_image(im):
    plt.imshow(im, cmap='gray')
    plt.xticks([])
    plt.yticks([])

In [None]:
%pwd

Let's do a segmentation demonstration.

In [None]:
# original image from https://upload.wikimedia.org/wikipedia/commons/5/56/Kobe_Bryant_2014.jpg
import imageio
image = imageio.imread('../data/Kobe_Bryant_2014.jpg')
image = image.mean(axis=2) # greyscale
display_image(image)
image.dtype, image.shape, image.min(), image.max()

In [None]:
def segment(image, threshold):
    output = np.zeros_like(image)
    output[image > threshold] = 255
    return output

display_image(segment(image, 100))

## Cython no-gil

In [None]:
%%cython
import numpy as np
import cython 

# if you comment this out, cython will warn you to add it for faster access
@cython.boundscheck(False) 
cdef void _segment(double[:,:] image, int n, int m, 
                   double threshold, double[:,:] output) nogil: # note the "nogil" directive
    cdef int i, j
    for i in range(n):
        for j in range(m):
            if image[i, j] > threshold:
                output[i, j] = 255
            else:
                output[i, j] = 0

def csegment(image, threshold):
    output = np.zeros_like(image)
    n, m = image.shape
    _segment(image, n , m, threshold, output)
    return output

In [None]:
display_image(csegment(image, 100))

In [None]:
%timeit segment(image, 100)
%timeit csegment(image, 100)

So performance on a single image is similar, what about using multi-threading for segmenting a bunch of images?

Let's download the 30 examples images from the [PASCAL VOC 2012 dataset](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/segexamples/index.html).

![example](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/segexamples/images/05.jpg)

In [None]:
image_urls = [
    'http://host.robots.ox.ac.uk/pascal/VOC/voc2012/segexamples/images/{:02d}.jpg'.format(i)
    for i in range(1,31)
]

In [None]:
display_image(imageio.imread(image_urls[4]))

Let's load and flatten the images:

And the actual C file:

In [None]:
%less primes.c

You can now import `primes` as if it was a regular Python module (if you aren't sure which of the things you are importing, restart your kernel, but don't forger to change directory to `../scripts/cython`).

In [None]:
import primes

In [None]:
primes.cfind_k_primes(10)

# Example - integration

This is from Robert Bradshaw [SciPy 2008 slides](http://wiki.sagemath.org/scipy08?action=AttachFile&do=get&target=scipy-cython.tgz).

We will write a simple integration code to integrate $f(x) = x^3 - 3x$:

In [None]:
def f(x):
    return x**3 - 3 * x

def integrate_f(a , b , N):
    s = 0
    dx = (b - a )/ N
    for i in range(N):
        s += f(a + i * dx)
    return s * dx

In [None]:
%timeit integrate_f(0, 1, 100000)

Now, the same with Cython. Note that when we define a function with `cdef` we can set a return value but we can only c
all it from within Cython. When we define a function with `def` we can import it and call it from Python.

Note that we change `x**3` to `x * x * x`.

In [None]:
%%cython
cdef double g(double x):
    return x * x * x - 3* x

def integrate_g(double a , double  b , int N):
    cdef double s = 0
    cdef double dx = (b - a )/ N
    cdef int i
    for i in range(N):
        s += g( a + i * dx )
    return s * dx

In [None]:
%timeit integrate_g(0, 1, 100000)

Now let's integrate 
$$
\int_a^b{\frac{\sin{x}}{x}}
$$

Without Cython, we should use either `math.sin` or `numpy.sin`:

In [None]:
def f(x):
    return np.sin(x) / x

integrate_f(1, 2, 100000)

With Cython, we can import some C functions using an `extern` block:

In [None]:
%%cython
cdef extern from "math.h":
    double sin(double)
    double cos(double)

cdef double g(double x):
    return sin(x)/x

def integrate_g(double a , double  b , int N):
    cdef double s = 0
    cdef double dx = (b - a) /  N
    cdef int i
    for i in range(N):
        s += g(a + i * dx)
    return s * dx

In [None]:
%timeit integrate_f(1, 3, 100000)
%timeit integrate_g(1, 3, 100000)

# Cython + NumPy

Cython works well with NumPy.

Let's loop over a NumPy array:

In [None]:
def summ(x):
    s = 0
    for i in range(x.shape[0]):
        s += x[i]
    return s

In [None]:
%%cython
cimport numpy as np

def csumm(long[:] x): # input type is a buffer
    cdef np.ndarray[long, ndim=1] arr = x # put buffer in a cython numpy array
    cdef int i = 0
    cdef long s = 0
    for i in range(arr.shape[0]):
        s += arr[i]
    return s

In [None]:
x = np.random.randint(0, 9, 100000)
%timeit summ(x)
%timeit csumm(x)

Note that if you would type `s` and `arr` as made of `int` you would get an error because Python's `int` is a C `long`... But you would get a `ValueError`, not a segmentation fault or anything like that. Here's an example:

In [None]:
y = np.array([0.1, 0.2, 0.3])
csumm(y)

In [None]:
images = [imageio.imread(url) for url in image_urls]
images = [im.mean(axis=2) for im in images]

Let's also resize the images so that the segmentation task is harder.

In [None]:
from skimage.transform import resize

In [None]:
shapes = [im.shape for im in images]
images = [resize(im, (w*12, h*12), mode='reflect') 
          for im, (w, h) in zip(images, shapes)]

First compare the NumPy and Cython versions:

In [None]:
%timeit [segment(im, 100) for im in images]
%timeit [csegment(im, 100) for im in images]

No real difference, of course.
Now let's do it with multi-threading, leveraging the `nogil` directive.

In [None]:
from concurrent.futures import ThreadPoolExecutor

def segment_parallel(images, threshold):
    def _segment(im):
        return csegment(im, threshold)
    
    with ThreadPoolExecutor() as executor:
        results = executor.map(_segment, images)
        return list(results)

In [None]:
%timeit [csegment(im, 100) for im in images]
%timeit segment_parallel(images, 100)

You can look at the process monitor (or `top`/`htop` on Linux/Mac) when the next two cells run to see that the first uses less cores than the latter. 

In [None]:
for _ in range(10): 
    [csegment(im, 100) for im in images];

In [None]:
for _ in range(10): 
    segment_parallel(images, 100);

## Numba no-gil

In [None]:
@numba.jit(nopython=True, nogil=True)
def _nmsegment(image, threshold, output):
    n, m = image.shape
    for i in range(n):
        for j in range(m):
            if image[i, j] > threshold:
                output[i, j] = 255
            else:
                output[i, j] = 0
    return output

def nmsegment(image, threshold):
    output = np.empty_like(image)
    _nmsegment(image, threshold, output)
    return output

display_image(nmsegment(image, 100))

In [None]:
%timeit segment(images[0], 100)
%timeit csegment(images[0], 100)
%timeit nmsegment(images[0], 100)

In [None]:
from concurrent.futures import ThreadPoolExecutor

def segment_parallel(images, threshold):  
    def func(image):
            return nmsegment(image, threshold)
    
    with ThreadPoolExecutor() as executor:        
        results = executor.map(func, images)
        return list(results)

In [None]:
%timeit [nmsegment(im, 100) for im in images]
%timeit segment_parallel(images, 100)

# Numba automatic parallelization

Numba uses Intel's MKL to automate parallelization of for loops with multiple threads.

Consider the function for calculation of MSE:

In [None]:
@numba.jit
def mean_squared_error_nm(yhat, y):
    mse = 0
    n = len(y)
    for i in range(n):
        mse += (yhat[i] - y[i])**2
    mse /= n
    return mse

We can parallelize it by adding the `parallel=True` directive (which requires that we can set `nopython=True` without getting an error) and using `numba.prange` instead of `range`:

In [None]:
%%cython
def cfind_k_primes(int k):
    cdef int n_primes, candidate, p
    cdef int[1000] primes
    n_primes = 0  # the current number of elements in p
    candidate = 2
    while n_primes < min(k, 1000):
        # is candidate prime?
        for p in primes[:n_primes]:
            if candidate % p == 0:
                break # not a prime        
        else: # if no break occurred in the loop, we have a prime
            primes[n_primes] = candidate
            n_primes += 1
        candidate += 1

    # convert primes from a cython type to a python list
    return [p for p in primes[:n_primes]] 

In [None]:
find_k_primes(100) == cfind_k_primes(100)

In [None]:
n = 1000
%timeit find_k_primes(n)
%timeit cfind_k_primes(n)

Note that in the inner loop only variables with type definitions, and thus C objects, are reffered to; therefore, the loop is translated to a C loop, and runs very fast.

# Creating a Cython module

We can now put that Cython code into a separate file with extension `pyx`. 

In [None]:
!mkdir ../scripts/cython
%cd ../scripts/cython

In [None]:
%%file primes.pyx
def cfind_k_primes(int k):
    cdef int n_primes, candidate, p
    cdef int[1000] primes
    n_primes = 0  # the current number of elements in p
    candidate = 2
    while n_primes < min(k, 1000):
        # is candidate prime?
        for p in primes[:n_primes]:
            if candidate % p == 0:
                break # not a prime        
        else: # if no break occurred in the loop, we have a prime
            primes[n_primes] = candidate
            n_primes += 1
        candidate += 1

    # convert primes from a cython type to a python list
    return [p for p in primes[:n_primes]] 

Now we compile it and import it with a single stroke using the [`pyximport` module](http://docs.cython.org/en/latest/src/reference/compilation.html#pyximport),

> Cython code, unlike Python, must be compiled.
> This happens in two stages:
> A .pyx file is compiles by Cython to a .c file.
> The .c file is compiled by a C comiler to a .so file (or a .pyd file on Windows)

In [None]:
import pyximport
pyximport.install()

In [None]:
from primes import cfind_k_primes

In [None]:
len(cfind_k_primes(100)) == 100

You can also do this on your own, without `pyximport`. This is useful when shipping or if you just want to see the C file - for example, to see how much work Cython saved you from doing!

You start by writing a `setup.py` file which imports `cythonize` from the Cython package and tells `setup` (the standard way to setup Python packages, using `distutils` or `setuptools`) to build an extension module using `cythonize`.

In [None]:
@numba.jit(parallel=True, nopython=True)
def mean_squared_error_pr(yhat, y):
    mse = 0
    n = len(y)
    for i in numba.prange(n):
        mse += (yhat[i] - y[i])**2
    mse /= n
    return mse

In [None]:
n = 1000000
y = np.random.random(n)
yhat = np.random.random(n)
%timeit mean_squared_error_nm(yhat, y)
%timeit mean_squared_error_pr(yhat, y)

Roughly two-fold faster - which makes sense on my 2-CPU machine.

# Numba stencils (local filters)

Numba's `stencil` decorator works similar to SciPy's `generic_filter`, but uses the JIT capabilities of Numba.

Let's start by adding some noise to the images: