Cython
====

Cython is an "optimizing static compiler " that combines Python with C to generate optimized code. Since Cython is a superset of Python, all valid Python programs are also valid Cython programs. However, by providing hints and static typing, we can get much faster programs. Note that while `numba` often provides similar speedups with less work,, an advantage of Cython is that it is easy to distribute optimized Cython modules since they can be built with the standard Python `setup.py` script.

We have already seen how to use Cython to wrap C and C++ functions from existing libraries. Here we will see how to use Cython to speed up Python functions. 

**Utility function for timing functions**

In [1]:
import time

In [2]:
def timer(f, *args, **kwargs):
    start = time.clock()
    ans = f(*args, **kwargs)
    return ans, time.clock() - start

In [3]:
def report(fs, *args, **kwargs):
    ans, t = timer(fs[0], *args, **kwargs)
    for f in fs[1:]:
        ans_, t_ = timer(f, *args, **kwargs)
        print('%s: %.1f' % (f.__name__, t/t_))

Incremental improvements
----

Generally, we start with a pure Python function, run it through Cython with the annotate `-a` flag, and incrementally modify the code until the yellow parts are minimized.

### Fibonacci example

In [4]:
def fib(n):
    a, b = 0, 1
    for i in range(n):
        a, b = a+b, a
    return a

In [5]:
%timeit -r2 -n3 fib(30)

3 loops, best of 2: 4.45 µs per loop


How to build Cython modules
----

From [official docs](http://docs.cython.org/index.html)

Using Cython consists of these steps:

- Write a .pyx source file
- Run the Cython compiler to generate a C file
- Run a C compiler to generate a compiled library
- Run the Python interpreter and ask it to import the module

In the Jupyter notebook, we can use the `%%cython` cell magic to automate these steps.

In [6]:
%load_ext cython

In [7]:
%%cython

def fib1(n):
    a, b = 0, 1
    for i in range(n):
        a, b = a+b, a
    return a

In [8]:
%timeit -r2 -n3 fib1(30)

3 loops, best of 2: 2.53 µs per loop


### Using Cython annnotations to identify bottlenecks

In [9]:
%%cython -a

def fib2(n):
    a, b = 0, 1
    for i in range(n):
        a, b = a+b, a
    return a

In [10]:
%%cython -a

def fib2(int n):
    cdef long a, b
    a, b = 0, 1
    for i in range(n):
        a, b = a+b, a
    return a

In [11]:
%timeit -r2 -n3 fib2(30)

3 loops, best of 2: 231 ns per loop


### pi_sum example

In [12]:
def pisum():
    sum = 0.0
    for j in range(1, 501):
        sum = 0.0
        for k in range(1, 10001):
            sum += 1.0/(k*k)
    return sum

In [13]:
%timeit -r2 -n3 pisum()

3 loops, best of 2: 1.37 s per loop


In [14]:
%%cython -a

def pisum1():
    sum = 0.0
    for j in range(1, 501):
        sum = 0.0
        for k in range(1, 10001):
            sum += 1.0/(k*k)
    return sum

In [15]:
%timeit -r2 -n3 pisum1()

3 loops, best of 2: 623 ms per loop


In [16]:
%%cython -a

def pisum1():
    cdef double sum = 0.0
    cdef int j, k
    for j in range(1, 501):
        sum = 0.0
        for k in range(1, 10001):
            sum += 1.0/(k*k)
    return sum

In [17]:
%timeit -r2 -n3 pisum1()

3 loops, best of 2: 35.9 ms per loop


### Using Cython compiler directives

In [18]:
%%cython -a
import cython

@cython.cdivision(True) 
def pisum2():
    cdef double sum = 0.0
    cdef int j, k
    for j in range(1, 501):
        sum = 0.0
        for k in range(1, 10001):
            sum += 1.0/(k*k)
    return sum

In [19]:
%timeit -r2 -n3 pisum2()

3 loops, best of 2: 35.9 ms per loop


### Mandel example

In [20]:
def mandel(z):
    maxiter = 80
    c = z
    for n in range(maxiter):
        if abs(z) > 2:
            return n
        z = z*z + c
    return maxiter

def mandelperf():
    r1 = np.linspace(-2.0, 0.5, 26)
    r2 = np.linspace(-1.0, 1.0, 21)
    return [mandel(complex(r, i)) for r in r1 for i in r2]

In [21]:
%timeit -r2 -n3 mandelperf()

3 loops, best of 2: 7.66 ms per loop


In [22]:
%%cython -a

import numpy as np

def mandel1(z):
    maxiter = 80
    c = z
    for n in range(maxiter):
        if abs(z) > 2:
            return n
        z = z*z + c
    return maxiter

def mandelperf1():
    r1 = np.linspace(-2.0, 0.5, 26)
    r2 = np.linspace(-1.0, 1.0, 21)
    return [mandel1(complex(r, i)) for r in r1 for i in r2]

In [23]:
%%cython -a

cimport numpy as np
import numpy as np

cdef int mandel1(double complex z):
    cdef int maxiter, n
    cdef double complex c

    maxiter = 80
    
    c = z
    for n in range(maxiter):
        if z.imag**2 + z.real**2 > 4:
            return n
        z = z*z + c
    return maxiter

def mandelperf1():
    r1 = np.linspace(-2.0, 0.5, 26)
    r2 = np.linspace(-1.0, 1.0, 21)
    return [mandel1(complex(r, i)) for r in r1 for i in r2]

In [24]:
%timeit -r2 -n3 mandelperf1()

3 loops, best of 2: 392 µs per loop


#### Matrix multiplication example

In [25]:
def matrix_multiply(u, v, res):
    m, n = u.shape
    n, p = v.shape
    for i in range(m):
        for j in range(p):
            res[i,j] = 0
            for k in range(n):
                res[i,j] += u[i,k] * v[k,j]
    return res

In [26]:
u = np.random.random((10,20))
v = np.random.random((20,5))

In [27]:
res = np.zeros((u.shape[0], v.shape[1]))
matrix_multiply(u, v, res)

array([[ 6.23206064,  5.02617378,  5.24195758,  2.99169761,  3.9401661 ],
       [ 5.44535991,  4.27051957,  5.09974356,  3.19339314,  4.4845852 ],
       [ 6.52695763,  4.9649044 ,  5.66479652,  3.73440473,  5.18981681],
       [ 5.47233217,  4.18485264,  4.34374265,  3.13801525,  4.27412098],
       [ 6.35742705,  5.02912642,  5.92758324,  3.56743263,  4.73902924],
       [ 5.57344818,  4.22251167,  4.12379554,  2.30636947,  4.08916241],
       [ 3.61588766,  2.28101131,  3.1266391 ,  2.32380826,  3.02340269],
       [ 4.96325909,  3.24325894,  4.12918062,  2.34622065,  3.23494718],
       [ 5.561466  ,  4.25910586,  4.87880398,  3.26787812,  4.11423797],
       [ 4.88445295,  4.11701615,  4.73502154,  3.06760703,  4.06496649]])

In [28]:
res = np.zeros((u.shape[0], v.shape[1]))
%timeit -r3 -n3 matrix_multiply(u, v, res)

3 loops, best of 3: 1.02 ms per loop


In [29]:
%%cython -a

import numpy as np

def matrix_multiply1(u, v, res):
    m, n = u.shape
    n, p = v.shape
    for i in range(m):
        for j in range(p):
            res[i,j] = 0
            for k in range(n):
                res[i,j] += u[i,k] * v[k,j]
    return res

In [30]:
%%cython -a

import cython

@cython.boundscheck(False)
@cython.wraparound(False)
def matrix_multiply1(double[:,:] u, double[:, :] v, double[:, :] res):
    cdef int i, j, k
    cdef int m, n, p

    m = u.shape[0]
    n = u.shape[1]
    p = v.shape[1]

    with cython.nogil:
        for i in range(m):
            for j in range(p):
                res[i,j] = 0
                for k in range(n):
                    res[i,j] += u[i,k] * v[k,j]

In [31]:
res = np.zeros((u.shape[0], v.shape[1]))
%timeit -r3 -n3 matrix_multiply1(u, v, res)

3 loops, best of 3: 13.4 µs per loop


### Parallel execution with Cython

Will not work unless OpenMP is installed.

In [32]:
%%cython --compile-args=-fopenmp --link-args=-fopenmp --force

import cython
from cython.parallel import parallel, prange

@cython.boundscheck(False)
@cython.wraparound(False)
def matrix_multiply2(double[:,:] u, double[:, :] v, double[:, :] res):
    cdef int i, j, k
    cdef int m, n, p

    m = u.shape[0]
    n = u.shape[1]
    p = v.shape[1]

    with cython.nogil, parallel():
        for i in prange(m):
            for j in prange(p):
                res[i,j] = 0
                for k in range(n):
                    res[i,j] += u[i,k] * v[k,j]

In [33]:
res = np.zeros((u.shape[0], v.shape[1]))
%timeit -r3 -n3 matrix_multiply2(u, v, res)

The slowest run took 12.73 times longer than the fastest. This could mean that an intermediate result is being cached.
3 loops, best of 3: 12.7 µs per loop
