# Cython: C made painless

* Python: fast development, slow execution

* C/C++/Fortran: slow development, fast execution

## Why is Python execution slow: 

It's all too dynamic.

* Runtime interprets the bytecode.

* Everything is an object (boxing/unboxing)

* Function calls are expensive

* Global interpreter lock (GIL)

Python has well defined C API. Can use that for moving computationally expensive parts to compiled code: *C extensions*.

Example: numpy, scipy, scikit-learn, lxml, Sage, ZeroMQ, ...

### Human user $\Longleftrightarrow$ Python runtime $\Longleftrightarrow$ C extensions.

The idea is to keep user interface, database, web, visualization etc etc in Python.

Writing C extensions manually can be daunting. (All pleasures of manual memory management, *plus* reference counting, parsing python arguments etc). 

https://docs.python.org/3.8/c-api/

E.g. Paul Ross's http://pythonextensionpatterns.readthedocs.io/en/latest/index.html

## Enter Cython

Cython (http://cython.org) is a static compiler from a superset of Python to C (or C++).

### Human user $\Longleftrightarrow$ Python runtime $\Longleftrightarrow$ Cython $\Longleftrightarrow$ a C extension.


If you already have a C/C++/Fortran code, expose it to Python by *wrapping* it in Cython.

Otherwise, 

1. build a prototype in pure python, 
2. profile to identify hotspots,
3. move hotspots to Cython,
4. Profit!

Perks:

* First-class NumPy support
* Can use the C++ standard library
* Parallelism: can release the GIL

## A worked example of Cythonizing a computation

Shamelessly stolen from Pauli Virtanen, *Cython tutorial*,
https://python.g-node.org/python-summerschool-2011/_media/materials/cython/cython-slides.pdf

Consider a planet orbiting a star.

Need to solve a second-order ODE:


$$
\begin{align}
\frac{d\mathbf{x}}{dt} &= \mathbf{v} \;,\\
\frac{d\mathbf{v}}{dt} &= \frac{\mathbf{F(\mathbf{x})}}{m} \;.
\end{align}
$$

Note that solving an ODE cannot be vectorized, hence NumPy is of no help.

For the sake of example, only use the Euler method.

In [2]:
import numpy as np

In [3]:
%load_ext cython

In [4]:
from math import sqrt

class Planet(object):
    """A class to store a planet's position and velocity."""
    def __init__(self):
        self.x = 1.0
        self.y = 0
        self.z = 0
        self.vx = 0
        self.vy = 0
        self.vz = 1.0
        
        self.m = 1.0
        

def single_step(planet, dt):
    """Make a single step in time, t -> t+dt."""
    
    # Gravitational force pulls towards origin
    r = sqrt(planet.x**2 + planet.y**2 + planet.z**2)
    r3 = r**3
    
    Fx = -planet.x / r3
    Fy = -planet.y / r3
    Fz = -planet.z / r3
    
    # update position
    planet.x += planet.vx * dt
    planet.y += planet.vy * dt
    planet.z += planet.vz * dt
    
    # update velocity
    m = planet.m
    planet.vx += Fx * dt / m
    planet.vy += Fy * dt / m
    planet.vz += Fz * dt / m


def propagate(planet, time_span, num_steps):
    """Make a number of time steps."""
    dt = time_span / num_steps
    
    for _ in range(num_steps):
        single_step(planet, dt)

In [5]:
planet = Planet()
%timeit propagate(planet, 1, 1000)

948 µs ± 4.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Compile the program to Cython

Every python program is a valid cython program

In [6]:
%%cython -a
# -a is for "annotate"

from math import sqrt

class Planet(object):
    def __init__(self):
        self.x = 1.0
        self.y = 0
        self.z = 0
        self.vx = 0
        self.vy = 0
        self.vz = 1.0
        
        self.m = 1.0
        

def single_step(planet, dt):
    """Make a single step in time, t -> t+dt."""
    
    # Gravitational force pulls towards origin
    r = sqrt(planet.x**2 + planet.y**2 + planet.z**2)
    r3 = r**3
    
    Fx = -planet.x / r3
    Fy = -planet.y / r3
    Fz = -planet.z / r3
    
    # update position
    planet.x += planet.vx * dt
    planet.y += planet.vy * dt
    planet.z += planet.vz * dt
    
    # update velocity
    m = planet.m
    planet.vx += Fx * dt / m
    planet.vy += Fy * dt / m
    planet.vz += Fz * dt / m


def propagate(planet, time_span, num_steps):
    """Make a number of time steps."""
    dt = time_span / num_steps
    
    for _ in range(num_steps):
        single_step(planet, dt)

In [7]:
planet = Planet()
%timeit propagate(planet, 1, 1000)

729 µs ± 1.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Simply compiling the whole program only gives some 25%. Not worth it usually.

### Throw in some static typing

In [8]:
%%cython -a

from __main__ import single_step

def propagate(planet,
              double time_span,              # NB: C double, C int
              int num_steps):
    """Make a number of time steps."""

    cdef double dt = time_span / num_steps    # NB: cdef ctype variable
    cdef int j
    
    for j in range(num_steps):
        single_step(planet, dt)

Notice that the loop has been compiled to the C loop.

The division is guarded for division-by-zero. We can switch the checks off and request the C semantics for the division.

In [9]:
%%cython -a

from __main__ import single_step

cimport cython

@cython.cdivision(True)                  # NB: decorator. Other useful decorators:
def propagate(planet,                    # wraparound, boundscheck
              double time_span,
              int num_steps):
    """Make a number of time steps."""

    cdef double dt = time_span / num_steps
    cdef int j
    
    for j in range(num_steps):
        single_step(planet, dt)

#### Now `single_step` is the bottleneck

`single_step` has lots of python overhead because of looking up attributes on a Python object `planet`. Move the Planet class to cython 

In [10]:
%%cython -a

from math import sqrt

cimport cython


cdef class Planet(object):                       # NB: cdef class
    cdef public double x, y, z, vx, vy, vz, m    # NB: cdef public double
                                                 #     here "public" means they are accessible from python
    def __init__(self):
        self.x = 1.0
        self.y = 0
        self.z = 0
        self.vx = 0
        self.vy = 0
        self.vz = 1.0
        
        self.m = 1.0


@cython.cdivision(True)
def single_step(Planet planet not None,          # NB type the planet parameter
                double dt):
    """Make a single step in time, t -> t+dt."""
    # Gravitational force pulls towards origin
    cdef double r, r3                            # NB statically type scalars
    r = sqrt(planet.x**2 + planet.y**2 + planet.z**2)
    r3 = r**3
    
    Fx = -planet.x / r3
    Fy = -planet.y / r3
    Fz = -planet.z / r3
    
    # update position
    planet.x += planet.vx * dt
    planet.y += planet.vy * dt
    planet.z += planet.vz * dt
    
    # update velocity
    m = planet.m
    planet.vx += Fx * dt / m
    planet.vy += Fy * dt / m
    planet.vz += Fz * dt / m


@cython.cdivision(True)
def propagate(planet,
              double time_span,
              int num_steps):
    """Make a number of time steps."""

    cdef double dt = time_span / num_steps
    cdef int j
    
    for j in range(num_steps):
        single_step(planet, dt)

In [11]:
planet = Planet()
%timeit propagate(planet, 1, 1000)

80.7 µs ± 618 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Now, 

1. Use `sqrt` from C library `math.h`
2. Make `single_step` a `cdef` function (i.e. C only, not available from Python space)


* `def foo(x):` is a python function
* `cdef double foo(double x):` is a C only function, not available from Python space
* `cpdef double foo(double x):` is both a fast C function and a slow Python function.

If no type is specified, the default is `object`, i.e. untyped Python objects

In [12]:
%%cython -a

from libc.math cimport sqrt                         # use the Cython wrapper over math.h

cimport cython


cdef class Planet(object):
    cdef public double x, y, z, vx, vy, vz, m
    def __init__(self):
        self.x = 1.0
        self.y = 0
        self.z = 0
        self.vx = 0
        self.vy = 0
        self.vz = 1.0
        
        self.m = 1.0


@cython.cdivision(True)
cdef void single_step(Planet planet,          # NB: cdef void, also Planet planet
                      double dt):
    """Make a single step in time, t -> t+dt."""
    # Gravitational force pulls towards origin
    cdef double r, r3
    r = sqrt(planet.x**2 + planet.y**2 + planet.z**2)
    r3 = r**3                             # XXX: check the generated C code
    
    Fx = -planet.x / r3
    Fy = -planet.y / r3
    Fz = -planet.z / r3
    
    # update position
    planet.x += planet.vx * dt
    planet.y += planet.vy * dt
    planet.z += planet.vz * dt
    
    # update velocity
    m = planet.m
    planet.vx += Fx * dt / m
    planet.vy += Fy * dt / m
    planet.vz += Fz * dt / m


@cython.cdivision(True)
def propagate(planet,
              double time_span,
              int num_steps):
    """Make a number of time steps."""

    cdef double dt = time_span / num_steps
    cdef int j
    
    for j in range(num_steps):
        single_step(planet, dt)

In [16]:
planet = Planet()
%timeit propagate(planet, 1, 1000)

79.5 µs ± 926 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Now check the generated C code for `r3 = r**3`. C `pow` function can be *slow*. Let's replace `r**3` by `r*r*r`: 

In [13]:
%%cython -a --cplus

from libc.math cimport sqrt

cimport cython

cdef class Planet(object):
    cdef public double x, y, z, vx, vy, vz, m
    def __init__(self):
        self.x = 1.0
        self.y = 0
        self.z = 0
        self.vx = 0
        self.vy = 0
        self.vz = 1.0
        
        self.m = 1.0


@cython.cdivision(True)
cdef void single_step(Planet planet,
                      double dt) nogil:
    """Make a single step in time, t -> t+dt."""
    # Gravitational force pulls towards origin
    cdef double r, r3
    r = sqrt(planet.x**2 + planet.y**2 + planet.z**2)
    r3 = r*r*r                             # XXX: check generated C code
    
    Fx = -planet.x / r3
    Fy = -planet.y / r3
    Fz = -planet.z / r3
    
    # update position
    planet.x += planet.vx * dt
    planet.y += planet.vy * dt
    planet.z += planet.vz * dt
    
    # update velocity
    m = planet.m
    planet.vx += Fx * dt / m
    planet.vy += Fy * dt / m
    planet.vz += Fz * dt / m


@cython.cdivision(True)
def propagate(planet,
              double time_span,
              int num_steps):
    """Make a number of time steps."""

    cdef double dt = time_span / num_steps
    cdef int j
    
    for j in range(num_steps):
        single_step(planet, dt)

In [16]:
planet = Planet()
%timeit propagate(planet, 1, 1000)

12.9 µs ± 138 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [17]:
948 / 13

72.92307692307692

Alternatively, can ask the compiler to do such transformations, via compiler flags e.g., for gcc it's `-O3 -ffast-math`.


## Dense output

Store the trajectory

In [19]:
%%cython -a --cplus

import numpy as np      # new

from libc.math cimport sqrt
cimport cython

cdef class Planet(object):
    cdef public double x, y, z, vx, vy, vz, m
    def __init__(self):
        self.x = 1.0
        self.y = 0
        self.z = 0
        self.vx = 0
        self.vy = 0
        self.vz = 1.0
        
        self.m = 1.0


@cython.cdivision(True)
cdef void single_step(Planet planet,
                      double dt) nogil:
    """Make a single step in time, t -> t+dt."""
    # Gravitational force pulls towards origin
    cdef double r, r3
    r = sqrt(planet.x**2 + planet.y**2 + planet.z**2)
    r3 = r*r*r                             # XXX: check generated C code
    
    Fx = -planet.x / r3
    Fy = -planet.y / r3
    Fz = -planet.z / r3
    
    # update position
    planet.x += planet.vx * dt
    planet.y += planet.vy * dt
    planet.z += planet.vz * dt
    
    # update velocity
    m = planet.m
    planet.vx += Fx * dt / m
    planet.vy += Fy * dt / m
    planet.vz += Fz * dt / m


@cython.cdivision(True)
def propagate(planet,
              double time_span,
              int num_steps):
    """Make a number of time steps."""

    cdef double dt = time_span / num_steps
    cdef int j
    
    traj = np.empty((num_steps, 3), dtype=float)          # dense output array
    
    for j in range(num_steps):
        single_step(planet, dt)
        traj[j, :] = [planet.x, planet.y, planet.z]

### Typed memoryviews : data pointer + shape + strides

In [23]:
%%cython -a --cplus

import numpy as np      # new

from libc.math cimport sqrt
cimport cython

cdef class Planet(object):
    cdef public double x, y, z, vx, vy, vz, m
    def __init__(self):
        self.x = 1.0
        self.y = 0
        self.z = 0
        self.vx = 0
        self.vy = 0
        self.vz = 1.0
        
        self.m = 1.0


@cython.cdivision(True)
cdef void single_step(Planet planet,
                      double dt) nogil:
    """Make a single step in time, t -> t+dt."""
    # Gravitational force pulls towards origin
    cdef double r, r3
    r = sqrt(planet.x**2 + planet.y**2 + planet.z**2)
    r3 = r*r*r
    
    Fx = -planet.x / r3
    Fy = -planet.y / r3
    Fz = -planet.z / r3
    
    # update position
    planet.x += planet.vx * dt
    planet.y += planet.vy * dt
    planet.z += planet.vz * dt
    
    # update velocity
    m = planet.m
    planet.vx += Fx * dt / m
    planet.vy += Fy * dt / m
    planet.vz += Fz * dt / m


@cython.cdivision(True)
@cython.wraparound(False)                   # try removing, check the C code
@cython.boundscheck(False)
def propagate(Planet planet not None,       # type the Planet variable
              double time_span,
              int num_steps):
    """Make a number of time steps."""

    cdef double dt = time_span / num_steps
    cdef int j
    
    cdef double[:, ::1] traj = np.empty((num_steps, 3), dtype=float)          # dense output array
    
    for j in range(num_steps):
        single_step(planet, dt)
        traj[j, 0] = planet.x
        traj[j, 1] = planet.y
        traj[j, 2] = planet.z

## Additional language features

* Typed memoryview syntax: access data in NumPy arrays, C arrays or `std::vector`s.
* Basic templating: fused types.
* Release the GIL: `cdef ... nogil` functions, `with nogil` blocks.
* Basic parallelism with OpenMP: `prange` loops.
* Translating C++ exceptions into Python exceptions automatically.

## Further reading

* Documentation: http://docs.cython.org/en/latest/
* Pauli Virtanen, *Cython tutorial*, 2011, https://python.g-node.org/python-summerschool-2011/_media/materials/cython/cython-slides.pdf
* Stefan van der Walt, *Speeding up scientific Python code using Cython*, https://github.com/stefanv/teaching/blob/master/2013_assp_zurich_cython/slides/zurich2012_cython.pdf
* Paul Ross, *Musings on Cython*, http://notes-on-cython.readthedocs.io/en/latest/index.html
* Kurt W Smith, *Cython: A Guide for Python Programmers*, O'Reilly 2015
* Stefan Behnel, *Get Native with Cython*, EuroPython 2014, https://www.youtube.com/watch?v=DXmblsdcsHw (50 mins);
  EuroSciPy 2015, https://www.youtube.com/watch?v=GmxZfZjEjZo (3 hrs).
  
* https://youtu.be/-lMiAKKyLFI (Примерно с 50 минуты про cython, очень кратко.) --- hat tip V.G.
  
There is also a cython-users mailing list.

### A pain point for Windows users

Trying to compile a C extension fails with *"... unable to find vcvarsall.bat"*.

See https://blogs.msdn.microsoft.com/pythonengineering/2016/04/11/unable-to-find-vcvarsall-bat/

Long story short: for python >= 3.5 (recommended!), you need to have Visual Studio 2015 or VS 2015 SDK. 

# Numba

In [2]:

#https://numba.pydata.org/numba-doc/latest/user/jitclass.html

import numba
from numba import float64

spec = [
    ('x', float64),
    ('y', float64),
    ('z', float64),
    ('vx', float64),
    ('vy', float64),
    ('vz', float64),
    ('m', float64),
]

@numba.jitclass(spec)
class Planet(object):
    """A class to store a planet's position and velocity."""
    def __init__(self):
        self.x = 1.0
        self.y = 0
        self.z = 0
        self.vx = 0
        self.vy = 0
        self.vz = 1.0
        
        self.m = 1.0

  @numba.jitclass(spec)


In [6]:
from math import sqrt

@numba.jit(nopython=True)
def numba_single_step(planet, dt):
    """Make a single step in time, t -> t+dt."""
    
    # Gravitational force pulls towards origin
    r = sqrt(planet.x**2 + planet.y**2 + planet.z**2)
    r3 = r**3
    
    Fx = -planet.x / r3
    Fy = -planet.y / r3
    Fz = -planet.z / r3
    
    # update position
    planet.x += planet.vx * dt
    planet.y += planet.vy * dt
    planet.z += planet.vz * dt
    
    # update velocity
    m = planet.m
    planet.vx += Fx * dt / m
    planet.vy += Fy * dt / m
    planet.vz += Fz * dt / m


@numba.jit(nopython=True)
def numba_propagate(planet, time_span, num_steps):
    """Make a number of time steps."""
    dt = time_span / num_steps
    
    for _ in range(num_steps):
        numba_single_step(planet, dt)

In [7]:
planet1 = Planet()
%timeit numba_propagate(planet1, 1, 1000)

36.6 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [10]:
%%cython?