This is a very brief introduction to ```cython``` and ```ctypes```. For more information, also have a look at

https://pythonprogramming.net/introduction-and-basics-cython-tutorial/

http://www.southampton.ac.uk/~feeg6002/ipythonnotebooks/Introduction-ctypes.html

https://docs.python.org/2/library/ctypes.html

<h1>```cython```</h1>

In [1]:
import numpy as np

#for in-notebook cython magic
%load_ext Cython

If Python code slows down, most of the time it's due to a lot of function calls. Because Python is not compiled, this involves a lot of overhead.

In [2]:
def sumtest_python(N):
    a = 0
    for i in range(N):
        for j in range(N):
            a += j
    return(a)

In [3]:
% timeit sumtest_python(100)

1000 loops, best of 3: 1.53 ms per loop


Cython tries to compile the code directly to C, speeding up loops considerably:

In [4]:
%%cython

def sumtest_cython_v1(N):
    a = 0
    for i in range(N):
        for j in range(N):
            a += j
    return(a)

In [5]:
%timeit sumtest_cython_v1(100)

1000 loops, best of 3: 243 µs per loop


Already a factor of 6, but Cython can do much better. The ```--annotate``` command reveals where python is still used and slows the code down:

In [6]:
%%cython --annotate

def sumtest_cython_v1(N):
    a = 0
    for i in range(N):
        for j in range(N):
            a += j
    return(a)

The problem is that our variables are not typed. Let's try again:

In [7]:
%%cython --annotate

def sumtest_cython_v2(int N):
    cdef int a = 0
    cdef int i,j = 0
    for i in range(N):
        for j in range(N):
            a += j
    return(a)

Our code is now almost pure C and therefore very fast. Let's compare all implementations:

In [8]:
%timeit sumtest_python(100)
%timeit sumtest_cython_v1(100)
%timeit sumtest_cython_v2(100)

1000 loops, best of 3: 1.51 ms per loop
1000 loops, best of 3: 242 µs per loop
The slowest run took 12.40 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 154 ns per loop


This is a very simple example, but it shows the potential for cases where function calls are the bottleneck.

<h1>Integrations</h1>

A very common case where lots of function calls can not be evaded are numerical integrations. Especially once we move to higher dimensionality, the innermost integrand is called very often. As an example, let's have a look at a very basic comsological angular diameter distance calculation:

$$ d_A(z) = \frac{1}{1+z} \int_0^z \frac{\mathrm{d} z'}{H(z')}$$

In many codes a function like this can be called very frequently, often within other integrals over other cosmological quantities.

In [9]:
from scipy import integrate

class Cosmology:
    """
    Defines a flat LCDM background cosmology.
    """
    Omega_m = 0.27
    Omega_L = 1. - Omega_m #assumes flatness

class Constants:
    """
    Defines fundamental physical constants.
    """
    ckms = 3.0e5 # speed of light in km/s 
    rho_crit0 = 2.7751973751261264e11 # rho_crit(z=0) in units of h^-1 Msun/ h^-3 Mpc^3
    G = 4.3e-9 # G in units of Mpc/Msun /(km/s)^2

def angular_distance(z):
    """
    Calculates angular diameter distances. Assumes a fixed LCDM cosmology
    defined in the Cosmology class.

    Parameters:
    ----------
    z : redshift

    Returns :
    ----------
    d_A(z) : angular diameter distance in Mpc/h
    """
    integral, error = integrate.quad(integrand_angular_distance,0.,z,epsrel=1e-4)
    d_A = integral * Constants.ckms*1e-2/(1.+z)
    return d_A

def integrand_angular_distance(z):
    """
    Integrand for the angular diameter distance calculation.

    Parameters:
    ----------
    z : redshift

    Returns:
    ----------
    1./E(z) : Inverse dimensionless Hubble function 1./E(z)=H0/H(z)
    """
    return 1./np.sqrt(Cosmology.Omega_m*(1.+z)**3 + Cosmology.Omega_L)

%timeit angular_distance(1.0)

10000 loops, best of 3: 81.4 µs per loop


Let's try a simple Cython implementation:

In [10]:
%%cython --annotate
# re-import needed modules for compilation
import numpy as np
from scipy import integrate

cdef double Omega_m = 0.27
cdef c = 3e5

def angular_distance_cython(double z):
    
    integral, error = integrate.quad(integrand_angular_distance_cython,0.,z,epsrel=1e-4)
    d_A = integral * 3e5 * 1e-2/(1.+z)
    return d_A

def integrand_angular_distance_cython(double z):
    cdef double result
    result = 1./np.sqrt(Omega_m*(1.+z)**3 + (1-Omega_m))
    return result

In [11]:
%timeit angular_distance(1.0)
%timeit angular_distance_cython(1.0)

10000 loops, best of 3: 85.6 µs per loop
The slowest run took 4.44 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 54 µs per loop


It's already faster, but the ```np.sqrt``` implementation slows down the code. We can also use a fast ```C``` version from 

In [12]:
%%cython --annotate
# re-import needed modules for compilation
import numpy as np
from scipy import integrate

# Note: fast C implementation of standard functions in libc
from libc.math cimport sqrt

cdef double Omega_m = 0.27
cdef double c = 3e5

def angular_distance_cython_libc(double z):
    
    integral, error = integrate.quad(integrand_angular_distance_cython_libc,0.,z,epsrel=1e-4)
    d_A = integral * c * 1e-2/(1.+z)
    return d_A

cdef double integrand_angular_distance_cython_libc(double z):
    cdef double result
    result = 1./sqrt(Omega_m*(1.+z)**3 + (1-Omega_m))
    return result


In [13]:
%timeit angular_distance(1.0)
%timeit angular_distance_cython(1.0)
%timeit angular_distance_cython_libc(1.0)

10000 loops, best of 3: 86 µs per loop
10000 loops, best of 3: 55.4 µs per loop
100000 loops, best of 3: 16.9 µs per loop


Already a speed-up of 4-5. To be much faster, we have to change the code considerably or move to C completely. This can be done via ```ctypes```.

<h1>```ctypes```</h1>

```ctypes``` allows calling C functions directly from shared libraries. After importing them, they can be used as any python function.

In [14]:
import os
from sys import platform as _platform
import ctypes as ct

""" import external C library """
root = os.getcwd()

# MacOS: compile clib.c as dylib
# Linux: compile clib.c as dll
if _platform == "linux" or _platform == "linux2":
    OSstring = 'so'
elif _platform == "darwin":
    OSstring = 'dylib'
clib = ct.CDLL(root+'/clib.'+OSstring) 

""" define function angular_distance_c and give type of arguments/outputs """
clib.angular_distance_c.restype = ct.c_double
clib.angular_distance_c.argtypes = (ct.c_double,)

In [15]:
%timeit clib.angular_distance_c(1.0)

The slowest run took 308.58 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.79 µs per loop


This gets even more severe in $n$-dimensional integrations. Let's compare a scipy n-dimensional Gaussian quadrature integration with GNU's Monte Carlo integrator. We want to solve:

$$ \int_0^\pi \frac{\mathrm d x}{\pi} \int_0^\pi \frac{\mathrm d y}{\pi} \int_0^\pi \frac{\mathrm d z}{\pi} \frac{1}{1 - \cos(x) \cos(y) \cos(z)} \: ,$$

and the analytic solution is given by

$$ \frac{1}{4 \pi^3} \Gamma \left( \frac{1}{4} \right)^4 = 1.39320392 \dots $$

In [20]:
def randomwalk_integrand(x, y, z):
    A = 1./np.pi**3
    result = A/(1. - np.cos(x) * np.cos(y) * np.cos(z))
    return result




In [21]:
# avoid singularities at the boundary
eps = 1e-10

%timeit integrate.nquad(randomwalk_integrand, [[0+eps, np.pi-eps], [0+eps, np.pi-eps], [0+eps, np.pi-eps]])

1 loop, best of 3: 1min 9s per loop


In [22]:
"""import mc_integral function from external C library"""

clib.mc_integral.restype = ct.c_double
clib.mc_integral.argtypes = None

%timeit clib.mc_integral()

10 loops, best of 3: 59.2 ms per loop


In [24]:
from scipy.special import gamma

eps = 1e-10
print 'C: ', clib.mc_integral()
print 'Python: ' , integrate.nquad(randomwalk_integrand, [[0+eps, np.pi-eps], [0+eps, np.pi-eps], [0+eps, np.pi-eps]])
print 'analytic: ', gamma(1./4)**4 / (4. * np.pi**3)

 C:  1.39322713862
Python:  (1.3932039279547017, 4.660446003751373e-06)
analytic:  1.39320392969


<h2>Passing an ```numpy``` array to C</h2>

The following example shows how to pass a numpy array to your external C library via ```ctypes``` to perform some operation on it. Our example function takes a numpy array and its size as input and sums up all elements.

In [25]:
import numpy as np
from numpy.ctypeslib import ndpointer

# input: array, len(array)
clib.datain.restype = ct.c_double
clib.datain.argtypes = [ndpointer(ct.c_double, flags="C_CONTIGUOUS"),ct.c_int]

array = np.ones(1000)

clib.datain(array,len(array))

1000.0