# Speeding up Python Programs

<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#NumPy-and-Vectorization" data-toc-modified-id="NumPy-and-Vectorization-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>NumPy and Vectorization</a></span></li><li><span><a href="#Switch-to-PyPy" data-toc-modified-id="Switch-to-PyPy-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Switch to PyPy</a></span></li><li><span><a href="#Testing" data-toc-modified-id="Testing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Testing</a></span></li><li><span><a href="#Compile-parts-of-the-Python-code" data-toc-modified-id="Compile-parts-of-the-Python-code-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Compile parts of the Python code</a></span><ul class="toc-item"><li><span><a href="#Cython" data-toc-modified-id="Cython-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Cython</a></span></li><li><span><a href="#Numba" data-toc-modified-id="Numba-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Numba</a></span></li></ul></li><li><span><a href="#Parallel-Processing" data-toc-modified-id="Parallel-Processing-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Parallel Processing</a></span><ul class="toc-item"><li><span><a href="#concurrent.futures-module" data-toc-modified-id="concurrent.futures-module-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>concurrent.futures module</a></span></li><li><span><a href="#multiprocessing-module" data-toc-modified-id="multiprocessing-module-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>multiprocessing module</a></span></li><li><span><a href="#Numba-(again)" data-toc-modified-id="Numba-(again)-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Numba (again)</a></span></li><li><span><a href="#IPython-parallel" data-toc-modified-id="IPython-parallel-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>IPython parallel</a></span></li><li><span><a href="#Big-data-systems" data-toc-modified-id="Big-data-systems-5.5"><span class="toc-item-num">5.5&nbsp;&nbsp;</span>Big data systems</a></span></li></ul></li><li><span><a href="#Use-the-GPU" data-toc-modified-id="Use-the-GPU-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Use the GPU</a></span><ul class="toc-item"><li><span><a href="#Numba-(yet-again)" data-toc-modified-id="Numba-(yet-again)-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Numba (yet again)</a></span></li><li><span><a href="#PyCUDA-and-PyOpenCL" data-toc-modified-id="PyCUDA-and-PyOpenCL-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>PyCUDA and PyOpenCL</a></span></li><li><span><a href="#Other-packages" data-toc-modified-id="Other-packages-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>Other packages</a></span></li></ul></li><li><span><a href="#Interface-to-C/C++/Fortran" data-toc-modified-id="Interface-to-C/C++/Fortran-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Interface to C/C++/Fortran</a></span><ul class="toc-item"><li><span><a href="#C-extension-interface" data-toc-modified-id="C-extension-interface-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>C extension interface</a></span></li><li><span><a href="#ctypes" data-toc-modified-id="ctypes-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>ctypes</a></span></li><li><span><a href="#cffi" data-toc-modified-id="cffi-7.3"><span class="toc-item-num">7.3&nbsp;&nbsp;</span>cffi</a></span></li><li><span><a href="#SWIG" data-toc-modified-id="SWIG-7.4"><span class="toc-item-num">7.4&nbsp;&nbsp;</span>SWIG</a></span></li><li><span><a href="#cppyy" data-toc-modified-id="cppyy-7.5"><span class="toc-item-num">7.5&nbsp;&nbsp;</span>cppyy</a></span></li><li><span><a href="#Boost.Python" data-toc-modified-id="Boost.Python-7.6"><span class="toc-item-num">7.6&nbsp;&nbsp;</span>Boost.Python</a></span></li><li><span><a href="#pybind11" data-toc-modified-id="pybind11-7.7"><span class="toc-item-num">7.7&nbsp;&nbsp;</span>pybind11</a></span></li><li><span><a href="#F2PY" data-toc-modified-id="F2PY-7.8"><span class="toc-item-num">7.8&nbsp;&nbsp;</span>F2PY</a></span></li></ul></li><li><span><a href="#Consider-other-languages" data-toc-modified-id="Consider-other-languages-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Consider other languages</a></span><ul class="toc-item"><li><span><a href="#Julia" data-toc-modified-id="Julia-8.1"><span class="toc-item-num">8.1&nbsp;&nbsp;</span>Julia</a></span></li><li><span><a href="#C/C++" data-toc-modified-id="C/C++-8.2"><span class="toc-item-num">8.2&nbsp;&nbsp;</span>C/C++</a></span></li><li><span><a href="#Fortran" data-toc-modified-id="Fortran-8.3"><span class="toc-item-num">8.3&nbsp;&nbsp;</span>Fortran</a></span></li></ul></li></ul></div>

## NumPy and Vectorization

What's so good about NumPy? For starters:
- Most of it written in C for speed
- It adds strongly-typed arrays to Python. These are less flexible than lists but potentially much more efficient
- Most numpy functions (called ufuncs) accept arrays as arguments and automatically operate on all the elements (vectorization). Don't write explicit loops unless you REALLY have to!
- AstroPy is built on top of NumPy, making a good thing even better

Jake VanderPlas is better than most of us at this sort of thing, and he helpfully wrote about it in the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) (chapter 2 is specifically about NumPy, but all of it is well worth reading).

## Switch to PyPy

Anyone reading this notebook is very probably using a CPython kernel. PyPy is a replacement implementation of Python, optimized for speed. It integrates cffi and a JIT compiler. Details at https://www.pypy.org/index.html

The project aims to be code-compatible with a large subset of a non-quite-current version of CPython. However, the extension mechanism is very different and not all third-party packages are currently compatible with PyPy. A compatibility list is at http://packages.pypy.org. At present, numpy and astropy are shown as working, but scipy, matplotlib, pandas and several others are not.

There are mixed reports about getting PyPy to work with Jupyter notebooks. If anyone wants to try it, I wish you luck.

## Testing

Define some functions to play with later, one trivial and one working the CPU harder:

In [1]:
import math

def sq(x):
    return x*x

PRIMES = [
    112272535095293,
    112582705942171,
    112272535095293,
    115280095190773,
    115797848077099,
    1099726899285419]

def is_prime(n):
    if n % 2 == 0:
        return False

    sqrt_n = int(math.floor(math.sqrt(n)))
    for i in range(3, sqrt_n + 1, 2):
        if n % i == 0:
            return False
    return True

Time this in unenhanbced mode to get a baseline. Because `map()` does lazy evaluation it's important to do the list comprehension to force a full calculation.

In [2]:
%%timeit -n2 -r4
results = [x for x in map(is_prime, PRIMES)]

2.5 s ± 67.1 ms per loop (mean ± std. dev. of 4 runs, 2 loops each)


In [3]:
for number, prime in zip(PRIMES, map(is_prime, PRIMES)):
    print('%d is prime: %s' % (number, prime))

112272535095293 is prime: True
112582705942171 is prime: True
112272535095293 is prime: True
115280095190773 is prime: True
115797848077099 is prime: True
1099726899285419 is prime: False


To get a comparison with compiled code run from the command line, the `./C/` directory contains:
- a version of `is_prime()` written in C
- a test program in C++
- a shared object library containing the `is_prime()` function

Running the test program gives:
```
$ ./testprime
112272535095293 is prime: False
112582705942171 is prime: False
112272535095293 is prime: False
115280095190773 is prime: False
115797848077099 is prime: False
1099726899285419 is prime: True
10 loops, average time taken: 251 milliseconds
```

This compares with 2.35 seconds for the unoptimized Python version run on the same machine.

## Compile parts of the Python code

Isolate the runtime bottleneck in a small block of Python code, auto-generate C code from it and compile to native binary. This can be surprisingly easy but requires some extra code (unlike Julia, which does this automatically - see later section).

However the results compare with a carefully hand-optimized C/C++ program, this can be a substantial improvement on simple Python. Jake VanderPlas wrote some blog articles back in 2012/2013 that are still interesting to read: http://jakevdp.github.io/blog/2012/08/24/numba-vs-cython/ and http://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/

### Cython

This basically adds two extensions to Python syntax:
- with `cdef`, variables can be declared with an explicit (C-compatible) type
- functions can be declared with `def`, `cdef` or `cpdef` for Python-only, C-only or Python+C use.

There are also some function decorators which can be used instead of cdef and to control various checks.

Start by loading the Cython extension into the notebook:

In [4]:
%load_ext Cython

A simple example to show the syntax. the `-a` flag on the cython magic annotates the output so you can see the C code generated.

In [5]:
%%cython -a

# factorials
cdef int a = 1
for i in range(1,10):
    a *= i
print(a)

362880


Now run the primes-check example. This is a bit fiddly, because Cython environment can't see Python globals and other cells can't see the Cython function. Hence the extra Python function `run_primes_cython()` to give `%%timeit` something to work with.

In [6]:
%%cython
import math
cimport cython # to get the decorators

PRIMES = [
    112272535095293,
    112582705942171,
    112272535095293,
    115280095190773,
    115797848077099,
    1099726899285419]

@cython.boundscheck(False)
@cython.wraparound(False)
def is_prime_cython(n):
    if n % 2 == 0:
        return False

    sqrt_n = int(math.floor(math.sqrt(n)))
    for i in range(3, sqrt_n + 1, 2):
        if n % i == 0:
            return False
    return True

def run_primes_cython():
    return [x for x in map(is_prime_cython, PRIMES)]

In [7]:
%%timeit -n2 -r4
run_primes_cython()

1.97 s ± 45.2 ms per loop (mean ± std. dev. of 4 runs, 2 loops each)


In this example, on my system, the speedup is unspectacular (23%) and nowhere near worth the effort.

### Numba

From [their website](http://numba.pydata.org): Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code. 

Numba has many capabilities but at its simplest it can be astonishingly easy to use: just add an `@jit` decorator to a standard Python function.

In [8]:
from numba.decorators import jit

@jit
def is_prime_numba(n):
    if n % 2 == 0:
        return False

    sqrt_n = int(math.floor(math.sqrt(n)))
    for i in range(3, sqrt_n + 1, 2):
        if n % i == 0:
            return False
    return True

In [9]:
%%timeit -n2 -r4
result = [x for x in map(is_prime_numba, PRIMES)]

336 ms ± 38.6 ms per loop (mean ± std. dev. of 4 runs, 2 loops each)


On my system, this is 6-fold faster than Cython for far less effort. It's virtually as fast as my C++ program run from the command line (296 ms for Numba vs 285 ms for C++ compiled with `gcc -g -O0` or 251 ms with `gcc -O3` optimization).

I'm embarrassed that I'd been using Python for many years before I discovered Numba. 

There are limitations, meaning Numba can fail to compile a function. In particular, keep data structures simple: lists and Numpy arrays are good, dictionaries and Pandas dataframes are a problem. It's generally best to split out the slow, computationally intensive parts of the program for compilation, leaving most of the logic and complex data handling in Python.

## Parallel Processing

How many processor cores are there in your laptop computer? For an affordable Core i5 CPU, this might be 4 physical cores and with multithreading the OS sees 8 virtual cores.

By default, Python (more specifically CPython, which is what we're most likely using at present) is not thread-safe and the global interpreter lock (GIL) prevents multiple threads accessing Python objects simultaneously. So most of the time we have a single thread in a single process running on a single core. Spreading the load to the other cores takes a bit more work, but of course clever people have already written packages to help with this. 

### concurrent.futures module

Describes itself as "a high-level interface for asynchronously executing callables". A simple and useful feature is a parallel version of Python's `map(function, iterable)`.

In [10]:
import concurrent.futures as cf

In [11]:
with cf.ProcessPoolExecutor() as executor:
    print([x for x in executor.map(sq, range(100))])

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 900, 961, 1024, 1089, 1156, 1225, 1296, 1369, 1444, 1521, 1600, 1681, 1764, 1849, 1936, 2025, 2116, 2209, 2304, 2401, 2500, 2601, 2704, 2809, 2916, 3025, 3136, 3249, 3364, 3481, 3600, 3721, 3844, 3969, 4096, 4225, 4356, 4489, 4624, 4761, 4900, 5041, 5184, 5329, 5476, 5625, 5776, 5929, 6084, 6241, 6400, 6561, 6724, 6889, 7056, 7225, 7396, 7569, 7744, 7921, 8100, 8281, 8464, 8649, 8836, 9025, 9216, 9409, 9604, 9801]


In [12]:
%%timeit -n2 -r4
with cf.ProcessPoolExecutor() as executor:
    results = [x for x in executor.map(is_prime, PRIMES)]

1.63 s ± 127 ms per loop (mean ± std. dev. of 4 runs, 2 loops each)


### multiprocessing module

Part of the Python standard library, this supports spawning processes for both local and remote concurrency. Like `concurrent.futures` it provides a parallel map function, but also quite a lot more. See [the documentation](https://docs.python.org/3/library/multiprocessing.html) for more advanced usage.

In [13]:
from multiprocessing import Pool, cpu_count

nProc = cpu_count()
print(f"Number of cores: {nProc}\n")
with Pool(nProc) as p:
    print(p.map(sq, range(100)))    

Number of cores: 4

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 900, 961, 1024, 1089, 1156, 1225, 1296, 1369, 1444, 1521, 1600, 1681, 1764, 1849, 1936, 2025, 2116, 2209, 2304, 2401, 2500, 2601, 2704, 2809, 2916, 3025, 3136, 3249, 3364, 3481, 3600, 3721, 3844, 3969, 4096, 4225, 4356, 4489, 4624, 4761, 4900, 5041, 5184, 5329, 5476, 5625, 5776, 5929, 6084, 6241, 6400, 6561, 6724, 6889, 7056, 7225, 7396, 7569, 7744, 7921, 8100, 8281, 8464, 8649, 8836, 9025, 9216, 9409, 9604, 9801]


In [14]:
%%timeit -n2 -r4 
with Pool(nProc) as p:
    results = [x for x in p.map(is_prime, PRIMES)]

1.63 s ± 10.8 ms per loop (mean ± std. dev. of 4 runs, 2 loops each)


### Numba (again)

The @jit and @vectorize decorators can be modified with `parallel=True`, and Numba will try to generate multiprocessor code. This is easy in principle but needs some thought from the programmer, as note also code can sensible run in parallel and avoiding race conditions is your responsibility.

In [15]:
from numba.decorators import jit

@jit(nopython=True, parallel=True)
def prime_parallel_numba(PRIMES):
    def is_prime_numba(n):
        if n % 2 == 0:
            return False

        sqrt_n = int(math.floor(math.sqrt(n)))
        for i in range(3, sqrt_n + 1, 2):
            if n % i == 0:
                return False
        return True
    
    nTodo = len(PRIMES)
    result = [False] * nTodo 
    for i in range(nTodo):
        result[i] = is_prime_numba(PRIMES[i])
    return result

This was more work than the non-parallel @jit example, because the compiler objected to `map()` and various other things I tried. At least it gives the right answer:

In [29]:
prime_parallel_numba(PRIMES)

[True, True, True, True, True, False]

In [17]:
%%timeit -n2 -r4
result = prime_parallel_numba(PRIMES)

291 ms ± 1.3 ms per loop (mean ± std. dev. of 4 runs, 2 loops each)


Essentially the same speed as the previous (and much easier) @jit example, and on taking a closer look Numba reported that it couldn't make this code parallel. You win some, you lose some...

### IPython parallel

Part of IPython/Jupyter rather than the Python language, this supports many types of parallelism, on a single machine or a cluster. It's a big, serious system, not a simple software drop-in like the previous examples.

Docs: https://ipyparallel.readthedocs.io/en/stable/intro.html

There's no demo here, because you need to start a controller and one or (preferably) more engines from the command line before starting Jupyter notebook.

### Big data systems

Imagine data collections so big that they don't fit in memory and aren't all in one place, but you want to do calculations and machine learning on them. Several software billionaires built their businesses on precisly this scenario, so we can be sure that it's a well-funded area of development. The latest, coolest code may be hidden inside Google, but some very powerful systems are available as free software. Some examples:
- Apache Spark
- TensorFlow
- PyTorch
- Pythran

## Use the GPU

We discussed above how to work with CPU cores.  But there's probably also a graphics processor (GPU) in your machine, where "cores" (defined _very_ differently by each manufacturer) are simpler but much more numerous: hundreds, maybe thousands. Originally these just did graphics processing (obviously) but to make this computing power more widely useful, in 2007 Nvidia released the CUDA software layer to allow general purpose computing on the GPU. By an odd coincidence, they sold a lot more hardware and [made a great deal of money](https://en.wikipedia.org/wiki/Nvidia#Finances) in the years since. 

To prevent things getting boringly simple, AMD are now also big players in the GPU computing market, using different technology and very different terminology to describe things. In particular, AMD strongly support OpenCL as an open-source competitor to the propietary CUDA technology. Intel CPUs with integrated graphics can also support OpenCL, and even Nvidia offer OpenCl support, though never with the same performance as CUDA on the same hardware.

Recent supercomputers are stuffed with thousands of graphics cards which never do graphics. More affordably, CUDA ___may___ also work on your laptop, depending on what hardware you have. Mine doesn't (it uses Intel graphics integrated on the CPU). If uncertain, on Linux/Mac you might try `lspci | grep -i nvidia`; a blank response means no CUDA-capable GPU was detected. OpenCL is very likely to work, though (depending on hardware) there may not be much performance gain; at least an OpenCl program is less likely to crash on startup than CUDA.

There's no point including sample code in this overview notebook, as it needs particular hardware and drivers to run. A few packages are mentioned briefly below, and I aim to make a separate notebook dedicated to CUDA and OpenCL.

### Numba (yet again)

Sponsors of the Numba project include Intel, Nvidia and AMD, so it's no surprise that it has pretty good CUDA support. 

There is an easy way and a hard-but-flexible way to use CUDA in Numba:
- Add `cuda=True` to function decorators like @jit and @vectorize
- Write your own CUDA kernels; docs here: https://numba.pydata.org/numba-doc/latest/cuda/index.html

### PyCUDA and PyOpenCL

Closely related packages from the same authors, these are for writing relatively low-level GPU code within Python: http://homepages.math.uic.edu/~jan/mcs572/mcs572notes/lec29.html

On my low-power fanless system with no graphics card, PyCUDA has no chance of running but PyOpenCL was willing to work with what it could find, in this case an i5 CPU:

```
In [1]: import pyopencl                                                                           In [2]: from pyopencl.tools import get_test_platforms_and_devices                                 In [3]: get_test_platforms_and_devices()                                                         Out[3]: [(<pyopencl.Platform 'Portable Computing Language' at 0x7f637bf7e020>,
        [<pyopencl.Device 'pthread-Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz' 
        on 'Portable Computing Language' at 0x55d53170fd10>])]
```

### Other packages

There are lots.

Anaconda provides an introduction to [working with GPU packages](https://docs.anaconda.com/anaconda/user-guide/tasks/gpu-packages/) in Python. The emphasis is on machine learning applications, reflecting a major driver for CUDA development.

___scikit-cuda___ is a CUDA equivalent of skikit-learn: https://scikit-cuda.readthedocs.io/en/latest/

___Anaconda accelerate___ is only in the paid-subscription versions of Anaconda so (of course) I'm unlikely to use it.

## Interface to C/C++/Fortran

There are lots of ways to do this, which gives flexibility plus a pretty strong hint that none of the current methods is perfect.  The typical pattern is to take a shared library/DLL and put a "foreign function interface" (a software wrapper) around it so you can call its functions.

These differ in which languages they support and whether the foreign code is embedded locally or pre-compiled in an existing library. Also, these projects come and go, so the list below is limited to projects that still seem to be active as of 2019.

### C extension interface

The original approach built into CPython. For details see https://docs.python.org/3/extending/extending.html. Preferably look at some of the other options first, as they can make life easier.

### ctypes

Part of this Python standard library, this is available without needing installation. It allows you to wrap existing libaries, including third-party objects that you have no control over and no source code. Docs: https://docs.python.org/3/library/ctypes.html

In [18]:
import ctypes

The following example shows the syntax, importing a shared library containing the `is_prime()` function.

In [19]:
testlib = ctypes.CDLL('C/libisprime.so')

bools = ("True", "False")
for number, prime in zip(PRIMES, map(testlib.is_prime, PRIMES)):
    print('%d is prime: %s' % (number, bools[prime]))

112272535095293 is prime: True
112582705942171 is prime: False
112272535095293 is prime: True
115280095190773 is prime: False
115797848077099 is prime: True
1099726899285419 is prime: True


Disaster! The program ran without an error message, but gives the wrong answers (should be 5 True's then a False). A test program written in C suggests the shared library is basically working:

```
$ ./testlib
112272535095293 is prime: true
112582705942171 is prime: true
112272535095293 is prime: true
115280095190773 is prime: true
115797848077099 is prime: true
1099726899285419 is prime: false
```

I have no idea why this failed, and as ctypes has a reputation for being a nightmare to debug I'm going to quietly move on without worrying about it. Test your code!

### cffi

As [the docs](https://cffi.readthedocs.io/en/latest/index.html) say: this is a "C Foreign Function Interface for Python. Interact with almost any C code from Python, based on C-like declarations that you can often copy-paste from header files or documentation".

Though cffi provides a binary (ABI) mode, they [recommend](https://cffi.readthedocs.io/en/latest/overview.html#abi-versus-api) that non-Windows users avoid it. All the examples here use API mode and assume we can access a C compiler.

More recent than ctypes, cffi needs an import:

In [20]:
from cffi import FFI

One way to use cffi is as a wrapper round existing library code - like ctypes, but hoping for better results.

The first step is to generate new files in a format that Python can use. At its simplest, give the calling signature, the header file and a path to the library file. The verbose output is gcc-style cryptic (turn it off by setting `verbose=False`), but we're mainly hoping for absence of error messages. A previous version of this code got the path the the library wrong and produced a LOT of error message, all totally unhelpful.

In [21]:
ffibuilder = FFI()

# cdef() expects a single string declaring the C types, functions and
# globals needed to use the shared object. It must be in valid C syntax.
ffibuilder.cdef("""
    bool is_prime(long n);
""")

# set_source() gives the name of the python extension module to
# produce, and some C source code as a string.  This C code needs
# to make the declarated functions, types and globals available,
# so it is often just the "#include".
ffibuilder.set_source("_primes_cffi",
"""
     #include "C/isprime.h"   // the C header of the library
""",
     libraries=['C/isprime'])   # library name, for the linker

ffibuilder.compile(verbose=True);

generating ./_primes_cffi.c
the current directory is '/home/colin/zcode/astro-Jupyter/performance'
running build_ext
building '_primes_cffi' extension
/home/colin/anaconda3/envs/ml2/bin/x86_64-conda_cos6-linux-gnu-cc -DNDEBUG -fwrapv -O2 -Wall -Wstrict-prototypes -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -fPIC -I/home/colin/anaconda3/envs/ml2/include/python3.6m -c _primes_cffi.c -o ./_primes_cffi.o
x86_64-conda_cos6-linux-gnu-gcc -pthread -shared -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,-rpath,/home/colin/anaconda3/envs/ml2/lib -L/home/colin/anaconda3/envs/ml2/lib -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,-rpath,/home/colin/anaconda3/envs/ml2/lib -L/home/colin/anaconda3/envs/ml2/lib -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -march=nocona -mtune=haswell -ftree-

That generates new C code, compiles it to an object module and links it to a shared library, all in the current directory.

```
$ ls _primes*
_primes_cffi.c  _primes_cffi.cpython-36m-x86_64-linux-gnu.so  _primes_cffi.o
```

Next we can get a `lib` object with callable Python functions, in this case just `is_prime()`:

In [22]:
from _primes_cffi import ffi, lib

for number, prime in zip(PRIMES, map(lib.is_prime, PRIMES)):
    print('%d is prime: %s' % (number, prime))

112272535095293 is prime: True
112582705942171 is prime: True
112272535095293 is prime: True
115280095190773 is prime: True
115797848077099 is prime: True
1099726899285419 is prime: False


The correct results! This supports the view that ctype's problems were not because of the `libisprime.so` file.

What about performance?

In [23]:
%%timeit -n2 -r4 
results = [x for x in map(lib.is_prime, PRIMES)]

252 ms ± 1.15 ms per loop (mean ± std. dev. of 4 runs, 2 loops each)


As good as running the C++ version from the command line!

We have the source code in this case, and cffi can work with this directly instead of needing the `.so` library. That's good, as generating this library was a (quite messy) extra step.

In [24]:
ffibuilder2 = FFI()

# cdef() expects a single string declaring the C types, functions and
# globals needed to use the shared object. It must be in valid C syntax.
ffibuilder2.cdef("""
    bool is_prime(long n);
""")

# set_source() gives the name of the python extension module to
# produce, and some C source code as a string.  This C code needs
# to make the declarated functions, types and globals available,
# so it is often just the "#include".
ffibuilder2.set_source("_primes2_cffi",
"""
     #include "C/isprime.h"   // the C header of the library
""",
     sources=['C/isprime.c'],
     libraries=['m'])   # we need to link with the math library

ffibuilder2.compile(verbose=True);

generating ./_primes2_cffi.c
the current directory is '/home/colin/zcode/astro-Jupyter/performance'
running build_ext
building '_primes2_cffi' extension
/home/colin/anaconda3/envs/ml2/bin/x86_64-conda_cos6-linux-gnu-cc -DNDEBUG -fwrapv -O2 -Wall -Wstrict-prototypes -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -fPIC -I/home/colin/anaconda3/envs/ml2/include/python3.6m -c _primes2_cffi.c -o ./_primes2_cffi.o
/home/colin/anaconda3/envs/ml2/bin/x86_64-conda_cos6-linux-gnu-cc -DNDEBUG -fwrapv -O2 -Wall -Wstrict-prototypes -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -fPIC -I/home/colin/anaconda3/envs/ml2/include/python3.6m -c C/isprime.c -o ./C/isprime.o
x86_64-conda_cos6-linux-gnu-gcc -pthread -shared -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,-rpath,/home/col

This version of the library works the same way as before, giving the same results and at least as good performance:

In [25]:
import _primes2_cffi 

for number, prime in zip(PRIMES, map(_primes2_cffi.lib.is_prime, PRIMES)):
    print('%d is prime: %s' % (number, prime))

112272535095293 is prime: True
112582705942171 is prime: True
112272535095293 is prime: True
115280095190773 is prime: True
115797848077099 is prime: True
1099726899285419 is prime: False


In [26]:
%%timeit -n2 -r4 
results = [x for x in map(_primes2_cffi.lib.is_prime, PRIMES)]

245 ms ± 1.46 ms per loop (mean ± std. dev. of 4 runs, 2 loops each)


My first impressions of cffi are very positive. It does more than ctypes, fairly easily, and (for whatever reason) it gave the correct answer when ctypes didn't.

### SWIG

This involves roughly the same steps as cffi, but they are done at the command prompt and not within Python. All the files are in the `./swig/` directory.

Start by writing an interface file:

```
/* File : isprime.i */
%module isprime
%{
#include "isprime.h"
%}
%include "isprime.h"
```

Because SWIG supports many scripting languages, not just python, we need to tell it which bindings to generate: 

```
$ swig -python isprime.i
```

This gives us two new files, `isprime.py` and `isprime_wrap.c`.

Now we need to compile to get an object library. This is highly system dependent, but after some trial and error this worked for me (on Linux Mint 19.2):

```
gcc -c isprime.c isprime_wrap.c -I/usr/include/python3.6 -fPIC
```
The include path needs to be a directory containing `Python.h`.

Finally use these object modules to create a shared library:
```
ld -shared -fPIC isprime.o isprime_wrap.o -o isprime.so
```

At last we have the two files we need: `isprime.py` is our interface, and `_isprime.so` the shared library.

Import the necessary function and use it:

In [27]:
from swig.isprime import is_prime as is_prime_swig

for number, prime in zip(PRIMES, map(is_prime_swig, PRIMES)):
    print('%d is prime: %s' % (number, prime))

112272535095293 is prime: True
112582705942171 is prime: True
112272535095293 is prime: True
115280095190773 is prime: True
115797848077099 is prime: True
1099726899285419 is prime: False


In [28]:
%%timeit -n2 -r4 
results = [x for x in map(is_prime_swig, PRIMES)]

278 ms ± 1.68 ms per loop (mean ± std. dev. of 4 runs, 2 loops each)


Not bad: correct results, and performance only slightly worse than cffi. However, getting to this point was a fairly ugly and (in my novice hands) error-prone process.

Why use SWIG? It's good if you have a lot of code to wrap, with regular updates needed: that can be automated. Also if you want to support multiple scripting languages from this list: C#, D, Go,  Guile, Java, Javascript, Lua, MzScheme/Racket, OCaml, Octave, Perl, PHP, Python, R, Ruby, Scilab, Tcl.

I still think cffi is easier to get started with.

### cppyy

This lets you write arbitrary C++ code within Python, which is compiled on the fly by [Cling](https://root.cern.ch/cling) (which has CERN behind it). It supports modern code up to at least C++14 standards.

TODO: working example

### Boost.Python

This is something different: a C++ library that can be used in your code to expose functions and classes to Python. It's part of a much bigger and more complex Boost package, now quite old. Unfortunately, I think it's fair to say that the documentation is a mess.

Adding the C++ code is simple enough. The challenge is figuring out how to compile it, and after reading various web pages on this topic I still have no idea.

Conclusion: Boost.Python is only for serious C++ programmers who already use the Boost libraries for other reasons. It has nothing to offer someone like me.

### pybind11

Conceptually similar to Boost.Python, but newer and much more lightweight. Targetted at C++11 or later, which makes this sort of thing much simpler than in older languageversions.

Various Github repos are at https://github.com/pybind, worth looking at the get the examples and tests. The documentation is at https://pybind11.readthedocs.io/en/stable/index.html: better than Boost.Python, but still a bit quirky.

These sites carefully avoid saying anything at all about installing pybind11 itself, though there are some clues to other requirements. The main thing to know is that, although this is mainly a C++ library, what you need to install is a python package: `pybind11` in pip or conda, `python3-pybind11` on Debian-based systems.

As with Boost.Python, creating the C++ code is simple, building it is at best confusing. Whatever your usual C++ build workflow, for pybind11 it is [strongly recommended to use CMake](https://stackoverflow.com/questions/54908007/how-to-properly-compile-c-code-with-pybind11). Alternatively, some people [prefer to use a `setup.py`](http://people.duke.edu/~ccc14/sta-663-2018/notebooks/S13C_pybind11.html) to control the build.

TODO: working example

### F2PY

An interface to Fortran 70/90/95. See https://www.numfys.net/howto/F2PY/ for an overview. That contains a broken link to the main F2PY website, which isn't encouraging. 

The SciPy pages may be more useful: https://docs.scipy.org/doc/numpy/f2py/. Apparently F2PY is now part of NumPy.

## Consider other languages

Python is relatively quick and easy to write but slow(ish) to run. We've looked at ways to speed it up, successful enough to make this the most widely-used language in modern astronomy. But there's a limit to speedup and sometimes you hit it.

Time to at least consider the alternatives.

### Julia

A [fairly new open-source language](https://julialang.org/), under rapid development and growing in popularity among scientists and engineers. Development started in 2009 and version 1.0, with a more difinitive and stable API, was released August 2018. 

The slogan is "walk like Python, run like C". The syntax is simple and familiar, largely a modernized cross between Python and Matlab, so the learning curve is fairly shallow. 

Most SciPy/NumPy equivalent functionality is built in as standard because this is what the language is designed for. There's a growing subset of AstroPy functionality available (and they'd love you to help expand this).

Making mixed-language programming easy is a core objective, so several Python packages are simply imported and used as-is. Plotting can use Matplotlib (or Plotly or several others), symbolic math uses SciPy.

What about "run like C"? The language implementation means that non-trivial Julia code can run many-fold faster than the equivalent Python, but this may need a [change of programming style](https://docs.julialang.org/en/v1/manual/performance-tips/). In particular:
- Speed depends on putting performance-critical code into functions that an optimizing compiler can work on the first time they are called
- Avoid global variables, avoid changing the type of variables
- As a modern scientific language, multi-threading and distributed processing are core features

Will Julia become common in astronomy? It deserves to, and I'll try to help. However, after several years of astronomers world-wide mostly converging on Python as their standard language and investing a lot of effort in its development, the timing of Julia's stable release is unfortunate.

### C/C++

We've talked about interfacing these languages to Python. Sometimes that's not worth the complications and it's better to use them directly. At least you get access to modern graphics (Qt, OpenGL, etc) in a way that's alien to Fortran.

### Fortran

Hard core! Fortran is outdated, ugly, hard to write, harder to debug and an all-round pain. So why is it still widely used?
- For the biggest simulations, especially highly parallel OpenMP code, Fortran programs still run fastest. Even C/C++ can't quite match it, and time on a big supercomupting cluster is a limited resource you may need to optimize.
- Lots of scientists spent the last 60 years writing, debugging, optimizing and validating Fortran code. Those software libraries still exist and you probably use them reguarly without realizing (hidden behind glue code for your favorite language). Sometimes only using them in the raw will get the job done.

This was the first programming language I ever learned (in 1974): FORTRAN IV punched on to cards and fed to an IBM 370/165 mainframe. Then we hung around by the line printer waiting for the operator to tear off your pages of 14-inch fanfold paper to see that you missed a comma on line 15. It seemed wonderful at the time.

The language has evolved since then (Hollerith strings were always dumb and no sane person misses them), but not as much as you'd think. Remember that in this world, machine time is precious but programmer time (and nervous energy) is expendable.