Accelerating Applications
==============

You may have heard that python is rather slow (compared to 'compiled' langauges like *C*).
It can also be quite difficult to write quick applications in python becaues you have to use odd indexing syntax.
There *is* another way! There are a number of ways that you can accelerate your slow applications almost automatically...

The first thing that we will look at (apart from the obvious using ```numpy``` rather than loops) is using multiple threads.

In [None]:
import numpy as np

The ```multiprocessing``` module
------------------------------

Doing multi-core processing in python is a little convoluted due to the [Global Interpreter Lock (GIL)](https://wiki.python.org/moin/GlobalInterpreterLock), however it is possible - it just means that we need to lanch a different python 'kernel' for each thread. This means there is some (particularly memory) overhead.

We'll only cover using the ```Pool``` here, but there is much more information in the [documentation](https://docs.python.org/3/library/multiprocessing.html). This works the best when you have very 'heavy' functions, such as ones that load data and do some heavy processing on it.

First we will need to introduce you to a funcitonal programming tool called ```map```. ```map``` lets you call a function with, for example, a list of numbers as an argument. Surely you would just use ```numpy``` arrays though, if you wanted to do that, right? Consider the following function:

In [None]:
def difficult_function(x):
    if x < 2:
        return 0
    else:
        return x

If we need to call this function with a numpy array, we will see that we have a bit of an issue!

In [None]:
try:
    difficult_function(np.arange(10))
except ValueError:
    print("Oh dear! You can't do this because you can't test element by element")

Instead, we need to *map* our input across a function. ```map()``` takes two arguments, the first your callable (i.e. your function) and second a list or array which should be fed to it *one-at-a-time*. ```map()``` returns a [*lazy* object](https://stackoverflow.com/questions/37417210/lazy-evaluation-of-map), meaning that it won't automatically generate all of your results. To get everything, you need to convert it to a ```list``` or a numpy ```array```.

In [None]:
output = map(difficult_function, np.arange(10))
print(list(output))

It is important to write functions for parallelisation that have no *side effects*. This means that they do not alter any of the objects that they take in.

Why is this the case? Well, if common objects are being altered then you will need to deal with communication between threads. If thread 1 alters an integer x and then thread 2 also alters x to a different value, which is the 'correct' value for x?

In summary: we need a *heavy* (i.e. takes at least a few seconds to compute) function that has no side effects. Idea: generating and then plotting some data!

In [None]:
from multiprocessing import Pool
import matplotlib.pyplot as plt
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas

In [None]:
def difficult_function(seed=10, size=1024, bins=1024):
    np.random.seed(seed)
    data = np.random.rand(size**2)
    fig, ax = plt.subplots()
    plot = ax.hist(data, bins=bins)
    
    canvas = FigureCanvas(fig)

    canvas.draw()       # draw the canvas, cache the renderer

    image = np.fromstring(canvas.tostring_rgb(), dtype='uint8')
    
    plt.close(fig)
    
    return image   # Get the pixel data from a matplotlib image!

In [None]:
seeds = range(8)

Let's test this as a *serial* function:

In [None]:
%timeit list(map(difficult_function, seeds))

Now let's take a look at the ```multiprocessing``` pool. This will enable you to split your ```map``` into multiple pieces, and then give each of them to different threads (processes) on your computer to execute **concurrently**.

To spawn the pool with n threads, call ```Pool(n)```.

In [None]:
pool = Pool(2)

Then to launch your analysis over multiple threads, you need to call ```map``` in a similar way to as above, but as a *method* to the pool!

In [None]:
%timeit pool.map(difficult_function, seeds)

Note that this is not simply twice as fast - this is because of the overhead that comes along with having to spawn another python instance and do a bit of communication between the main 'host' and the workers that go away and call the function.

Just-in-Time Compilation
=============

So, you have heard that python is not a compiled language, eh? Well, let's challenge that.

If you have heard of LLVM, I am sure you can see where this is going. Essentially, we *can* compile python to machine code, but it's only helpful if we have functions that take a bit of time.

This is where [```numba```](https://numba.pydata.org/) comes in. It uses LLVM to compile your functions to machine code (sort of) and then uses them in your scripts. They can even be faster than their ```numpy``` equivalents! It's also very easy to use.

Let's start with a few standard ways of summing up an array:

In [None]:
def standard(x):
    y = 0
    for item in x:
        y += item
        
    return y

In [None]:
def numpy_way(x):
    return x.sum()

In [None]:
def python_way(x):
    return sum(x)

Now, let's profile these and see which is fastest (my bet is, of course, on the ```numpy``` implementation!)

In [None]:
input_data = np.arange(10000000)

In [None]:
%timeit standard(input_data)

In [None]:
%timeit numpy_way(input_data)

In [None]:
%timeit python_way(input_data)

So, the ```numpy``` implementation should be the fastest *by far*. Here is where the magic comes in. We can speed up our terrible standard way of a for loop over the elements in the array by 'decorating' the function with ```@numba.jit```. This tells ```numba``` to compile that function, ready for use later. In fact, the function is not actually compiled until it is used later on.

In [None]:
import numba

In [None]:
@numba.jit
def numba_way(x):
    y = 0
    for item in x:
        y += item
        
    return y

In [None]:
%timeit numba_way(input_data)

Cool! Depending on your system, this should be on par with *or* faster than the ```numpy``` implementation! Note, though, that ```%timeit``` tells us that the first loop took much longer; this is to be expected we had to compile the function then!

This is a very simple example, but it is much easier to write your complex logic as a series of ```for``` loops rather than using ```numpy``` operations, especially whilst you are new. Don't use ```numba``` as a crutch, but it can be very helpful if you are in a pinch!

Be careful where you use ```numba```, as it may slow you down if you are using it on 'quick' functions.

Further reading:
  - Look at using 'type annotations' in ```numba``` to speed you up even more
  - If you need even more parallelism , check out http://mpi4py.scipy.org/docs/.

# End of Notebook