# 3. Performance

Typically Python is slower than compiled languages due to **dynamic-typing**.

For example, let's say you wanted to add two integers together, in C this might look like:

    int a = 1;
    int b = 2;
    int c = a + b;
    
the C compiler knows from the start that a and b are integers; they cannot be anything else, hence it calls the appropriate instruction sets from assembly, returning another integer value, which might look something like:

    1. Assign 1 to a
    2. Assign 2 to b
    3. add<int,int>(a, b)
    4. Assign result to c
    
Compared to Python, which is:
    
    a = 1
    b = 2
    c = a + b
    
here the interpreter has no idea what type a, b and c are - only that they are *Python objects*. The interpreter must inspect the PyObject_HEAD for each variable to find the type information, then call the appropriate summation routine for the two types:

    
    Assign 1 to a
        Set a->PyObject_HEAD->typecode to integer
        Set a->val = 1
    Assign 2 to b
        Set b->PyObject_HEAD->typecode to integer
        Set b->val = 2
    call binary_add(a, b)
        find typecode in a->PyObject_HEAD
        a is an integer; value is a->val
        find typecode in b->PyObject_HEAD
        b is an integer; value is b->val
        call binary_add<int, int>(a->val, b->val)
        result of this is result, and is an integer.
    Create a Python object c
        set c->PyObject_HEAD->typecode to integer
        set c->val to result

### Python is Interpreted, not compiled

A compiler can look ahead and optimize for repeated or unneeded operations, which can result in significant speedups, interpeters on the other hand do not have this luxury.

### Python's object model leads to inefficient memory access

When it comes to applying batch operations to multiple integers (in an array for instance), C is much more efficient as there is significantly *less overhead* in creating arrays, where as *Python Lists* are a contiguous buffer of *pointers* which could potential point to random areas in memory, whereas small C arrays are likely to be cached. NumPy arrays get around this by wrapping a C array with a *single Python object*.

## Time

Python has a *time* module that simply returns the time in seconds from the Epoch (01/01/1970).

In [None]:
import time
time.time()

Naive profiling can take place by differencing the times before and after running some code of interest:

In [None]:
import numpy as np
t0 = time.time()
# some function
np.dot(np.random.randn(1000,3),np.ones((3,2))*10)

print(time.time() - t0)

Some of the most powerful tools used in Performance analysis are part of **iPython Magicks**: %timeit, %run and %prun.

Let's illustrate using the trapezoidal rule:

### Trapezoidal Rule

This is a method from numerical integration for approximating a definite integral:

$$
\int_a^b f(x)dx \approx (b-a)\frac{f(b)-f(a)}{2}
$$

Rather than using a single interval for this estimate, we break the interval down into $n$ subintervals, to obtain a more accurate approximation:

$$
\int_a^b f(x)dx \approx \sum_{k=1}^N \frac{f(x_{k-1})+f(x_k)}{2}\Delta x_k 
$$

which for a uniform grid of equally-spaced panels becomes:

$$
I = \frac{\Delta x}{2}(f(x_0) + 2f(x_1) + 2f(x_2) + \dots + 2f(x_{N-1}) + f(x_N))
$$

In [None]:
def f(x):
    return 2*x*x + 3*x + 1

def trapz(f, a, b, N):
    h = (b-a)/float(N)
    sum_y = 0
    x = a
    for i in range(N):
        x += h
        sum_y += f(x)
    sum_y += .5 * (f(a) + f(b))
    return sum_y*h

In [None]:
trapz(f, 1, 5, 10000)

Confirm this using *sympy*:

In [None]:
import sympy

xs = sympy.symbols("xs")
fx = 2*xs*xs + 3*xs + 1
ifx = sympy.integrate(fx, (xs, 1, 5))
ifx.evalf()

In [None]:
%timeit trapz(f, 1, 5, 10000)

In [None]:
%prun trapz(f, 1, 5, 100000)

## Speeding up Python/Pandas

When you have cratered under the weight of slow code, and you have profiled your code to find the bottleneck, there are a number of easy tools to speed up your code;

One of these methods is making effective use of **list comprehensions**.

In [None]:
def calculate_y(x):
    return 3*x**3 + 4*x**2 + 10*x - x**4

In [None]:
%%timeit
calcs = []
for x in range(10000):
    calcs.append(calculate_y(x))

In [None]:
%timeit calcs = [calculate_y(x) for x in range(10000)]

A moderate speed up for very little work, and it's slightly easier to read and write

## Eval

In addition, Pandas provides access to *fast array expression evaluation* with eval():

In [None]:
import pandas as pd
x1, x2, x3 = [pd.DataFrame(np.random.normal(3.0, 2.0, size=(100000,500))) for i in range(3)]
x1.head(1)

In [None]:
%timeit x1 + x2 + x3

In [None]:
%timeit pd.eval("x1 + x2 + x3")

In [None]:
%timeit (x1 < x2) & (x2 > x3)

In [None]:
%timeit pd.eval("(x1 < x2) & (x2 > x3)")

In [None]:
dfx = pd.DataFrame(np.random.randn(1000000,2), columns=['x','y'])

In [None]:
%timeit dfx.x*1.5 + .25*dfx.x**2 - 3.4*dfx.y + .75*dfx.y**2 - 10

In [None]:
%timeit dfx.eval("1.5*x + 0.25*x**2 - 3.4*y + 0.75*y**2 - 10")

The calculations for this can easily be *assigned* using inplace:

In [None]:
dfx.eval("z = x<0.5", inplace=True)
dfx.head()

Local variables can be assigned using *const @* identifier:

In [None]:
cs = 0.007
dfx.eval("x * @cs").head()

## Cython

Python developers typically solve performance constraints by building Python extensions by wrapping code written in other compiled languages (such as C/C++). However the C/Python API is hideous complex for all but the most veteran developers.

Cython is a language that allows programmers to write fast code without having to write C/C++/Fortran directly. It looks like Python code but with type declarations. Cython code is translated to C, which is then compiled to create a Python extension that we can import and use.

Cython often achieves several orders of magnitude increase, often faster than hand-coded C, but can take a long time to get right.

Recall our trapz() function:

In [None]:
# Pure Python

def trapz(f, a, b, N):
    h = (b-a)/float(N)
    sum_y = 0
    x = a
    for i in range(N):
        x += h
        sum_y += f(x)
    sum_y += .5 * (f(a) + f(b))
    return sum_y*h

In [None]:
df = pd.DataFrame({'a': np.random.randn(10000),
                   'b': np.random.randn(10000),
                   'N': np.random.randint(100, 1000, (10000)),
                   'x': 'x'})

In [None]:
%timeit df.apply(lambda x: trapz(f, x.a, x.b, x.N), axis=1)

Let's profile to see why it's slow:

In [None]:
%prun -l 4 df.apply(lambda x: trapz(f, x.a, x.b, x.N), axis=1)

Most of the time is spent in our functions, so we should convert them to Cython.

In [None]:
%load_ext Cython

The easiest thing we can do is simply use the **iPython Magic** to convert a Jupyter notebook block into Cython for us:

In [None]:
%%cython

def f2(x):
    return 2*x*x + 3*x + 1

def trapz2(f, a, b, N):
    h = (b-a)/float(N)
    sum_y = 0
    x = a
    for i in range(N):
        x += h
        sum_y += f(x)
    sum_y += .5 * (f(a) + f(b))
    return sum_y*h

In [None]:
%timeit df.apply(lambda x: trapz2(f2, x.a, x.b, x.N), axis=1)

A fair speed-up just by using Cython's import, now let's try and improve things by using the --annotate flag to the declaration:

In [None]:
%%cython --annotate

def f2(x):
    return 2*x*x + 3*x + 1

def trapz2(f, a, b, N):
    h = (b-a)/float(N)
    sum_y = 0
    x = a
    for i in range(N):
        x += h
        sum_y += f(x)
    sum_y += .5 * (f(a) + f(b))
    return sum_y*h

In the above, the colour indicates the 'typedness' of the extension, where yellower lines are closer to Python, and therefore require more calls to the Python C API, while whiter lines indicate code that is closer to pure C, hence requiring few, if any, Python API calls.

Clicking on a line reveals the C code underneath the call to Cython.

The goal in speeding up code with Cython is to turn as many lines to white as possible. The easiest way to do this is with type declarations:

In [None]:
%%cython --annotate

def f3(double x):
    return 2*x*x + 3*x + 1

def trapz3(f, double a, double b, int N):
    # declare types
    cdef double h, x, sum_y
    cdef int i
    # continue
    h = (b-a)/float(N)
    sum_y = 0
    x = a
    for i in range(N):
        x += h
        sum_y += f(x)
    sum_y += .5 * (f(a) + f(b))
    return sum_y*h

In [None]:
%timeit df.apply(lambda x: trapz3(f3, x.a, x.b, x.N), axis=1)

As we can see, another halving of the speed, just by using type declarations. The next thing we could do is inline the polynomial function. What this means is we ask the compiler to paste the function wherever it is called rather than making an expensive function call - this is particularly useful when we call $f(x)$ many, many times in the calculations of this integral, this involves:
* changing *Python* def to cdef
* add a return type
* add the *inline* keyword

In [None]:
%%cython --annotate

import cython

cdef inline double f4(double x):
    return 2*x*x + 3*x + 1

@cython.cdivision(True)
cpdef double trapz4(double a, double b, int N):
    # declare types
    cdef double h, x, sum_y
    cdef int i
    # continue
    h = (b-a)/float(N)
    sum_y = 0
    x = a
    for i in range(N):
        x += h
        sum_y += f4(x)
    sum_y += .5 * (f4(a) + f4(b))
    return sum_y*h

In [None]:
%timeit df.apply(lambda x: trapz4(x.a, x.b, x.N), axis=1)

The cdef keyword declares a C object. Everything that follows it is therefore specified in terms of C; we are essentially writing C, but using a subset of Python's syntax rules. So when we create function cdef f4, it is a C function, and not available to you in Python. This is worth considering to ensure it is not called in Python by accident.

cpdef keyword is however a hybrid declaration that creates both a C interface and a Python interface to the function.

### Using Numpy Arrays

If we profile the function now, we see that our functions are not longer near the top:

In [None]:
%prun -l 4 df.apply(lambda x: trapz4(x.a, x.b, x.N), axis=1)

However *series* is being called a lot. This is because each row is being turned into a *series*.

In [None]:
%%cython --annotate

cimport numpy as np
import numpy as np

import cython

cdef inline double f4(double x):
    return 2*x*x + 3*x + 1

@cython.cdivision(True)
cpdef double trapz4(double a, double b, int N):
    # declare types
    cdef double h, x, sum_y
    cdef int i
    # continue
    h = (b-a)/float(N)
    sum_y = 0
    x = a
    for i in range(N):
        x += h
        sum_y += f4(x)
    sum_y += .5 * (f4(a) + f4(b))
    return sum_y*h

cpdef np.ndarray[double] apply_trapz(np.ndarray col_a, np.ndarray col_b, np.ndarray col_n):
    assert(col_a.dtype == np.float and col_b.dtype == np.float and col_n.dtype == np.int)
    
    cdef Py_ssize_t i, n = len(col_n)
    assert(len(col_a) == len(col_b) == n)
    cdef np.ndarray[double] res = np.empty(n)
    
    for i in range(len(col_a)):
        res[i] = trapz4(col_a[i], col_b[i], col_n[i])
    return res

In [None]:
%timeit apply_trapz(df.a.values, df.b.values, df.N.values)

Our work appears to be finished in terms of optimizations here.

### Compiler Directives

For example, calculating the euclidean distance between 2 arrays:

In [None]:
def euclidean(x, y):
    return np.sqrt(((x - y)**2).sum())

In [None]:
%timeit euclidean(np.random.randn(1000), np.random.randn(1000))

In order to get a speed up under Cython, we need to iterate over the elements manually to aggregate them:

In [None]:
%%cython --annotate

import cython
cimport numpy as np
from libc.math cimport sqrt

@cython.boundscheck(False)
@cython.wraparound(False)
def euclidean2(np.ndarray[np.float64_t, ndim=1] x, np.ndarray[np.float64_t, ndim=1] y):
    cdef:
        double diff
        int i
    diff = 0
    for i in range(x.shape[0]):
        diff += (x[i] - y[i])**2
    return sqrt(diff)

In [None]:
%timeit euclidean2(np.random.randn(1000), np.random.randn(1000))

Setting *boundscheck* to False removes boundary checking for indexing operations, forcing us to ensure that we do not try to index arrays using index values that are out of bounds. When we set *wraparound* to False, Cython will not support negative indexes, as is the case with Python. Using directives is powerful, but dangerous; if we do not index properly or make some other error, it can cause segmentation faults and/or corruption.

## Task

Gradient descent is the method of taking steps to reduce the minimized objective function with regards to the optimum weights for a linear-regression problem. The algorithm works as:
1. Initialise $\bf w$ at uniform random, $i = 0$
1. While i < maximum iterations:
    1. Calculate $\Delta_w \mathbf{e}$
    2. Update $w^{(k+1)}=w^{(k)} - \gamma \Delta_w \mathbf{e}$
1. Until convergence

where $\gamma$ is the learning rate. 

Run the code below in normal Python to see how fast it is with %timeit, then try to Cythonize it and see who gets the best performance.

In [None]:
def gradient_descent(X, y, gamma = 1., n_iter = 10**3):
    n, P = X.shape
    nX = np.column_stack(((np.ones(n,)), X))
    w = np.random.rand(P+1)
    for i in range(1,n_iter):
        dE = np.dot((2*nX.T),(np.dot(nX,w) - y))
        w -= gamma*dE
    return w

In [None]:
X = np.random.normal(3.0, 1.0, size=(10000,500))
y = np.random.randn(10000)
%timeit gradient_descent(X, y, n_iter=300)

In [None]:
# your codes here

## Numba

Where Cython pre-compiles parts of Python code before running. Another approach is **Just-in-Time (JIT)** compilation. Numba is a compiler that runs Python code through an LLVM compiler to produce optimized bytecode for fast execution. Numba doesn't need a C/C++ compiler on your machine. 

The *@jit* decorator runs the decorated function through bytecode analysis using a type inference engine. 

In [None]:
from numba import jit

In [None]:
def pairwise_py(X):
    M, N = X.shape
    D = np.empty((M,M), dtype=np.float64)
    for i in range(M):
        for j in range(M):
            d = 0.0
            for k in range(N):
                tmp = X[i,k] - X[j,k]
                d += tmp*tmp
            D[i,j] = np.sqrt(d)
    return D

In [None]:
%timeit pairwise_py(np.random.rand(1000, 3))

In [None]:
@jit
def jit_pairwise_py(X):
    M, N = X.shape
    D = np.empty((M,M), dtype=np.float64)
    for i in range(M):
        for j in range(M):
            d = 0.0
            for k in range(N):
                tmp = X[i,k] - X[j,k]
                d += tmp*tmp
            D[i,j] = np.sqrt(d)
    return D

In [None]:
%timeit jit_pairwise_py(np.random.rand(1000, 3))

As you can see for specific functions, jit makes a huge improvement in performance. One performance caveat is that it will only speed up code that uses NumPy arrays. When your codes includes things like lists, strings or dictionaries, it will revert to *Object* mode and not provide an appreciable speedup to your code. 

## Task

Use Numba to just-in-time compile the `gradient_descent()` function we used earlier.

In [None]:
# your codes here