## Preamble

This blog post gives an introduction to some techniques for benchmarking, profiling and optimising Python code. If you would like to try the code examples for yourself, you can [download the Jupyter notebook](https://github.com/alimanfoo/alimanfoo.github.io/blob/master/_posts/2017-01-23-go-faster-python.ipynb) (right click the "Raw" button, save link as...) that this blog post was generated from. To run the notebook, you will need a working Python 3 installation, and will also need to install a couple of Python packages. The way I did that was to first [download](http://conda.pydata.org/miniconda.html) and [install](http://conda.pydata.org/docs/install/quick.html) miniconda into my home directory. I then ran the following from the command line:

<pre>
user@host:~$ export PATH=~/miniconda3/bin/:$PATH
user@host:~$ conda create -n go_faster_python python=3.5
user@host:~$ source activate go_faster_python
(go_faster_python) user@host:~$ conda config --add channels conda-forge
(go_faster_python) user@host:~$ conda install cython numpy jupyter line_profiler
(go_faster_python) user@host:~$ jupyter notebook &
</pre>

## Introduction

I use Python both for writing software libraries and for interactive data analysis. Python is an **interpreted**, **dynamically-typed** language, with lots of convenient **high-level data structures** like lists, sets, dicts, etc. It's great for getting things done, because no compile step means no delays when developing and testing code or when exploring data. No type declarations means lots of flexibility and less typing (on the keyboard I mean - sorry, bad joke). The high-level data structures mean you can focus on solving the problem rather than low-level nuts and bolts.

But the down-side is that Python can be slow. If you have a Python program that's running slowly, what are your options?

## Benchmarking and profiling

My intuitions for why a piece of code is running slowly are very often wrong, and many other programmers more experienced than me [say the same thing](http://cython.readthedocs.io/en/latest/src/tutorial/profiling_tutorial.html#profiling-tutorial). If you're going to try and optimise something, you need to do some benchmarking and profiling first, to find out (1) exactly how slow it goes, and (2) where the bottleneck is.

To introduce some basic Python benchmarking and profiling tools, let's look at a toy example: computing the sum of 2-dimensional array of integers. Here's some example data:

In [1]:
data = [[90, 62, 33, 78, 82],
        [37, 31, 0, 72, 32],
        [7, 71, 79, 81, 100],
        [33, 50, 66, 81, 71],
        [87, 26, 54, 78, 81],
        [37, 22, 96, 79, 41],
        [88, 75, 100, 19, 88],
        [24, 72, 59, 33, 92],
        [71, 6, 59, 8, 11],
        [89, 76, 65, 12, 13]]

Strictly speaking this isn't an array, it's a list of lists. But using lists is a common and natural way to store data in Python. 

Here is a naive implementation of a function called `sum2d` to compute the overall sum of a 2-dimensional data structure:

In [2]:
def sum1d(l):
    """Compute the sum of a list of numbers."""
    s = 0
    for x in l:
        s += x
    return s


def sum2d(ll):
    """Compute the sum of a list of lists of numbers."""
    s = 0
    for l in ll:
        s += sum1d(l)
    return s


Run the implementation to check it works:

In [3]:
sum2d(data)

2817

We need a bigger dataset to illustrate slow performance. To make a bigger test dataset I'm going to make use of the multiplication operator ('\*') which when applied to a Python list will create a new list by repeating the elements of the original list. E.g., here's the original list repeated twice: 

In [4]:
data * 2

[[90, 62, 33, 78, 82],
 [37, 31, 0, 72, 32],
 [7, 71, 79, 81, 100],
 [33, 50, 66, 81, 71],
 [87, 26, 54, 78, 81],
 [37, 22, 96, 79, 41],
 [88, 75, 100, 19, 88],
 [24, 72, 59, 33, 92],
 [71, 6, 59, 8, 11],
 [89, 76, 65, 12, 13],
 [90, 62, 33, 78, 82],
 [37, 31, 0, 72, 32],
 [7, 71, 79, 81, 100],
 [33, 50, 66, 81, 71],
 [87, 26, 54, 78, 81],
 [37, 22, 96, 79, 41],
 [88, 75, 100, 19, 88],
 [24, 72, 59, 33, 92],
 [71, 6, 59, 8, 11],
 [89, 76, 65, 12, 13]]

Make a bigger dataset by repeating the original data a million times:

In [5]:
big_data = data * 1000000

Now we have a dataset that is 10,000,000 rows by 5 columns:

In [6]:
len(big_data)

10000000

In [7]:
len(big_data[0])

5

Try running the function on these data:

In [8]:
sum2d(big_data)

2817000000

On my laptop this takes a few seconds to run.

### Benchmarking:  `%time`, `%timeit`, `timeit`

Before you start optimising, you need a good estimate of performance as a place to start from, so you know when you've improved something. If you're working in a Jupyter notebook there are a couple of magic commands available which are very helpful for benchmarking: [`%time`](http://ipython.readthedocs.io/en/stable/interactive/magics.html?highlight=%25time#magic-time) and [`%timeit`](http://ipython.readthedocs.io/en/stable/interactive/magics.html?highlight=%25timeit#magic-timeit). If you're writing a Python script to do the benchmarking, you can use the [`timeit`](https://docs.python.org/3/library/timeit.html) module from the Python standard library.

Let's look at the output from [`%time`](http://ipython.readthedocs.io/en/stable/interactive/magics.html?highlight=%25time#magic-time):

In [9]:
%time sum2d(big_data)

CPU times: user 2.89 s, sys: 0 ns, total: 2.89 s
Wall time: 2.89 s


2817000000

The first line of the output gives the amount of CPU time, broken down into 'user' (your code) and 'sys' (operating system code). The second line gives the wall time, which is the actual amount of time elapsed. Generally the total CPU time and the wall time will be the same, but sometimes not. E.g., if you are benchmarking a multi-threaded program, then wall time may be less than CPU time, because CPU time counts time spent by each CPU separately and adds them together, but the CPUs may actually be working in parallel.

One thing to watch out for when benchmarking is that performance can be variable, and may be affected by other processes running on your computer. To see this happen, try running the cell above again, but while it's running, give your computer something else to do at the same time, e.g., play some music, or a video, or just scroll the page up and down a bit.

To control for this variation, it's a good idea to benchmark several runs (and avoid the temptation to check your email while it's running). The [`%timeit`](http://ipython.readthedocs.io/en/stable/interactive/magics.html?highlight=%25timeit#magic-timeit) magic will automatically benchmark a piece of code several times: 

In [10]:
%timeit sum2d(big_data)

1 loop, best of 3: 2.84 s per loop


Alternatively, using the [`timeit`](https://docs.python.org/3/library/timeit.html) module:

In [11]:
import timeit
timeit.repeat(stmt='sum2d(big_data)', repeat=3, number=1, globals=globals())

[2.892044251999323, 2.8372464299973217, 2.8505056829999376]

### Function profiling: `%prun`, `cProfile`

The next thing to do is investigate which part of the code is taking the most time. If you're working in a Jupyter notebook, the [`%prun`](http://ipython.readthedocs.io/en/stable/interactive/magics.html?highlight=%25prun#magic-prun) command is a very convenient way to profile some code. Use it like this:

In [12]:
%prun sum2d(big_data)

 

The output from [`%prun`](http://ipython.readthedocs.io/en/stable/interactive/magics.html?highlight=%25prun#magic-prun) pops up in a separate panel, but for this blog post I need to get the output inline, so I'm going to use the [`cProfile`](https://docs.python.org/3/library/profile.html?highlight=cprofile) module from the Python standard library directly, which does the same thing:

In [13]:
import cProfile
cProfile.run('sum2d(big_data)', sort='time')

         10000004 function calls in 3.883 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 10000000    2.378    0.000    2.378    0.000 <ipython-input-2-12b138a62a96>:1(sum1d)
        1    1.505    1.505    3.883    3.883 <ipython-input-2-12b138a62a96>:9(sum2d)
        1    0.000    0.000    3.883    3.883 {built-in method builtins.exec}
        1    0.000    0.000    3.883    3.883 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




There are a couple of things to notice here. 

First, the time taken to execute the profiling run is quite a bit longer than the time we got when benchmarking earlier. This is because profiling adds some overhead. Generally this doesn't affect the conclusions you would draw about which functions take the most time, but it's something to be aware of.

Second, most of the time is being taken up in the `sum1d` function, although a decent amount of time is also being spent within the `sum2d` function. You can see this from the 'tottime' column, which shows the total amount of time spent within a function, **not** including any calls made to other functions. The 'cumtime' column shows the total amount of time spent in a function, including any function calls.

Also, you'll notice that there were 10,000,004 function calls. Calling a Python function has some overhead. Maybe the code would go faster if we reduced the number of function calls? Here's a new implementation, combining everything into a single function:

In [14]:
def sum2d_v2(ll):
    """Compute the sum of a list of lists of numbers."""
    s = 0
    for l in ll:
        for x in l:
            s += x
    return s


In [15]:
%timeit sum2d_v2(big_data)

1 loop, best of 3: 2.22 s per loop


This is a bit faster. What does the profiler tell us?

In [16]:
cProfile.run('sum2d_v2(big_data)', sort='time')

         4 function calls in 2.223 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.223    2.223    2.223    2.223 <ipython-input-14-81d66843d00e>:1(sum2d_v2)
        1    0.000    0.000    2.223    2.223 {built-in method builtins.exec}
        1    0.000    0.000    2.223    2.223 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




In fact we've hit a dead end here, because there is only a single function to profile, and function profiling cannot tell us which lines of code within a function are taking up most time. To get further we need to do some...

### Line profiling: `%lprun`, `line_profiler`

You can do line profiling with a Python module called [`line_profiler`](https://github.com/rkern/line_profiler). This is not part of the Python standard library, so you have to install it separately, e.g., via pip or conda.

For convenience, the `line_profiler` module provides a `%lprun` magic command for use in a Jupyter notebook, which can be used as follows: 

In [17]:
%load_ext line_profiler

In [None]:
%lprun -f sum2d_v2 sum2d_v2(big_data)

You can also do the same thing via regular Python code:

In [19]:
import line_profiler
l = line_profiler.LineProfiler()
l.add_function(sum2d_v2)
l.run('sum2d_v2(big_data)')

<line_profiler.LineProfiler at 0x7fcf947c4d68>

In [20]:
l.print_stats()

Timer unit: 1e-06 s

Total time: 31.6946 s
File: <ipython-input-14-81d66843d00e>
Function: sum2d_v2 at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def sum2d_v2(ll):
     2                                               """Compute the sum of a list of lists of numbers."""
     3         1            2      2.0      0.0      s = 0
     4  10000001      2338393      0.2      7.4      for l in ll:
     5  60000000     15348131      0.3     48.4          for x in l:
     6  50000000     14008119      0.3     44.2              s += x
     7         1            0      0.0      0.0      return s



Notice that this takes *a lot* longer than with just function profiling or without any profiling. Line profiling adds *a lot* more overhead, and this can skew benchmarking results sometimes, so it's a good idea after each optimisation you make to benchmark without any profiling at all, as well as running function and line profiling.

If you are getting bored waiting for the line profiler to finish, you can interrupt it and it will still output some useful statistics.

Note that you have to explicitly tell `line_profiler` which functions to do line profiling within. When using the `%lprun` magic this is done via the `-f` option. 

Most of the time is spent in the inner for loop, iterating over the inner lists, and performing the addition. Python has a built-in `sum()` function, maybe we could try that? ...

In [21]:
def sum2d_v3(ll):
    """Compute the sum of a list of lists of numbers."""
    s = 0
    for l in ll:
        x = sum(l)
        s += x
    return s


In [22]:
%timeit sum2d_v3(big_data)

1 loop, best of 3: 1.91 s per loop


We've shaved off a bit more time. What do the profiling results tell us?

In [23]:
cProfile.run('sum2d_v3(big_data)', sort='time')

         10000004 function calls in 2.714 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 10000000    1.358    0.000    1.358    0.000 {built-in method builtins.sum}
        1    1.356    1.356    2.714    2.714 <ipython-input-21-baa7cce51590>:1(sum2d_v3)
        1    0.000    0.000    2.714    2.714 {built-in method builtins.exec}
        1    0.000    0.000    2.714    2.714 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




In [24]:
import line_profiler
l = line_profiler.LineProfiler()
l.add_function(sum2d_v3)
l.run('sum2d_v3(big_data)')
l.print_stats()

Timer unit: 1e-06 s

Total time: 9.65201 s
File: <ipython-input-21-baa7cce51590>
Function: sum2d_v3 at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def sum2d_v3(ll):
     2                                               """Compute the sum of a list of lists of numbers."""
     3         1            1      1.0      0.0      s = 0
     4  10000001      2536840      0.3     26.3      for l in ll:
     5  10000000      4125789      0.4     42.7          x = sum(l)
     6  10000000      2989380      0.3     31.0          s += x
     7         1            1      1.0      0.0      return s



Now a decent amount of time is being spent inside the built-in `sum()` function, and there's not much we can do about that. But there's also still time being spent in the for loop, and in arithmetic. 

You've probably realised by now that this is a contrived example. We've actually been going around in circles a bit, and our attempts at optimisation haven't got us very far. To optimise further, we need to try something different...

## NumPy

For numerical problems, the first port of call is [NumPy](http://www.numpy.org/). Let's use it to solve the sum2d problem. First, let's create a new test dataset, of the same size (10,000,000 rows, 5 columns), but this time using the `np.random.randint()` function:

In [25]:
import numpy as np
big_array = np.random.randint(0, 100, size=(10000000, 5))
big_array

array([[34, 55, 45, 16, 83],
       [20, 34, 77, 55, 64],
       [45, 61, 65,  8, 37],
       ..., 
       [73, 86, 73,  5, 88],
       [70, 57, 92, 40, 21],
       [50, 46, 93, 38, 43]])

The `big_array` variable is a NumPy array. Here's a few useful properties:

In [26]:
# number of dimensions
big_array.ndim

2

In [27]:
# size of each dimension
big_array.shape

(10000000, 5)

In [28]:
# data type of each array element
big_array.dtype

dtype('int64')

In [29]:
# number of bytes of memory used to store the data
big_array.nbytes

400000000

In [30]:
# some other features of the array
big_array.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

NumPy also has its own `np.sum()` function which can operate on N-dimensional arrays. Let's try it:

In [31]:
%timeit np.sum(big_array)

10 loops, best of 3: 30.5 ms per loop


So using NumPy is almost two orders of magnitude faster than our own Python sum2d implementation. Where does the speed come from? NumPy's functions are implemented in C, so all of the looping and arithmetic is done in native C code. Also, a NumPy array stores it's data in a single, contiguous block of memory, which can be accessed quickly and efficiently.

### Aside: array arithmetic

There are lots of things you can do with NumPy arrays, without ever having to write a for loop. E.g.:

In [32]:
# column sum
np.sum(big_array, axis=0)

array([494856187, 495018159, 495000327, 494824068, 494760222])

In [33]:
# row sum
np.sum(big_array, axis=1)

array([233, 250, 216, ..., 325, 280, 270])

In [34]:
# add 2 to every array element
big_array + 2

array([[36, 57, 47, 18, 85],
       [22, 36, 79, 57, 66],
       [47, 63, 67, 10, 39],
       ..., 
       [75, 88, 75,  7, 90],
       [72, 59, 94, 42, 23],
       [52, 48, 95, 40, 45]])

In [35]:
# multiply every array element by 2
big_array * 2

array([[ 68, 110,  90,  32, 166],
       [ 40,  68, 154, 110, 128],
       [ 90, 122, 130,  16,  74],
       ..., 
       [146, 172, 146,  10, 176],
       [140, 114, 184,  80,  42],
       [100,  92, 186,  76,  86]])

In [36]:
# add two arrays element-by-element
big_array + big_array

array([[ 68, 110,  90,  32, 166],
       [ 40,  68, 154, 110, 128],
       [ 90, 122, 130,  16,  74],
       ..., 
       [146, 172, 146,  10, 176],
       [140, 114, 184,  80,  42],
       [100,  92, 186,  76,  86]])

In [37]:
# more complicated operations
t = (big_array * 2) == (big_array + big_array)
t

array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       ..., 
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]], dtype=bool)

In [38]:
np.all(t)

True

## Cython

For some problems, there is no convenient way to solve them with NumPy alone. Or in some cases, the problem can be solved with NumPy, but the solution requires too much memory. If this is the case, another option is to try optimising with [Cython](http://docs.cython.org/en/latest/). Cython takes your Python code and transforms it into C code, which can then be compiled just like any other C code. A key feature is that Cython allows you to add static type declarations to certain variables, which then enables it to generate highly efficient native C code for certain critical sections of your code.

To illustrate the use of Cython I'm going to walk through an example from the Cython profiling tutorial. The task is to compute an approximate value for pi, using the formula below:

<img src='/assets/approx_pi.png'>

In fact there is a way to solve this problem using NumPy:

In [39]:
def approx_pi_numpy(n):
    pi = (6 * np.sum(1 / (np.arange(1, n+1)**2)))**.5
    return pi


In [40]:
approx_pi_numpy(10000000)

3.1415925580968325

In [41]:
%timeit approx_pi_numpy(10000000)

10 loops, best of 3: 84.6 ms per loop


The NumPy solution is pretty quick, but it does require creating several reasonably large arrays in memory. If we wanted a higher precision estimate, we might run out of memory. Also, allocating memory has some overhead, and so we should be able to find a Cython solution that is even faster.

Let's start from a pure Python solution:

In [42]:
def recip_square(i):
    x = i**2
    s = 1 / x
    return s


def approx_pi(n):
    """Compute an approximate value of pi."""
    val = 0
    for k in range(1, n+1):
        x = recip_square(k)
        val += x
    pi = (6 * val)**.5
    return pi


In [43]:
approx_pi(10000000)

3.1415925580959025

Benchmark and profile:

In [44]:
%timeit approx_pi(10000000)

1 loop, best of 3: 3.79 s per loop


In [45]:
cProfile.run('approx_pi(10000000)', sort='time')

         10000004 function calls in 4.660 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 10000000    3.092    0.000    3.092    0.000 <ipython-input-42-2f54265e4169>:1(recip_square)
        1    1.568    1.568    4.660    4.660 <ipython-input-42-2f54265e4169>:7(approx_pi)
        1    0.000    0.000    4.660    4.660 {built-in method builtins.exec}
        1    0.000    0.000    4.660    4.660 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




In [46]:
l = line_profiler.LineProfiler()
# l.add_function(recip_square)
l.add_function(approx_pi)
# use a smaller value of n, otherwise line profiling takes ages
l.run('approx_pi(2000000)')
l.print_stats()

Timer unit: 1e-06 s

Total time: 3.36972 s
File: <ipython-input-42-2f54265e4169>
Function: approx_pi at line 7

Line #      Hits         Time  Per Hit   % Time  Line Contents
     7                                           def approx_pi(n):
     8                                               """Compute an approximate value of pi."""
     9         1            2      2.0      0.0      val = 0
    10   2000001       676782      0.3     20.1      for k in range(1, n+1):
    11   2000000      2007925      1.0     59.6          x = recip_square(k)
    12   2000000       685011      0.3     20.3          val += x
    13         1            5      5.0      0.0      pi = (6 * val)**.5
    14         1            0      0.0      0.0      return pi



Notice that I commented out line profiling for the `recip_square()` function. If you're running this yourself, I recommend running line profiling with and without including `recip_square()`, and each time take a look at the results for the outer `approx_pi()` function. This is a good example of how the overhead of line profiling can skew timings, so again, something to beware of. 

Now let's start constructing a Cython implementation. If you're working in a Jupyter notebook, there is a very convenient `%%cython` magic which enables you to write a Cython module within the notebook. When you execute a `%%cython` code cell, behind the scenes Cython will generate and compile your module, and import the functions into the current session so they can be called from other code cells. To make the `%%cython` magic available, we need to load the Cython notebook extension:

In [47]:
%load_ext cython

To begin with, all I'll do is copy-paste the pure Python implementation into the Cython module, then change the function names so we can benchmark and profile separately:

In [48]:
%%cython


def recip_square_cy1(i):
    x = i**2
    s = 1 / x
    return s


def approx_pi_cy1(n):
    """Compute an approximate value of pi."""
    val = 0
    for k in range(1, n+1):
        x = recip_square_cy1(k)
        val += x
    pi = (6 * val)**.5
    return pi


In [49]:
approx_pi_cy1(10000000)

3.1415925580959025

In [50]:
%timeit approx_pi_cy1(10000000)

1 loop, best of 3: 2.74 s per loop


Notice that the Cython function is a bit faster than the pure-Python function, even though the code is identical. However, we're still a long way short of the NumPy implementation.

What we're going to do next is make a number of modifications to the Cython implementation. If you're working through this in a notebook, I recommend copy-pasting the Cython code cell from above into a new empty code cell below, and changing the function names to `approx_py_cy2()` and `recip_square_cy2()`. Then, work through the optimisation steps below, one-by-one. After each step, run benchmarking and profiling to see where and how much speed has been gained, and examine the HTML diagnostics generated by Cython to see how much yellow (Python interaction) there is left. The goal is to remove all yellow from critical sections of the code. 

* Copy-paste Cython code cell from above into a new cell below; rename `approx_pi_cy1` to `approx_pi_cy2` and `recip_square_cy1` to `recip_square_cy2`. Be careful to rename all occurrences of the function names.
* Change `%%cython` to `%%cython -a` at the top of the cell. Adding the `-a` flag causes Cython to generate an HTML document with some diagnostics. This includes colouring of every line of code according to how much interaction is happening with Python. The more you can remove interaction with Python, the more Cython can optimise and generate efficient native C code.
* Add support for function profiling by adding the following comment (special comments like this are interpreted by Cython as compiler directives): 

<pre># cython: profile=True
</pre>

* Add support for line profiling by adding the following comments:

<pre>
# cython: linetrace=True
# cython: binding=True
# distutils: define_macros=CYTHON_TRACE_NOGIL=1
</pre>

The following steps optimise the `recip_square` function:

* Add a static type declaration for the `i` argument to the `recip_square` function, by changing the function signature from `recip_square(i)` to `recip_square(int i)`.
* Add static type declarations for the `x` and `s` variables within the `recip_square` function, via a `cdef` section at the start of the function.
* Tell Cython to use C division instead of Python division within the `recip_square` function by adding the `@cython.cdivision(True)` annotation immediately above the function definition. This also requires adding the import statement `cimport cython`.

Now the `recip_square` function should be fully optimised. The following steps optimise the outer `approx_pi_cy2` function:

* Optimise the `for` loop by adding static type declarations for the `n` argument and the `k` variable.
* Make `recip_square` a `cdef` function and add a return type.
* Add a static type declaration for the `val` variable.
* Remove line profiling support.
* Remove function profiling support.
* Make `recip_square` an `inline` function.

Here is the end result...

In [51]:
%%cython -a


cimport cython


@cython.cdivision(True)
cdef inline double recip_square_cy2(int i):
    cdef:
        double x, s
    x = i**2
    s = 1 / x
    return s


def approx_pi_cy2(int n):
    """Compute an approximate value of pi."""
    cdef:
        int k
        double val
    val = 0
    for k in range(1, n+1):
        x = recip_square_cy2(k)
        val += x
    pi = (6 * val)**.5
    return pi


In [52]:
%timeit approx_pi_cy2(10000000)

100 loops, best of 3: 11.6 ms per loop


In [None]:
cProfile.run('approx_pi_cy2(10000000)', sort='time')

In [None]:
l = line_profiler.LineProfiler()
# l.add_function(recip_square_cy2)
l.add_function(approx_pi_cy2)
# use a smaller value of n, otherwise line profiling takes ages
l.run('approx_pi_cy2(2000000)')
l.print_stats()

So the optimised Cython implementation is about 8 times faster than the NumPy implementation. You may have noticed that a couple of the optimisation steps didn't actually make much difference to performance, and so weren't really necessary.

## Cython and NumPy

Finally, here's a Cython implementation of the sum2d function, just to give an example of how to use Cython and NumPy together. As above, you may want to introduce the optimisations one at a time, and benchmark and profile after each step, to get a sense of which optimisations really make a difference.

In [54]:
%%cython -a


cimport numpy as np
cimport cython


@cython.wraparound(False)
@cython.boundscheck(False)
def sum2d_cy(np.int64_t[:, :] ll):
    """Compute the sum of a 2-dimensional array of integers."""
    cdef:
        int i, j
        np.int64_t s
    s = 0
    for i in range(ll.shape[0]):
        for j in range(ll.shape[1]):
            s += ll[i, j]
    return s


In [55]:
%timeit sum2d_cy(big_array)

10 loops, best of 3: 34.3 ms per loop


In [56]:
%timeit np.sum(big_array)

10 loops, best of 3: 30.4 ms per loop


In this case the Cython and NumPy implementations are nearly identical, so there is no value in using Cython.

## Further reading

There are loads of good resources on the Web on the topics covered here. Here's just a few:

* [The Python profilers](https://docs.python.org/3.6/library/profile.html)
* [A guide to analyzing Python performance](https://www.huyng.com/posts/python-performance-analysis) by Huy Nguyen
* [SnakeViz](https://jiffyclub.github.io/snakeviz/)
* [NumPy tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)
* [Cython tutorials](http://cython.readthedocs.io/en/latest/src/tutorial/)
* [The fallacy of premature optimization](http://ubiquity.acm.org/article.cfm?id=1513451) by Randall Hyde