# Code Performance

In this section we will look at ways of measuring the "efficiency" of our code. In this context, efficiency refers to the numer of cpu operations that are required to complete a calculation. The fewer operations required to complete a calculation, the faster it will run, and the more efficient the code.


## Interpretation vs Compilation

To start with, it's useful to understand a key difference between different types of computer language : _compiled_ vs _interpreted_.

With _compiled_ languages like C, C++, Fortran, going from code (text) to a program is a two step process. First, a dedicated _compiler_ program parses (reads) the high-level code and converts it into instructions that can be understood by the cpu. The compiler outputs an executable binary file, which you then have to run in a second step. Compilation can potentially take a long time, but this allows the executable to be highly optimised and efficient. On the other hand, when the executable is highly optimised, it bears little resemblance to the original code, and debugging can become more difficult.

With _interpreted_ languages, like Python, there is no separate compiler, and no executable.  The code you write is parsed, compiled, and executed, at the time you run the program by an _interpreter_. The compilation step is omitted, simplifying the process. However, the interpreters ability to optimise the code is reduced and in general, interpreted languages are not nearly as fast as compiled languages.



## The `time` library

We can investigate code efficiency by measuring how long a function takes to run. All computers generally have an internal clock, which can provide the date and time to some precision, depending on the machine and operating system. Most computers will provide the time in seconds since `January 1, 1970, 00:00:00` (aka "the epoch"). 

Python's `time` library provides a range of functions. `time.time()` will return the time in second since the epoch, however its precision varies from one platform to another. `time.perf_counter()` is more useful here, since it will provide the best precision available on the platform, however it can only be used for measuring relative time (ie. the difference between repeated calls to `time.perf_counter()`.  We can use this to measure the time elapsed during execution of a function, as in the example below.

In [2]:
# first define the function we want to time
def simple_sum(arr_in):
    result = 0
    for i in range(len(arr_in)):
        result = result + arr_in[i]
    return result

a = [1., 2., 3.]
print(simple_sum(a))

6.0


In [3]:
import time

start = time.perf_counter()
simple_sum(a)
#time.sleep(5)
end = time.perf_counter()

print(end-start)

7.660000028408831e-05


Try running the cell above several times. You should see some variation. If you are able to do this while your computer is also doing something computationally intensive, you might see some really big variation. If you don't see much variation in time, uncomment the `time.sleep` line above - you should see that sending the process to sleep between the time measurements affects the result.

## Measuring Execution Time

A key issue to note here is that modern cpus are designed to run multiple programs simultaneously, using 'multi-threading' techniques to switch between processes and make efficient use of the cpu. Our function above might be held in a queue before it can run, or it might be interrupted in the middle of its operation. Using the absolute time (known as the "wall clock" time) can be therefore give misleading results.  Really we want an estimate of the "cpu time" (the time spend by the cpu executing this function) rather than the "wall clock time" (the absolute time difference between the start and end of the execution).

An improvement on `time.perf_counter()` is therefore to use `time.process_time()`.  This will return the time elapsed while the cpu is executing a single process.

In [4]:
start = time.process_time()
simple_sum(a)
#time.sleep(5)
end = time.process_time()

print(end-start)

0.0


Now, you should see less variation.  If you try uncommenting the sleep line, you should see the process will go to sleep for 5 seconds, but since the cpu does not spend this time executing the process, it will not impact the measured time interval.

The final point to note here is that the time values measured here are very small. In general, fast functions will execute in time that is comparable to the resolution of the time measurement.  To avoid this, it is best to make sure the time interval you measure is around 1s. You can achieve this by running the function many times, and dividing the measured interval by the number of function calls.

In [5]:
n=int(1e5)
start = time.process_time()
for i in range(n):
    simple_sum(a)
end = time.process_time()

print((end-start)/n)

9.375e-07


It is worth exploring what happens as you reduced the number of calls to `simple_sum()` in the cell above. You should see the mean value starts to increase, as the total time gets close to the resolution of `time.process_time()`.

Note that you will still see variation in estimated time when running the cell above repeatedly. This results from imperfections in the ability of `time.process_time()` to measure the true time spent by the cpu in the process (which is non-trivial). We can account for this by repeating the measurement several times. Note that when repeating this measurement we want to take the _smallest_ value! Unlikel typical measurements, here we know that the variation results from processes that are external to the one we are trying to measure, and all variation increases the measured value.

In [6]:
times = []

for i in range(10):
    start = time.process_time()
    for i in range(n):
        simple_sum(a)
    end = time.process_time()
    times.append((end-start)/n)

print(min(times))

1.09375e-06


On my laptop, I obtain values for the execution time of `simple_sum()` that are all around 1.3$\mu$s. Note that this will vary from one machine to another, depending on its specification.

In summary, to obtain accurate estimates of function execution time :
1. measure cpu time, not wall clock time
2. time the execution of many calls to the function, with total time > 0.1s
3. repeat the above several times and take the smallest value

## The `timeit` library

Fortunately, the `timeit` library includes pre-defined functions explicitly for measuring function execution time. Full details are available at : https://docs.python.org/3/library/timeit.html

`timeit.timeit` will measure a specified number of calls to a function, and return the total measure time. `timeit.repeat` will do the same, but repeat this a specified number of times and return a list. In both cases, the function to be timed must be passed as a 'callable' (a function without brackets, and hence without arguments), which we can achieve a lambda function (see $\S$2 below).  Also note the timer function can be specified - by default it is time.perf_counter().

The example below implements does the same thing as the previous code cell, using `timeit`.

In [7]:
import timeit

times = timeit.repeat(lambda: simple_sum(a), number=n, repeat=10, timer=time.process_time )
print(min(times)/n)

1.25e-06


## Example - Vectorisation

Vectorisation was introduced in the 2nd year Python tutorial ($\S$5.1). Essentally, vectorisation allows us to perform an operation on every element of a numpy array. Numpy provides a wide range of functions that support vectorisation.

In this section, we'll look at the performance increase that vectorisation can offer. As an example, we'll look at finding the square of each element in a large numpy array. First, we'll implement this using a for loop, and time it with the `timeit` library.

In [11]:
import numpy as np

old = np.arange(1000)
new = np.empty(1000)

def forloop():
    for i in range(1000):
        new[i] = old[i]**2
    return new

from timeit import Timer

print(min(Timer(forloop).repeat(number=1000, repeat=10)))

0.4556897199999952


And now using numpy vectorised routines instead.  Note that we wrap the numpy call up in a function to make the comparison as fair as possible.

In [10]:

def nploop():
    return old**2

print(min(Timer(nploop).repeat(number=1000, repeat=10)))

0.0014118909999893958


As you can see, the performance increase is significant !  This is achieved because, like many Python libraries, it uses _compiled_ code (usually C, sometimes Fortran) behind the scenes.  By using a vectorised operated, the 'for loop' is implemented in that compiled code, so Python only has to interpret a handful of lines.  In the first example, Python will interpret every iteration of the for loop individually.