# Performance

**ENGSCI233: Computational Techniques and Computer Systems** 

*Department of Engineering Science, University of Auckland*

# 0 What's in this Notebook?

Having mostly bug-free code is a **necessary but not sufficient** condition for a good computer program. This notebook is all about understanding code performance, how algorithms scale with larger and larger problems, how to identify bottlenecks, and strategies to speed up execution.

Things you need to know:
- That Big O notation measures how algorithm runtime grows with problem size. You can read this from a graph by running the algorithm several times at different problem sizes.
- Profiling tells you how long your code spends running different functions, so you can figure out which parts are too slow.
- When calculations are independent of each other, they can be run in parallel. This gives you speed-up and efficiency.

In [None]:
# imports and environment: this cell must be executed before any other in the notebook
%matplotlib inline
from performance233 import*

## 1 Measuring algorithms

<mark>***How should we decide if one algorithm is better than another?***</mark>

One way to assess an algorithm is to count how many operations it takes to solve a problem of size $N$, and to compare this against another implementation. For instance, suppose we wish to sort an array of $N$ random numbers in ascending order; we could use:

1. Heapsort, which takes $k_1N\log_2 N+k_2$ operations to complete,
2. or insertion sort, which takes $k_3 N^2 + k_4$ operations to complete.

Depending on the values of $[k_1, k_2, k_3, k_4]$, we may assess one algorithm as superior to the other for a given problem size. 

A second approach is to compare the asymptotic scaling of each algorithm: how do they perform as $N$ gets **really really large**. The latter we denote using $\mathcal{O}()$, called ["Big O notation"](https://www.interviewcake.com/article/java/big-o-notation-time-and-space-complexity) or order-of-complexity. This notation ignores **constant multiplicative factors** and focuses on the functional form. 

From a scaling perspective, heapsort with order-of-complexity $\mathcal{O}(N\log_2N)$ is superior to insertion sort with $\mathcal{O}(N^2)$.

### 1.1 Graphing $\mathcal{O}()$

How do we decide if an algorithm is $\mathcal{O}(N^2)$ or $\mathcal{O}(N\log_2N)$? Both have graphs that are concave up...

***Execute the cell below.***

In [None]:
compare_scaling()

A useful diagnostic is to plot execution time on log-log axes, for a few doublings of the problem size, i.e., $N$, $2N$, $4N$, $8N$, etc. If the problem has scaling $\mathcal{O}(N^\alpha)$, then the plot will be linear on log-log axes, and $\alpha$ can be read off as the slope.

***Execute the cell below.***

In [None]:
scaling_loglog()

***Read the slope off the middle plot and verify that it is 2, i.e., $\mathcal{O}(N^2)$. ***

***There are two-ways to construct a log-log plot: (middle) by explicitly log converting the $x$ and $y$ quantities, and (right) by calling the Python commands `ax.set_xscale('log')` and `ax.set_yscale('log')`.***

### 1.2 Other metrics

**Scaling of the execution time** is just one consideration when deciding on an algorithm implementation. Depending on the application or hardware available, consideration may also have to be given to memory use, stability, or preconditions of the algorithm input (some search algorithms are *very* fast if the input is already sorted).

## 2 Profiling for optimisation

<mark>***How should we choose which parts of a computer program should be improved?***</mark>


Large computer programs might comprise hundreds of different functions or methods calling each other in a complex sequence. Just one poorly written implementation can **slow the entire program**. How can we locate these bottlenecks?

Profilers are used to automatically analyse the performance of our code. This could include how efficiently is uses memory or how quickly it performs tasks. Here, we will focus on the latter, introducing the idea of code execution time and looking at how a profiler can be used to generate **execution time** statistics for a computer program. 

### 2.1 Measuring time

Ultimately, profiling relies on measuring **how long it takes** to run different parts of the code. So we need a measure of [`time`](https://www.programiz.com/python-programming/time#time).

***Run the cell below and answer the questions.***

In [None]:
# import the time module
from time import time

# get the current time IN SECONDS from the system clock
t0 = time()              
print(t0)

***Run the cell above over and over (Ctrl+Enter ***\*wait\* *** Ctrl+Enter ***\*wait\* *** Ctrl+Enter). How does the output change?***

> <mark>*~ your answer here ~*</mark>

***Divide `t0` by 3600, to print the number of hours. Convert this to a number of days and a number of years.***

> <mark>*~ your answer here ~*</mark>

***Subtract the current time IN YEARS from today's date. When did the clock start?***

> <mark>*~ your answer here ~*</mark>

Once we can measure time, we can measure **time differences**.

***Run the cell below and answer the questions.***

In [None]:
# start the clock
tstart = time()

# do something you want to time
    # e.g., find the location of the largest element in an array of random numbers
i = np.argmax(np.random.rand(1000))
    
# stop the clock
tend = time()
print(tend-tstart, 'seconds')

***How long does it take to find the max value? Do you believe the answer?***

> <mark>*~ your answer here ~*</mark>

***Run the cell above over and over (Ctrl+Enter ***\*wait\* *** Ctrl+Enter ***\*wait\* *** Ctrl+Enter). How does the output change?***

> <mark>*~ your answer here ~*</mark>

Sometimes an operation executes **too quickly** to be timed accurately using `time()`. In which case, we can repeat the operation `N` times, and divide the total execution time by `N`.

***Run the cell below and answer the questions.***

In [None]:
# start the clock
tstart = time()

# do something you want to time
    # e.g., find the location of the largest element in an array of random numbers
N = 100
for j in range(N):
    i = np.argmax(np.random.rand(1000))
    
# stop the clock
tend = time()
print((tend-tstart)/N, 'seconds')

***How long does it take to find the max value? Does this agree with the previous estimate?***

> <mark>*~ your answer here ~*</mark>

***Increase the value of `N`. Does the estimate of the execution time change (or at least stabilise?)***

> <mark>*~ your answer here ~*</mark>

### 2.2 Profilers

So, one way to get a sense of where our code is slow, is by writing in a whole bunch of `time()` and `print()` calls. This takes forever when you have a complicated code... and then you just have to go back in and pull them out when you're finished optimising.

It's much better to use a ***PROFILER***. In the lab, you will use the [`cProfile`](https://docs.python.org/3.2/library/profile.html) module to study the efficiency of an LU factorisation implementation. There is not much more to say here except to study the typical output of such a profiler.

<img src="img/profiler.png" alt="Drawing" style="width: 900px;"/>

There's a lot of useful information to unpack here. After some general header information (e.g., total runtime) the profiler goes on to rank different function and method calls by the total time the code has spend in them. The columns give:

1. The **total number** of times the function or method was called.
2. The **total time** spent in that function or method (**excludes other function or method calls**).
3. Total time **per function or method call** (excludes time spent calling other functions or methods).
4. Cumulative time spent in that function or method - `tend-tstart` in the example above (**includes other functions of methods**).
5. Cumulative time per function or method call - `(tend-tstart)/N` in the example above (includes time spent calling other functions of methods).
6. The **name** of the function or method.

From the print out above, we can identify that the large majority of time is spent in the `row_reduction` function, and this is a consequence of ***both*** the large number of function calls (199) ***and*** the relatively slow function execution (0.043 seconds, compared to the next slowest `lu_factor` at 0.005 seconds per call). 

Perhaps we should take a look at improving `row_reduction`? You'll do this in the lab.

## 3 Concurrency and Parallelisation

<mark>***How can we make optimised code even faster?***</mark>

Throughout the 80's and 90's, clever engineers devised new methods to squeeze more and more transistors onto microchips. The result was that computing speeds roughly **doubled every 2 years**: the so-called [Moore's Law](https://en.wikipedia.org/wiki/Moore%27s_law). While research into transistor miniaturisation [continues to this day](https://www.technologyreview.com/s/602592/worlds-smallest-transistor-is-cool-but-wont-save-moores-law/), in practice, gains in computing power are achieved through **multi-core processing**. Many desktops now come with 8-core chips (8 independent processors) as standard, although if you're rereading these notes in 5 years time that will probably [sound primitive](https://i.pinimg.com/originals/93/44/66/9344663cd0094039d4bacd47f67d48fe.jpg). 

**Concurrency** is the idea that you can do two or more things at the same time<sup>2</sup>. It is ubiquitous in computing: multiple apps ***concurrently*** running on your phone, 30 or so ENGSCI students ***concurrently*** working through some contrived lab problem on a networked desktop machine each Wednesday morning. In each case, we can think of a **shared resource** (your phone's memory, a pool of Desktops) being accessed by **independent processes** (phone apps, ENGSCI nerds). 

The same concepts can be applied to **parallelise** your code.

<sup>2</sup><sub>[Scary concept if you're a male.](https://vignette.wikia.nocookie.net/tehmeme/images/5/5d/Y0UJC.png/revision/latest?cb=20120505151500)</sub>

### 3.1 An example

***- - - - CLASS CODING EXERCISE - - - -***

How long does it take Python to factorise a 3000$\times$3000 matrix? How about 10 of them?

In [None]:
# import some pacakges
from scipy.linalg import lu
import numpy as np
from time import time
    
N = 3000
n = 10

# create some matrices
As = []
for i in range(n):
    As.append(np.random.rand(N,N))

# factorise one matrix using lu()
t0 = time()
P,L,U = lu(As[0])
t1 = time()
print('factorising 1 matrix: ',t1-t0, 'seconds')

# factorise ten matrices using lu()
# *** your code here ***

print('factorising {:d} matrices: '.format(n),t1-t0, 'seconds')

# free up the memory 
del(As)

***Does the time for factorising 1 matrix vs. 10 matrices scale as you expect?***

> <mark>*~ your answer here ~*</mark>

***Explain whether we need to have FINISHED factorising the FIRST matrix before starting on the SECOND.***

> <mark>*~ your answer here ~*</mark>


### 3.2 Thinking about parallelisation

A multi-core microprocessor contains several independent processing units. We can think of these as forming a **pool** of workers. At any given time, some of the workers may be **idle** while others will be busy completing an **assigned task**. When a new request comes along, it will be either assigned to an available worker or, in the event everyone is busy, stored in a **queue**. When a worker finishes a task, they are **returned to the pool** ready to receive the next queued assignment.

***If the input to one assignment does not depend on the outcome of another, then two workers can complete their tasks simultaneously***.


The case is relatively easy to treat in Python using the [`multiprocessing`](https://docs.python.org/3.4/library/multiprocessing.html?highlight=process) library, the [`Pool`](https://docs.python.org/3.4/library/multiprocessing.html?highlight=process#multiprocessing.pool.Pool) class, and the [`map()`](https://docs.python.org/3.4/library/multiprocessing.html?highlight=process#multiprocessing.pool.Pool.map) method. Although, we cannot do multiprocessing inside this Jupyter Notebook, I have included a supplementary script `parallel_example.py` that demonstrates how to parallelise a loop and what kind of speed-ups can be achieved.

<img src="img/parallel_example.png" alt="Drawing" style="width: 500px;"/>

***Visual Studio Code printout for `parallel_example.py`, a parallelised version of Example 3.1.***

Finally, if the two assignments are **related to one another**, two workers may still be able to complete them simultaneously, although they may need to **communicate with each other** from time to time. To treat this case, we require a [message passing protocol](https://en.wikipedia.org/wiki/Message_passing). You won't need to do that in this course.

### 3.3 Speed-up

The purpose of parallelisation is to reduce the execution time of a program. By measuring execution time for different sized pools, we can develop a sense of relative gains and diminishing returns. We define the parallel speedup, $S$, and parallel efficiency, $E$,

\begin{equation}
S=\frac{T_s}{T_p},\quad\quad E = \frac{S}{n_p}
\end{equation}

where $T_s$ is the **serial execution time** and $T_p$ is the parallel execution time for a pool of size $n_p$.

***Run the cell below to plot parallel speed-up and efficiency for `parallel_example.py`.***

In [None]:
parallel_analysis()

***What does the phrase "linear speedup" imply?***

> <mark>*~ your answer here ~*</mark>

***Which sections of the lefthand plot would you consider "sublinear" and "supralinear"?***

> <mark>*~ your answer here ~*</mark>

***Explain how the parallel speedup plot shows diminishing returns.***

> <mark>*~ your answer here ~*</mark>

***Why might the speedup get WORSE for a very large pool?***

> <mark>*~ your answer here ~*</mark>

In [None]:
# solution code for 3.1
#t0 = time()
#for A in As:
#    P,L,U = lu(A)
#t1 = time()