# Serial Optimization
How do you optimise serial code? There are two main techniques that are easy go-to's for serial optimization in Python: 
1. Vectorization
2. Memoization

## Vectorization
Because Python is an interpreted language - i.e. each line is dispatched to an interpreter program to interpret and execute - there can be a lot of overhead compared to compiled programs. Python needs to check variable types and use the correct functions for the inputs, because these are not necessarily specified. As a result, when you're using Python for large datasets or for long loops, it can pay to implement vectorization.

Vectorization essentially uses compiled versions of the loop or functions. There is therefore much less overhead per operation, providing large speedups. As an added benefit, they can use Single Instruction, Multiple Data (SIMD) instructions - these allow the same instruction to be performed on multiple cells of data simultaneously, providing an even greater speedup.

Consider the following examples. The first uses base Python to generate a list of integers using a for loop. The second does the same, but uses `numpy` to vectorize the calculation. The %%timeit readouts show a large speed increase in the second example.

In [10]:
%%timeit
result = []
for x in range(1_000_000):
    result.append(x * 2)

35.4 ms ± 232 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [6]:
import numpy as np

In [None]:
%%timeit
ints = np.arange(1_000_000)
result_np = ints * 2

1.68 ms ± 67.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [None]:
assert result == result_np.tolist(), "Results do not match!"

In [13]:
%timeit total = sum(range(10))

85.4 ns ± 0.302 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


In [14]:
%%timeit 
import random

def func():
    total = 0
    for _ in range(10):
        total += random.randint(1, 100)

114 ns ± 0.737 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


We see a large speed increase in the second example, which uses `numpy` to vectorize the calculation.

Now, try this yourself. The following code generates an array of random pairs of numbers, and calculates the Euclidean distance between them. 

In [12]:
from math import sqrt
from line_profiler import profile

points = np.random.rand(1000, 2)

@profile
def pairwise_distances(points):
  n = len(points)
  distances = np.zeros((n, n))
  for i in range(n):
      for j in range(n):
          distances[i, j] = sqrt((points[i, 0] - points[j, 0]) ** 2 + (points[i, 1] - points[j, 1]) ** 2)
  return distances

#### Q1: Profiling
How long does it take to calculate the Euclidean distance using the provided code?

In [10]:
%timeit pairwise_distances(points)

503 ms ± 1.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### Are there specific parts of the code that are slow?

In [14]:
%prun pairwise_distances(points)

 

         1000671 function calls (1000661 primitive calls) in 0.635 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.566    0.566    0.626    0.626 2478890263.py:6(pairwise_distances)
  1000000    0.061    0.000    0.061    0.000 {built-in method math.sqrt}
        2    0.007    0.003    0.007    0.004 {method '__exit__' of 'sqlite3.Connection' objects}
        1    0.000    0.000    0.626    0.626 <string>:1(<module>)
        2    0.000    0.000    0.001    0.000 iostream.py:276(<lambda>)
        5    0.000    0.000    0.000    0.000 attrsettr.py:66(_get_attr_opt)
       14    0.000    0.000    0.000    0.000 socket.py:623(send)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       66    0.000    0.000    0.000    0.000 enum.py:1585(_get_value)
        1    0.000    0.000    0.008    0.008 history.py:1024(writeout_cache)
        2    0.000    0.000    0.000    0.

In [16]:
%load_ext line_profiler

In [17]:
%lprun -f pairwise_distances pairwise_distances(points)

Timer unit: 1e-09 s

Total time: 0.703022 s
File: /var/folders/_g/qmpy73s53td4zg6jqd4y__yr0000gq/T/ipykernel_33512/2478890263.py
Function: pairwise_distances at line 6

Line #      Hits         Time  Per Hit   % Time  Line Contents
     6                                           @profile
     7                                           def pairwise_distances(points):
     8         1       5000.0   5000.0      0.0    n = len(points)
     9         1    1863000.0    2e+06      0.3    distances = np.zeros((n, n))
    10      1001     108000.0    107.9      0.0    for i in range(n):
    11   1001000  102769000.0    102.7     14.6        for j in range(n):
    12   1000000  598276000.0    598.3     85.1            distances[i, j] = sqrt((points[i, 0] - points[j, 0]) ** 2 + (points[i, 1] - points[j, 1]) ** 2)
    13         1       1000.0   1000.0      0.0    return distances

In [20]:
# Vectorized operations with NumPy
def vectorized_pairwise_distances(points):
    x = points[:, 0][:, np.newaxis]
    y = points[:, 1][:, np.newaxis]
    diffs = x - y
    return np.sqrt(np.sum(diffs**2, axis = -1))

if a.all(vectorized_pairwise_distances(points) == pairwise_distances(points)):
    print("Both functions yield the same result!")


NameError: name 'a' is not defined

In [None]:
%timeit vectorized_pairwise_distances(points)
%lprun -f vectorized_pairwise_distances vectorized_pairwise_distances(points)

#### Vectorize the code, and verify it produces the same output. Is it faster? How much faster?

The following code, which calculates \pi using Monte Carlo integration, is provided. Answer the following questions:
1. Is it correct?
2. How long does it take to run?
3. Are there any bottlenecks?

Then vectorize the code, verify the outputs, and measure any speedup.

In [1]:
import random

def monte_carlo_pi(num_samples):
    inside_circle = 0
    for _ in range(num_samples):
        x, y = random.uniform(-1, 1), random.uniform(-1, 1)
        if x**2 + y**2 <= 1:
            inside_circle += 1
    return (inside_circle / num_samples) * 4

In [2]:
%timeit monte_carlo_pi(1_000_000)

306 ms ± 3.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]:
def monte_carlo_pi_numpy(num_samples):
  x, y = np.random.uniform(-1, 1, (2, num_samples))
  pi = 4 * np.mean(x**2 + y**2 <= 1)
  return pi


In [9]:
%timeit monte_carlo_pi_numpy(1_000_000)

10.7 ms ± 125 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Memoization
Memoization is a technique to cache results when they're calculated so they can be quickly retrieved when the same input is called. It avoids expensive computation of the same results again and again.

We can implement a cache ourselves using basic Python, using a simple dictionary. 

In [None]:
cache = {}
def fib_cache(n):
    if n in cache:
        return cache[n]
    if n < 2:
        result = n
    else:
        result = fib(n-1) + fib(n-2)
    cache[n] = result
    return result


We can also use the `functools` module, which comes with Python and provides the function decorator `@lru_cache`. LRU stands for Least Recently Used - when the cache hits its size limit it removes the least recently used data to make room for the new data. 

In [None]:
from functools import lru_cache

@lru_cache(maxsize=None)
def fib_lru(n):
    if n < 2:
        return n
    return fib(n-1) + fib(n-2)