# Parallelism

In [1]:
import numba
import numpy as np

## Introduction

The heat generated on a CPU can largely be attributed to transistors.

At least two things determine the heat generated by a transistor:

1. How large the transistor is -- More surface area implies more heat
2. How quickly the transistor spins

There have been periods of time where computers got faster largely due to their transistors being able to spin more quickly. However, we have been pushing the physical limits of how small a transitor can go... The first transistors were ~1 cm in diameter (roughly the size of a lady bug) while the smallest transistors now are 3 nm in diameter which is roughly 12 atoms across.

You used to be able to "wait" for computers to get faster and then your code would "automagically" be faster a few years later because transistors were smaller and could thus spin faster (which allowed your calculations to get faster).

Computers now can't do serially computations more quickly than those of 2-3 years ago but they have substantially more processing power because there are more processors/transistors on a single chip. This means that in order to take advantages of the advances in computing, we must evolve to write non-serial code (i.e. parallel).

## Parallelism with numba

Recall our example from the "Interpreted vs Compiled" notebook:

In [2]:
@numba.jit(nopython=True)
def calculate_pi(n=1_000_000):
    """
    Approximates pi by drawing two random numbers and
    determining whether the of the sum of their squares
    is less than one (which tells us if the points are
    in the upper-right quadrant of the unit circle). The
    fraction of draws in the upper-quadrant approximates
    the area which we can then multiply by 4 to get the
    area of the circle (which is pi since r=1)
    """
    in_circ = 0

    # Iterate for many samples
    for i in range(n):
        # Draw random numbers
        x = np.random.random()
        y = np.random.random()

        if (x**2 + y**2) < 1:
            in_circ += 1

    return 4 * (in_circ / n)


In [11]:
%%timeit

calculate_pi(10_000_000)

69.5 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


How could we parallelize this?

Note that each element of the for-loop is unrelated. If we had a way to do this on its own then we could parallelize into `n` jobs.

Look at numba's [parallel computation documentation](https://numba.pydata.org/numba-doc/latest/user/parallel.html)

In [5]:
@numba.jit(nopython=True, parallel=True)
def calculate_pi_parallel(n=1_000_000):
    """
    Approximates pi by drawing two random numbers and
    determining whether the of the sum of their squares
    is less than one (which tells us if the points are
    in the upper-right quadrant of the unit circle). The
    fraction of draws in the upper-quadrant approximates
    the area which we can then multiply by 4 to get the
    area of the circle (which is pi since r=1)
    """
    # Iterate for many sample
    in_circ = 0
    for i in numba.prange(n):
        # Draw random numbers
        x = np.random.random()
        y = np.random.random()

        if (x**2 + y**2) < 1:
            in_circ += 1

    return 4 * (in_circ / n)


In [9]:
calculate_pi_parallel(1_000_000)

3.140968

In [13]:
%%timeit

calculate_pi(5)

168 ns ± 22.4 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


In [119]:
%%time

calculate_pi_parallel(100_000_000)

CPU times: user 1.51 s, sys: 2.97 ms, total: 1.51 s
Wall time: 160 ms


3.14145812

### Beware of race conditions

Any time that you write parallel code, you must understand how each computation can affect another. If you don't consider this carefully, your output could depend on the order in which each computation finishes! This is known as a "race condition" and is _very very bad_ because it creates a non-determinism in your code.

Why is it so bad? It's possible that your code returns the right answer sometimes and the wrong answer others -- This non-determinism makes it difficult to debug

In [14]:
def dumb_parallel_function(n=5):
    x = np.zeros(n)
    for i in range(n):
        x[0] = i
        x[i] = i

    return x

In [41]:
dumb_parallel_function(5)

array([4., 1., 2., 3., 4.])

In [96]:
@numba.jit(parallel=True)
def dumb_parallel_function(n=500_000):
    x = np.zeros(n)
    for i in numba.prange(n):
        if i == n-1:
            x[0] = i
            x[i] = i
        elif i != 0:
            x[i] = i

    return x

In [104]:
dumb_parallel_function(5)

array([4., 1., 2., 3., 4.])