# Notebook 10

# Parallel Algorithms

This will be a very quick introduction to parallel algorithms. Computers are becoming increasingly parallel, with having 2-30 cores being quite typical in consumer machines. GPUs on many modern computers can also support 1000s of running processes at the same time, but have a particular architecture that makes coding them challenging (and that varies with brand). 

Here we will discuss algorithm design for parallel computers from a high-level perspective; actual parallel coding varies quite a bit among languages, and Python (which we have been using) is not the best example as most python implementations do not support multithreading well. In real parallel programming, you also need to worry abut issues relating to locks and syncronization, which can be quite complicated to get right. Think of making our B-tree parallel. What if one thread is doing an insertion while another is searching? What if both are inserting at the same time?

The model we will be discussing is the PRAM (the *parallel RAM*) which has $p$ processors that run at the same time in lock-step and that share memory. The main parameters we will be using are:

- $T_p$: The running time with $p$ processors
- $T_1$: The *sequental* runtimes, what you would get with only one processor
- $T_\infty$: The runtime with infinite processors. Often called the *span*
- $W$: The work, the sum of the amount of time each processor spends. For a given $p$, this is at most $pT_p$. With only one processor $T_1=W$

Brent's law says that if you know $T_1$ and $T_\infty$, you can get the runtime for any number of processors:

$$ T_p = O\left(T_\infty + \frac{T_1}{p} \right) $$

This makes sense as you can not go faster than $T_\infty$ and you can get get more than a factor $p$ speedup over $T_1$.

In an ideal word, we want algorithms that have the same work as the sequential algorithms, but run much faster as $p$ grows.

### Example: Sorting, take 0

One good place to start is to think of what is the best algorithm to give a small value of $T_\infty$. That is, given an enrestricted number of processors, what is the fastest you can solve the problem.

One very simple way to sort is to permute the values and check to see if they are in order. With $n!$ processors, each can handle a different permutation and check to see if it is in sorted order in $O(n)$ time. Only one processor will get the numbers and sorted order, and that can ourput it.

This algorithm has $T_\infty=n$, work is $n\cdot n!$. So the parallel runtime is faster than $O(n \log n)$ of mergesort, but the work is much mich worse!

### Example: Sorting, take 1

Recall that mergesort is a sequential algorithm that runs in time $T_1=O(n \log n)$ and work $O(n \log n)$.

We assume the input is an array $A$ of size $n$ of distinct numbers, and $B$ is the destination array. Using $p=n$ processors, processor $i$ counts how many items in $A$ are at most $A[i]$, call this $\ell_i$. It then runs $B[\ell_i]=A[i]$. The list is now sorted.

This took time $O(n)$ with $n$ processors, $T_n=O(n)$ which is faster than mergesort. However, the work is $O(n^2)$ which is significantly worse.

### Example: Merging

Suppose you have two sorted lists $A$ and $B$ of size $n/2$ and you want to merge them into a sorted list $C$. With $n$ processors, we can do something similar to the sorting example and assign a single processor to each data item, and count how many items are less than or equal to it, which gives it position in the sorted list. However, as the lists are sorted, each processor can use binary search rather than sequential search. This takes time $T_n=O(\log n)$, and $O(n \log n)$ work. With $p\leq N$ processors this thus takes time:

$$ T_p = O \left( \log n + \frac{n \log n}{p} \right) $$

This is faster than normal merge when $p > \log n$, and the work is a factor-log worse.

### Example: Sorting, take 2

Run mergesort, but with the parallel merge just described. This gives

$$ T_p = O \left( \log^2 n + \frac{n \log^2 n}{p} \right) $$

This is also faster than normal mergesort when $p > \log n$, and the work is a factor-log worse. There are parallel sorting algorithms that have optimal $T_1 = O(n \log n)$ and $T_\infty = O(\log n)$ but they are more complicated.




# PRAM: Variations based on reading and writing

There are multiple variations of whether each processor can read or write the same memory location at the same time.

- EREW: Exclusive-read, exclusive write. Each processor must read/write from different locations at teach time step
- CREW: Concurrent-read, exclusive write. Multiple processors can read from the same address, but they must write to different addresses
- CRCW: Concurrent-read concurrent-write. There are variations of this depending on what happens when multiple processors write to the same address at the same time
 - CRCW-P: Priority. The lowest numbered processor who writes has their write go though
 - CRCW-A: Arbitrary: An arbitrary write succeeds.
 - CRCW-C: Common: Processors can write to the same address but only if they write the same data
 
If you look at the algorithms we discussed for sorting, they were in the CREW model. 

Here is an example of how we can use the additional power of a CRCW to get a faster span. Consider the problem of the convex hull in 2D: given a set of points in 2D, compute the extreme points, that is those you would get if you stretch a rubber band around the points. 

Here is a simple algorithm CRCW-C. Create an array of size $n$ containing $n$ true values each corresponding to a single point. Then, use $n^4$ processors to look at each 4-tuple of points $abcd$ and if $d$ is in $\triangle abc$, then set the value corresponding to $d$ to false. This only takes constant time per processor and afterwards only those points on the hull will not be set to false. Another constant-span CRCW-C algorithm with $n^2$ processors could then plane the convex hull points consecutively in memory.

Do you see how a similar approach could be used to sort with $O(1)$ span on a CRCW-C?