# Ch. 3. Parallel Algorithms

## I. Trivially Parallelizable Algorithms
In order to properly make use of an HPC system or other parallel computing environment, you need to be able to make your code able to run in parallel. As we saw with our monte carlo pi example, even making the same algorithm run in parallel can make huge changes in terms of performance. If a job running on an HPC system has access to hundreds of cores, but is only written to run in serial, it will not make use of the computational resources it has available to it. In theory, if your task is completely parallelizable, then if you scale it up by a factor of _n_ cores, it will speed up _n_ times. That is to say, if you run a completely parallelizable job on 20 cores, it will run up to 20 times faster than on one core. A graph of this is below: ![Theoretical maximum speedup](https://cdn.comsol.com/wordpress/2014/03/Graph-depicting-how-the-size-of-the-job-increases-with-the-number-of-available-processes.png)
This graphic represents the theoretical maximum speedup of various types of HPC jobs, where the x-axis represents the number of cores the job is run on, the y axis represents how much faster the job runs (i.e. if the y-value is 20, the job is 20 times faster than single core), and the value of _phi_ represents the fraction of the job that can be parallelized. The mathematical principle this is based on is called [The Gustafson-Barsis Law](https://en.wikipedia.org/wiki/Gustafson%27s_law)

The class of algorithms for which the value of _phi_ is 1, that is to say the class of algorithms that scale perfectly, is known as the "trivially parallelizable" class. This means that you can expect linear performance scaling as you increase number of cores linearly. These algorithms are, rather unsurprisingly, very easy to make parallel. Because of this, people using HPC systems often try to reduce their workloads to different "building blocks" made up of trivially parallelizable algorithms. A Trivially parallelizable algorithm is defined by task-independence. That is to say, if you break up an algorithm into tasks, in order for that algorithm to be trivially parallelizable, each task must not depend on the output of any other task. The image below represents a trivially parallelizable algorithm:
![trivially parallel](http://matthewrocklin.com/slides/images/embarrassing-process.png)
Each set of input data goes into a process, and comes out changed. None of them depend on what is happening in other simultaneous processes.

### Example 3.1 - Generation of Data
As an example of a trivially parallelizable algorithm, we are going to generate lots of numbers in serial and in parallel. This will also teach you a very practical skill with python's `multiprocessing` library: how to "unroll" a trivially parallelizable loop into processes that can run simultaneously.

In [7]:
%%time
# Generating numbers in serial

with open("/home/users/glick/intro-to-hpc/data/datagen.out" ,"w") as file:
    # Printing numbers to file
    for i in range(1,10000001):
        file.write(str(i)+'\n')
        
with open("/home/users/glick/intro-to-hpc/data/datagen.out") as file:
    # Reading numbers from file. Don't print them all out because there're too many
    print(len(file.readlines()))

10000000
CPU times: user 6.03 s, sys: 699 ms, total: 6.73 s
Wall time: 6.25 s


In [11]:
# Generating numbers in parallel

from multiprocessing import Pool
import time
file = open("/home/users/glick/intro-to-hpc/data/datagenparallel.out" ,"w")

# We simply replace the for loop with a function...
def process_single(i):
    file.write(str(i)+'\n')

# Create a multiprocessing Pool
pool = Pool(32)

#... And then map our inputs to it as follows:
tasks = [pool.apply_async(process_single, (j,)) for j in range(1,100001)]

# We use the `get()` function of the last submitted task to time how long it takes to run the process
%time tasks[-1].get()

#Clean up the files
time.sleep(3)
file.close()

with open("/home/users/glick/intro-to-hpc/data/datagenparallel.out") as file2:
    # Reading numbers from file. Don't print them all out because there're too many
    print(len(file2.readlines()))


CPU times: user 4.04 s, sys: 2.58 s, total: 6.62 s
Wall time: 4.58 s
89273


You can see that writing the code in parallel is not much more complex than writing it in serial. In this case, the task is carried out so quickly in serial that it is not really worth it to parallelize, but there are a multitude of real-life scenarios where a workflow, or at least part of a workflow is trivially parallelizable and it makes sense to do so.

## II. Monte Carlo Simulations

### Example 3.2 - Monte Carlo Frog Hop Simulation

## III. Single Instruction, Multiple Data

### Example 3.3 - Vector Addition

## IV. Multiple Instruction, Multiple Data

### Example 3.4 - Asynchronous Branching Logic

## V. Concurrency vs. Parallelism

### Example 3.5a - Downloading and IO-Bound Tasks

### Example 3.5b - Data Generation and CPU-Bound Tasks

## VI. Dataflows

### Example 3.6 - `sleep_fail` dataflow

## Exercise 3. Write a dataflow that estimates the value of _e_ 