## I.What is High-Performance Computing?

Most people believe that High-Performance Computing (HPC) is defined by the architecture of the computer it runs on. In standard computing, one physical computer with one or more cores is used to carry out a task. All of these cores can access a pool of [Shared Memory](https://en.wikipedia.org/wiki/Shared_memory). Shared memory makes it trivial for two or more programs or parts of a single program to understand what other programs are doing. The following diagram represents a multicore single computer working in this fashion: ![multicore architecture diagram](https://i.pinimg.com/originals/22/31/f8/2231f856a5341e19526d089e1ffbe630.jpg "Logo Title Text 1")
On the other hand, most high performance systems are clusters which consist of many separate multicore computers which are all connected over a network. In this configuration, the individual parts of a program running on separate physical machines are not easily able to communicate without connecting over the network. The following diagram represents a cluster of multicore computers working as a single cluster:  ![HPC cluster architecture diagram](http://cialisalto.com/wp-content/uploads/2018/01/excellent-hpc-cluster-architecture-on-and-contemporary-with-data-5.jpg "Logo Title Text 1")

Because of the dependence on networking, programming for HPC environments requires special considerations that normal programming does not.

### Example 1.1 - computing pi on a standard computer
As an example of "traditional" computation, we will now compute the numerical value of Pi through the monte carlo simulation method.

Monte carlo simulations are a class of computational algorithms that rely on repeated random sampling to obtain numerical results. We will discuss them in more detail in the parallel algorithms notebook. A very common monte carlo simulation uses some basic geometry to estimate the numerical value of pi. 

For this, we imagine a circle of radius 1 inscribed in a square. Then, we choose random points in the square, and classify "them by whether they are inside the circle or outside. This gives us an estimate of the areas of the square and the circle. Then, we take the ratio of one to the other and we have an estimation of pi. If this doesn't make any sense to you, don't worry. This image may help you understand a bit better: ![Monte Carlo Pi](https://ds055uzetaobb.cloudfront.net/image_optimizer/aabd5727316301f18f53bd4cbc63914fed0bcb2c.gif "Logo Title Text 1")
In the example below, we calculate pi by monte carlo simulation.

In [None]:
# Monte Carlo Pi Simulation
import random as r
import math as m

# Number of darts that land inside.
inside = 0
# Total number of darts to throw.
total = 10000

# Iterate for the number of darts.
for i in range(0, total):
    # Generate random x, y in [0, 1].
    x2 = r.random()**2
    y2 = r.random()**2
    # Increment if inside unit circle.
    if m.sqrt(x2 + y2) < 1.0:
        inside += 1

# inside / total = pi / 4
pi = (float(inside) / total) * 4

# It works!
print(pi)

## II.Why is HPC important?
High performance computing opens the door to large scale data analysis, computational science, and research computing. It is useful in a number of scenarios, including where software is too time-critical, too performance critical, or simply too big to run on a traditional system.

Let's take a look at a few examples of scenarios where you would need an HPC System or an HPC System drastically changes your process.

- **Scenario 1: Real-Time App:** Let's imagine you're an app developer working for The Weather Channel. You want to make an app that automatically detects where a person is, very specifically, and then uses your complex AccuWeather analysis to predict the likelihood of rain any time in the next seven days for that specific location. You would like your user to be able to get this data within seconds. There is no way you will be able to serve this user, let alone your millions of other users, this data with a single traditional computer. Having an HPC system available here is extremely important, in order to ensure your users get their weather in close to real time. ![Weather Models](http://aemstatic-ww1.azureedge.net/content/ias/en/articles/2012/08/honeywell-3d/_jcr_content/leftcolumn/article/headerimage.img.jpg/1345596242801.jpg "")
- **Scenario 2: Designing a New Car or Plane:** You're a brand new aerospace engineer working for the Mercedes-Benz Formula One team. You have the off season (usually between December and May, or about five months) to design a new car which is better than all the cars that beat you last year. Traditionally, the way to do this is start with a small model, put it in a wind tunnel, evaluate it, and repeat this process. Then, you slowly scale up to bigger models and eventually start building concept cars. However, you only have five months, and each model may take a month to design and produce. You simply don't have time. Instead, you get started with your HPC system and start creating some [Computational Fluid Dynamics](https://en.wikipedia.org/wiki/Computational_fluid_dynamics) models which you can then use to create your new car with plenty of time to spare. The image below is the output of a CFD model. ![CFD model of car](https://upload.wikimedia.org/wikipedia/commons/f/fa/Verus_Engineering_Porsche_987.2_Ventus_2_Package.png " ")

### Example 1.2 - Multiplying Matrices in Parallel

Though it's not as large-scale as either of the examples mentioned above, a very applicable and useful application of parallel programming is matrix multiplication. Runtime of multiplying two _n_ by _n_ matrices is in complexity class O(_n<sup>3</sup>_). This means that a good parallel algorithm which can make use of multiple cores for this process is very important. An example of a parallel algorithm to multiply matrices is below.

In [None]:
# Necessary definitions for matrix multiplication
import ctypes
import multiprocessing
import numpy

def read(filename):
    lines = open(filename, 'r').read().splitlines()
    A = []
    B = []
    matrix = A
    for line in lines:
        if line != "":
            matrix.append(map(int, line.split("\t")))
        else:
            matrix = B
    return A, B

def printMatrix(matrix, f):
    for line in matrix:
        f.write("\t".join(map(str,line)) + "\n")

def lineMult(start):
    global A, B, mp_arr, part
    n = len(A)
    # create a new numpy array using the same memory as mp_arr
    arr = numpy.frombuffer(mp_arr.get_obj(), dtype=ctypes.c_int)
    C = arr.reshape((n,n))
    for i in range(start, start+part):
        for k in range(n):
            for j in range(n):
                C[i][j] += list(A[i])[k] * list(B[k])[j]

def ikjMatrixProduct(A, B, threadNumber):
    n = len(A)
    
    pool = multiprocessing.Pool(threadNumber)

    pool.map(lineMult, range(0, int(n), int(part)))
    # mp_arr and arr share the same memory
    arr = numpy.frombuffer(mp_arr.get_obj(), dtype=ctypes.c_int) 
    C = arr.reshape((n,n))
    return C

def extant_file(x):
    """
    'Type' for argparse - checks that file exists but does not open.
    """
    if not isfile(x):
        raise argparse.ArgumentError("{0} does not exist".format(x))
    return x

In [None]:
# Run the matrix multiplication

import argparse, sys
from os.path import isfile
from argparse import ArgumentParser

A, B = read("data/matrix_mult.in")

n, m, p = len(A), len(list(A[0])), len(list(B[0]))

threadNumber = 32
part = len(A) // threadNumber
if part < 1:
    part = 1

# shared, can be used from multiple processes
mp_arr = multiprocessing.Array(ctypes.c_int, n*p)
C = ikjMatrixProduct(A, B, threadNumber)
printMatrix(C, sys.stdout)

## III.Multiprocessing
One of the reasons this course is written in python is that python is an easy to use language which is commonly used for computational science and other HPC applications. Another reason that Python was chosen for this course is Python's powerful _multiprocessing_ library. _Multiprocessing_ allows users to open subprocesses within python in order to run many different snippets of python code all at once. In the following examples, we will use _multiprocessing_ to speed up our computations. The following image is a description of how multiprocessing can help speed up Python programs: ![Multiprocessing Diagram](https://sebastianraschka.com/images/blog/2014/multiprocessing_intro/multiprocessing_scheme.png "")

### Example 1.3 - Basic Parallelization with Multiprocessing
In this example, we are going to create random strings in parallel and in serial. This is a good task to parallelize because each random string does not depend at all on the random strings that came before it. We will discuss in greater detail what makes tasks better candidates for parallelization in topic 6: algorithm analysis.

In [None]:
# Define rand_string function to generate one random string

import random
import string
import multiprocessing

random.seed(123)

# Define an output queue
output = multiprocessing.Queue()

# define a example function
def rand_string(length, output):
    """ Generates a random string of numbers, lower- and uppercase chars. """
    rand_str = ''.join(random.choice(
                        string.ascii_lowercase
                        + string.ascii_uppercase
                        + string.digits)
                   for i in range(length))
    output.put(rand_str)


In [None]:
# Call our rand_string in serial

import time
NUM_STRINGS = 100

processes = []
before = time.time()
for _ in range(NUM_STRINGS):
    processes.append(rand_string(5, output))
results = [output.get() for p in processes]
after = time.time()

print("Generated {} strings in {} seconds".format(NUM_STRINGS, after-before))

In [None]:
# Call our rand_string in parallel

import time

before = time.time()
NUM_STRINGS = 100
# Setup a list of processes that we want to run
processes = []
for _ in range(NUM_STRINGS):
    p = multiprocessing.Process(target=rand_string, args=(5, output))
    processes.append(p)

# Run processes
for p in processes:
    p.start()

# Exit the completed processes
for p in processes:
    p.join()

# Get process results from the output queue
before = time.time()
results = [output.get() for p in processes]
after = time.time()

print('\n')
print("Generated {} strings in {} seconds".format(NUM_STRINGS, after-before))

As you can see, the task runs more than five times faster in parallel than it does in serial

## IV.Parallelism and HPC
Parallelism is at the core of any HPC system. The way that HPC systems can be many hundreds of thousands of times faster than traditional systems is through massive parallelism. Some HPC systems have a total of many millions of cores, distributed among many systems (known as 'nodes'), as compared to the 4-8 of a standard modern desktop. This massive hardware parallelism, combined with some clever parallel algorithms, can lead to amazing processing power. 

### Example 1.4 - Monte Carlo Pi in Parallel
In this example, we will compute pi through monte carlo approximation, similar to the way we did it before, except this time we will do it in parallel. As mentioned earlier, the monte carlo pi calculation is a very easily parallelized algorithm, and there should be a large speedup when it is run in parallel.

In [42]:
# Defining parallel monte carlo pi calculation 

import random
import multiprocessing
from multiprocessing import Pool


#caculate the number of points in the unit circle
#out of n points
def monte_carlo_pi_part(n):
    
    count = 0
    for i in range(n):
        x=random.random()
        y=random.random()
        
        # if it is within the unit circle
        if x*x + y*y <= 1:
            count=count+1
        
    #return
    return count   

In [44]:
# Calling parallel monte carlo pi simulation

np = multiprocessing.cpu_count()
print('You have {0:1d} CPUs'.format(np))

# Nummber of points to use for the Pi estimation
n = 10000000

# iterable with a list of points to generate in each worker
# each worker process gets n/np number of points to calculate Pi from

part_count=[n//np for i in range(np)]

#Create the worker pool
# http://docs.python.org/library/multiprocessing.html#module-multiprocessing.pool
pool = Pool(processes=np)   

# parallel map
count=pool.map(monte_carlo_pi_part, part_count)

print("Esitmated value of Pi: {} ".format(sum(count)/(n*1.0)*4))

You have 32 CPUs
Esitmated value of Pi: 3.1416516 


## V.Power and Speed Comparison
One definition of HPC has to do with scale. Some people believe that an HPC system is defined by how powerful it is rather than by how it is designed. Computational power has increased an incredible amount recently and HPC scale systems are now easily accessible to many people. The following graphic shows how powerful modern systems can be: ![hpc system power comparison](https://i.imgur.com/frXsxpz.png)

As you might guess, parallel algorithms have the potential to be much faster than serial algorithms. In order to show this, we are going to use the two versions of monte carlo pi that we have designed in this notebook and time them against each other. 

### Example 1.5 - Pi in Serial vs Parallel
In this example, we will pit against each other our parallel and serial pi approximation calculations. To do this, we will use the jupyter "magic" `%time` to time the two algorithms.

In [49]:

# Calling parallel monte carlo pi simulation

# Nummber of points to use for the Pi estimation
n = 10000

# iterable with a list of points to generate in each worker
# each worker process gets n/np number of points to calculate Pi from

part_count=[n//np for i in range(np)]

#Create the worker pool
# http://docs.python.org/library/multiprocessing.html#module-multiprocessing.pool
pool = Pool(processes=np)   

# parallel map
count=pool.map(monte_carlo_pi_part, part_count)

print("Esitmated value of Pi: {} ".format(sum(count)/(n*1.0)*4))

Esitmated value of Pi: 3.158 
CPU times: user 105 ms, sys: 839 ms, total: 943 ms
Wall time: 712 ms


In [48]:
%%time

# Calling serial monte carlo pi simulation
# Monte Carlo Pi Simulation
import random as r
import math as m

# Number of darts that land inside.
inside = 0
# Total number of darts to throw.
total = 10000

# Iterate for the number of darts.
for i in range(0, total):
    # Generate random x, y in [0, 1].
    x2 = r.random()**2
    y2 = r.random()**2
    # Increment if inside unit circle.
    if m.sqrt(x2 + y2) < 1.0:
        inside += 1

# inside / total = pi / 4
pi = (float(inside) / total) * 4

# It works!
print(pi)

3.1624
CPU times: user 11 ms, sys: 406 µs, total: 11.4 ms
Wall time: 10.8 ms


## VI. HPC Architecture

### Example 1.6 - Pinging Other Nodes 
"You're not alone out there"

## Exercise 1. Write a program to compute sums of _n_ integers in parallel 

In [None]:
# Your code goes here