## <center> Introduction to Parallel and Distributed Computing </center>
#### <center> Linh B. Ngo </center>
#### <center> CPSC 3620 </center>

### <center> What is Parallel Computing? </center>

<center> <img src="pictures/payroll-serial.png" width="700"/> 
<sub> *https://computing.llnl.gov/tutorials/parallel_comp/* </sub>
</center>

<center> <img src="pictures/payroll-parallel.png" width="700"/> 
<sub> *https://computing.llnl.gov/tutorials/parallel_comp/* </sub>
</center>

- Problem
- Execution framework
- Compute resource

The **problem** should be able to

- Be broken apart into discrete pieces of work that can be solved simultaneously
- Be solved in less time with multiple compute resources than with a single compute resource

The **execution framework** should be able to

- Execute multiple program instructions concurrently at any moment in time

The **compute resource** might be

- A single computer with multiple processors
- An arbitrary number of computers connected by a network
- A combination of both
- A special computational component inside a single compute, separate from the main processors (GPU)

### <center> Progress of Parallel and Distributed Computing </center>

<center> Single computer, single core </center>
<br>
<center> <img src="pictures/single-core.png" width="250"/> 
<sub> *http://www.intel.com/pressroom/kits/pentiumee/* </sub>
</center>

<center> Single site, single computer, multiple cores </center>
<br>
<center> <img src="pictures/multiple-core.png" width="250"/> 
<sub> *http://www.intel.com/pressroom/kits/pentiumee/* </sub>
</center>

<center> Single site, multiple computers, multiple cores </center>
<br>
<center> <img src="pictures/cluster-computers.png" width="500"/> </center>

<center> Multiple sites, multiple computers, multiple cores </center>
<br>
<center> <img src="pictures/grid-computing.png" width="500"/> </center>

<center> Multiple sites, multiple computers, multiple cores, virtual unified domain </center>
<br>
<center> <img src="pictures/cloud.png" width="400"/> </center>

### <center> Distributed Computing Systems </center>

"A collection of individual computing devices that can communicate with each other." (Attiya and Welch, 2004)

"A collection of individual computing devices that **can communicate with each other**." (Attiya and Welch, 2004)

### <center> Quantification of Performance Improvement </center>

Can we just throw more compute resources at the problem?

**Parallel Speedup**: How much faster the proram becomes once *some* computing resources are added


**Parallel Efficiency**: Ratio of performance improvement per individual unit of computing resource

Given $p$ processors, speedup, $S(p)$ is calculated as the ratio of the time it takes to run the program using a single processor over the time it takes to run the program using *p* processor. 

<br>

<center> $S(p) = \frac{\textrm{Sequential runtime}}{\textrm{Parallel runtime}} = \frac{t_{s}}{t_{p}}$ </center>


In [2]:
# A program takes 30 seconds to run on a single-core machine, and 15 seconds to run on a dual-core machine
ts = 30 
tp = 15 
S = (ts / tp)
print (S)

2.0


**Theoretical Max**: Let $f$ be the fraction of the program that is not parallelizable. 
Assume no communication overhead. 

<center> <img src="pictures/amdahl-law.png" width="600"/> </center>

<center> ${t_{p}} = f{t_{s}} + \frac{(1-f)t_{s}}{p}$ </center>

<br>

<center> $S(p) = \frac{t_{s}}{f{t_{s}} + \frac{(1-f)t_{s}}{p}} = \frac{p}{pf+1-f} = \frac{p}{(p-1)f + 1}$ </center>

This is known as Amdahl's Law


The efficiency $E$ is then defined as the ratior of speedup over the number of processors, $p$.

<br>
  
<center> $E = \frac{t_{s}}{t_{p} \times p} = \frac{S(p)}{p} 100\% = \frac{1}{(p-1)f + 1} 100\%$ </center>


In [3]:
# Suppose that 4% of my application is serial.  
# What is my predicted speedup according to Amdahl’s Law on 5 processors?

In [4]:
# Suppose that I get a speedup of 8 when I run my application on 10 processors.  
# According to Amdahl's Law, # What portion is serial?  
# What is the speedup on 20 processors?  What is the efficiency?  
# What is the best speedup that I could hope for?

Since $S(p)=\frac{p}{(p-1)f + 1}$, we have $S(p) \leq p $

Limiting factors:
    
- Non-parallelizable code
- Commmunication overhead

Example Scaling:
https://github.com/linhbngo/parallel-r/JAGS.ipynb

**Superlinear speedup:** $S(p)>p$

- Poor sequential reference implementation
- Memory caching
- I/O Blocking

### <center> Types of Distributed Computing Systems </center>

<center> **Flynn's Taxonomy**
<img src="pictures/flynn.png" width="500"/> 
<sub>*http://www.slideshare.net/hlshih/high-performance-computing-building-blocks-production-perspective*</sub>
</center>

- Streaming SIMD extensions for x86 architectures
- Shared
- Distributed Shared Memory
- Heterogeneous Computing (Accelerators)
- Message Passing

<center> **Streaming SIMD** </center>

<center>
<img src="pictures/intel-simd.png" width="600"/>

<sub> https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions </sub>
</center>

<center> **Shared Memory (Distributed Shared Memory)** </center>

<center> <img src="pictures/shared-mem.png" width="500"/>
<br>
<sub>*https://computing.llnl.gov/tutorials/parallel_comp/*</sub></center>

-   One processor, multiple threads
-   All threads have read/write access to the same memory
-   Programming models:
        - Threads (pthread) – programmer manages all parallelism
        - OpenMP: Compiler extensions handle parallelization through in-code markers
        - Vendor libraries (e.g. Intel math kernel libraries)


<center> **Heterogeneous Computing (Accelerators)** </center>

<center> <img src="pictures/gpu-computing.png" width="500"/> 
<br>
<sub> *http://www.nvidia.com/docs/IO/143716/how-gpu-acceleration-works.png* </sub>
</center>

- GPU (Graphic Processing Units)
        - Processor unit on graphic cards designed to support graphic rendering (numerical manipulation)
        - Significant advantage for certain classes of scientific problem
        - CUDA – Library developed by NVIDIA for their GPUs
        - OpenACC – Standard devides by NVIDIA, Cray, and Portal Compiler (PGI). 
        - OpenAMP – Extensions to Visual C++ (Microsoft) to direct computation to GPU
        - OpenCL – Set of standards by the group behind OpenGL
- FPGA (field programmable gate array)
        - Dynamically reconfigurable circuit board
        - Expensive, difficult to program
        - Power efficient, low heat

<center> **Message Passing** </center>

<center> <img src="pictures/message-passing.png" width="500"/> 
<sub> *https://computing.llnl.gov/tutorials/parallel_comp/*</sub>
</center>

- Processes handle their own memory, data is passed between processes via messages.
        - Scales well
        - Commodity parts
        - Expandable
        - Heterogenous
- Programming Models:
        - MPI: standardized message passing library
        - MPI + OpenMP (hybrid model)
        - MapReduce programming model



### <center> Benchmarking </center>

- LINPACK (Linear Algebra Package: Dense Matrix Solver
- HPCC: High-Performance Computing Challenge
        - HPL (LINPACK to solve linear system of equation) 
        - DGEMM (Double Precision General Matric Multiply)
        - STREAM (Memory bandwidth)
        - PTRANS (Parallel Matrix Transpose to measure processors communication)
        - RandomAccess (Random memory updates)
        - FFT (double precision complex discrete fourier transform)
        - Communication bandwidth and latency


- SHOC: Scalable Heterogeneous Computing
        - Non-traditional systems (GPU)
- TestDFSIO
        - I/O Performance of MapReduce/Hadoop Distributed File System

### <center> Ranking </center>

- TOP500: Rank the supercomputers based on their LINPACK score
- GREEN500: Rank the supercomputers with emphasis on energy usage (LINPACK / power consumption)
- GRAPH500: Rank systems based on benchmarks degisned for data-intensive computing