<img src="./images/galvanize-logo.png" alt="galvanize-logo" align="center" style="width: 200px;"/>

<hr />

### Parallel Programming


## Objectives

* Describe some algorithms that contribute to fast code.

* Describe a typical process for parallel computing

* Describe how to get application run time

## Algorithms for code optimization

   * Adaptive methods (e.g. adaptive quadrature, adaprive Runge-Kutta)

   * Divide and conquer (e.g. Barnes-Hut, Fast Fourier Transform)

   * Tabling and dynamic programming (e.g. Viterbi algorithm for Hidden Markov Models

   * Graphs and network algorihtms (e.g. shortest path, max flow min cut)

   * Hashing (e.g. locality senstive hashing, Bloom filters)

   * Probabilistic algorithms (e.g. randomized projections, Monte Carlo integration)

## A typical process for parallel programming



1. Start by profiling a serial program to identify bottlenecks

2. Identify opportunities for parallelism by asking the following questions.

   * Can tasks be performed in parallel?
       * Function calls
       * Loops
   * Can data be split and operated on in parallel?
       * Decomposition of arrays along rows, columns, blocks
       * Decomposition of trees into sub-trees
   * Is there a pipeline with a sequence of stages?
       * Data preprocessing and analysis
       * Graphics rendering



3. Identify the nature of the parallelism?

   * **Linear** - Embarrassingly parallel programs
   * **Recursive** - Adaptive partitioning methods



4. Determine the granularity?

   10s of jobs
   1000s of jobs



5. Choose an algorithm

   * Organize by tasks
       * Task parallelism
       * Divide and conquer
   * Organize by data
       * Geometric decomposition
       * Recursive decomposition
   * Organize by flow
       * Pipeline
       * Event-based processing



6. Map to program and data structures

   * Program structures
      * Single program multiple data (SPMD)
      * Master/worker
      * Loop parallelism
      * Fork/join
   * Data structures
      * Shared data
      * Shared queue
      * Distributed array



7. Map to parallel environment

    * Multi-core shared memory
       * Cython with OpenMP
       * multiprocessing
       * IPython.cluster
    * Multi-computer
       * IPython.cluster
       * MPI
       * Hadoop / Spark
    * GPU
       * CUDA
       * OpenCL



8. Execute, debug, tune in parallel environment

## Getting application run time

* [Python Debugger](https://docs.python.org/3/library/pdb.html)  is one way to identify troublesome sections of code, but it is not always the best way to compare multiple functions meant for the same task.

* [timeit()](https://docs.python.org/3/library/timeit.html#basic-examples) is commonly used to compare specific different implementations of code.

In [9]:
import numpy as np

def special_squares(n):
    v = np.arange(n)
    return v[v%2==0]**2

n = 1000000
%timeit special_squares(n)

16.1 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
