<img src="https://snipboard.io/Kx6OAi.jpg">

# Session 1. Intro to Parallel Computing
<div style="margin-top: -10px; padding: 5px; line-height: 20px;"><img src="https://snipboard.io/v5q47G.jpg" style="width: 35px; float: left; margin-right: 10px;"> Author:  <a href="http://www.linkedin.com/in/davidyerrington">David Yerrington</a>, Data Scientist<br>San Francisco, CA</div>

## Learning Objectives

- Explain the core concepts of parallel computing
- Be familiar with types of parallel processing provided by Dask
- Identify the major components of Dask: Collection Types and its Scheduler

### Prerequisite Knowledge
- Basic Pandas 
  - Difference between Series vs Dataframe
  - Bitmasks, query function, selecting data
  - Aggregations

## Environment Setup

We will first review some basic points to setup Python and the environment to start in [the setup guide](../environment.md).


In [1]:
from IPython.display import Video

# 1. Intro to Parallel Computing with Dask

<a name="big-data"></a>
## What is "big data"?
---

Big data is a term used for problems that exceeds the processing capacity of typical databases.  There are still applications of data needed to understand general characteristics, that require the ability to model both predictively and structurally.  When data grows beyond the capacity of a single machine either in speed to process, size to manage, or in variety of formats, the ability to manage data requires a different solution.

**The 3 V's of Bid Data**
- **Volume**: The amount of data
- **Variety**: The different formats of data
- **Velocity**: The speed of which data can be analyzed

> **Dave Yerrington's 4th V (unofficial big data tenet):**
> - **Value**: It's important to assess the value of any solutions in terms of the business.  Understanding the underpinnings of cost vs benefit is even more essential in the context of big data.  It's easy to misundersatnd the 3 V's without looking at the bigger picture, connecting the value of the business cases involved.

![3v](https://snipboard.io/ewbKGk.jpg)

<a id='parallelism'></a>
## 1.1 Parallelism
---

The conceptual basis of Big Data processing is that a data transformation can be broken down and solved through a process of computing many smaller transformations.  The simplest form of parallel processing involves independent tasks with no need to communicate with each other.

- Running multiple instances to process data
- Data can be subset and solved iteratively 
- Sub-solutions can be solved independently

### Example:  Mowing Lawns (1 Mower)

A good analogy to start with involves mowing a lawn.  Think of the mower as a single function that has to complete its task over an entire lawn of grass (a dataset).  Regardless of how big the lawn is, a fixed rate mower could mow the entirety of lawn-space.

![](https://snipboard.io/CR89sH.jpg)

### Example:  Mowing Lawns (4 Mowers)

With 4x the mowers, each mower technically only needs to mow 1/4 of the lawn-space to complete its task.  
- Each mower can complete its task independently
- Each mower operates at the same time

![](https://snipboard.io/Bm3ER1.jpg)

### 1.2 Example:  Summation

It is possible to sum an entire list with a single function or basic iteration.  However, to illustrate the concept of parallel processing, breaking a list up into smaller "chunks" enables tasks to be computed independently.  The idea of many tasks solving portions of a larger problem is the core idea that underpins the existance of parallel processing.

In [3]:
Video("../media/parallel_processing_demo_1.m4v")

### Question:  What are some problems you've run into with Pandas?

# 2. What is Dask?

![](https://snipboard.io/HNWi85.jpg)

Dask is:

- A Python-based parallel computing framework
- Similiar API as Pandas

### Compared to Spark / PySpark

Spark is another great framework.  Spark is an all encompsing solution compared to Dask:

- Also a parallel computing framework
- Has a machine learning component like Scikit-Learn
- Interoperable with many big data frameworks like Hadoop/HDFS
- DataFrames / Panel-like data object
- SQL interface
- Graph capabilities (GraphX)

Dask's area of focus is around distributed computing in Python but also includes ML now with Dask ML.  PySpark/Spark is fundamentally written in Scala (a really powerful langauge) so even if you're using Python/PySpark, all of it's bindings and functionality is relative to underlying Scala code.  Spark is an ecosystem of tools and libraries for building pipelines, streaming, graph, and ML problems. 

## Why Dask?

As practioners of Python, it's quite nice to stay in the ecosystem of data science toolsets that are familliar.  To be able to go from exploratory analysis, to ETL, to feature engineering, to data warehouses, modeling, and production systems without having to leave Python, is a huge plus.  You can do everything in Spark, but there is a quite a lot of precedence knowledge to understand about working in an entirely different ecosystem of tools.  With Dask, you can work with all your favorite Python tools with the possibility of scaling.

![](https://snipboard.io/rM7sSJ.jpg)


> # Also:  Dask comes with Anaconda by default now!!


# 3.  Core Concepts

We'll get into more specifics in the next notebook / session, but it's helpful to know a few basic concepts before diving into specifics.


## 3.1 "Dask is a graph execution engine"

> "... specifically, a directed acyclic graph of tasks with data dependencies – using ordinary Python data structures, namely dicts, tuples, functions, and arbitrary Python values" -Dask Documentation

![](https://snipboard.io/jNO1WE.jpg)

> **Did You Know: DAGs**
>
> If you've ever used a decision tree classifier, you've used a [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph).   Generally, DAGs flow in one direction only represent unidirectional ordering and model relationships between entities ([topological ordering](https://en.wikipedia.org/wiki/Topological_ordering)).  Depending on the domain of graph theory, or computer science that you think in, DAGs can describe a variety of applications.
>
> - Scheduling of process executions
> - Family Trees
> - Decision Trees

When it comes to Dask, before any processing occurs, a "plan" is devised before any processing occurs.  The "plan" determines which steps and in which order a set of tasks will be performed.  

## The "plan" is the **DAG**

![](https://snipboard.io/WAE4Rv.jpg)

Before anything is computed, a DAG is devised that considers the data, tasks to be ran and with available resources.  Dask processes can be used across several machines with many nodes, or on a single machine.

### Question:  Why do you think the **DAG** is generated but not executed by default?

## 3.2 A **Scheduler** "runs" the **DAG**

The DAG defines the scope of work that will take place once the time comes to take action.  The "scheduler" takes the DAG and actually executes it.  The behavior of the scheduler is also dependent on the data and is dictated by the type of Dag data types used (more on that soon).

## 4. Dask Interface Types

We'll learn that you can't just take a Pandas DataFrame and plug it into Dask and expect it to work.  Generally, the API is very similar to Pandas and can be quite easy to adjust to working with the Dask API.  However, Dask is more than working with DataFrames.  Dask is fundemantally **Python** in nature and provides many object types for solving distributed problems.  We will be exploring Dask in the context of DataFrames for the majority of our sessions but it is helpful to understand the Dask ecosystem but more importantly, the interfaces that exist, and how the scheduler behaves in relation to Dask's interfaces.

> ![](https://snipboard.io/N75dYH.jpg) <br>
> Source:  [Bicortex](http://bicortex.com/data-analysis-with-dask-a-python-scale-out-parallel-computation-framework-for-big-data/)

### 4.1 High Level Collections

For working with datasets that are "larger than memory", dask provides these high level primitive types that can be processed in parellel.

 - Array -> `np.array`
 - Bag -> `list`
 - DataFrame -> Pandas `DataFrame`


### 4.2 Low Level Schedulers

![](https://snipboard.io/4Mwi71.jpg)

Collections contain and describe data.  Schedulers execute task graphs in parallel.

#### Single Machine Scheduler  
Using a single maching, tasks are computed on a local process or thread pool.  Single machine scheduling is the default.

> ##### Local Threads
> 
> ```python
> import dask
> dask.config.set(scheduler='threads')  # overwrite default with threaded scheduler
> ```

> #### Local Multiprocessing
> 
> ```python
> import dask.multiprocessing
> dask.config.set(scheduler='processes')  # overwrite default with multiprocessing scheduler
> ```

#### Distributed Scheduler  
The most flexible and feature-rich scheduler.  Can run locally or on a cluster.  We could have an entire session on this topic alone.  For our sessions, we will be using the **distributed scheduler** in local mode.

Some notable features with using a distributed scheduler that we will explore in the next session:

- A diagnostic dashboard that can be used to measure resource utilization and task progress.
- The docs say it can be more efficient than using the local scheduling processor with multiprocessor option.  (I found this to be true in most cases.)

![](https://snipboard.io/Av73Mn.jpg)

> If you've used a scheduling system like Luigi or Airflow these systems are designed with similar features.  Dask is an alternative to using the multiprocessing library or these big fancy scheduling systems but allows you to programatically configure and run paralell tasks.


# 5. Summary

## 1.  Parallel Computing 

- Running multiple instances to process data
- Data can be subset and solved iteratively
- Sub-solutions can be solved independently

![](https://snipboard.io/WH6FfT.jpg)

## 2.  Parallel Processing in Dask
- Generally, processing is "lazy" and starts with a "plan"
- The "plan" is the DAG
- Each step in a DAG models tasks that will be performed
- The DAG is not executed until requested

## 3. Dask Collections and Scheduling
- Dask provides collections that are roughly equivalent to familliar Python data types we use when working with data.
  - `Array` -> `np.array`
  - `Bag` -> `list`
  - `DataFrame` -> `Pandas DataFrame`
- The scheduler is responsible for executing the DAG (ie: the "plan")
  - Local scheduling (multiprocess and threaded)
    - Great for debugging and can work great in a pinch for speeding up slow `.apply` functions
  - Distributed
    - Offers the most flexibility and features.
    - Tends to be a bit more efficient by default but this largely depends on your dataset / transformations / processsing
