<img src="images/dask_horizontal.svg"
     width="45%"
     alt="Dask logo\">
     
# Scaling your data work with Dask

### Materials & setup

- Tutorial materials available at https://github.com/coiled/data-science-at-scale
- Two ways to go through the tutorial:
    1. Run locally on your laptop
    2. Run using Binder (no setup required)

### About the speakers

- **[James Bourbeau](https://www.jamesbourbeau.com/)**: Dask maintainer and Software Engineer at [Coiled](https://coiled.io/).
- **[Hugo Bowne-Anderson](http://hugobowne.github.io/)**: Head of Data Science Evangelism and Marketing at [Coiled](https://coiled.io/).

# Overview

Dask is a flexible, open source library for parallel computing in Python

- Documentation: https://docs.dask.org
- GitHub: https://github.com/dask/dask

From a high-level Dask:

- Enables parallel and larger-than-memory computations
- Scales the existing Python ecosystem
    - Uses familiar APIs you're used to from projects like NumPy, Pandas, and scikit-learn
    - Allows you to scale existing workflows with minimal code changes
- Dask works on your laptop, but also scales out to large clusters
- Offers great built-in diagnosic tools

<img src="images/dask-components.svg"
     width="85%"
     alt="Dask components\">

## Dask Schedulers, Workers, and Beyond

Work (Python code) is performed on a cluster, which consists of

* a scheduler (which manages and sends the work / tasks to the workers)
* workers, which compute the tasks.

The client is "the user-facing entry point for cluster users." What this means is that the client lives wherever you are writing your Python code and the client talks to the scheduler, passing it the tasks.

<img src="images/dask-cluster.svg"
     width="85%"
     alt="Dask components\">

# Dask in action!

In [None]:
# Sets up Dask's distributed scheduler
from dask.distributed import Client

client = Client()
client

In [None]:
# Download data
%run prep.py -d flights

In [None]:
# Perform Pandas-like operations
import os
import dask.dataframe as dd

df = dd.read_csv(os.path.join("data", "nycflights", "*.csv"),
                 parse_dates={"Date": [0, 1, 2]},
                 dtype={"TailNum": str,
                        "CRSElapsedTime": float,
                        "Cancelled": bool})

df.groupby("Origin").DepDelay.mean().compute()

## Tutorial goals

The goal for this tutorial is to cover the basics of Dask. Attendees should walk away with an understanding of what
Dask offers, how it works, and ideas of how Dask can help them effectively scale their own data intensive workloads.

The tutorial consists of several Jupyter notebooks which contain explanatory material on how Dask works. Specifically, the notebooks presented cover the following topics:

- [Dask Delayed](1-delayed.ipynb)
- [Dask DataFrame](2-dataframe.ipynb)
- [Machine Learning](3-machine-learning.ipynb)

Each notebook also contains hands-on exercises to illustrate the concepts being presented. Let's look at our first example to get a sense for how they work.

### Exercise: Print `"Hello world!"`

Use Python to print the string "Hello world!" to the screen.

In [None]:
# Your solution here

In [None]:
# Run this cell to see a solution
%load solutions/overview.py

Note that several of the examples here have been adapted from the Dask tutorial at https://tutorial.dask.org.

## Optional: Work directly from the cloud with Coiled 

<br>
<img src="images/horizontal.png" alt="Coiled logo" style="width: 500px;"/>
<br>

Here I'll spin up a Dask cluster using Coiled to show you just how easy it can be. Note that to do so, I've also signed into the [Coiled Beta](cloud.coiled.io/), pip installed `coiled`, and authenticate. You can do the same!

You can also spin up [this hosted Coiled notebook](https://cloud.coiled.io/jobs/coiled/quickstart), which means you don't have to do anything locally.

The plan:

* use Coiled to load in **all** of the NYC taxi dataset from 10+ CSVs (8+ GBs) on an AWS cluster, 
* massage the data, 
* engineer a feature, and
* compute the average tip as a function of the number of passengers.

In [None]:
import coiled
from dask.distributed import Client

In [None]:
# Spin up cluster
cluster = coiled.Cluster(n_workers=10)

In [None]:
# Connect Dask to my cluster
client = Client(cluster)
client

In [None]:
import dask.dataframe as dd

# Read data into a Dask DataFrame
df = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv", 
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    dtype={
        'RatecodeID': 'float64',
       'VendorID': 'float64',
       'passenger_count': 'float64',
       'payment_type': 'float64'
    },
    storage_options={"anon":True}
)
df

In [None]:
%%time

# Prepare to compute the average tip 
# as a function of the number of passengers
mean_amount = df.groupby("passenger_count").tip_amount.mean()

In [None]:
%%time

# Compute the average tip 
# as a function of the number of passengers
mean_amount.compute()

In [None]:
client.shutdown()

**Recap:** We have
* used Coiled to load in **all** of the NYC taxi dataset from 10+ CSVs (10 GBs) on an AWS cluster,
* computed the average tip as a function of the number of passengers, and 
* learned a bunch about using Dask on cloud-based clusters!