# Data Science and Machine Learning at Scale

<img src="images/logo-dask-matrix.jpeg"
     width="50%"
     alt="Dask logo">
     
## Materials & setup

- Tutorial materials available at https://github.com/coiled/data-science-at-scale
- Two ways to go through the tutorial:
    1. Run locally on your laptop
    2. Run using [Binder](https://mybinder.org/v2/gh/coiled/data-science-at-scale/master?urlpath=lab) (no setup required)

## Overview

Dask is a flexible, open source library for parallel computing in Python

- Documentation: https://docs.dask.org
- GitHub: https://github.com/dask/dask

From a high-level Dask:

- Enables parallel and larger-than-memory computations
- Scales the existing Python ecosystem
    - Uses familiar APIs you're used to from projects like NumPy, pandas, and scikit-learn
    - Allows you to scale existing workflows with minimal code changes
- Dask works on your laptop, but also scales out to large clusters
- Offers great built-in diagnosic tools

<img src="images/dask-overview.png"
     width="75%"
     alt="Dask components\">

## Dask Schedulers, Workers, and Beyond

Work (Python code) is performed on a cluster, which consists of

* a scheduler (which manages and sends the work / tasks to the workers)
* workers, which compute the tasks.

The client is "the user-facing entry point for cluster users." What this means is that the client lives wherever you are writing your Python code and the client talks to the scheduler, passing it the tasks.

<img src="images/dask-cluster.png"
     width="75%"
     alt="Dask components\">

## Dask in action!

In [1]:
# Sets up Dask's distributed scheduler
from dask.distributed import Client

client = Client()
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 8,Total memory: 16.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:57979,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 8
Started: Just now,Total memory: 16.00 GiB

0,1
Comm: tcp://192.168.1.6:57992,Total threads: 2
Dashboard: http://192.168.1.6:57994/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:57983,
Local directory: /Users/rpelgrim/Documents/git/data-science-at-scale/dask-worker-space/worker-qnmfqxtf,Local directory: /Users/rpelgrim/Documents/git/data-science-at-scale/dask-worker-space/worker-qnmfqxtf

0,1
Comm: tcp://192.168.1.6:57993,Total threads: 2
Dashboard: http://192.168.1.6:57997/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:57985,
Local directory: /Users/rpelgrim/Documents/git/data-science-at-scale/dask-worker-space/worker-edzcaz9r,Local directory: /Users/rpelgrim/Documents/git/data-science-at-scale/dask-worker-space/worker-edzcaz9r

0,1
Comm: tcp://192.168.1.6:57990,Total threads: 2
Dashboard: http://192.168.1.6:57995/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:57982,
Local directory: /Users/rpelgrim/Documents/git/data-science-at-scale/dask-worker-space/worker-4xkqli1u,Local directory: /Users/rpelgrim/Documents/git/data-science-at-scale/dask-worker-space/worker-4xkqli1u

0,1
Comm: tcp://192.168.1.6:57991,Total threads: 2
Dashboard: http://192.168.1.6:57996/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:57984,
Local directory: /Users/rpelgrim/Documents/git/data-science-at-scale/dask-worker-space/worker-bu5xcnnm,Local directory: /Users/rpelgrim/Documents/git/data-science-at-scale/dask-worker-space/worker-bu5xcnnm


In [2]:
# Download data
%run prep.py -d flights

- Downloading NYC Flights dataset... done
- Extracting flight data... done
- Creating json data... done
** Created flights dataset! in 8.79s**


In [3]:
# Perform Pandas-like operations
import os
import dask.dataframe as dd

df = dd.read_csv(os.path.join("data", "nycflights", "*.csv"),
                 parse_dates={"Date": [0, 1, 2]},
                 dtype={"TailNum": str,
                        "CRSElapsedTime": float,
                        "Cancelled": bool})

df.groupby("Origin").DepDelay.mean().compute()

Origin
EWR    10.295469
JFK    10.351299
LGA     7.431142
Name: DepDelay, dtype: float64

## Tutorial goals

The goal for this tutorial is to cover the basics of Dask. Attendees should walk away with an understanding of what
Dask offers, how it works, and ideas of how Dask can help them effectively scale their own data intensive workloads.

The tutorial consists of several Jupyter notebooks which contain explanatory material on how Dask works. Specifically, the notebooks presented cover the following topics:

- [Dask Delayed](1-delayed.ipynb)
- [Dask DataFrame](2-dataframe.ipynb)
- [Machine Learning](3-machine-learning.ipynb)

Each notebook also contains hands-on exercises to illustrate the concepts being presented. Let's look at our first example to get a sense for how they work.

### Exercise: Print `"Hello world!"`

Use Python to print the string "Hello world!" to the screen.

In [None]:
# Your solution here

In [None]:
# Run this cell to see a solution
%load solutions/overview.py

Note that several of the examples here have been adapted from the Dask tutorial at https://tutorial.dask.org.

## Optional: Work directly from the cloud with Coiled 

<br>
<img src="images/Coiled_Social-Templates_sand.png" alt="Coiled logo" width=50%/>
<br>

Here, I'll spin up a cluster on Coiled to show you just how easy it can be. Note that to do so, I've also signed into the [Coiled Cloud](https://cloud.coiled.io/), pip/conda installed `coiled`, and authenticated. You can do the same!

(Note: Coiled will be already installed if you followed the instructions for local setup, but you still need to authenticate)

You can also spin up [this hosted Coiled notebook](https://cloud.coiled.io/jobs/coiled/quickstart), which means you don't have to do anything locally.

The plan:

* use Coiled to load in **all** of the NYC taxi dataset from 10+ CSVs (8+ GBs) on an AWS cluster, 
* massage the data, 
* engineer a feature, and
* compute the average tip as a function of the number of passengers.

In [None]:
import coiled
from dask.distributed import LocalCluster, Client

In [None]:
cluster = coiled.Cluster(n_workers=10)
client = Client(cluster)
print('Dashboard:', client.dashboard_link)

In [None]:
import dask.dataframe as dd

# Read data into a Dask DataFrame
df = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv", 
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    dtype={
        'RatecodeID': 'float64',
       'VendorID': 'float64',
       'passenger_count': 'float64',
       'payment_type': 'float64'
    },
    storage_options={"anon":True}
)
df

In [None]:
%%time

# Prepare to compute the average tip 
# as a function of the number of passengers
mean_amount = df.groupby("passenger_count").tip_amount.mean()

In [None]:
%%time

# Compute the average tip 
# as a function of the number of passengers
mean_amount.compute()

In [None]:
client.shutdown()

## Recap

We have:

* used Coiled to load in **all** of the NYC taxi dataset from 10+ CSVs (10 GBs) on an AWS cluster,
* computed the average tip as a function of the number of passengers, and 
* learned a bunch about using Dask on cloud-based clusters!