# SCALING YOUR DATA SCIENCE WORKFLOWS

## Coiled Video Tutorial

<p float="center">
  <img src="images/Coiled-Logo_Horizontal-Small_black_RGB.png", alt="Coiled logo", width="350", hspace="10"/>
</p>

<p float="center">
  <img src="images/dask_horizontal.png", alt="Dask logo", width="400", hspace="10" />
</p>

# Outline

In this notebook, we'll 

1. Perform a basic analytics workflow on the NYC taxi dataset using **pandas**;
2. **Scale up** this workflow to a dataset that doesn't fit in RAM using **Dask**;
3. **Scale out** this workflow to leverage a cluster on the Cloud using **Coiled**.

## 1. Pandas: Import Data and Perform a .groupby()

<img src="images/1200px-Pandas_logo.png" alt="pandas logo" style="width: 500px;"/>

### Download the data from Amazon

In [3]:
# download all 12 csv files to the current directory (~10GB)
!wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-{01..12}.csv

--2021-06-28 10:59:57--  https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.36.230
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.36.230|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 687088084 (655M) [text/csv]
Saving to: ‘yellow_tripdata_2019-01.csv’


2021-06-28 11:06:18 (1,73 MB/s) - ‘yellow_tripdata_2019-01.csv’ saved [687088084/687088084]

--2021-06-28 11:06:18--  https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv
Reusing existing connection to s3.amazonaws.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 649882828 (620M) [text/csv]
Saving to: ‘yellow_tripdata_2019-02.csv’


2021-06-28 11:11:14 (2,10 MB/s) - ‘yellow_tripdata_2019-02.csv’ saved [649882828/649882828]

--2021-06-28 11:11:14--  https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-03.csv
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.36.230|:443... con

### Investigate part of the data locally with Pandas

In [6]:
%%time
# Import pandas and read in 1st file
import pandas as pd
df = pd.read_csv(
    "yellow_tripdata_2019-01.csv",
)
df

CPU times: user 20.1 s, sys: 8 s, total: 28.1 s
Wall time: 34 s


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1,2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.50,1,N,151,239,1,7.0,0.5,0.5,1.65,0.0,0.3,9.95,
1,1,2019-01-01 00:59:47,2019-01-01 01:18:59,1,2.60,1,N,239,246,1,14.0,0.5,0.5,1.00,0.0,0.3,16.30,
2,2,2018-12-21 13:48:30,2018-12-21 13:52:40,3,0.00,1,N,236,236,1,4.5,0.5,0.5,0.00,0.0,0.3,5.80,
3,2,2018-11-28 15:52:25,2018-11-28 15:55:45,5,0.00,1,N,193,193,2,3.5,0.5,0.5,0.00,0.0,0.3,7.55,
4,2,2018-11-28 15:56:57,2018-11-28 15:58:33,5,0.00,2,N,193,193,2,52.0,0.0,0.5,0.00,0.0,0.3,55.55,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7667787,2,2019-01-31 23:57:36,2019-02-01 00:18:39,1,4.79,1,N,263,4,1,18.0,0.5,0.5,3.86,0.0,0.3,23.16,0.0
7667788,2,2019-01-31 23:32:03,2019-01-31 23:33:11,1,0.00,1,N,193,193,1,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0
7667789,2,2019-01-31 23:36:36,2019-01-31 23:36:40,1,0.00,1,N,264,264,1,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0
7667790,2,2019-01-31 23:14:53,2019-01-31 23:15:20,1,0.00,1,N,264,7,1,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0


### Perfoming a Basic Analytics Operation

In [6]:
%%time
# Compute average tip as a function of the number of passengers
df.groupby("passenger_count").tip_amount.mean()

CPU times: user 263 ms, sys: 449 ms, total: 712 ms
Wall time: 1.47 s


passenger_count
0    1.786901
1    1.828308
2    1.833877
3    1.795579
4    1.702710
5    1.869868
6    1.856830
7    6.542632
8    6.480690
9    3.116667
Name: tip_amount, dtype: float64

### How to Operate on the Entire Dataset?

In [8]:
# get size of csv in memory
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7667792 entries, 0 to 7667791
Data columns (total 18 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   VendorID               int64  
 1   tpep_pickup_datetime   object 
 2   tpep_dropoff_datetime  object 
 3   passenger_count        int64  
 4   trip_distance          float64
 5   RatecodeID             int64  
 6   store_and_fwd_flag     object 
 7   PULocationID           int64  
 8   DOLocationID           int64  
 9   payment_type           int64  
 10  fare_amount            float64
 11  extra                  float64
 12  mta_tax                float64
 13  tip_amount             float64
 14  tolls_amount           float64
 15  improvement_surcharge  float64
 16  total_amount           float64
 17  congestion_surcharge   float64
dtypes: float64(9), int64(6), object(3)
memory usage: 1.0+ GB


We could write a **for** loop like this one, but it's not ideal.

```python
for filename in glob("~/data/nyctaxi/yellow_tripdata_2019-*.csv"):
    df = pd.read_csv(filename)
    ...
    df.to_parquet(...)
```

## 2. SCALE UP: Use Dask locally to process the full dataset

<img src="images/dask_horizontal.png" alt="Dask logo" style="width: 400px;"/>

In [9]:
# Import Dask parts, spin up a local cluster, and instantiate a Client
from dask.distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=4)
client = Client(cluster)
client

0,1
Client  Scheduler: tcp://127.0.0.1:60698  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 4.00 GiB


In [10]:
%%time

import dask.dataframe as dd

# Import the full dataset (note the very familiar Dask API!)
df = dd.read_csv(
    "yellow_tripdata_2019-*.csv", 
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    dtype={'RatecodeID': 'float64',
       'VendorID': 'float64',
       'passenger_count': 'float64',
       'payment_type': 'float64'}

)
df

CPU times: user 564 ms, sys: 269 ms, total: 834 ms
Wall time: 2.18 s


Unnamed: 0_level_0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
npartitions=127,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
,float64,datetime64[ns],datetime64[ns],float64,float64,float64,object,int64,int64,float64,float64,float64,float64,float64,float64,float64,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [11]:
%%time

# Prepare to compute the average tip 
# as a function of the number of passengers
mean_amount = df.groupby("passenger_count").tip_amount.mean()

CPU times: user 21.9 ms, sys: 6.09 ms, total: 28 ms
Wall time: 48.9 ms


In [None]:
%%time

# compute the average tip as a function of the number of passengers
mean_amount.compute()

In [None]:
# shut down our cluster
client.shutdown()

## 3. SCALE OUT: Use Coiled to work directly in the cloud

<br>
<img src="images/Coiled-Logo_Horizontal-Small_black_RGB.png" alt="Coiled logo" style="width: 500px;"/>
<br>


### Using Coiled

In [10]:
!coiled login --token f9cd3755cadf32f7421aeb4e9bbeb76186a9fd52

[32mAuthentication successful[0m
Credentials have been saved at [34m/Users/richard/.config/dask/[0m[34mcoiled.yaml[0m


In [8]:
import coiled
from dask.distributed import LocalCluster, Client

To set up a Coiled cluster takes just 4 simple steps:
1. Create a software environment 
2. Create a cluster configuration 
3. Spin up the cluster 
4. Connect cluster to Dask client

In [9]:
%%time

# Create a Software Environment
coiled.create_software_environment(
    name="my-software-env-test",
    conda="environment-test.yml",
)

Found existing software environment build, returning
CPU times: user 56.6 ms, sys: 148 ms, total: 205 ms
Wall time: 4.99 s


In [10]:
%%time 

# Control the resources of your cluster by creating a cluster configuration
coiled.create_cluster_configuration(
    name="my-cluster-config-test",
    worker_memory="16 GiB",
    worker_cpu=4,
    scheduler_memory="8 GiB",
    scheduler_cpu=2,
    software="my-software-env-test",
)


CPU times: user 32 ms, sys: 45 ms, total: 77 ms
Wall time: 1.47 s


In [11]:
# Spin up cluster, instantiate a Client
cluster = coiled.Cluster(n_workers=20, configuration="my-cluster-config-test")
client = Client(cluster)
client

Output()

Found software environment build


0,1
Client  Scheduler: tls://ec2-54-164-69-190.compute-1.amazonaws.com:8786  Dashboard: http://ec2-54-164-69-190.compute-1.amazonaws.com:8787,Cluster  Workers: 18  Cores: 72  Memory: 288.00 GiB


In [12]:
import dask.dataframe as dd

# Read data into a Dask DataFrame (not local data!)
df = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv", 
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    dtype={
        'RatecodeID': 'float64',
       'VendorID': 'float64',
       'passenger_count': 'float64',
       'payment_type': 'float64'
    },
    storage_options={"anon":True}
)
df

Unnamed: 0_level_0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
npartitions=127,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
,float64,datetime64[ns],datetime64[ns],float64,float64,float64,object,int64,int64,float64,float64,float64,float64,float64,float64,float64,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [13]:
%%time

# Prepare to compute the average tip 
# as a function of the number of passengers
mean_amount = df.groupby("passenger_count").tip_amount.mean()

CPU times: user 40.2 ms, sys: 20.4 ms, total: 60.5 ms
Wall time: 281 ms


In [14]:
%%time

# Compute the average tip 
# as a function of the number of passengers
mean_amount.compute()

CPU times: user 106 ms, sys: 31.2 ms, total: 138 ms
Wall time: 14.2 s


passenger_count
0.0    2.122789
1.0    2.206790
2.0    2.214306
3.0    2.137775
4.0    2.023804
5.0    2.235441
6.0    2.221105
7.0    6.675962
8.0    7.111625
9.0    7.377822
Name: tip_amount, dtype: float64

And let's not forget our basic Dask hygiene practices:

In [None]:
# shutdown the cluster
client.shutdown()

That's it, folks!

For questions and feedback, please do reach out via our Coiled Community Slack channel or support@coiled.io.