<br>
<br>
<center><img src="images/horizontal.png" alt="Coiled logo" style="width: 500px;" align="center"/></center>
<br>
<center><img src="images/dask_horizontal_no_pad.svg" alt="Dask logo" style="width: 500px;"/></center>

# Scalable Data Science

In this notebook, we'll 

* Perform a basic analytics and ETL workflow on the NYC taxi dataset using Pandas;
* Scale up this workflow to a dataset that doesn't fit in RAM using Dask;
* (Optional) Scale out this workflow to leverage a cluster on the Cloud using Coiled.

The workflow is intentionally boring so that we can see the power of scalable data science immediately: we'll load some data, convert some data types for more efficient storage, add a new column, save the new DataFrame to file.

In the notebooks that follow, we'll jump into more interesting examples, including machine learning.

*A bit about me:* I'm Hugo Bowne-Anderson, Head of Data Science Evangelism and Marketing at [Coiled](coiled.io/). We build products that bring the power of scalable data science and machine learning to you, such as single-click hosted clusters on the cloud. We want to take the DevOps out of data science so you can get back to your real job. If you're interested in taking Coiled for a test drive, you can sign up for our [free Beta here](beta.coiled.io/).

Before scaling up, let's look at a common workflow in Pandas.

## 1. Pandas: Convert CSV to Parquet and Engineer a Feature

<img src="images/pandas-logo.svg" alt="pandas logo" style="width: 500px;"/>

In the following, we'll 

* use Pandas to load in part of the NYC taxi dataset from a CSV, 
* massage the data, 
* engineer a feature, 
* compute the average tip as a function of the number of passengers, and 
* save to [Parquet](https://en.wikipedia.org/wiki/Apache_Parquet) (far more efficient than CSV, but not human-readable).

If you're following along in Binder, you won't be able to execute the code but you can read it.

### Download the data from Amazon

In [None]:
!wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-{01..12}.csv

Note: this will take at least several minutes to download the above.

### Investigate data locally with Pandas


In [1]:
# Import pandas and read in beginning of 1st file
import pandas as pd
df = pd.read_csv(
    "data_taxi/yellow_tripdata_2019-01.csv", 
    nrows=10000,
)
df

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1,2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.50,1,N,151,239,1,7.0,0.5,0.5,1.65,0.0,0.3,9.95,
1,1,2019-01-01 00:59:47,2019-01-01 01:18:59,1,2.60,1,N,239,246,1,14.0,0.5,0.5,1.00,0.0,0.3,16.30,
2,2,2018-12-21 13:48:30,2018-12-21 13:52:40,3,0.00,1,N,236,236,1,4.5,0.5,0.5,0.00,0.0,0.3,5.80,
3,2,2018-11-28 15:52:25,2018-11-28 15:55:45,5,0.00,1,N,193,193,2,3.5,0.5,0.5,0.00,0.0,0.3,7.55,
4,2,2018-11-28 15:56:57,2018-11-28 15:58:33,5,0.00,2,N,193,193,2,52.0,0.0,0.5,0.00,0.0,0.3,55.55,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,2,2019-01-01 00:20:18,2019-01-01 00:24:40,2,1.03,1,N,263,141,2,5.5,0.5,0.5,0.00,0.0,0.3,6.80,
9996,2,2019-01-01 00:25:29,2019-01-01 00:45:00,2,4.97,1,N,141,231,1,17.0,0.5,0.5,3.66,0.0,0.3,21.96,
9997,2,2019-01-01 00:47:19,2019-01-01 00:55:38,2,2.16,1,N,231,33,1,9.0,0.5,0.5,2.06,0.0,0.3,12.36,
9998,2,2019-01-01 00:31:41,2019-01-01 00:56:26,1,1.08,1,N,249,186,1,15.0,0.5,0.5,1.00,0.0,0.3,17.30,


In [2]:
# Check out head of file
!head data_taxi/yellow_tripdata_2019-01.csv

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
1,2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.50,1,N,151,239,1,7,0.5,0.5,1.65,0,0.3,9.95,
1,2019-01-01 00:59:47,2019-01-01 01:18:59,1,2.60,1,N,239,246,1,14,0.5,0.5,1,0,0.3,16.3,
2,2018-12-21 13:48:30,2018-12-21 13:52:40,3,.00,1,N,236,236,1,4.5,0.5,0.5,0,0,0.3,5.8,
2,2018-11-28 15:52:25,2018-11-28 15:55:45,5,.00,1,N,193,193,2,3.5,0.5,0.5,0,0,0.3,7.55,
2,2018-11-28 15:56:57,2018-11-28 15:58:33,5,.00,2,N,193,193,2,52,0,0.5,0,0,0.3,55.55,
2,2018-11-28 16:25:49,2018-11-28 16:28:26,5,.00,1,N,193,193,2,3.5,0.5,0.5,0,5.76,0.3,13.31,
2,2018-11-28 16:29:37,2018-11-28 16:33:43,5,.00,2,N,193,193,2,52,0,0.5,0,0,0.3,55.55,
1,2019-01-01 00:21:28,2019-01-01 00:28:37,1,1.30,1,N,163,229,1,6.5,0.5,0.5,1.25,0,0.3,9.05,
1,2019-01-01 00:32:01,2019-01-01 0

### Massage data types 

Before we convert to parquet format, let's clean up our types a little

In [3]:
%%time

# Read in data, parse dates
df = pd.read_csv(
    "data_taxi/yellow_tripdata_2019-01.csv", 
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
)
df

CPU times: user 20.1 s, sys: 5.76 s, total: 25.8 s
Wall time: 27.9 s


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1,2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.50,1,N,151,239,1,7.0,0.5,0.5,1.65,0.0,0.3,9.95,
1,1,2019-01-01 00:59:47,2019-01-01 01:18:59,1,2.60,1,N,239,246,1,14.0,0.5,0.5,1.00,0.0,0.3,16.30,
2,2,2018-12-21 13:48:30,2018-12-21 13:52:40,3,0.00,1,N,236,236,1,4.5,0.5,0.5,0.00,0.0,0.3,5.80,
3,2,2018-11-28 15:52:25,2018-11-28 15:55:45,5,0.00,1,N,193,193,2,3.5,0.5,0.5,0.00,0.0,0.3,7.55,
4,2,2018-11-28 15:56:57,2018-11-28 15:58:33,5,0.00,2,N,193,193,2,52.0,0.0,0.5,0.00,0.0,0.3,55.55,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7667787,2,2019-01-31 23:57:36,2019-02-01 00:18:39,1,4.79,1,N,263,4,1,18.0,0.5,0.5,3.86,0.0,0.3,23.16,0.0
7667788,2,2019-01-31 23:32:03,2019-01-31 23:33:11,1,0.00,1,N,193,193,1,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0
7667789,2,2019-01-31 23:36:36,2019-01-31 23:36:40,1,0.00,1,N,264,264,1,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0
7667790,2,2019-01-31 23:14:53,2019-01-31 23:15:20,1,0.00,1,N,264,7,1,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0


In [4]:
%%time

# Alter data types for efficiency
df = df.astype({
    "VendorID": "uint8",
    "passenger_count": "uint8",
    "RatecodeID": "uint8",
    "store_and_fwd_flag": "category",
    "PULocationID": "uint16",
    "DOLocationID": "uint16",    
})

# Create new feature in dataset: tip_ratio
df["tip_ratio"] = df.tip_amount / df.total_amount

CPU times: user 1.52 s, sys: 2.2 s, total: 3.72 s
Wall time: 4.78 s


## Basic Analytics

In [5]:
%%time

# Compute average tip as a function of the number of passengers
df.groupby("passenger_count").tip_amount.mean()

CPU times: user 107 ms, sys: 40 ms, total: 147 ms
Wall time: 157 ms


passenger_count
0    1.786901
1    1.828308
2    1.833877
3    1.795579
4    1.702710
5    1.869868
6    1.856830
7    6.542632
8    6.480690
9    3.116667
Name: tip_amount, dtype: float64

### Convert to Parquet

In [6]:
%%time
df.to_parquet("data_taxi/yellow_tripdata_2019-01.parq")

CPU times: user 6.27 s, sys: 2.53 s, total: 8.8 s
Wall time: 6.9 s


In [7]:
%%time

# Read from parquet and time it
df = pd.read_parquet("data_taxi/yellow_tripdata_2019-01.parq", columns=["passenger_count"])
df

CPU times: user 63.4 ms, sys: 99 ms, total: 162 ms
Wall time: 186 ms


Unnamed: 0,passenger_count
0,1
1,1
2,3
3,5
4,5
...,...
7667787,1
7667788,1
7667789,1
7667790,1


**Recap:** We have

* used Pandas to load in part of the NYC taxi dataset from a CSV, 
* massaged the data, 
* engineered a feature, 
* computed the average tip as a function of the number of passengers, and 
* saved to [Parquet](https://en.wikipedia.org/wiki/Apache_Parquet) (far more efficient than CSV, but not human-readable).

### Operate on many files in a for loop?

We could do this, but it's unpleasant

```python
for filename in glob("~/data/nyctaxi/yellow_tripdata_2019-*.parq"):
    df = pd.read_csv(filename)
    ...
    df.to_parquet(...)
```

## 2. Use Dask locally to process the full dataset

<img src="images/dask_horizontal_no_pad.svg" alt="Dask logo" style="width: 500px;"/>

The full NYC taxi dataset won't even fit in the RAM of my laptop. Do I need a cluster yet? No. First, I can take advantage of all the cores on my laptop in parallel. This is what we call *scaling up* our computation (out-of-core computing). Later we'll see how to *scale out* computation across a cluster.

One way of doing this is with [Dask](dask.org/). As we're about to see, part of the value of Dask lies in its API being as close as possible to the PyData APIs we know and love, in this case, Pandas.

In [the words of Matthew Rocklin](https://coiled.io/blog/history-dask/), core developer and co-maintainer of Dask and CEO of Coiled, there was a social goal of Dask:
> Invent nothing. We wanted to be as familiar as possible to what users already knew in the PyData stack

Let's do it!

The plan:

* use Dask to load in **all** of the NYC taxi dataset from 10+ CSVs (8+ GBs), 
* massage the data, 
* engineer a feature,
* compute the average tip as a function of the number of passengers, and
* save to [Parquet](https://en.wikipedia.org/wiki/Apache_Parquet) (far more efficient than CSV, but not human-readable).

We'll also dive into the basics of Dask and distributed compute (but we'll execute some code first and dive into this part while it runs!).

In [2]:
# Import Dask parts, spin up a local cluster, and instantiate a Client
from dask.distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=4)
client = Client(cluster)
client

0,1
Client  Scheduler: tcp://127.0.0.1:54105  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 8  Memory: 8.59 GB


In [3]:
%%time

import dask.dataframe as dd

# Import the full dataset (note the Dask API!)
df = dd.read_csv(
    "~/Downloads/data-science-at-scale-master/data_taxi/yellow_tripdata_2019-*.csv", 
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    dtype={'RatecodeID': 'float64',
       'VendorID': 'float64',
       'passenger_count': 'float64',
       'payment_type': 'float64'}

)
df

CPU times: user 1.93 s, sys: 554 ms, total: 2.49 s
Wall time: 4.01 s


Unnamed: 0_level_0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
npartitions=117,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
,float64,datetime64[ns],datetime64[ns],float64,float64,float64,object,int64,int64,float64,float64,float64,float64,float64,float64,float64,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [4]:
# Alter data types for efficiency
df = df.astype({
    "VendorID": "UInt8",
    "passenger_count": "UInt8",
    "RatecodeID": "UInt8",
    "store_and_fwd_flag": "category",
    "PULocationID": "UInt16",
    "DOLocationID": "UInt16",    
})

# Create new feature in dataset: tip_ratio
df["tip_ratio"] = df.tip_amount / df.total_amount

In [5]:
%%time

# Prepare to compute the average tip 
# as a function of the number of passengers
mean_amount = df.groupby("passenger_count").tip_amount.mean()

CPU times: user 68.1 ms, sys: 12.1 ms, total: 80.2 ms
Wall time: 89 ms


In [None]:
%%time

# compute the average tip as a function of the number of passengers
mean_amount.compute()

In [13]:
%%time
df.to_parquet("data_taxi/yellow_tripdata_2019.parq")

CPU times: user 42 s, sys: 6.79 s, total: 48.8 s
Wall time: 3min 50s


### 2a Notes on what is happening in Dask and Python

The above code will take some time to run so let's take this opportunity to see what is going on with Dask, Python, and the distributed computation.

#### Components of Dask

Dask contains 3 main components and we have already see two of them above:
* High-level collections in the form of Dask DataFrames;
* Schedulers (in this case, on a single machine).

Let's get a sense for what these are.

<img src="images/dask-components.svg" width="400px">

#### Dask DataFrames

What exactly is this Dask DataFrame? A schematic is worth a thousand words:

<img src="images/dask-dataframe.svg" width="400px">

Essentially, the Dask DataFrame is a large, virtual dataframe divided along the index into multiple Pandas dataframes.

#### Dask Schedulers, Workers, and beyond

<img src="images/dask-cluster.svg" width="400px">

Work (Python code) is performed on a cluster, which consists of 
* a _scheduler_ (which manages and sends the work / tasks to the workers)
* _workers_, which compute the tasks.

The _client_ is "the user-facing entry point for cluster users." What this means is that the client lives wherever you are writing your Python code and the client talks to the scheduler, passing it the tasks.

**Recap:** We have

* used Dask to load in **all** of the NYC taxi dataset from 10+ CSVs (8+ GBs), 
* massaged the data, 
* engineered a feature, 
* computed the average tip as a function of the number of passengers, and 
* saved to [Parquet](https://en.wikipedia.org/wiki/Apache_Parquet) (far more efficient than CSV, but not human-readable),
* dived into the basic of Dask and distributed compute and understand the basic concepts.

## 3. Optional: Work directly from the cloud with Coiled 

<br>
<img src="images/horizontal.png" alt="Coiled logo" style="width: 500px;"/>
<br>

Here I'll spin up a cluster on Coiled to show you just how easy it can be. Note that to do so, I've also signed into the [Coiled Beta](beta.coiled.io/), pip installed `coiled`, and authenticated. You can do the same!

The plan:

* use Coiled to load in **all** of the NYC taxi dataset from 10+ CSVs (8+ GBs) on an AWS cluster, 
* massage the data, 
* engineer a feature, 
* compute the average tip as a function of the number of passengers, and 
* save to [Parquet](https://en.wikipedia.org/wiki/Apache_Parquet) (far more efficient than CSV, but not human-readable).

In [14]:
import coiled
from dask.distributed import LocalCluster, Client

In [None]:
# Create a Software Environment
coiled.create_software_environment(
    name="my-software-env",
    conda="binder/environment.yml",
)

In [None]:
# Control the resources of your cluster by creating a new cluster configuration
coiled.create_cluster_configuration(
    name="my-cluster-config",
    worker_memory="16 GiB",
    worker_cpu=4,
    scheduler_memory="8 GiB",
    scheduler_cpu=2,
    software="my-software-env",
)


In [15]:
# Spin up cluster, instantiate a Client
cluster = coiled.Cluster(n_workers=10, configuration="my-cluster-config")
client = Client(cluster)
client

Creating Cluster. This takes about a minute ...Checking environment images
Valid environment image found


0,1
Client  Scheduler: tls://ec2-18-224-64-117.us-east-2.compute.amazonaws.com:8786  Dashboard: http://ec2-18-224-64-117.us-east-2.compute.amazonaws.com:8787/status,Cluster  Workers: 0  Cores: 0  Memory: 0 B


In [16]:
import dask.dataframe as dd

# Read data into a Dask DataFrame
df = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv", 
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    dtype={
        'RatecodeID': 'float64',
       'VendorID': 'float64',
       'passenger_count': 'float64',
       'payment_type': 'float64'
    },
    storage_options={"anon":True}
)
df

Unnamed: 0_level_0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
npartitions=127,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
,float64,datetime64[ns],datetime64[ns],float64,float64,float64,object,int64,int64,float64,float64,float64,float64,float64,float64,float64,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [17]:
%%time

# Prepare to compute the average tip 
# as a function of the number of passengers
mean_amount = df.groupby("passenger_count").tip_amount.mean()

CPU times: user 27.2 ms, sys: 5.49 ms, total: 32.7 ms
Wall time: 32.5 ms


In [18]:
%%time

# Compute the average tip 
# as a function of the number of passengers
mean_amount.compute()

CPU times: user 1.94 s, sys: 275 ms, total: 2.21 s
Wall time: 28.4 s


passenger_count
0.0    2.122789
1.0    2.206790
2.0    2.214306
3.0    2.137775
4.0    2.023804
5.0    2.235441
6.0    2.221105
7.0    6.675962
8.0    7.111625
9.0    7.377822
Name: tip_amount, dtype: float64

In [19]:
# Alter data types for efficiency
df = df.astype({
    "VendorID": "UInt8",
    "passenger_count": "UInt8",
    "RatecodeID": "UInt8",
    "store_and_fwd_flag": "category",
    "PULocationID": "UInt16",
    "DOLocationID": "UInt16",    
})

# Create new feature in dataset: tip_ratio
df["tip_ratio"] = df.tip_amount / df.total_amount

In [20]:
%%time
df.to_parquet("s3://hugo-coiled-tutorial/nyctaxi-2019.parq")

NoCredentialsError: Unable to locate credentials

In [21]:
cluster.close()

**Recap:** We have
* used Coiled to load in **all** of the NYC taxi dataset from 10+ CSVs (8+ GBs) on an AWS cluster, 
* massaged the data, 
* engineered a feature, 
* computed the average tip as a function of the number of passengers, and 
* saved to [Parquet](https://en.wikipedia.org/wiki/Apache_Parquet) (far more efficient than CSV, but not human-readable).