# Dask DataFrame

## Notebook Objectives
* **Download NYC Yellow Taxi Cab Dataset for 2019**.
* **Reading and working with tabular data using pandas**, a popular library for data analysis.
* **Reading and working with tabular data using Dask DataFrame** - an interface to scale pandas code, and a look at **Dask Dashboards** for real-time visualization of the state of your cluster.
* **Scaling Dask computation to the Cloud** using Coiled, a deployment-as-a-service library for scaling Python. (Optional)
* **Limitations of Dask DataFrame**.
* **References** for further reading.

## Download NYC Yellow Taxi Cab Dataset for 2019

A typical data science workflow starts with some data that needs to be understood. A typical first step is data cleaning and  exploratory analysis to find interesting details and patterns.

In this notebook, we will be working with the [New York City Yellow Taxi Trips Dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) for 2019.

The following code cell downloads the data to the current folder `/1-introduction-to-dask`. We recommend moving all csv files to a `/data` subdirectory. The rest of the notebook assumes all data is present in a `/data` directory.

In [4]:
!wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-{01..12}.csv

--2020-12-30 00:15:19--  https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.85.174
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.85.174|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 687088084 (655M) [text/csv]
Saving to: ‘yellow_tripdata_2019-01.csv’


2020-12-30 00:18:50 (3.15 MB/s) - ‘yellow_tripdata_2019-01.csv’ saved [687088084/687088084]

--2020-12-30 00:18:50--  https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv
Reusing existing connection to s3.amazonaws.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 649882828 (620M) [text/csv]
Saving to: ‘yellow_tripdata_2019-02.csv’


2020-12-30 00:21:10 (4.46 MB/s) - ‘yellow_tripdata_2019-02.csv’ saved [649882828/649882828]

--2020-12-30 00:21:10--  https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-03.csv
Reusing existing connection to s3.amazonaws.com:443.
HTTP request sent, awa

## Reading and working with tabular data using **pandas**

### Reading data

pandas has a `read_csv` method to import data into your workspace. We use it to read the taxi data for January 2019.

`%%time` is a [magic function](https://ipython.readthedocs.io/en/stable/interactive/magics.html) in IPython to compute the execution time of a Python expression.

pandas reads data in the form of a 'dataframe' -- a structured format consisting of rows and column, along with some metadata about the values.

In [2]:
%%time

import pandas as pd

df = pd.read_csv("data/yellow_tripdata_2019-01.csv")
df

CPU times: user 9.87 s, sys: 1.48 s, total: 11.4 s
Wall time: 12 s


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1,2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.50,1,N,151,239,1,7.0,0.5,0.5,1.65,0.0,0.3,9.95,
1,1,2019-01-01 00:59:47,2019-01-01 01:18:59,1,2.60,1,N,239,246,1,14.0,0.5,0.5,1.00,0.0,0.3,16.30,
2,2,2018-12-21 13:48:30,2018-12-21 13:52:40,3,0.00,1,N,236,236,1,4.5,0.5,0.5,0.00,0.0,0.3,5.80,
3,2,2018-11-28 15:52:25,2018-11-28 15:55:45,5,0.00,1,N,193,193,2,3.5,0.5,0.5,0.00,0.0,0.3,7.55,
4,2,2018-11-28 15:56:57,2018-11-28 15:58:33,5,0.00,2,N,193,193,2,52.0,0.0,0.5,0.00,0.0,0.3,55.55,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7667787,2,2019-01-31 23:57:36,2019-02-01 00:18:39,1,4.79,1,N,263,4,1,18.0,0.5,0.5,3.86,0.0,0.3,23.16,0.0
7667788,2,2019-01-31 23:32:03,2019-01-31 23:33:11,1,0.00,1,N,193,193,1,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0
7667789,2,2019-01-31 23:36:36,2019-01-31 23:36:40,1,0.00,1,N,264,264,1,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0
7667790,2,2019-01-31 23:14:53,2019-01-31 23:15:20,1,0.00,1,N,264,7,1,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0


Note that the computation takes ~12s to complete. pandas has read all the data for January and inferred the datatypes for each column. The `.info()` method can be used to gather a concise summary of the dataframe.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7667792 entries, 0 to 7667791
Data columns (total 18 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   VendorID               int64  
 1   tpep_pickup_datetime   object 
 2   tpep_dropoff_datetime  object 
 3   passenger_count        int64  
 4   trip_distance          float64
 5   RatecodeID             int64  
 6   store_and_fwd_flag     object 
 7   PULocationID           int64  
 8   DOLocationID           int64  
 9   payment_type           int64  
 10  fare_amount            float64
 11  extra                  float64
 12  mta_tax                float64
 13  tip_amount             float64
 14  tolls_amount           float64
 15  improvement_surcharge  float64
 16  total_amount           float64
 17  congestion_surcharge   float64
dtypes: float64(9), int64(6), object(3)
memory usage: 1.0+ GB


### Working with the data

After importing the data, the next step is working on the data to find some useful information.

In the following blocks, the mean of the tip amount is calculated as a function of passenger count.

In pandas, you can use `mean()` to calculate mean, and `groupby()` for mapping to a column.

In [3]:
%%time

df.groupby("passenger_count").tip_amount.mean()

CPU times: user 88.3 ms, sys: 14.2 ms, total: 103 ms
Wall time: 102 ms


passenger_count
0    1.786901
1    1.828308
2    1.833877
3    1.795579
4    1.702710
5    1.869868
6    1.856830
7    6.542632
8    6.480690
9    3.116667
Name: tip_amount, dtype: float64

### Limitation in pandas

pandas is the most popular library for exploratory data analysis, but it has a limitation. pandas is great at handling small quantities of data, but fails with a `MemoryError` when using larger datasets. This is where Dask comes in.

Optional: Uncomment and run the following code block to read the entire dataset in pandas.

In [1]:
# import glob

# df = pd.concat(map(pd.read_csv, glob.glob('data/*.csv')))
# df

## Reading and working with tabular data using **Dask DataFrame**

### Reading data

Dask can be used to scale pandas to larger datasets. Dask's DataFrame API has the same functions as the pandas API because it's a wrapper around pandas. This makes Dask code familiar and easy to use.

First, spin up a cluster! 

In [1]:
from dask.distributed import Client

client = Client(n_workers=4)
client

0,1
Client  Scheduler: tcp://127.0.0.1:53261  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 12  Memory: 17.18 GB


Open the Dask Dashboard in JupyterLab -- Cluster Map, Task Stream, and Dask workers

* **Cluster map** (also called the pew-pew map) visualizes interactions between the scheduler and the workers.
* **Task stream** shows tasks performed by each worker in real-time.
* **Dask workers** displays CPU and memory being used by each worker.

The same reading operation with Dask, but this time read the complete dataset - data for all the years.

In [5]:
%%time

import dask.dataframe as dd

df = dd.read_csv("data/yellow_tripdata_2019-*.csv")
df

CPU times: user 234 ms, sys: 68.3 ms, total: 302 ms
Wall time: 5.88 s


Unnamed: 0_level_0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
npartitions=127,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
,int64,object,object,int64,float64,int64,object,int64,int64,int64,float64,float64,float64,float64,float64,float64,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


That took ~200 milliseconds because Dask hasn't actually imported all the data. It has created partitions and estimated the datatypes of each column.

Let's look at the first few rows, `head()` pandas method can be used for this.

In [3]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1,2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.5,1,N,151,239,1,7.0,0.5,0.5,1.65,0.0,0.3,9.95,
1,1,2019-01-01 00:59:47,2019-01-01 01:18:59,1,2.6,1,N,239,246,1,14.0,0.5,0.5,1.0,0.0,0.3,16.3,
2,2,2018-12-21 13:48:30,2018-12-21 13:52:40,3,0.0,1,N,236,236,1,4.5,0.5,0.5,0.0,0.0,0.3,5.8,
3,2,2018-11-28 15:52:25,2018-11-28 15:55:45,5,0.0,1,N,193,193,2,3.5,0.5,0.5,0.0,0.0,0.3,7.55,
4,2,2018-11-28 15:56:57,2018-11-28 15:58:33,5,0.0,2,N,193,193,2,52.0,0.0,0.5,0.0,0.0,0.3,55.55,


To look at the last few rows, use the `tail()` pandas method.

In [4]:
df.tail()

ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+-----------------+---------+----------+
| Column          | Found   | Expected |
+-----------------+---------+----------+
| RatecodeID      | float64 | int64    |
| VendorID        | float64 | int64    |
| passenger_count | float64 | int64    |
| payment_type    | float64 | int64    |
+-----------------+---------+----------+

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'RatecodeID': 'float64',
       'VendorID': 'float64',
       'passenger_count': 'float64',
       'payment_type': 'float64'}

to the call to `read_csv`/`read_table`.

Alternatively, provide `assume_missing=True` to interpret
all unspecified integer columns as floats.

This throws an error because the datatypes of the last few rows were just read, and they did not match the datatypes Dask had estimated initially. This is different from pandas. pandas reads the complete dataset before inferring the datatypes and null-value information, which wouldn't be ideal for a larger-than-memory dataset.

Dask estimates the datatypes with a small sample of data to stay efficient, so it's common to run into this error. A good practice is to specify datatypes during function call.

*Note that Dask also provides a helpful error message to diagnose this issue.*

In [6]:
df = dd.read_csv("data/yellow_tripdata_2019-*.csv",
                 dtype={'RatecodeID': 'float64',
                        'VendorID': 'float64',
                        'passenger_count': 'float64',
                        'payment_type': 'float64'
                       })

In [6]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1.0,2019-01-01 00:46:40,2019-01-01 00:53:20,1.0,1.5,1.0,N,151,239,1.0,7.0,0.5,0.5,1.65,0.0,0.3,9.95,
1,1.0,2019-01-01 00:59:47,2019-01-01 01:18:59,1.0,2.6,1.0,N,239,246,1.0,14.0,0.5,0.5,1.0,0.0,0.3,16.3,
2,2.0,2018-12-21 13:48:30,2018-12-21 13:52:40,3.0,0.0,1.0,N,236,236,1.0,4.5,0.5,0.5,0.0,0.0,0.3,5.8,
3,2.0,2018-11-28 15:52:25,2018-11-28 15:55:45,5.0,0.0,1.0,N,193,193,2.0,3.5,0.5,0.5,0.0,0.0,0.3,7.55,
4,2.0,2018-11-28 15:56:57,2018-11-28 15:58:33,5.0,0.0,2.0,N,193,193,2.0,52.0,0.0,0.5,0.0,0.0,0.3,55.55,


In [7]:
df.tail()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
685448,,2019-12-31 00:07:00,2019-12-31 00:46:00,,12.78,,,230,72,,32.32,2.75,0.5,0.0,6.12,0.3,41.99,0.0
685449,,2019-12-31 00:20:00,2019-12-31 00:47:00,,18.52,,,219,32,,51.63,2.75,0.5,0.0,6.12,0.3,61.3,0.0
685450,,2019-12-31 00:50:00,2019-12-31 01:21:00,,13.13,,,161,76,,38.02,2.75,0.5,0.0,6.12,0.3,47.69,0.0
685451,,2019-12-31 00:38:19,2019-12-31 01:19:37,,14.51,,,230,21,,41.86,2.75,0.0,0.0,6.12,0.3,51.03,0.0
685452,,2019-12-31 00:21:00,2019-12-31 00:56:00,,-17.16,,,193,219,,44.62,2.75,0.5,0.0,0.0,0.3,48.17,0.0


This now works!

### Working with the data

The same computation (to calculate mean for the tip amount as a function of passenger count) is now performed on the entire dataset using Dask DataFrame.

*Note that Dask code is similar to pandas code.*

In [8]:
%%time

mean_tip_amount = df.groupby("passenger_count").tip_amount.mean()
mean_tip_amount

CPU times: user 12.2 ms, sys: 1.72 ms, total: 13.9 ms
Wall time: 12.5 ms


Dask Series Structure:
npartitions=1
    float64
        ...
Name: tip_amount, dtype: float64
Dask Name: truediv, 420 tasks

Dask DataFrame is backed by the Delayed API we saw in the previous notebook, so the evaluations here are also lazy.

You can use `compute()` to get the output.

In [9]:
%%time

mean_tip_amount.compute()

CPU times: user 4.41 s, sys: 653 ms, total: 5.07 s
Wall time: 1min 17s


passenger_count
0.0    2.122789
1.0    2.206790
2.0    2.214306
3.0    2.137775
4.0    2.023804
5.0    2.235441
6.0    2.221105
7.0    6.675962
8.0    7.111625
9.0    7.377822
Name: tip_amount, dtype: float64

Dask deletes intermediate results, like the full pandas dataframe for each file. This lets us handle datasets that are larger than memory, but also means that repeated computations will have to load all of the data in each time.

You can use `persist()` to store intermediate results for future use:

```
mean_tip_persist = mean_tip_amount.persist()
```

### Checkpoint

**Question:** Compute the standard deviation for tip_amount as a function of passenger_count for the entire dataset.

In [None]:
#your answer here

In [None]:
# Solution 1

std_tip = df_dask.groupby("passenger_count").tip_amount.std().compute()

### Sharing intermediate outputs

Sometimes individual computations may related to each other, and can benefit from sharing intermediate results. For example, computing minimum and maximum values.

In pandas (and therefore in Dask DataFrame), you can use `min()` and `max()` to compute minimum and maximum respectively.

In [13]:
max_tip_amount = df.tip_amount.max()
min_tip_amount = df.tip_amount.min()

### Without Sharing

In [15]:
%%time
max_tip = max_tip_amount.compute()
min_tip = min_tip_amount.compute()

CPU times: user 3.17 s, sys: 422 ms, total: 3.6 s
Wall time: 1min 12s


### With Sharing

In [16]:
import dask

In [17]:
%%time
max_tip, min_tip = dask.compute(max_tip_amount, min_tip_amount)

CPU times: user 2.18 s, sys: 296 ms, total: 2.47 s
Wall time: 39.9 s


Notice the shared computation is significantly faster!

### Checkpoint

**Question:** Compute the mean and standard deviation for total amount by sharing intermediate results.

In [None]:
#your answer here

In [None]:
# Solution 2

mean_total = df_dask.total_amount.mean()
std_total = df_dask.total_amount.mean()

dask.compute(mean_total, std_total)

In [2]:
client.close()

## Scaling to the Cloud (Optional)

We can now scale our Dask workflow to the Cloud. There are many different ways to do this, but here we'll use Coiled. Coiled allows us to stay in this same notebook and makes the process much easier.

1. Sign in to [cloud.coiled.io](https://cloud.coiled.io/)
2. Navigate to the Dashboard tab and copy the login token.
3. Open the terminal (or command prompt in Windows), execute `coiled login`, and share the token when prompted.

*Note: It's free while in beta!*

That's it! Now in the same notebook, let's connect to our Coiled cluster.

In [1]:
from dask.distributed import Client

In [2]:
import coiled

cluster = coiled.Cluster(n_workers=10)

client = Client(cluster)
client

Output()


+---------+--------+-----------+---------+
| Package | client | scheduler | workers |
+---------+--------+-----------+---------+
| blosc   | None   | 1.10.2    | 1.10.2  |
| lz4     | None   | 3.1.3     | 3.1.3   |
+---------+--------+-----------+---------+


0,1
Client  Scheduler: tls://ec2-18-219-229-106.us-east-2.compute.amazonaws.com:8786  Dashboard: http://ec2-18-219-229-106.us-east-2.compute.amazonaws.com:8787,Cluster  Workers: 10  Cores: 40  Memory: 171.80 GB


In [4]:
import dask.dataframe as dd

In [5]:
%%time

df = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-01.csv",
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    dtype={
        "payment_type": "UInt8",
        "VendorID": "UInt8",
        "passenger_count": "UInt8",
        "RatecodeID": "UInt8",
        "store_and_fwd_flag": "category",
        "PULocationID": "UInt16",
        "DOLocationID": "UInt16",
    },
    storage_options={"anon": True},
    blocksize="16 MiB",
).persist()

df.groupby("passenger_count").tip_amount.mean().compute()

CPU times: user 288 ms, sys: 72.5 ms, total: 361 ms
Wall time: 35.9 s


passenger_count
0    1.786901
1    1.828308
2    1.833877
3    1.795579
4    1.702710
5    1.869868
6    1.856830
7    6.542632
8    6.480690
9    3.116667
Name: tip_amount, dtype: float64

## Limitations of Dask DataFrame

Dask DataFrame API does not implement the complete pandas interface because some pandas operations are not suited for a parallel and distributed environment.

### Data Shuffling

Dask DataFrames consist of multiple pandas dataframes, each of which has it's index starting from zero. Some operations like indexing (`set_index`, `reset_index`) may need the data to be sorted, which requires a lot of time-consuming shuffling of data. These operations are slower in Dask. Hence, presorting the index and making logical partitions are good practices.


## References

* [Dask DataFrame documentation](https://docs.dask.org/en/latest/dataframe.html)
* [Dask DataFrame API](https://docs.dask.org/en/latest/dataframe-api.html)
* [Dask DataFrame examples](https://examples.dask.org/dataframe.html)
* [Dask Tutorial - DataFrames](https://github.com/pavithraes/dask-tutorial/blob/master/04_dataframe.ipynb)