<img src="images/dask_horizontal.svg" width="45%" alt="Dask logo\">

# Performance Optimization

This notebook walks through a Dask DataFrame ETL workload. We'll demonstrate how to diagnose performance issues, utilize the Dask dashboard, and cover several common DataFrame best practices. 

## Dataset: Uber/Lyft TLC Trip Records

The New York City Taxi and Limousine Commission (TLC) collects trip information for each taxi and for-hire vehicle trip completed by licensed drivers and vehicles; Here we'll analyze a subset of the [High-Volume For-Hire Services](https://www.nyc.gov/site/tlc/businesses/high-volume-for-hire-services.page) datset stored which provides a good example of an out-of-core dataset that's too large for a standard laptop due to memory limitations.

Some characteristics of the dataset:

- CSV dataset that's ~60 GB in memory
- Stored in `s3://coiled-datasets/uber-lyft-tlc-sample/csv-10/`
- In region `us-east-2`

## Cluster setup

Because the dataset is too large for a laptop, we'll create a larger Dask cluster on AWS using [Coiled](https://www.coiled.io).
(Disclaimer: Some of the instructors for this tutorial are employed by Coiled.):

In [1]:
import coiled

cluster = coiled.Cluster(
    n_workers=50,
    region="us-east-2",  # start workers close to data to minimize costs
)
client = cluster.get_client()

Output()

Output()

Once we have initialized a cluster and client, we can easily view the Dask dashboard either through widgets provided by [dask-labextension](https://github.com/dask/dask-labextension), or by visiting the dashboard URL directly:

In [3]:
client

0,1
Connection method: Cluster object,Cluster type: coiled.Cluster
Dashboard: https://cluster-lcrgs.dask.host/_r9nYKAZQpcqf96s/status,

0,1
Dashboard: https://cluster-lcrgs.dask.host/_r9nYKAZQpcqf96s/status,Workers: 50
Total threads: 200,Total memory: 741.96 GiB

0,1
Comm: tls://10.0.42.56:8786,Workers: 50
Dashboard: http://10.0.42.56:8787/status,Total threads: 200
Started: 9 minutes ago,Total memory: 741.96 GiB

0,1
Comm: tls://10.0.38.28:34677,Total threads: 4
Dashboard: http://10.0.38.28:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.38.28:44565,
Local directory: /scratch/dask-scratch-space/worker-1jk5w0g1,Local directory: /scratch/dask-scratch-space/worker-1jk5w0g1

0,1
Comm: tls://10.0.46.172:45975,Total threads: 4
Dashboard: http://10.0.46.172:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.46.172:38521,
Local directory: /scratch/dask-scratch-space/worker-5c8yjc5l,Local directory: /scratch/dask-scratch-space/worker-5c8yjc5l

0,1
Comm: tls://10.0.43.189:45895,Total threads: 4
Dashboard: http://10.0.43.189:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.43.189:39817,
Local directory: /scratch/dask-scratch-space/worker-rnobb2gp,Local directory: /scratch/dask-scratch-space/worker-rnobb2gp

0,1
Comm: tls://10.0.40.144:41829,Total threads: 4
Dashboard: http://10.0.40.144:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.40.144:40385,
Local directory: /scratch/dask-scratch-space/worker-pkhaidvq,Local directory: /scratch/dask-scratch-space/worker-pkhaidvq

0,1
Comm: tls://10.0.32.42:43339,Total threads: 4
Dashboard: http://10.0.32.42:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.32.42:40713,
Local directory: /scratch/dask-scratch-space/worker-ss_vb7rq,Local directory: /scratch/dask-scratch-space/worker-ss_vb7rq

0,1
Comm: tls://10.0.40.245:35091,Total threads: 4
Dashboard: http://10.0.40.245:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.40.245:35861,
Local directory: /scratch/dask-scratch-space/worker-b5eawd68,Local directory: /scratch/dask-scratch-space/worker-b5eawd68

0,1
Comm: tls://10.0.41.78:33655,Total threads: 4
Dashboard: http://10.0.41.78:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.41.78:38381,
Local directory: /scratch/dask-scratch-space/worker-3jjverm8,Local directory: /scratch/dask-scratch-space/worker-3jjverm8

0,1
Comm: tls://10.0.47.13:42023,Total threads: 4
Dashboard: http://10.0.47.13:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.47.13:37501,
Local directory: /scratch/dask-scratch-space/worker-1n6mbtj9,Local directory: /scratch/dask-scratch-space/worker-1n6mbtj9

0,1
Comm: tls://10.0.32.240:37699,Total threads: 4
Dashboard: http://10.0.32.240:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.32.240:43897,
Local directory: /scratch/dask-scratch-space/worker-c1zm8ue7,Local directory: /scratch/dask-scratch-space/worker-c1zm8ue7

0,1
Comm: tls://10.0.38.120:40647,Total threads: 4
Dashboard: http://10.0.38.120:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.38.120:34131,
Local directory: /scratch/dask-scratch-space/worker-mwxx5p0_,Local directory: /scratch/dask-scratch-space/worker-mwxx5p0_

0,1
Comm: tls://10.0.35.162:33699,Total threads: 4
Dashboard: http://10.0.35.162:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.35.162:40041,
Local directory: /scratch/dask-scratch-space/worker-b22pa8hl,Local directory: /scratch/dask-scratch-space/worker-b22pa8hl

0,1
Comm: tls://10.0.33.65:41489,Total threads: 4
Dashboard: http://10.0.33.65:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.33.65:37767,
Local directory: /scratch/dask-scratch-space/worker-q1fvz398,Local directory: /scratch/dask-scratch-space/worker-q1fvz398

0,1
Comm: tls://10.0.43.241:42165,Total threads: 4
Dashboard: http://10.0.43.241:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.43.241:33625,
Local directory: /scratch/dask-scratch-space/worker-yk5gg84l,Local directory: /scratch/dask-scratch-space/worker-yk5gg84l

0,1
Comm: tls://10.0.36.172:43685,Total threads: 4
Dashboard: http://10.0.36.172:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.36.172:46841,
Local directory: /scratch/dask-scratch-space/worker-3mim60cy,Local directory: /scratch/dask-scratch-space/worker-3mim60cy

0,1
Comm: tls://10.0.32.220:36687,Total threads: 4
Dashboard: http://10.0.32.220:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.32.220:43247,
Local directory: /scratch/dask-scratch-space/worker-n3biet5g,Local directory: /scratch/dask-scratch-space/worker-n3biet5g

0,1
Comm: tls://10.0.32.127:33655,Total threads: 4
Dashboard: http://10.0.32.127:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.32.127:40181,
Local directory: /scratch/dask-scratch-space/worker-8ypk_drm,Local directory: /scratch/dask-scratch-space/worker-8ypk_drm

0,1
Comm: tls://10.0.40.221:43019,Total threads: 4
Dashboard: http://10.0.40.221:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.40.221:35083,
Local directory: /scratch/dask-scratch-space/worker-8xm576bk,Local directory: /scratch/dask-scratch-space/worker-8xm576bk

0,1
Comm: tls://10.0.39.125:35815,Total threads: 4
Dashboard: http://10.0.39.125:8787/status,Memory: 14.83 GiB
Nanny: tls://10.0.39.125:43799,
Local directory: /scratch/dask-scratch-space/worker-i6b17959,Local directory: /scratch/dask-scratch-space/worker-i6b17959

0,1
Comm: tls://10.0.32.129:35565,Total threads: 4
Dashboard: http://10.0.32.129:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.32.129:39729,
Local directory: /scratch/dask-scratch-space/worker-cjrd8yyr,Local directory: /scratch/dask-scratch-space/worker-cjrd8yyr

0,1
Comm: tls://10.0.37.189:44209,Total threads: 4
Dashboard: http://10.0.37.189:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.37.189:43465,
Local directory: /scratch/dask-scratch-space/worker-il584nyc,Local directory: /scratch/dask-scratch-space/worker-il584nyc

0,1
Comm: tls://10.0.47.172:37701,Total threads: 4
Dashboard: http://10.0.47.172:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.47.172:41903,
Local directory: /scratch/dask-scratch-space/worker-p0nhipjq,Local directory: /scratch/dask-scratch-space/worker-p0nhipjq

0,1
Comm: tls://10.0.39.163:34197,Total threads: 4
Dashboard: http://10.0.39.163:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.39.163:45411,
Local directory: /scratch/dask-scratch-space/worker-izwmhu91,Local directory: /scratch/dask-scratch-space/worker-izwmhu91

0,1
Comm: tls://10.0.46.182:37689,Total threads: 4
Dashboard: http://10.0.46.182:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.46.182:38607,
Local directory: /scratch/dask-scratch-space/worker-kkq4y6cl,Local directory: /scratch/dask-scratch-space/worker-kkq4y6cl

0,1
Comm: tls://10.0.47.46:38347,Total threads: 4
Dashboard: http://10.0.47.46:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.47.46:34025,
Local directory: /scratch/dask-scratch-space/worker-189r6y4p,Local directory: /scratch/dask-scratch-space/worker-189r6y4p

0,1
Comm: tls://10.0.33.122:46191,Total threads: 4
Dashboard: http://10.0.33.122:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.33.122:46667,
Local directory: /scratch/dask-scratch-space/worker-m0y4qciz,Local directory: /scratch/dask-scratch-space/worker-m0y4qciz

0,1
Comm: tls://10.0.37.166:34325,Total threads: 4
Dashboard: http://10.0.37.166:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.37.166:37119,
Local directory: /scratch/dask-scratch-space/worker-19wezwjy,Local directory: /scratch/dask-scratch-space/worker-19wezwjy

0,1
Comm: tls://10.0.35.4:39579,Total threads: 4
Dashboard: http://10.0.35.4:8787/status,Memory: 14.83 GiB
Nanny: tls://10.0.35.4:43553,
Local directory: /scratch/dask-scratch-space/worker-yglawwu4,Local directory: /scratch/dask-scratch-space/worker-yglawwu4

0,1
Comm: tls://10.0.41.249:34413,Total threads: 4
Dashboard: http://10.0.41.249:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.41.249:41811,
Local directory: /scratch/dask-scratch-space/worker-tca0iqub,Local directory: /scratch/dask-scratch-space/worker-tca0iqub

0,1
Comm: tls://10.0.45.90:37251,Total threads: 4
Dashboard: http://10.0.45.90:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.45.90:34907,
Local directory: /scratch/dask-scratch-space/worker-obqo8yoe,Local directory: /scratch/dask-scratch-space/worker-obqo8yoe

0,1
Comm: tls://10.0.47.101:43401,Total threads: 4
Dashboard: http://10.0.47.101:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.47.101:40413,
Local directory: /scratch/dask-scratch-space/worker-spm8qzhy,Local directory: /scratch/dask-scratch-space/worker-spm8qzhy

0,1
Comm: tls://10.0.35.60:43549,Total threads: 4
Dashboard: http://10.0.35.60:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.35.60:39999,
Local directory: /scratch/dask-scratch-space/worker-_jopu986,Local directory: /scratch/dask-scratch-space/worker-_jopu986

0,1
Comm: tls://10.0.39.113:42507,Total threads: 4
Dashboard: http://10.0.39.113:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.39.113:34173,
Local directory: /scratch/dask-scratch-space/worker-lexemq6z,Local directory: /scratch/dask-scratch-space/worker-lexemq6z

0,1
Comm: tls://10.0.43.70:38301,Total threads: 4
Dashboard: http://10.0.43.70:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.43.70:33063,
Local directory: /scratch/dask-scratch-space/worker-55qt_tx7,Local directory: /scratch/dask-scratch-space/worker-55qt_tx7

0,1
Comm: tls://10.0.46.18:34615,Total threads: 4
Dashboard: http://10.0.46.18:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.46.18:42221,
Local directory: /scratch/dask-scratch-space/worker-a88mfzr6,Local directory: /scratch/dask-scratch-space/worker-a88mfzr6

0,1
Comm: tls://10.0.45.103:42365,Total threads: 4
Dashboard: http://10.0.45.103:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.45.103:42265,
Local directory: /scratch/dask-scratch-space/worker-a_gcf84v,Local directory: /scratch/dask-scratch-space/worker-a_gcf84v

0,1
Comm: tls://10.0.36.92:34903,Total threads: 4
Dashboard: http://10.0.36.92:8787/status,Memory: 14.85 GiB
Nanny: tls://10.0.36.92:38921,
Local directory: /scratch/dask-scratch-space/worker-lf2sbjzg,Local directory: /scratch/dask-scratch-space/worker-lf2sbjzg

0,1
Comm: tls://10.0.39.86:37879,Total threads: 4
Dashboard: http://10.0.39.86:8787/status,Memory: 14.83 GiB
Nanny: tls://10.0.39.86:36113,
Local directory: /scratch/dask-scratch-space/worker-6nzganx3,Local directory: /scratch/dask-scratch-space/worker-6nzganx3

0,1
Comm: tls://10.0.46.1:33907,Total threads: 4
Dashboard: http://10.0.46.1:8787/status,Memory: 14.83 GiB
Nanny: tls://10.0.46.1:35531,
Local directory: /scratch/dask-scratch-space/worker-ordbunup,Local directory: /scratch/dask-scratch-space/worker-ordbunup

0,1
Comm: tls://10.0.45.195:39599,Total threads: 4
Dashboard: http://10.0.45.195:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.45.195:43851,
Local directory: /scratch/dask-scratch-space/worker-zppayz0u,Local directory: /scratch/dask-scratch-space/worker-zppayz0u

0,1
Comm: tls://10.0.42.157:45073,Total threads: 4
Dashboard: http://10.0.42.157:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.42.157:35103,
Local directory: /scratch/dask-scratch-space/worker-tnj1tev9,Local directory: /scratch/dask-scratch-space/worker-tnj1tev9

0,1
Comm: tls://10.0.33.57:44919,Total threads: 4
Dashboard: http://10.0.33.57:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.33.57:44693,
Local directory: /scratch/dask-scratch-space/worker-9v1g_rr8,Local directory: /scratch/dask-scratch-space/worker-9v1g_rr8

0,1
Comm: tls://10.0.43.77:33033,Total threads: 4
Dashboard: http://10.0.43.77:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.43.77:37333,
Local directory: /scratch/dask-scratch-space/worker-i9b6ndo5,Local directory: /scratch/dask-scratch-space/worker-i9b6ndo5

0,1
Comm: tls://10.0.43.107:44629,Total threads: 4
Dashboard: http://10.0.43.107:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.43.107:34177,
Local directory: /scratch/dask-scratch-space/worker-k7_8iccv,Local directory: /scratch/dask-scratch-space/worker-k7_8iccv

0,1
Comm: tls://10.0.46.118:45867,Total threads: 4
Dashboard: http://10.0.46.118:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.46.118:46873,
Local directory: /scratch/dask-scratch-space/worker-3l15bwm5,Local directory: /scratch/dask-scratch-space/worker-3l15bwm5

0,1
Comm: tls://10.0.44.57:41435,Total threads: 4
Dashboard: http://10.0.44.57:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.44.57:37671,
Local directory: /scratch/dask-scratch-space/worker-gs6hvfaf,Local directory: /scratch/dask-scratch-space/worker-gs6hvfaf

0,1
Comm: tls://10.0.39.243:35579,Total threads: 4
Dashboard: http://10.0.39.243:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.39.243:45979,
Local directory: /scratch/dask-scratch-space/worker-rkm3yvu0,Local directory: /scratch/dask-scratch-space/worker-rkm3yvu0

0,1
Comm: tls://10.0.34.32:44227,Total threads: 4
Dashboard: http://10.0.34.32:8787/status,Memory: 14.85 GiB
Nanny: tls://10.0.34.32:41583,
Local directory: /scratch/dask-scratch-space/worker-08xfcen4,Local directory: /scratch/dask-scratch-space/worker-08xfcen4

0,1
Comm: tls://10.0.34.227:39673,Total threads: 4
Dashboard: http://10.0.34.227:8787/status,Memory: 14.83 GiB
Nanny: tls://10.0.34.227:46737,
Local directory: /scratch/dask-scratch-space/worker-owrd4rfw,Local directory: /scratch/dask-scratch-space/worker-owrd4rfw

0,1
Comm: tls://10.0.45.139:34937,Total threads: 4
Dashboard: http://10.0.45.139:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.45.139:42201,
Local directory: /scratch/dask-scratch-space/worker-9tw2ynje,Local directory: /scratch/dask-scratch-space/worker-9tw2ynje

0,1
Comm: tls://10.0.32.166:40959,Total threads: 4
Dashboard: http://10.0.32.166:8787/status,Memory: 14.84 GiB
Nanny: tls://10.0.32.166:39739,
Local directory: /scratch/dask-scratch-space/worker-fl9f391j,Local directory: /scratch/dask-scratch-space/worker-fl9f391j


Using `dask.dataframe.read_csv()`, we can lazily read this data in and do some low-level exploration before performing more complex computations:

In [2]:
%%time

import dask.dataframe as dd

ddf = dd.read_csv(
    "s3://coiled-datasets/uber-lyft-tlc-sample/csv-10/*", 
    dtype={"wav_match_flag": "category"},
)

CPU times: user 52.2 s, sys: 2.93 s, total: 55.1 s
Wall time: 58.6 s


In [4]:
ddf.dtypes

Unnamed: 0                 int64
hvfhs_license_num         object
dispatching_base_num      object
originating_base_num      object
request_datetime          object
on_scene_datetime         object
pickup_datetime           object
dropoff_datetime          object
PULocationID               int64
DOLocationID               int64
trip_miles               float64
trip_time                  int64
base_passenger_fare      float64
tolls                    float64
bcf                      float64
sales_tax                float64
congestion_surcharge     float64
airport_fee              float64
tips                     float64
driver_pay               float64
shared_request_flag       object
shared_match_flag         object
access_a_ride_flag        object
wav_request_flag          object
wav_match_flag          category
dtype: object

After some initial exploration, we see that the columns representing on-scene and pickup times are stored as `object`s. We decide to do some feature engineering by converting these to datetimes and moving relevant date components into separate columns.

In [5]:
%%time

# Convert to datetime
ddf["on_scene_datetime"] = dd.to_datetime(ddf["on_scene_datetime"], format="mixed")
ddf["pickup_datetime"] = dd.to_datetime(ddf["pickup_datetime"], format="mixed")

# Unpack columns
ddf = ddf.assign(
    accessible_vehicle=ddf.on_scene_datetime.isnull(),
    pickup_month=ddf.pickup_datetime.dt.month,
    pickup_dow=ddf.pickup_datetime.dt.dayofweek,
    pickup_hour=ddf.pickup_datetime.dt.hour,
)
ddf = ddf.drop(columns=["on_scene_datetime", "pickup_datetime"])

CPU times: user 70.4 ms, sys: 8.73 ms, total: 79.2 ms
Wall time: 103 ms


From here, some data sanitization and improvements to readability:

- Normalize airport fees to non-null floats
- Remove trip time outliers
- Rename service codes to their corresponding rideshare companies

In [9]:
%%time

# Format airport_fee
ddf["airport_fee"] = ddf["airport_fee"].fillna(0)

# Remove outliers
lower_bound = 0
Q3 = ddf["trip_time"].quantile(0.75)
upper_bound = Q3 + (1.5 * (Q3 - lower_bound))
ddf = ddf.loc[(ddf["trip_time"] >= lower_bound) & (ddf["trip_time"] <= upper_bound)]

service_names = {
    "HV0002": "juno",
    "HV0005": "lyft",
    "HV0003": "uber",
    "HV0004": "via",
}

ddf["service_names"] = ddf["hvfhs_license_num"].map(service_names)
ddf = ddf.drop(columns=["hvfhs_license_num"])

Now that the data is cleaned up, we can do some computations on our data.

First, let's compute the average tip amount across all riders:

In [26]:
%%time

(ddf.tips > 0).mean().compute()

CPU times: user 269 ms, sys: 19.8 ms, total: 289 ms
Wall time: 15 s


0.1583987882760626

Or some metrics of tipping grouped by rideshare company:

In [13]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.sum().compute()

CPU times: user 820 ms, sys: 67.7 ms, total: 888 ms
Wall time: 1min 13s


service_names
juno    1.053204e+06
lyft    9.243761e+07
uber    1.984338e+08
via     1.486593e+06
Name: tips, dtype: float64

In [14]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.mean().compute()

CPU times: user 983 ms, sys: 83.5 ms, total: 1.07 s
Wall time: 1min 24s


service_names
juno    3.928619
lyft    4.837245
uber    4.763513
via     2.314891
Name: tips, dtype: float64

# Persist when possible

Looking at the dashboard while performing the above analysis, it should become clear that whenever we compute operations on `ddf`, we must also run through all the dependent operations that read in and sanitize `ddf`, which forces several repeated computation steps.

When doing mutliple computations on the same dataset, it can save both time and money to `.persist()` it first - this incurs the time and cost of computing the dataset once, in exchange for future computations on the dataset working with an in-memory copy of the computed data:

In [15]:
%%time

ddf = ddf.persist()

CPU times: user 574 ms, sys: 16.9 ms, total: 591 ms
Wall time: 591 ms


In [16]:
%%time

from distributed import wait
wait(ddf);

CPU times: user 1.16 s, sys: 87 ms, total: 1.24 s
Wall time: 1min 4s


Now that `ddf` has been persisted, we can see that the same analysis as above can be computed much faster, with the initial creation of `ddf` no longer being included:

In [23]:
%%time

(ddf.tips > 0).mean().compute()

CPU times: user 286 ms, sys: 21.8 ms, total: 308 ms
Wall time: 14.3 s


0.1583987882760626

In [24]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.sum().compute()

CPU times: user 275 ms, sys: 16.4 ms, total: 291 ms
Wall time: 7.32 s


service_names
juno    1.053204e+06
lyft    9.243761e+07
uber    1.984338e+08
via     1.486593e+06
Name: tips, dtype: float64

In [25]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.mean().compute()

CPU times: user 306 ms, sys: 26.6 ms, total: 333 ms
Wall time: 17.4 s


service_names
juno    3.928619
lyft    4.837245
uber    4.763513
via     2.314891
Name: tips, dtype: float64

Note that the choice to persist data depends on several factors, including:

- Whether or not it fits into your clusters memory
- If it's being reused in enough computations

In general, a best practice to follow is persisting the dataset(s) you expect to use the most throughout computations.

# Avoid repeated compute calls

When working with related results that share computations between one another, calling `.compute()` on each object individually forces us to discard shared work that could otherwise be used to speed up future computations.

For example:

In [21]:
trip_frac = (ddf.tips > 0).mean()
gb_sum = ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.sum()
gb_mean = ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.mean()

Intuitively, we know that `gb_sum` and `gb_mean` both depend on `ddf.loc[lambda x: x.tips > 0].groupby("service_names")`, but calling `.compute()` on each object forces us to compute this result twice.

To compute all of these objects in parallel and compute shared parts of the computation only once, we can use [`dask.compute()`](https://docs.dask.org/en/stable/api.html#dask.compute):

In [22]:
%%time

import dask

trip_frac, gb_sum, gb_mean = dask.compute(trip_frac, gb_sum, gb_mean)

CPU times: user 587 ms, sys: 36.1 ms, total: 623 ms
Wall time: 31.2 s


# Store data efficiently

Up until this point, all of our performance optimizations have taken place after the initial reading of the data.
However, as ability to compute increases, data access and I/O become more significant bottlenecks.
Additionally, parallel computing will often add new constraints to how your store your data, particularly around providing random access to blocks of your data that are in line with how you plan to compute on it.

## File format

[Parquet](https://parquet.apache.org) is a popular, columnar file format designed for efficient data storage and retrieval. It handles random access, metadata storage, and binary encoding well. We [recommend using Parquet](https://docs.dask.org/en/stable/dataframe-best-practices.html#use-parquet) when working with tabular data.

In [62]:
%%time

import dask.dataframe as dd

# ddf = dd.read_csv(
#     "s3://coiled-datasets/uber-lyft-tlc-sample/csv-ill/*", 
#     dtype={"wav_match_flag": "category"},
# )

ddf = dd.read_parquet("s3://coiled-datasets/uber-lyft-tlc-sample/parquet-10/")

CPU times: user 1.65 s, sys: 60.9 ms, total: 1.71 s
Wall time: 3.96 s


In [63]:
ddf.dtypes

hvfhs_license_num       string[python]
dispatching_base_num    string[python]
originating_base_num    string[python]
request_datetime        datetime64[ns]
on_scene_datetime       datetime64[ns]
pickup_datetime         datetime64[ns]
dropoff_datetime        datetime64[ns]
PULocationID                     int32
DOLocationID                     int32
trip_miles                     float32
trip_time                        int32
base_passenger_fare            float32
tolls                          float32
bcf                            float32
sales_tax                      float32
congestion_surcharge           float32
airport_fee                    float32
tips                           float32
driver_pay                     float32
shared_request_flag           category
shared_match_flag             category
access_a_ride_flag            category
wav_request_flag              category
wav_match_flag                category
dtype: object

From here, we can see that the same data sanitization as earlier can be done much faster:

In [64]:
%%time

# # Convert to datetime
# ddf["on_scene_datetime"] = dd.to_datetime(ddf["on_scene_datetime"], format="mixed")
# ddf["pickup_datetime"] = dd.to_datetime(ddf["pickup_datetime"], format="mixed")

# Unpack columns
ddf = ddf.assign(
    accessible_vehicle=ddf.on_scene_datetime.isnull(),
    pickup_month=ddf.pickup_datetime.dt.month,
    pickup_dow=ddf.pickup_datetime.dt.dayofweek,
    pickup_hour=ddf.pickup_datetime.dt.hour,
)
ddf = ddf.drop(columns=["on_scene_datetime", "pickup_datetime"])

# Format airport_fee
ddf["airport_fee"] = ddf["airport_fee"].fillna(0)

# Remove outliers
lower_bound = 0
Q3 = ddf["trip_time"].quantile(0.75)
upper_bound = Q3 + (1.5 * (Q3 - lower_bound))
ddf = ddf.loc[(ddf["trip_time"] >= lower_bound) & (ddf["trip_time"] <= upper_bound)]

service_names = {
    "HV0002": "juno",
    "HV0005": "lyft",
    "HV0003": "uber",
    "HV0004": "via",
}

ddf["service_names"] = ddf["hvfhs_license_num"].map(service_names)
ddf = ddf.drop(columns=["hvfhs_license_num"])

CPU times: user 122 ms, sys: 8.05 ms, total: 130 ms
Wall time: 128 ms


Following best practices, we will now persist this sanitized dataset, so we no longer need to incur repeated I/O costs:

In [30]:
ddf = ddf.persist()

In [31]:
%%time

from distributed import wait
wait(ddf);

CPU times: user 1.14 s, sys: 76.6 ms, total: 1.22 s
Wall time: 46.4 s


From here, analysis can continue as normally:

In [32]:
%%time

(ddf.tips > 0).mean().compute()

CPU times: user 261 ms, sys: 18.8 ms, total: 279 ms
Wall time: 15.2 s


0.1583987882760626

In [33]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.sum().compute()

CPU times: user 174 ms, sys: 15.9 ms, total: 190 ms
Wall time: 7.82 s


service_names
juno    1.053204e+06
lyft    9.243762e+07
uber    1.984339e+08
via     1.486593e+06
Name: tips, dtype: float32

In [34]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.mean().compute()

CPU times: user 289 ms, sys: 21.7 ms, total: 311 ms
Wall time: 17.1 s


service_names
juno    3.928619
lyft    4.837245
uber    4.763513
via     2.314891
Name: tips, dtype: float64

Note that since we persisted the data, the impact of the improved I/O is gone by the time we get to the analysis.
This is because at this point, the data is stored in memory with pandas objects and datatypes; how it was originally stored no longer matters.
Put differently, all analysis beyond I/O and sanitization creates an identical task graph to the previous dataset.
In the next section, we will see how to troubleshoot and optimize our analysis independent of I/O.

## Partition size

So far, we've been working with the default partition size which, for this dataset, is pretty small (~10 MB).
A small partition size results in very many partition, which in turn results in very many tasks in our computation graphs.

When choosing a partition size, the goal is to give Dask enough to do per task that the scheduler overhead isn't taking up a disproportionate amount of time, but not so much that the workers run out of memory.
A good rule of thumb for partition sizes is between 100 MB and 1 GB per partition ([excellent blog post on this](https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes)).

So the first step is to see what our partiton size currently is:

In [35]:
import dask
dask.utils.format_bytes(ddf.partitions[0].compute().memory_usage(deep=True).sum())

'9.60 MiB'

Let's repartition to a bigger size.

In [65]:
%%time

ddf = ddf.repartition("100MiB")
ddf = ddf.persist()
wait(ddf);

CPU times: user 3.81 s, sys: 328 ms, total: 4.14 s
Wall time: 1min 47s


Note that we persist after we repartition so we don't repeat the repartitioning work every time we compute.

As a sanity check, let's check the new partition size:

In [66]:
dask.utils.format_bytes(ddf.partitions[0].compute().memory_usage(deep=True).sum())

'95.99 MiB'

Nice! Now let's do our analyses again.
Remember that this time, the task graph will be much smaller.
You can always inspect the graph by calling `.visualize()` rather than `.compute()` or by looking at the "Graph" page in the dashboard.

In [46]:
%%time

(ddf.tips > 0).mean().compute()

CPU times: user 55.4 ms, sys: 5.16 ms, total: 60.6 ms
Wall time: 1.32 s


0.1583987882760626

In [52]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.sum().compute()

CPU times: user 44.1 ms, sys: 4.85 ms, total: 48.9 ms
Wall time: 1.03 s


service_names
juno    1.053204e+06
lyft    9.243762e+07
uber    1.984338e+08
via     1.486593e+06
Name: tips, dtype: float32

In [53]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.mean().compute()

CPU times: user 64.6 ms, sys: 5.88 ms, total: 70.5 ms
Wall time: 1.82 s


service_names
juno    3.928619
lyft    4.837245
uber    4.763513
via     2.314891
Name: tips, dtype: float64

That was fast 🔥

Here we improved on the task graph by increasing the partition size, but we haven't improved the performance of the tasks themselves.
In the next section, we'll explore how changing the data type of your columns can make individual tasks more perfomant.

# Use efficient data types

Up until this point, we've been using the default data types inferred by Dask for most of our columns. In the case of string data, this means we are using the Python `object` type, which can be slow to process:

In [54]:
ddf.dtypes

dispatching_base_num    string[python]
originating_base_num    string[python]
request_datetime        datetime64[ns]
dropoff_datetime        datetime64[ns]
PULocationID                     int32
DOLocationID                     int32
trip_miles                     float32
trip_time                        int32
base_passenger_fare            float32
tolls                          float32
bcf                            float32
sales_tax                      float32
congestion_surcharge           float32
airport_fee                    float32
tips                           float32
driver_pay                     float32
shared_request_flag           category
shared_match_flag             category
access_a_ride_flag            category
wav_request_flag              category
wav_match_flag                category
accessible_vehicle                bool
pickup_month                     int32
pickup_dow                       int32
pickup_hour                      int32
service_names            

Recent versions of [Dask and pandas have improved support for PyArrow data types, most notably PyArrow strings](https://medium.com/coiled-hq/pyarrow-strings-in-dask-dataframes-55a0c4871586), which are faster and more memory efficient than Python `objects`.

Let's enjoy some of the benefits of PyArrow strings by casting relevant string columns to `string[pyarrow]`:

In [67]:
%%time

ddf = ddf.astype({
    "service_names": "string[pyarrow]",
    "dispatching_base_num": "string[pyarrow]",
    "originating_base_num": "string[pyarrow]",
})

ddf = ddf.persist()
wait(ddf);

CPU times: user 485 ms, sys: 72.6 ms, total: 558 ms
Wall time: 3.05 s


In [68]:
ddf.dtypes

dispatching_base_num    string[pyarrow]
originating_base_num    string[pyarrow]
request_datetime         datetime64[ns]
dropoff_datetime         datetime64[ns]
PULocationID                      int32
DOLocationID                      int32
trip_miles                      float32
trip_time                         int32
base_passenger_fare             float32
tolls                           float32
bcf                             float32
sales_tax                       float32
congestion_surcharge            float32
airport_fee                     float32
tips                            float32
driver_pay                      float32
shared_request_flag            category
shared_match_flag              category
access_a_ride_flag             category
wav_request_flag               category
wav_match_flag                 category
accessible_vehicle                 bool
pickup_month                      int32
pickup_dow                        int32
pickup_hour                       int32


With that done, let's revisit our partition sizes to see how they've been impacted:

In [69]:
dask.utils.format_bytes(ddf.partitions[1].compute().memory_usage(deep=True).sum())

'41.40 MiB'

Nice! With PyArrow strings, our partitions are noticeably smaller, and we can once again repartition our data to land at a solid 100 MB partition size:

In [70]:
%%time

ddf = ddf.repartition("100MB")
ddf = ddf.persist()
wait(ddf);

CPU times: user 643 ms, sys: 77.3 ms, total: 720 ms
Wall time: 6.34 s


With these new data types, we can now see that the analyses results in an even smaller task graph; on top of that, the improved performance of the PyArrow strings means that each individual task is more performant:

In [72]:
%%time

(ddf.tips != 0).mean().compute()

CPU times: user 27.8 ms, sys: 3.26 ms, total: 31 ms
Wall time: 919 ms


0.1583987882760626

In [75]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.sum().compute()

CPU times: user 37.1 ms, sys: 4.88 ms, total: 42 ms
Wall time: 883 ms


service_names
juno    1.053204e+06
lyft    9.243762e+07
uber    1.984338e+08
via     1.486593e+06
Name: tips, dtype: float32

In [76]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.mean().compute()

CPU times: user 47.4 ms, sys: 4.8 ms, total: 52.2 ms
Wall time: 1.1 s


service_names
juno    3.928619
lyft    4.837245
uber    4.763513
via     2.314891
Name: tips, dtype: float64

Note that as of `dask=2023.3.1`, we can skip the effort of manually recasting Python object columns to PyArrow strings by modifying the value of `dataframe.convert-string` in our Dask config:

In [None]:
# dask.config.set({"dataframe.convert-string": True});

The benefits of PyArrow strings aren't just limited to computation. By setting them as the default data type when reading in Parquet data, we can also improve the performance of I/O.

# Summary

In this notebook, we took a look at a representative Dask DataFrame workload that could benefit from Dask.

Starting from a suboptimal place performance-wise, we explored the dashboard to find potentials areas for improvement.
We then went through some basic Dask best practices that allowed us to shrink our task graph and improve the performance of individual tasks, which was reflected both in our analyses runtimes and dashboard plots.

# Additional Resources

- Repositories on GitHub:
    - Dask https://github.com/dask/dask
    - Distributed https://github.com/dask/distributed

- Documentation:
    - Dask documentation https://docs.dask.org
    - Distributed documentation https://distributed.dask.org

- If you have a Dask usage questions, please ask it on the [Dask GitHub discussions board](https://github.com/dask/dask/discussions).

- If you run into a bug, feel free to file a report on the [Dask GitHub issue tracker](https://github.com/dask/dask/issues).

- If you're interested in getting involved and contributing to Dask. Please check out our [contributing guide](https://docs.dask.org/en/latest/develop.html).

# Thank you!