<img src="images/dask_horizontal.svg" width="45%" alt="Dask logo\">

# Optimizing Dask Workloads

This notebook illustrates a common Dask ETL workload, and demonstrates how one might go about diagnosing and resolving performance issues using the dashboard. We will provide a motivating example for where a user may require Dask, and outline some common do's and don't's of Dask dataframe operations.

## Cluster setup

Because the data we are using resides in AWS S3, we will be spinning up a [Coiled cluster](https://docs.coiled.io/user_guide/clusters/index.html) in the same region to minimize I/O costs (is it worth adding a blurb that some of the presenters are Coiled employees?):

In [None]:
import coiled

cluster = coiled.Cluster(n_workers=50, region="us-east-2")  # start workers close to data to minimize costs
client = cluster.get_client()

Once we have initialized a cluster and client, we can easily view the Dask dashboard either through widgets provided by [dask-labextension](https://github.com/dask/dask-labextension), or by visiting the dashboard URL directly:

In [None]:
client

## Motivation: TLC Trip Records

The New York City Taxi and Limousine Commission (TLC) collects trip record information for each taxi and for-hire vehicle trip completed by licensed drivers and vehicles; a subset of this data (~60 GB) stored on S3 provides a good example of an out-of-core dataset that we would otherwise be unable to explore on a standard laptop.

Using `dask.dataframe.read_csv`, we can lazily read this data in and do some low-level exploration before performing more complex computations:

In [None]:
%%time

import dask.dataframe as dd

ddf = dd.read_csv(
    "s3://coiled-datasets/uber-lyft-tlc-sample/csv-10/*", 
    # "s3://coiled-datasets/uber-lyft-tlc-sample/csv-ill/*", 
    dtype={"wav_match_flag": "category"},  # worth adding a blurb around why we set this?
)

After some initial exploration, we see that the columns representing on-scene and pickup times are stored as strings(?); using `to_datetime`, we can get a datetime representation of these columns and extract the relevant date components into their own new columns:

In [None]:
%%time

# Convert to datetime
ddf["on_scene_datetime"] = dd.to_datetime(ddf["on_scene_datetime"], format="mixed")
ddf["pickup_datetime"] = dd.to_datetime(ddf["pickup_datetime"], format="mixed")

# Unpack columns
ddf = ddf.assign(
    accessible_vehicle=ddf.on_scene_datetime.isnull(),
    pickup_month=ddf.pickup_datetime.dt.month,
    pickup_dow=ddf.pickup_datetime.dt.dayofweek,
    pickup_hour=ddf.pickup_datetime.dt.hour,
)
ddf = ddf.drop(columns=["on_scene_datetime", "pickup_datetime"])

From here, some data sanitization and improvements to readability:

- Normalize airport fees to non-null floats
- Remove outlier data
- Rename service codes to their corresponding rideshare companies

In [None]:

# Format airport_fee
ddf["airport_fee"] = ddf["airport_fee"].replace("None", 0).astype(float).fillna(0)

# Remove outliers
lower_bound = 0
Q3 = ddf["trip_time"].quantile(0.75)
upper_bound = Q3 + (1.5 * (Q3 - lower_bound))
ddf = ddf.loc[(ddf["trip_time"] >= lower_bound) & (ddf["trip_time"] <= upper_bound)]

# # Categorize *_flag columns
# ddf = ddf.categorize(columns=["shared_request_flag", "shared_match_flag", "access_a_ride_flag", "wav_request_flag", "wav_match_flag"])

service_names = {
    "HV0002": "juno",
    "HV0005": "lyft",
    "HV0003": "uber",
    "HV0004": "via",
}

ddf["service_names"] = ddf["hvfhs_license_num"].map(service_names)

In [None]:
ddf

Now that the data is cleaned up, we can now do some preliminary analysis; for the sake of this tutorial, we are mostly intereseted in how long each computation take to run.


Some metrics we may be interested in are the average tip amount across all riders:

In [None]:
%%time

(ddf.tips != 0).mean().compute()

Or some metrics of tipping grouped by rideshare company:

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.sum().compute()

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.mean().compute()

## Do: persist intelligently

Looking at the dashboard while performing the above analysis, it should become clear that whenever we compute operations on `ddf`, we must also run through all the dependent operations that read in and sanitize `ddf`, which forces each operation to take much longer than necessary while incurring unnecessary I/O costs.

In [None]:
%%time

ddf = ddf.persist()

In [None]:
%%time

from distributed import wait
wait(ddf);

In [None]:
%%time

(ddf.tips != 0).mean().compute()

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.sum().compute()

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.mean().compute()

# Compute Intelligently

In [None]:
trip_frac = (ddf.tips != 0).mean()
gb_sum = ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.sum()
gb_mean = ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.mean()

In [None]:
%%time

import dask

trip_frac, gb_sum, gb_mean = dask.compute(trip_frac, gb_sum, gb_mean)

# File format

In [None]:
%%time

import dask.dataframe as dd

# ddf = dd.read_csv(
#     "s3://coiled-datasets/uber-lyft-tlc-sample/csv-ill/*", 
#     dtype={"wav_match_flag": "category"},
# )

# ddf = dd.read_parquet("s3://coiled-datasets/uber-lyft-tlc-sample/parquet-ill/")
ddf = dd.read_parquet("s3://coiled-datasets/uber-lyft-tlc-sample/parquet-10/")

In [None]:
ddf.dtypes

In [None]:
%%time

# # Convert to datetime
# ddf["on_scene_datetime"] = dd.to_datetime(ddf["on_scene_datetime"], format="mixed")
# ddf["pickup_datetime"] = dd.to_datetime(ddf["pickup_datetime"], format="mixed")

# Unpack columns
ddf = ddf.assign(
    accessible_vehicle=ddf.on_scene_datetime.isnull(),
    pickup_month=ddf.pickup_datetime.dt.month,
    pickup_dow=ddf.pickup_datetime.dt.dayofweek,
    pickup_hour=ddf.pickup_datetime.dt.hour,
)
ddf = ddf.drop(columns=["on_scene_datetime", "pickup_datetime"])

# Format airport_fee
ddf["airport_fee"] = ddf["airport_fee"].replace("None", 0).astype(float).fillna(0)

# Remove outliers
lower_bound = 0
Q3 = ddf["trip_time"].quantile(0.75)
upper_bound = Q3 + (1.5 * (Q3 - lower_bound))
ddf = ddf.loc[(ddf["trip_time"] >= lower_bound) & (ddf["trip_time"] <= upper_bound)]

# # Categorize *_flag columns
# ddf = ddf.categorize(columns=["shared_request_flag", "shared_match_flag", "access_a_ride_flag", "wav_request_flag", "wav_match_flag"])

service_names = {
    "HV0002": "juno",
    "HV0005": "lyft",
    "HV0003": "uber",
    "HV0004": "via",
}

ddf["service_names"] = ddf["hvfhs_license_num"].map(service_names)

In [None]:
ddf = ddf.persist()

In [None]:
%%time

from distributed import wait
wait(ddf);

In [None]:
%%time

(ddf.tips != 0).mean().compute()

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.sum().compute()

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.mean().compute()

Since we persisted the data, the impact of the improved IO is gone by the time we get to the computations. This is becuase at this point the data is in pandas objects with pandas datatypes. How it was originally stored no longer matters. To put it a different way, we have exactly the same task graph as we had in the previous section. In the next section we will see how to change that task graph.

# do: choose a reasonable partition size

So far we've been working with the default partition size which in this case is pretty small. Since the data has a fixed size, when you have a small partition size you have to have many partitions. And when you have many partitions you have even more tasks since every partition results in at least one task. 

The goal is to give Dask enough to do per task so that the scheduler overhead isn't taking up a disproportionate amount of time, but not so much that the workers run out of memory. A good rule of thumb for partition sizes is between 100MB and 1GB per partiton ([excellent blog post on this](https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes))

So the first step is to see what our partiton size currently is:

In [None]:
import dask
dask.utils.format_bytes(ddf.partitions[0].compute().memory_usage(deep=True).sum())

That is small! Now we can repartition to a bigger size. 

In [None]:
%%time

ddf = ddf.repartition("100MiB")
ddf = ddf.persist()
wait(ddf);

Note: we persist after we repartition because we don't want to be doing that repartitioning work every time we call compute.

Let's check that that worked.

In [None]:
dask.utils.format_bytes(ddf.partitions[0].compute().memory_usage(deep=True).sum())

Ok! Now lets do out calculations again. Remember that this time the task graph will have many fewer nodes. You can always inspect the graph by calling `.visualize()` rather than `.compute()` or by looking at the "Graph" page in the dashboard. 

In [None]:
%%time

(ddf.tips != 0).mean().compute()

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.sum().compute()

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.mean().compute()

Woohoo! That was fast! Ok last section. We have improved on the task graph by changing the partition size, but we haven't improved the performance of the tasks themselves. In this next section we'll explore how changing the data type of your columns can make individual tasks more perfomant.

# Data types

Look at a dashboard plot (maybe the GIL contention plot?) that demostrates we could benefit from PyArrow strings

In [None]:
ddf.dtypes

In [None]:
ddf = ddf.astype({
    "service_names": "string[pyarrow]",
    "hvfhs_license_num": "string[pyarrow]",
    "dispatching_base_num": "string[pyarrow]",
    "originating_base_num": "string[pyarrow]",
})

In [None]:
%%time

ddf = ddf.persist()
wait(ddf);

In [None]:
ddf.dtypes

In [None]:
dask.utils.format_bytes(ddf.partitions[1].compute().memory_usage(deep=True).sum())

In [None]:
# dask.config.set({"dataframe.convert-string": True});

In [None]:
%%time

ddf = ddf.repartition("100MB")
ddf = ddf.persist()
wait(ddf);

In [None]:
%%time

(ddf.tips != 0).mean().compute()

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.sum().compute()

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.mean().compute()

# Conclusions

- Explored Dask dashboard
    - Saw plots that are available
    - Learned to interpret them
- Best practices
    - File format, partition size, data types, etc.