<img src="images/dask_horizontal.svg" width="45%" alt="Dask logo\">

# Performance Optimization

This notebook walks through a Dask DataFrame ETL workload. We'll demonstrate how to diagnose performance issues, utilize the Dask dashboard, and cover several common DataFrame best practices. 

## Dataset: Uber/Lyft TLC Trip Records

The New York City Taxi and Limousine Commission (TLC) collects trip information for each taxi and for-hire vehicle trip completed by licensed drivers and vehicles; here we'll analyze a subset of the [High-Volume For-Hire Services](https://www.nyc.gov/site/tlc/businesses/high-volume-for-hire-services.page) dataset stored which provides a good example of an out-of-core dataset that's too large for a standard laptop due to memory limitations.

Some characteristics of the dataset:

- CSV dataset that's ~115 GB in memory
- Stored in `s3://coiled-datasets/uber-lyft-tlc-sample/csv-10/`
- In region `us-east-2`

## Cluster setup

Because the dataset is too large for a laptop, we'll create a larger Dask cluster on AWS using [Coiled](https://www.coiled.io).
(Disclaimer: Some of the instructors for this tutorial are employed by Coiled):

In [None]:
import coiled

cluster = coiled.Cluster(
    n_workers=20,
    region="us-east-2",  # start workers close to data to minimize costs
)
client = cluster.get_client()

Once we have initialized a cluster and client, we can easily view the Dask dashboard either through widgets provided by [dask-labextension](https://github.com/dask/dask-labextension), or by visiting the dashboard URL directly:

In [None]:
client

Using `dask.dataframe.read_csv()`, we can lazily read this data in and do some low-level exploration before performing more complex computations:

In [None]:
%%time

import dask.dataframe as dd

ddf = dd.read_csv(
    "s3://coiled-datasets/uber-lyft-tlc-sample/csv-0.2-10/*", 
    dtype={"wav_match_flag": "category"},
)

In [None]:
ddf.dtypes

After some initial exploration, we see that the columns representing on-scene and pickup times are stored as `object`s. We decide to do some feature engineering by converting these to datetimes and moving relevant date components into separate columns.

In [None]:
%%time

# Convert to datetime
ddf["on_scene_datetime"] = dd.to_datetime(ddf["on_scene_datetime"], format="mixed")
ddf["pickup_datetime"] = dd.to_datetime(ddf["pickup_datetime"], format="mixed")

# Unpack columns
ddf = ddf.assign(
    accessible_vehicle=ddf.on_scene_datetime.isnull(),
    pickup_month=ddf.pickup_datetime.dt.month,
    pickup_dow=ddf.pickup_datetime.dt.dayofweek,
    pickup_hour=ddf.pickup_datetime.dt.hour,
)
ddf = ddf.drop(columns=["on_scene_datetime", "pickup_datetime"])

From here, some data sanitization and improvements to readability:

- Normalize airport fees to non-null floats
- Remove trip time outliers
- Rename service codes to their corresponding rideshare companies

In [None]:
%%time

# Format airport_fee
ddf["airport_fee"] = ddf["airport_fee"].fillna(0)

# Remove outliers
lower_bound = 0
Q3 = ddf["trip_time"].quantile(0.75)
upper_bound = Q3 + (1.5 * (Q3 - lower_bound))
ddf = ddf.loc[(ddf["trip_time"] >= lower_bound) & (ddf["trip_time"] <= upper_bound)]

service_names = {
    "HV0002": "juno",
    "HV0005": "lyft",
    "HV0003": "uber",
    "HV0004": "via",
}

ddf["service_names"] = ddf["hvfhs_license_num"].map(service_names)
ddf = ddf.drop(columns=["hvfhs_license_num"])

Now that the data is cleaned up, we can do some computations on our data.

First, let's compute the average tip amount across all riders:

In [None]:
%%time

(ddf.tips > 0).mean().compute()

We might also be interested in some metrics of tipping grouped by rideshare company:

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.sum().compute()

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.mean().compute()

# Persist when possible

Looking at the dashboard while performing the above analysis, it should become clear that whenever we compute operations on `ddf`, we must also run through all the dependent operations that read in and sanitize `ddf`, which forces several repeated computation steps.

When doing multiple computations on the same dataset, it can save both time and money to `persist()` it first - this incurs the time and cost of computing the dataset once, in exchange for future computations on the dataset working with an in-memory copy of the computed data:

In [None]:
%%time

ddf = ddf.persist()

In [None]:
%%time

from distributed import wait
wait(ddf);

Now that `ddf` has been persisted, we can see that the same analysis as above can be computed much faster, with the initial creation of `ddf` no longer being included:

In [None]:
%%time

(ddf.tips > 0).mean().compute()

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.sum().compute()

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.mean().compute()

Note that the choice to persist data depends on several factors, including:

- Whether or not it fits into your cluster's memory
- If it's being reused in enough computations

In general, a best practice to follow is persisting the dataset(s) you expect to use the most throughout computations.

# Avoid repeated compute calls

When working with related results that share computations between one another, calling `compute()` on each object individually forces us to discard shared work that could otherwise be used to speed up future computations.

For example:

In [None]:
trip_frac = (ddf.tips > 0).mean()
gb_sum = ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.sum()
gb_mean = ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.mean()

Intuitively, we know that `gb_sum` and `gb_mean` both depend on `ddf.loc[lambda x: x.tips > 0].groupby("service_names")`, but calling `.compute()` on each object forces us to compute this result twice.

To compute all of these objects in parallel and compute shared parts of the computation only once, we can use [`dask.compute()`](https://docs.dask.org/en/stable/api.html#dask.compute):

In [None]:
%%time

import dask

trip_frac, gb_sum, gb_mean = dask.compute(trip_frac, gb_sum, gb_mean)

# Store data efficiently

Up until this point, all of our performance optimizations have taken place after the initial reading of the data.
However, as ability to compute increases, data access and I/O become more significant bottlenecks.
Additionally, parallel computing will often add new constraints to how you store your data, particularly around providing random access to blocks of your data that are in line with how you plan to compute on it.

## File format

[Parquet](https://parquet.apache.org) is a popular, columnar file format designed for efficient data storage and retrieval. It handles random access, metadata storage, and binary encoding well. We [recommend using Parquet](https://docs.dask.org/en/stable/dataframe-best-practices.html#use-parquet) when working with tabular data.

In [None]:
%%time

import dask.dataframe as dd

# ddf = dd.read_csv(
#     "s3://coiled-datasets/uber-lyft-tlc-sample/csv-ill/*", 
#     dtype={"wav_match_flag": "category"},
# )

ddf = dd.read_parquet("s3://coiled-datasets/uber-lyft-tlc-sample/parquet-0.2-10/")

In [None]:
ddf.dtypes

From here, we can see that the same data sanitization as earlier can be done much faster:

In [None]:
%%time

# # Convert to datetime
# ddf["on_scene_datetime"] = dd.to_datetime(ddf["on_scene_datetime"], format="mixed")
# ddf["pickup_datetime"] = dd.to_datetime(ddf["pickup_datetime"], format="mixed")

# Unpack columns
ddf = ddf.assign(
    accessible_vehicle=ddf.on_scene_datetime.isnull(),
    pickup_month=ddf.pickup_datetime.dt.month,
    pickup_dow=ddf.pickup_datetime.dt.dayofweek,
    pickup_hour=ddf.pickup_datetime.dt.hour,
)
ddf = ddf.drop(columns=["on_scene_datetime", "pickup_datetime"])

# Format airport_fee
ddf["airport_fee"] = ddf["airport_fee"].fillna(0)

# Remove outliers
lower_bound = 0
Q3 = ddf["trip_time"].quantile(0.75)
upper_bound = Q3 + (1.5 * (Q3 - lower_bound))
ddf = ddf.loc[(ddf["trip_time"] >= lower_bound) & (ddf["trip_time"] <= upper_bound)]

service_names = {
    "HV0002": "juno",
    "HV0005": "lyft",
    "HV0003": "uber",
    "HV0004": "via",
}

ddf["service_names"] = ddf["hvfhs_license_num"].map(service_names)
ddf = ddf.drop(columns=["hvfhs_license_num"])

Following best practices, we will now persist this sanitized dataset, so we no longer need to incur repeated I/O costs:

In [None]:
ddf = ddf.persist()

In [None]:
%%time

from distributed import wait
wait(ddf);

From here, analysis can continue as normally:

In [None]:
%%time

(ddf.tips > 0).mean().compute()

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.sum().compute()

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.mean().compute()

Note that since we persisted the data, the impact of the improved I/O is gone by the time we get to the analysis.
This is because at this point, the data is stored in memory with pandas objects and datatypes; how it was originally stored no longer matters.
Put differently, all analysis beyond I/O and sanitization creates an identical task graph to the previous dataset.
In the next section, we will see how to troubleshoot and optimize our analysis independent of I/O.

## Partition size

So far, we've been working with the default partition size which, for this dataset, is pretty small (~10 MB).
A small partition size results in very many partitions, which in turn results in very many tasks in our computation graphs.

When choosing a partition size, the goal is to give Dask enough to do per task that the scheduler overhead isn't taking up a disproportionate amount of time, but not so much that the workers run out of memory.
A good rule of thumb for partition sizes is between 100 MB and 1 GB per partition ([excellent blog post on this](https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes)).

So the first step is to see what our partition size currently is:

In [None]:
import dask
dask.utils.format_bytes(ddf.partitions[0].compute().memory_usage(deep=True).sum())

Let's repartition to a bigger size.

In [None]:
%%time

ddf = ddf.repartition("100MiB")
ddf = ddf.persist()
wait(ddf);

Note that we persist after we repartition so we don't repeat the repartitioning work every time we compute.

As a sanity check, let's check the new partition size:

In [None]:
dask.utils.format_bytes(ddf.partitions[0].compute().memory_usage(deep=True).sum())

Nice! Now let's do our analyses again.
Remember that this time, the task graph will be much smaller.
You can always inspect the graph by calling `visualize()` rather than `compute()` or by looking at the "Graph" page in the dashboard.

In [None]:
%%time

(ddf.tips > 0).mean().compute()

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.sum().compute()

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.mean().compute()

That was fast 🔥

Here we improved on the task graph by increasing the partition size, but we haven't improved the performance of the tasks themselves.
In the next section, we'll explore how changing the data type of your columns can make individual tasks more perfomant.

# Use efficient data types

Up until this point, we've been using the default data types inferred by Dask for most of our columns. In the case of string data, this means we are using the Python `object` type, which can be slow to process:

In [None]:
ddf.dtypes

Recent versions of [Dask and pandas have improved support for PyArrow data types, most notably PyArrow strings](https://medium.com/coiled-hq/pyarrow-strings-in-dask-dataframes-55a0c4871586), which are faster and more memory efficient than Python `objects`.

Let's enjoy some of the benefits of PyArrow strings by casting relevant string columns to `string[pyarrow]`:

In [None]:
%%time

ddf = ddf.astype({
    "service_names": "string[pyarrow]",
    "dispatching_base_num": "string[pyarrow]",
    "originating_base_num": "string[pyarrow]",
})

ddf = ddf.persist()
wait(ddf);

In [None]:
ddf.dtypes

With that done, let's revisit our partition sizes to see how they've been impacted:

In [None]:
dask.utils.format_bytes(ddf.partitions[1].compute().memory_usage(deep=True).sum())

Nice! With PyArrow strings, our partitions are noticeably smaller, and we can once again repartition our data to land at a solid 100 MB partition size:

In [None]:
%%time

ddf = ddf.repartition("100MB")
ddf = ddf.persist()
wait(ddf);

With these new data types, we can now see that the analyses result in an even smaller task graph; on top of that, the improved performance of the PyArrow strings means that each individual task is more performant:

In [None]:
%%time

(ddf.tips != 0).mean().compute()

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.sum().compute()

In [None]:
%%time

ddf.loc[lambda x: x.tips > 0].groupby("service_names").tips.mean().compute()

Note that as of `dask=2023.3.1`, we can skip the effort of manually recasting Python `object` columns to PyArrow strings by modifying the value of `dataframe.convert-string` in our Dask config:

In [None]:
# dask.config.set({"dataframe.convert-string": True});

The benefits of PyArrow strings aren't just limited to computation. By setting them as the default data type when reading in Parquet data, we can also improve the performance of I/O.

# Summary

In this notebook, we took a look at a representative Dask DataFrame workload that could benefit from Dask.

Starting from a suboptimal place performance-wise, we explored the dashboard to find potentials areas for improvement.
We then went through some basic Dask best practices that allowed us to shrink our task graph and improve the performance of individual tasks, which was reflected both in our analyses runtimes and dashboard plots.

# Additional Resources

- Repositories on GitHub:
    - Dask https://github.com/dask/dask
    - Distributed https://github.com/dask/distributed

- Documentation:
    - Dask documentation https://docs.dask.org
    - Distributed documentation https://distributed.dask.org

- If you have a Dask usage questions, please ask it on the [Dask GitHub discussions board](https://github.com/dask/dask/discussions).

- If you run into a bug, feel free to file a report on the [Dask GitHub issue tracker](https://github.com/dask/dask/issues).

- If you're interested in getting involved and contributing to Dask. Please check out our [contributing guide](https://docs.dask.org/en/latest/develop.html).

# Thank you!