<img src="images/coiled-logo.svg"
     align="right"
     width="5%"
     alt="Coiled logo\">

### Sign up for the next live session https://www.coiled.io/tutorials

<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="30%"
     alt="Dask logo\">

# Get better at Dask Dataframes

In this lesson, you will learn the advantages of working with the parquet data format and best practices when working with big data. You will learn how to manipulate inconvenient file sizes and datatypes, as well as how to make your data easier to manipulate. You will be exploring the Uber/Lyft dataset and learning some key practices of feature engineering with Dask Dataframes.

## Dask Dataframes 

<img src="https://docs.dask.org/en/stable/_images/dask-dataframe.svg"
     align="right"
     width="30%"
     alt="Dask DataFrame is composed of pandas DataFrames"/>

At its core, the `dask.dataframe` module implements a "blocked parallel" `DataFrame` object that looks and feels like the `pandas` API, but for parallel and distributed workflows. One Dask `DataFrame` is comprised of many in-memory pandas `DataFrame`s separated along the index. One operation on a Dask `DataFrame` triggers many pandas operations on the constituent pandas `DataFrame`s in a way that is mindful of potential parallelism and memory constraints.

Dask dataframes are very useful, but getting the most out of them can be tricky.  Where your data is stored, the format your data was saved in, the size of each file and the data types, are some examples of things you need to care when it comes to working with dataframes. 

### Work close to your data

To get started when you are working with data that is in the cloud it's always better to work close to your data to minimize the impact of IO networking. 

In this lesson, we will use Coiled Clusters that will be created on the same region that our datasets are stored. (the region is `"us-east-2"`)

**NOTE:**
If you do not have access to a Coiled Cluster, you can follow along just make sure you use the smaller dataset (use the `"0.5GB-"` ones). 


## Parquet vs CSV

Most people are familiarized with **csv** files, but when it comes to working with data, working with **parquet** can make a big difference. 

### Parquet is where it's at!!

The Parquet file format is column-oriented and it is designed to efficiently store and retrieve data. Columnar formats provide better compression and improved performance, and enable you to query data column by column. Consequently, aggregation queries are faster compared to row-oriented storage.

<img src="https://raw.githubusercontent.com/coiled/dask-tutorial/main/images/storage-files.png"
     align="right"
     width="50%"
     alt="Dask DataFrame is composed of pandas DataFrames"/>
     
     
- **Column pruning:** Parquet lets you read specific columns from a dataset without reading the entire file.
- **Better compression:**  Because in each column the data types are fairly similar, the compression of each column is quite straightforward. (saves on storage)
- **Schema:** Parquet stores the file schema in the file metadata.
- **Column metadata:** Parquet stores metadata statistics for each column, which can make certain types of queries a lot more efficient.

    
### Small motivation example: 

Let's see an example where we compare reading the same data but in one case it is stored as `csv` files, while the other as `parquet` files. 

**Note - Windows Users**

Unless you are using WSL, you will need to go to a command prompt or PowerShell window within an environment that includes coiled and run the following command from there.


In [None]:
### coiled login
#!coiled login --token ### --account dask-tutorials

In [None]:
import coiled
import dask
import dask.dataframe as dd
from dask.distributed import Client

In [None]:
# we use this to avoid re-using clusters on a team
import uuid

id_cluster = uuid.uuid4().hex[:4]

In [None]:
%%time
cluster = coiled.Cluster(
    n_workers=10,
    name=f"nyc-uber-lyft-{id_cluster}",
    account="dask-tutorials",
    worker_vm_types=["r6i.2xlarge"],
    backend_options={"region_name": "us-east-2"},
)

In [None]:
client = Client(cluster)

In [None]:
client

In [None]:
# data dictionary
data = {
    "5GB-csv": "s3://coiled-datasets/h2o-benchmark/N_1e8_K_1e2/*.csv",
    "5GB-pq": "s3://coiled-datasets/h2o-benchmark/N_1e8_K_1e2_parquet/*.parquet",
}

In [None]:
ddf_csv = dd.read_csv(data["5GB-csv"], storage_options={"anon": True})
ddf_pq = dd.read_parquet(data["5GB-pq"], storage_options={"anon": True})

In [None]:
ddf_csv

In [None]:
ddf_pq

In [None]:
%%time
ddf_csv.groupby("id1").agg({"v1": "sum"}).compute()

In [None]:
%%time
ddf_pq.groupby("id1").agg({"v1": "sum"}).compute()

### Memory usage 

Notice that the `parquet` version without doing much it is already ~7X faster. Let's take a look at the memory usage as well as the `dtypes` in both cases.

In [None]:
## memory usage for 1 partition
ddf_csv.partitions[0].memory_usage(deep=True).compute().apply(dask.utils.format_bytes)

In [None]:
ddf_pq.partitions[0].memory_usage(deep=True).compute().apply(dask.utils.format_bytes)

## Uber/Lyft data transformation

In the example above we saw that the format in which the data is stored, already makes a big difference. 

**Working with parquet** 

Let's use the Uber/Lyft dataset, as an example of a `parquet` dataset to learn how to troubleshoot the nuances of working with real data. The data comes from [High-Volume For-Hire Services](https://www.nyc.gov/site/tlc/businesses/high-volume-for-hire-services.page)

_Data dictionary:_

https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_hvfhs.pdf

In [None]:
# inspect data
import s3fs

s3 = s3fs.S3FileSystem()
files = s3.glob("nyc-tlc/trip data/fhvhv_tripdata_*.parquet")
files[:3]

In [None]:
len(files)

### Let's get a cluster 

From experience we know that we will need a cluster where the workers have plenty of memory. 

**Inspect the data**

In [None]:
ddf = dd.read_parquet(
    "s3://nyc-tlc/trip data/fhvhv_tripdata_*.parquet",
    # storage_options=storage_options={'anon': True} #needed if on binder
)
ddf

In [None]:
# inspect dtypes
ddf.dtypes

In [None]:
%%time
# inspect memory usage of 1 partition
ddf.partitions[0].memory_usage(deep=True).compute().apply(dask.utils.format_bytes)

### Challenges:

- Big partitions
- Inefficient data types

### Recommendations and best practices:

**Partition size**

In general we recommend starting with partitions that are in the order of ~100MB (in memory). However, the choice of the partition size can vary depending on the worker memory that you have available. 

For documentation on partition sizes visit the [repartition docs](https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.repartition.html) as well as the repartition section in the Dask Dataframe [best practices](https://docs.dask.org/en/stable/dataframe-best-practices.html#repartition-to-reduce-overhead)

**Data Types**

- Avoid object types for strings: use `"string[pyarrow]"`
- Reduce int/float representation if possible
- Use categorical dtypes when possible (avoid high cardinality).
- Consider using Nullable dtypes (very new/experimental)

### Create conversions dictionary

Based on these recommendations, let's work on better `dtypes`

In [None]:
import pandas as pd

In [None]:
conversions = {}
for column, dtype in ddf.dtypes.items():
    if dtype == "object":
        conversions[column] = "string[pyarrow]"
    if dtype == "float64":
        conversions[column] = "float32"
    if dtype == "int64":
        conversions[column] = "int32"
    if "flag" in column:
        conversions[column] = pd.CategoricalDtype(categories=["Y", "N"])
    if column == "airport_fee":
        conversions[
            column
        ] = "float32"  # noticed that this has floats and the <NA> is making it an object
conversions

In [None]:
# use new dtypes this takes a bit of time
ddf = ddf.astype(conversions)
ddf = ddf.persist()

In [None]:
ddf.partitions[0].memory_usage(deep=True).compute().apply(dask.utils.format_bytes)

In [None]:
dask.utils.format_bytes(ddf.partitions[0].memory_usage(deep=True).compute().sum())

### Repartition

In [None]:
ddf = ddf.repartition(partition_size="128MB").persist()

In [None]:
dask.utils.format_bytes(ddf.memory_usage(deep=True).compute().sum())

In [None]:
ddf.npartitions

In [None]:
dask.utils.format_bytes(ddf.partitions[0].memory_usage(deep=True).compute().sum())

### Other repartition options 

Sometimes, a repartition by size is not convenient for your use case. You can also repartition on a period of time if you have a timeseries with a datetime index. For example: if you where to need your data partition every `1d` you can do:

```python
ddf = ddf.set_index("request_datetime")
ddf = ddf.repartition(freq="1d")
```

**Note:**
Read more about repartition in the [dask documentation on this feature](https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.repartition.html#dask-dataframe-dataframe-repartition)

### Save data to and S3 bucket

In [None]:
# creds to be provided in live.
s3_storage_options = {"key": "***", "secret": "***"}

In [None]:
usr_id = "your_name"

In [None]:
ddf.to_parquet(
    f"s3://dask-tutorials-datasets/{usr_id}/",
    storage_options=s3_storage_options,
)

In [None]:
cluster.shutdown()
client.close()

## Let's do some data analysis

Now we are at a stage that our whole dataset is ~80GB in memory. When it comes to exploring data we do not necessarily need the whole data set, we can work with a sample, as well as only select a subset of columns. One of the beauties of the parquet file format is **column pruning**

Note: Keep in mind, that if you will do feature engineering, your data size will increase and having extra memory can help.

### Read data back

After you save your data, you will want to read it back to do some data analysis or train a model. When reading data back, there are some caveats regarding the `dtypes`.

- **Roundtriping for string pyarrow dtype** is not yet supported in pandas/dask. Hence when you read your data you need to tell pandas/dask to cast those columns as "string[pyarrow]" otherwise they'll be "string[python]". 
- **Nullable dtypes:** Using nullable dtypes is a fairly new feature and still under development, consider this experimental. Available in `dask >= 2022.12.0`

**What are nullable dtypes?**

Pandas (hence Dask) primarily uses NaN to represent missing data. Because NaN is a float, this forces an array of integers with any missing values to become floating point. In some cases, this may not matter much. But if your integer column is, say, an identifier, casting to float can be problematic. 

Nullable dtypes, allow you to work around this issue. 

If you want to read more about nullable dtypes, check the pandas [missing data docs](
https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data)

NOTE: 
1. If you are in a live session you will be able to read the parquet files we stored, providing the credentials that we share with you live. 
2. If you are following this tutorial on your own the credentials will not work, but you can read a copy of the dataset we wrote, from `"s3://coiled-datasets/uber-lyft-tlc/"`

In [None]:
%%time
cluster = coiled.Cluster(
    name=f"uber-lyft-{id_cluster}",
    n_workers=15,
    account="dask-tutorials",
    worker_vm_types=["m6i.xlarge"],
    backend_options={"region_name": "us-east-2"},
)

In [None]:
client = Client(cluster)
client

In [None]:
# use bucket where you wrote above if you are following from public session
# or public data uri ("s3://coiled-datasets/uber-lyft-tlc/") otherwise
file_to_read = (
    f"s3://dask-tutorials-datasets/{usr_id}/"  # replace for public uri if needed
)

In [None]:
# if reading form public uri and in binder use storage_options=storage_options={'anon': True}
df = dd.read_parquet(
   file_to_read,    # replace for "s3://coiled-datasets/uber-lyft-tlc/" if needed
   storage_options = s3_storage_options,
).astype(
    {
        "hvfhs_license_num": "string[pyarrow]",
        "dispatching_base_num": "string[pyarrow]",
        "originating_base_num": "string[pyarrow]",
    }
)

In [None]:
df.dtypes

In [None]:
df.hvfhs_license_num.dtype

## Memory usage 

```python
dask.utils.format_bytes(
    df.memory_usage(deep=True).sum().compute()
)
```
'82.81 GiB'


In [None]:
df.head()

In [None]:
df.columns

In [None]:
# len(df)
## 783_431_901

In [None]:
%%time
##let's count to see NaN
df.count().compute()

In [None]:
# Create a column tip > 0 = True
df["tip_flag"] = df.tips > 0

df_small = df[
    [
        "hvfhs_license_num",
        "tips",
        "base_passenger_fare",
        "driver_pay",
        "trip_miles",
        "trip_time",
        "shared_request_flag",
        "tip_flag",
    ]
].persist()

In [None]:
df_small.head()

In [None]:
df_small.base_passenger_fare.sum().compute() / 1e9

In [None]:
df_small.driver_pay.sum().compute() / 1e9

In [None]:
df_small.tips.sum().compute() / 1e6

In [None]:
df_small.columns

### Are New Yorkers tippers? 

Let's see how many trips have tip by provider 

In [None]:
tip_counts = df_small.groupby(["hvfhs_license_num"]).tip_flag.value_counts().compute()

In [None]:
tip_counts

**From the data dictionary we know:**

As of September 2019, the HVFHS licenses are the following:

- HV0002: Juno  
- HV0003: Uber  
- HV0004: Via  
- HV0005: Lyft  

In [None]:
type(tip_counts)

In [None]:
## this is a pandas
tip_counts = tip_counts.unstack(level="tip_flag")
tip_counts / 1e6

### Percentage of total rides that tip

In [None]:
tip_counts[True] * 100 / (tip_counts[True] + tip_counts[False])

### sum and mean of tips by provider 

In [None]:
tips_total = (
    df_small.loc[lambda x: x.tip_flag]
    .groupby("hvfhs_license_num")
    .tips.agg(["sum", "mean"])
    .compute()
)
tips_total

In [None]:
provider = {"HV0002": "Juno", "HV0005": "Lyft", "HV0003": "Uber", "HV0004": "Via"}

In [None]:
tips_total = tips_total.assign(provider=lambda df: df.index.map(provider)).set_index(
    "provider"
)
tips_total

### What percentage of the passenger fare is the tip

### Exercise
- Create a new column named "tip_percentage" that represents the what fraction of the passenger fare is the tip

In [None]:
# solution
tip_percentage = df_small.tips / df_small.base_passenger_fare
df_small["tip_percentage"] = tip_percentage

In [None]:
df_small = df_small.persist()

## Tip percentage mean of trip with tip

In [None]:
tips_perc_mean = (
    df_small.loc[lambda x: x.tip_flag]
    .groupby("hvfhs_license_num")
    .tip_percentage.mean()
    .compute()
)
tips_perc_mean

In [None]:
(tips_perc_mean.to_frame().set_index(tips_perc_mean.index.map(provider)))

### Base pay per mile per - by provider


In [None]:
dollars_per_mile = df_small.base_passenger_fare / df_small.trip_miles
df_small["dollars_per_mile"] = dollars_per_mile
df_small = df_small.persist()

In [None]:
(
    df_small.groupby("hvfhs_license_num")
    .dollars_per_mile.agg(["min", "max", "mean", "std"])
    .compute()
)

In [None]:
# filter: check only trips with tip
(
    df_small.loc[lambda x: x.tip_flag]
    .groupby("hvfhs_license_num")
    .dollars_per_mile.agg(["min", "max", "mean", "std"])
    .compute()
)

### Get insight on the data

We are seeing weird numbers, let's try to take a deeper look and remove some outliers

In [None]:
(
    df_small[["trip_miles", "base_passenger_fare", "tips", "tip_flag"]]
    .loc[lambda x: x.tip_flag]
    .describe()
    .compute()
)

### Getting to know the data

- How would you get more insights on the data?
- Can you visualize it?

**Hint:** Get a small sample, like 0.1% of the data to plot ~700_000 rows (go smaller if needed depending on your machine), compute it and work with that pandas dataframe.

In [None]:
# needed to avoid plots from breaking
%matplotlib inline

In [None]:
## Take a sample
df_tiny = (
    df_small.loc[lambda x: x.tip_flag][["trip_miles", "base_passenger_fare", "tips"]]
    .sample(frac=0.001)
    .compute()
)

In [None]:
# box plot
df_tiny.boxplot()

### Cleaning up outliers

- Play with the pandas dataframe `df_tiny` to get insights on good filters for the bigger dataframe. 

Hint: think about pandas dataframe quantiles [docs here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html)

In [None]:
df_tiny.tips.quantile([0.25, 0.75])

### Exercise

- Calculate the first and third quantiles for `base_passenger_fare` and `trip_miles`

In [None]:
# solution
df_tiny.base_passenger_fare.quantile([0.25, 0.75])

In [None]:
# solution
df_tiny.trip_miles.quantile([0.25, 0.75])

### Conditions to filter the dataset

We can use the information of Q1 and Q3 to create contions to filter the dataset

In [None]:
tips_filter_vals = df_tiny.tips.quantile([0.25, 0.75]).values
tips_condition = df_tiny.tips.between(*tips_filter_vals)

In [None]:
tips_condition

### Exercise

- Create filter conditions for the `base_passenger_fare` and `trip_miles`

In [None]:
## Solution
fare_filter_vals = df_tiny.base_passenger_fare.quantile([0.25, 0.75]).values
fares_condition = df_tiny.base_passenger_fare.between(*fare_filter_vals)

miles_filter_vals = df_tiny.trip_miles.quantile([0.25, 0.75]).values
miles_condition = df_tiny.trip_miles.between(*miles_filter_vals)

### Filter dataframe and plot

In [None]:
# solution
df_tiny.loc[(tips_condition & fares_condition) & miles_condition].boxplot()

## Filtering our big dataset based on the insights

Based on these numbers let's go back to our `df_small` dataset and try to filter it.

**Note:**

Sometimes when you are trying to filter and you have been doing feature engineering, you might get a divisions not known error.
If that's the case you can do 

```python
df_small = df_small.reset_index()
df_small = (df_small
            .set_index("column_to_be_the_index")
            .persist()
           )
```

In [None]:
tips_condition = df_small.tips.between(*tips_filter_vals)
miles_condition = df_small.trip_miles.between(*miles_filter_vals)
fares_condition = df_small.base_passenger_fare.between(*fare_filter_vals)

In [None]:
df_small = df_small.loc[(tips_condition & fares_condition) & miles_condition].persist()

### Stats on `dollars_per_mile`

In [None]:
(
    df_small.groupby("hvfhs_license_num")
    .dollars_per_mile.agg(["min", "max", "mean", "std"])
    .compute()
)

### Let's look at the `tip_percentage` again

### Exercise 
- Compute the `tip_percentage` mean by provider 

In [None]:
#Solution
tips_perc_avg = df_small.groupby("hvfhs_license_num").tip_percentage.mean().compute()
tips_perc_avg

In [None]:
(tips_perc_avg.to_frame().set_index(tips_perc_avg.index.map(provider)))

In [None]:
len(df_small)

### Average trip time by provider

In [None]:
trips_time_avg = (
    df_small.groupby("hvfhs_license_num")
    .trip_time.agg(["min", "max", "mean", "std"])
    .compute()
)
trips_time_avg

### In minutes

In [None]:
trips_time_avg.set_index(trips_time_avg.index.map(provider)) / 60

## What we've learned
- Most New Yorkers do not tip
- But it looks like of those who tip, it is common to tip around 20% regardless of the provider. Unless it's Via, they tend to tip slightly less.
- The trip_time column needs some cleaning of outliers. 

In [None]:
cluster.shutdown()
client.close()

### Useful links

- https://tutorial.dask.org/01_dataframe.html

**Useful links**

* [DataFrames documentation](https://docs.dask.org/en/stable/dataframe.html)
* [Dataframes and parquet](https://docs.dask.org/en/stable/dataframe-parquet.html)
* [Dataframes examples](https://examples.dask.org/dataframe.html)

### Other lesson

Register [here](https://www.coiled.io/tutorials) for reminders. 

We have another lesson, where we’ll parallelize a custom Python workflow that scrapes, parses, and cleans data from Stack Overflow. We’ll get to: ‍

- Learn how to do arbitrary task scheduling using the Dask Futures API
- Utilize blocking and non-blocking distributed calculations

By the end, we’ll see how much faster this workflow is using Dask and how the Dask Futures API is particularly well-suited for this type of fine-grained execution.
