<img src="images/coiled-logo.svg"
     align="right"
     width="5%"
     alt="Coiled logo\">

### Sign up for the next live session https://www.coiled.io/tutorials

<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="30%"
     alt="Dask logo\">

# Get better at Dask Dataframes

In this lesson, you will learn the advantages of working with the parquet data format and best practices when working with big data. You will learn how to manipulate inconvenient file sizes and datatypes, as well as how to make your data easier to manipulate. You will be exploring the Uber/Lyft dataset and learning some key practices of feature engineering with Dask Dataframes.

## Dask Dataframes 

<img src="https://docs.dask.org/en/stable/_images/dask-dataframe.svg"
     align="right"
     width="30%"
     alt="Dask DataFrame is composed of pandas DataFrames"/>

At its core, the `dask.dataframe` module implements a "blocked parallel" `DataFrame` object that looks and feels like the `pandas` API, but for parallel and distributed workflows. One Dask `DataFrame` is comprised of many in-memory pandas `DataFrame`s separated along the index. One operation on a Dask `DataFrame` triggers many pandas operations on the constituent pandas `DataFrame`s in a way that is mindful of potential parallelism and memory constraints.

Dask dataframes are very useful, but getting the most out of them can be tricky.  Where your data is stored, the format your data was saved in, the size of each file and the data types, are some examples of things you need to care when it comes to working with dataframes. 

### Work close to your data

To get started when you are working with data that is in the cloud it's always better to work close to your data to minimize the impact of IO networking. 

In this lesson, we will use Coiled Clusters that will be created on the same region that our datasets are stored. (the region is `"us-east-2"`)


## Parquet vs CSV

Most people are familiarized with **csv** files, but when it comes to working with data, working with **parquet** can make a big difference. 

### Parquet is where it's at!!

The Parquet file format is column-oriented and it is designed to efficiently store and retrieve data. Columnar formats provide better compression and improved performance, and enable you to query data column by column. Consequently, aggregation queries are faster compared to row-oriented storage.

<img src="https://raw.githubusercontent.com/coiled/dask-tutorial/main/images/storage-files.png"
     align="right"
     width="50%"
     alt="Dask DataFrame is composed of pandas DataFrames"/>
     
     
- **Column pruning:** Parquet lets you read specific columns from a dataset without reading the entire file.
- **Better compression:**  Because in each column the data types are fairly similar, the compression of each column is quite straightforward. (saves on storage)
- **Schema:** Parquet stores the file schema in the file metadata.
- **Column metadata:** Parquet stores metadata statistics for each column, which can make certain types of queries a lot more efficient.

    

In [None]:
### coiled login
#!coiled login --token ### --account dask-tutorials

In [None]:
import coiled
import dask
import dask.dataframe as dd
from dask.distributed import Client

In [None]:
# we use this to avoid re-using clusters on a team
import uuid

id_cluster = uuid.uuid4().hex[:4]

## Uber/Lyft data transformation

The NYC Taxi dataset is a timeless classic.

The NYC Taxi and Limousine Commission (TLC) has data from all ride-share services in the city of New York. This includes private limosine services, van services, and a new category "High Volume For Hire Vehicle" services, those that dispatch 10,000 rides per day or more. This is a special category defined for Uber and Lyft.

Let's use the Uber/Lyft dataset, as an example of a `parquet` dataset to learn how to troubleshoot the nuances of working with real data. The data comes from [High-Volume For-Hire Services](https://www.nyc.gov/site/tlc/businesses/high-volume-for-hire-services.page)

_Data dictionary:_

https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_hvfhs.pdf

### Let's get a cluster

In [None]:
%%time
cluster = coiled.Cluster(
    name=f"uber-lyft-{id_cluster}",
    n_workers=20,
    account="dask-tutorials",
    worker_vm_types=["m6i.xlarge"],
    backend_options={"region_name": "us-east-2"},
)

In [None]:
client = Client(cluster)
client

### Explore the data

We have a public version of this data set that is ready to use to get some insights, at
`"s3://coiled-datasets/uber-lyft-tlc/"`

In [None]:
dask.config.set({"dataframe.convert-string": True}) #Use PyArrow strings

In [None]:
df = dd.read_parquet(
   "s3://coiled-datasets/uber-lyft-tlc/" 
)

In [None]:
df.dtypes

## Memory usage 

```python
dask.utils.format_bytes(
    df.memory_usage(deep=True).sum().compute()
)
```
'82.81 GiB'


In [None]:
df.head()

In [None]:
df.columns

In [None]:
# len(df)
## 783_431_901

In [None]:
#We have enough memory so we persist the data set
df = df.persist()

### Get some insights

We assume you know pandas, so using pandas syntax and adding at the end `.compute()`, compute the follwoing quantities. 

How much did New Yorkers pay Uber/Lyft? Sum the `base_passenger_fare` column.

In [None]:
#solution
df.base_passenger_fare.sum().compute() / 1e9

How much did Uber/Lyft pay drivers?

In [None]:
#solution
df.driver_pay.sum().compute() / 1e9

How much did Uber/Lyft drivers made on tips?

In [None]:
#solution
df.tips.sum().compute() / 1e6

### Are New Yorkers tippers? 

Let's make our data set smaller and create a column that is a Yes/No for the tip. 

In [None]:
%%time
##let's count to see NaN
df.count().compute()

In [None]:
# Create a column tip > 0 = True
df["tip_flag"] = df.tips > 0

df = df[
    [
        "hvfhs_license_num",
        "tips",
        "base_passenger_fare",
        "driver_pay",
        "trip_miles",
        "trip_time",
        "shared_request_flag",
        "tip_flag",
    ]
].persist()

In [None]:
df.head()

In [None]:
df.columns

### Exercise

What percentage of rides received a tip?

In [None]:
#solution
tip_count = df["tip_flag"].value_counts().compute()

perc_trip_tips = tip_count[True] * 100 / (tip_count[True] + tip_count[False])
perc_trip_tips

### How many trips have tip by provider?

In [None]:
tip_by_provider = df.groupby(["hvfhs_license_num"]).tip_flag.value_counts().compute()

In [None]:
tip_by_provider

**From the data dictionary we know:**

As of September 2019, the HVFHS licenses are the following:

- HV0002: Juno  
- HV0003: Uber  
- HV0004: Via  
- HV0005: Lyft  

In [None]:
type(tip_by_provider)

In [None]:
## this is a pandas
tip_by_provider = tip_by_provider.unstack(level="tip_flag")
tip_by_provider / 1e6

### sum and mean of tips by provider 

In [None]:
tips_total = (
    df.loc[lambda x: x.tip_flag]
    .groupby("hvfhs_license_num")
    .tips.agg(["sum", "mean"])
    .compute()
)
tips_total

In [None]:
provider = {"HV0002": "Juno", "HV0005": "Lyft", "HV0003": "Uber", "HV0004": "Via"}

In [None]:
tips_total = tips_total.assign(provider=lambda df: df.index.map(provider)).set_index(
    "provider"
)
tips_total

### What percentage of the passenger fare is the tip?

### Exercise

Create a new column named "tip_percentage" that represents the what fraction of the passenger fare is the tip

In [None]:
# solution
tip_percentage = df.tips / df.base_passenger_fare
df["tip_percentage"] = tip_percentage

In [None]:
df = df.persist()

## Tip percentage mean of trip

In [None]:
tips_perc_mean = (
    df.loc[lambda x: x.tip_flag]
    .groupby("hvfhs_license_num")
    .tip_percentage.mean()
    .compute()
)
tips_perc_mean

In [None]:
(tips_perc_mean.to_frame().set_index(tips_perc_mean.index.map(provider)))

### Get insight on the data

We are seeing weird numbers, let's try to take a deeper look and remove some outliers

In [None]:
(
    df[["trip_miles", "base_passenger_fare", "tips", "tip_flag"]]
    .loc[lambda x: x.tip_flag]
    .describe()
    .compute()
)

### Getting to know the data

- How would you get more insights on the data?
- Can you visualize it?

**Hint:** Get a small sample, like 0.1% of the data to plot ~700_000 rows (go smaller if needed depending on your machine), compute it and work with that pandas dataframe.

In [None]:
# needed to avoid plots from breaking
%matplotlib inline

In [None]:
## Take a sample
df_sample = (
    df.loc[lambda x: x.tip_flag][["trip_miles", "base_passenger_fare", "tips"]]
    .sample(frac=0.001)
    .compute()
)

In [None]:
# box plot
df_sample.boxplot()

### Cleaning up outliers

- Play with the pandas dataframe `df_tiny` to get insights on good filters for the bigger dataframe. 

Hint: think about pandas dataframe quantiles [docs here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html)

In [None]:
df_sample.tips.quantile([0.25, 0.75])

### Exercise

Calculate the first and third quantiles for `base_passenger_fare` and `trip_miles`

In [None]:
# solution
df_sample.base_passenger_fare.quantile([0.25, 0.75])

In [None]:
# solution
df_sample.trip_miles.quantile([0.25, 0.75])

### Conditions to filter the dataset

We can use the information of Q1 and Q3 to create contions to filter the dataset

In [None]:
tips_filter_vals = df_sample.tips.quantile([0.25, 0.75]).values
tips_condition = df_sample.tips.between(*tips_filter_vals)

In [None]:
tips_condition

### Exercise

Create filter conditions for the `base_passenger_fare` and `trip_miles`

In [None]:
## Solution
fare_filter_vals = df_sample.base_passenger_fare.quantile([0.25, 0.75]).values
fares_condition = df_sample.base_passenger_fare.between(*fare_filter_vals)

miles_filter_vals = df_sample.trip_miles.quantile([0.25, 0.75]).values
miles_condition = df_sample.trip_miles.between(*miles_filter_vals)

### Filter dataframe and plot

In [None]:
df_sample.loc[(tips_condition & fares_condition) & miles_condition].boxplot()

## Filtering our big dataset based on the insights

Based on these numbers let's go back to our `df` dataset and try to filter it.


In [None]:
tips_condition = df.tips.between(*tips_filter_vals)
miles_condition = df.trip_miles.between(*miles_filter_vals)
fares_condition = df.base_passenger_fare.between(*fare_filter_vals)

In [None]:
df = df.loc[(tips_condition & fares_condition) & miles_condition].persist()

### Let's look at the `tip_percentage` again

### Exercise 
Compute the `tip_percentage` mean by provider 

In [None]:
#Solution
tips_perc_avg = df.groupby("hvfhs_license_num").tip_percentage.mean().compute()
tips_perc_avg

In [None]:
(tips_perc_avg.to_frame().set_index(tips_perc_avg.index.map(provider)))

In [None]:
len(df)

### Average trip time by provider

In [None]:
trips_time_avg = (
    df.groupby("hvfhs_license_num")
    .trip_time.agg(["min", "max", "mean", "std"])
    .compute()
)
trips_time_avg

### In minutes

In [None]:
trips_time_avg.set_index(trips_time_avg.index.map(provider)) / 60

## What we've learned
- Most New Yorkers do not tip
- But it looks like of those who tip, it is common to tip around 20% regardless of the provider. Unless it's Via, they tend to tip slightly less.
- The trip_time column needs some cleaning of outliers. 

In [None]:
cluster.shutdown()
client.close()

### Useful links

- https://tutorial.dask.org/01_dataframe.html

**Useful links**

* [DataFrames documentation](https://docs.dask.org/en/stable/dataframe.html)
* [Dataframes and parquet](https://docs.dask.org/en/stable/dataframe-parquet.html)
* [Dataframes examples](https://examples.dask.org/dataframe.html)

### Other lesson

Register [here](https://www.coiled.io/tutorials) for reminders. 

We have another lesson, where we’ll parallelize a custom Python workflow that scrapes, parses, and cleans data from Stack Overflow. We’ll get to: ‍

- Learn how to do arbitrary task scheduling using the Dask Futures API
- Utilize blocking and non-blocking distributed calculations

By the end, we’ll see how much faster this workflow is using Dask and how the Dask Futures API is particularly well-suited for this type of fine-grained execution.
