# Dask Exercise 1: switching from `pandas`

**Dask intro**
* [Why Dask?](https://docs.dask.org/en/stable/why.html)
* [10 min to Dask](https://docs.dask.org/en/stable/10-minutes-to-dask.html)
* [Dask dataframe](https://docs.dask.org/en/stable/dataframe.html) -- scope of what is easily ported from `pandas` vs what might be slow / not implemented in Dask.
* [Dask and parquets](https://docs.dask.org/en/stable/dataframe-parquet.html)

The first and easiest step in starting to use `dask` is to make the switch from `pandas` dataframes (dfs) to `dask` dataframes (ddfs) and find the equivalent methods. The look and feel of this should be very familiar.

The **major** difference between `pandas` dfs and `dask` ddfs is that ddfs are not read into memory. The schema and certain attributes of the df are there, but to actually get computations, you have call `.compute()`. Dask uses a lazy evaluation, which means it's storing the steps and the order you want to do it in, and evaluating it all at once when you say you want it.

The `dask` equivalent for Python's `numpy` arrays is `dask arrays`. 

Skills:
* equivalent methods for dask dataframes
* task graphs
* concatenation
* merges
* partitioned parquets

In [None]:
import dask.dataframe as dd
import dask_geopandas as dg
import geopandas as gpd
import pandas as pd

TAXI_DATA = ("https://raw.githubusercontent.com/mwaskom/"
             "seaborn-data/master/taxis.csv"
            )

### Basics

Most `pandas` methods can be called, but you need to add a `.compute()`.

To convert a ddf to a df, simply use `ddf.compute()`. 

Alternatively, you can change any df to a ddf.

In [None]:
df = pd.read_csv(TAXI_DATA)

ddf = dd.from_pandas(df, npartitions=1)

In [None]:
ddf = dd.read_csv(TAXI_DATA)
ddf

In [None]:
ddf.columns

In [None]:
ddf.dtypes

In [None]:
ddf.describe().compute()

In [None]:
# df.shape or len(df) would not work
# to find how many rows there are, find the length of the index
len(ddf.index)

### Cleaning Columns / Apply Row-Wise Functions  

For the most part, this is intact. Sometimes there is a `dask` equivalent of the `pandas` methods. Always look to see if there is one first.

In [None]:
# Lambda functions in pandas
df = df.assign(
    same_borough = df.apply(
        lambda x: 
        1 if x.pickup_borough == x.dropoff_borough
        else 0, axis=1)
)

In [None]:
# Apply the same lambda function for dask df
# Make sure to add the metadata argument to specify the data type
ddf = ddf.assign(
    same_borough = ddf.apply(
        lambda x:
        1 if x.pickup_borough == x.dropoff_borough
        else 0, axis=1, meta=('same_borough', 'int')
    )
)

In [None]:
ddf.same_borough.value_counts().compute()

In [None]:
def type_of_trip(row) -> str: 
    if ((row.passengers == 1) and 
        (row.distance >= 5) and 
        (row.pickup_borough != row.dropoff_borough)
       ):
        return "individual_long_trips"
    elif ((row.passengers > 1) and 
          (row.distance >= 5) and 
          (row.pickup_borough != row.dropoff_borough)
    ):
        return "group_long_trips"
    
    else:
        return "short_trips"
        

In [None]:
ddf = ddf.assign(
    trip_type = ddf.apply(
        lambda x: type_of_trip(x), 
        axis=1, meta=("trip_type", "str"))
)

In [None]:
ddf.groupby("trip_type").agg(
    {"total": "sum", 
     "passengers": "sum"}).reset_index().compute()

### DateTimes

A lot of the methods are similar here!

In [None]:
ddf = ddf.assign(
    pickup_time = dd.to_datetime(ddf.pickup),
    pickup_hour = dd.to_datetime(ddf.pickup).dt.hour
)

In [None]:
ddf.pickup_hour.value_counts().compute()

## Task Graphs

Since Dask is lazily evaluated, it's really just storing the order of operations for you. To see all the transformations you're doing to the dataframe, look at the task graph.

In [None]:
# Split up the ddf into multiple partitions
ddf2 = ddf.repartition(npartitions=5)

In [None]:
ddf2.visualize()

## Concatenation

Instead of `pd.concat`, use `dd.multi.concat`. [Docs](https://docs.dask.org/en/stable/generated/dask.dataframe.multi.concat.html).

The fact that we are concatenating a list of dfs or ddfs is a very useful concept to use in `dask.delayed`.

In [None]:
yellow = df[df.color=="yellow"].reset_index(drop=True)
green = df[df.color=="green"].reset_index(drop=True)

yellow_ddf = dd.from_pandas(yellow, npartitions=1)
green_ddf = dd.from_pandas(green, npartitions=1)

In [None]:
combined_df = pd.concat([yellow, green], axis=0)
combined_df.shape

In [None]:
combined_ddf = dd.multi.concat([yellow_ddf, green_ddf], axis=0)
len(combined_ddf.index)

## Merges

This isn't a very meaningful merge, but we'll use it to demonstrate anyway.

Let's say that there's a column called `manhattan_flag` in `yellow_ddf` and we want to bring that column in for `green_ddf`. 

We cannot use the `validate` parameter in the merge, but most of the other arguments are present. [Docs](https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.merge.html).

In [None]:
# Not that this is a meaningful merge, but, we can!
yellow_ddf = yellow_ddf.assign(
    manhattan_flag = yellow_ddf.apply(
        lambda x: 
        1 if (x.pickup_borough == "Manhattan") or 
        (x.dropoff_borough=="Manhattan") 
         else 0, axis=1, meta=("manhattan_flag", "int")
    )
)

yellow_ddf


In [None]:
yellow_ddf_flag = yellow_ddf[["pickup_borough", "dropoff_borough", 
                              "manhattan_flag"]].drop_duplicates()

yellow_ddf_flag.visualize()

In [None]:
m1 = dd.merge(
    green_ddf,
    yellow_ddf_flag,
    on = ["pickup_borough", "dropoff_borough"],
    how = "inner"
)

In [None]:
m1.manhattan_flag.value_counts().compute()

## Partitioned Parquets

We already use parquets because it's a lot faster than csv or geojson. We can also use partitioned parquets (a folder of lots of smaller parquet files).

The folder of partitioned parquets can be easily read back in or filtered against.

If you find that the `.compute()` step in bringing a very large ddf into memory is holding you back, consider exporting it out as a partitioned parquet. Reading a partitioned parquet back in and exporting as a single parquet is faster.

In [None]:
type(ddf.compute())

In [None]:
type(ddf)

In [None]:
# We don't have trouble computing and bringing this back into memory
ddf.compute()

In [None]:
# Look at this task graph
ddf3 = dd.read_csv(TAXI_DATA).repartition(npartitions=3)
ddf3 = ddf3.assign(
    trip_type = ddf3.apply(
        lambda x: type_of_trip(x), 
        axis=1, meta=("trip_type", "str"))
)

ddf3.visualize()

In [None]:
ddf3

In [None]:
# look at the format of this partitioned parquet
ddf3.to_parquet("dask1_multipart", overwrite=True)

In [None]:
read_in_ddf3 = dd.read_parquet("dask1_multipart/")
read_in_ddf3

In [None]:
# task graph just reads in each part
read_in_ddf3.visualize()