# Dask Exercise 2: dask delayed

* [dask delayed](https://docs.dask.org/en/stable/delayed.html)
* [dask tutorial](https://tutorial.dask.org/03_dask.delayed.html)

Skills:
* Convert a for loop into a simple dask delayed workflow 
* Get more familiar with `dask.dataframe` wrangling

In [None]:
import dask.dataframe as dd
import pandas as pd

from dask import delayed, compute

GCS_FILE_PATH = ("gs://calitp-analytics-data/data-analyses/"
                 "rt_delay/v2_rt_trips/"
                )

analysis_date = "2023-03-15"
la_metro = 182
big_blue_bus = 300
muni = 282

operators = [la_metro, big_blue_bus, muni]

## Simple Workflow to Parallelize

This is a typical workflow. 
1. Read in pandas df.
2. Apply a certain function.
3. Export df.

Let's say we have a df corresponding to each operator. We want to apply the same aggregation function and then save out the results.

Typically, we would use a loop. A loop is **sequential**. By using `dask delayed` objects, we can run those **simultaneously**. Instead of running operator 1, operator 2, operator 3, ... , operator N, why not let them run at the same time and save out the results?

There is nothing inherent in our workflow that specifies that operator 1 must be run before operator 2. We are applying the same function to each operator. To speed it up, let's use dask to run it in parallel and get our results.

In [None]:
df = pd.read_parquet(
    f"{GCS_FILE_PATH}{big_blue_bus}_{analysis_date}.parquet")

In [None]:
# Set up a function that counts the number of 
# unique route_ids and route_type
def simple_route_aggregation(df: pd.DataFrame) -> pd.DataFrame:
        aggregated = (df.groupby(["calitp_itp_id",
                                  "organization_name"])
                      .agg({"route_id": "nunique", 
                            "route_type": "nunique"})
                      .reset_index()
                     )
        
        return aggregated


In [None]:
df_agg = simple_route_aggregation(df)

In [None]:
df_agg

### Move it to delayed

We can use the `@delayed` decorator right above our defined function.


Alternatively, you can wrap the function, like `delayed(my_function)(args)`. These are equivalent.

```
@delayed
def my_function(df):
    df2 = do something
    return df2
    
    
or...
delayed(my_function)(df)
```

**Note where the parentheses fall**...it is not a typo.


In [None]:
# We can use a decorator to make it a delayed function
@delayed
def import_data(itp_id: int):
    return pd.read_parquet(
        f"{GCS_FILE_PATH}{itp_id}_{analysis_date}.parquet")

In [None]:
# We have a list of 3 operators we would have looped over
operators

In [None]:
# Let's read in our data using list comprehension
dfs = [import_data(x) for x in operators]

In [None]:
# We have a list of delayed objects
# these dfs are not materialized / read into memory
dfs

In [None]:
# Set  up a list to store our results
results = [delayed(simple_route_aggregation)(df) for df in dfs]

In [None]:
# The results are also delayed objects
results

In [None]:
# Wrap compute around each of the items in the results list 
# and see what's inside
results_computed = [compute(i) for i in results]

In [None]:
# This is a list of tuples...that's not what we want
results_computed

In [None]:
type(results_computed[0])

In [None]:
# We need the first item of the tuple...that's our df
type(results_computed[0][0])

In [None]:
results_computed_correct = [compute(i)[0] for i in results]

In [None]:
results_computed_correct

In [None]:
type(results_computed_correct[0])

In [None]:
# Alternatively, the code can be written like a loop, 
# but it won't run like a loop. It will run it simultaneously 
# for the three operators

results2 = []

for itp_id in operators:
    operator_df = import_data(itp_id)
    print(f"type for operator_df: {type(operator_df)}")
    
    aggregated_df = delayed(simple_route_aggregation)(operator_df)
    print(f"type for aggregated_df: {type(aggregated_df)}")
    
    results2.append(aggregated_df)


In [None]:
results_computed2 = [compute(i)[0] for i in results2]

In [None]:
results_computed2

At this point, you can either write a function to export each individual aggregated pandas df result to be its standalone parquet, or combine it all. 

We will not export and overwrite the file in the GCS bucket right now.

Since our results are just pandas dfs, we could also concatenate them.

In [None]:
pd.concat(results_computed2, axis=0)

In [None]:
# This is rather pointless for such a small df, but for larger
# ones, we may want to concatenate and export it as a partitioned parquet
dd.multi.concat(results_computed2, axis=0).compute()

## To Do

* For the same 3 operators, use delayed functions throughout, from importing the parquet, applying a function, and saving the results to a list.
* Your function should group each trip into a category based on its `mean_speed_mph`. 
   * < 10 mph
   * 10-15 mph
   * 15-20 mph
   * 20+ mph
* For each operator, get the count of trips by category and its proportion
* Save the results in a list, compute the results for all the operators at once
* Concatenate the aggregated results for all the operators into one dask df