## RT vs Schedule Route Aggregation Issues

#### 1. Keep schedule and vp apart 
For metrics derived from `vp_usable`, wherever we can merge with schedule info, do it. These metrics will include `vp_only` and `vp_and_schedule`, but **not** `schedule_only`. Schedule stuff `schedule_only` and `vp_and_schedule` will be done in `route_typologies`.

Merge RT and schedule stuff **after** route-direction-time period aggregation.
Trips that are in vp will not have a route_id or direction_id, so our aggregation will wrap up all those into "Unknown" routes.

For RT stuff, keep speed separate from other metrics for now.

#### 2. Add columns for trip table 
* Need `schedule_gtfs_dataset_key` from `vp_usable` and also `helpers.import_scheduled_trips` to get route-direction info.
* Add `time_of_day`, `peak_offpeak` column with `gtfs_schedule_wrangling`

#### 3. Get metrics
Set up a function to do the division for certain percentages or other normalized metrics.

This can be used for trip-level table, but will also need to be used after route-direction aggregation.

#### 4. Set up for weighted metrics 
Set up a function to help with weighted averages or percents. This should include all the columns we need to sum for a given grouping. 

For trips, this won't do anything, and it can be passed onto the metrics function in 3.
For route-direction-time_period, this will do something, and it will be passed onto the metrics function in 3.

#### 5. Are functions generalizable?
For these functions for aggregation, put it separately in a script / `segment_speed_utils`. Leave this until it's obvious what can be used.

#### 6. References to review while making changes
* how to set up speed-trip tables [add natural identifiers where necessary](https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/scripts/stop_arrivals_to_speed.py)
* [averaging of speeds](https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/scripts/average_speeds.py)
* crosswalk of operator identifiers [created here](https://github.com/cal-itp/data-analyses/blob/main/gtfs_funnel/crosswalk_gtfs_dataset_key_to_organization.py) and there is a [helper function](https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/segment_speed_utils/helpers.py#L169)...so use this!  
* [segment_calcs](https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/segment_speed_utils/segment_calcs.py) for some aggregation
* [time helpers](https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/segment_speed_utils/time_helpers.py)

In [1]:
import dask.dataframe as dd
import pandas as pd
import yaml

from shared_utils import rt_dates 
from segment_speed_utils import helpers, gtfs_schedule_wrangling
from segment_speed_utils.project_vars import RT_SCHED_GCS, SEGMENT_GCS

In [2]:
analysis_date = rt_dates.DATES["dec2023"]

In [3]:
with open("config.yml") as f:
    config_dict = yaml.safe_load(f)
    
config_dict

{'trip_metrics': 'vp_trip/trip_metrics',
 'route_direction_metrics': 'vp_route_dir/route_direction_metrics'}

In [4]:
EXPORT_FILE = config_dict["trip_metrics"]

In [5]:
df = pd.read_parquet(
    f"{RT_SCHED_GCS}trip_level_metrics/{analysis_date}_metrics.parquet"
)
df.to_parquet(f"{RT_SCHED_GCS}{EXPORT_FILE}_{analysis_date}.parquet")

In [6]:
gtfs_key = dd.read_parquet(
    f"{SEGMENT_GCS}vp_usable_{analysis_date}",
    columns = ["schedule_gtfs_dataset_key", "trip_instance_key"]
).drop_duplicates().compute()

In [7]:
df2 = pd.merge(
    gtfs_key,
    df,
    on = "trip_instance_key",
    how = "inner",
)

In [8]:
trip_to_route = helpers.import_scheduled_trips(
    analysis_date,
    columns = ["trip_instance_key", "route_id", "direction_id"],
    get_pandas = True
)

In [9]:
# The left only merges are in vp, but not in schedule
# Fill in route_id and direction_id with missing
df3 = pd.merge(
    df2,
    trip_to_route,
    on = "trip_instance_key",
    how = "left",
    indicator = "sched_rt_category"
)

In [10]:
df3.sched_rt_category.value_counts()

both          77977
left_only      8151
right_only        0
Name: sched_rt_category, dtype: int64

In [11]:
df3 = df3.assign(
    route_id = df3.route_id.fillna("Unknown"),
    direction_id = df3.direction_id.astype("Int64"),
    sched_rt_category = df3.apply(
        lambda x: "vp_only" if x.sched_rt_category=="left_only"
        else "vp_sched",
        axis=1)
)

In [12]:
df3.sched_rt_category.value_counts()

vp_sched    77977
vp_only      8151
Name: sched_rt_category, dtype: int64

In [13]:
time_of_day = gtfs_schedule_wrangling.get_trip_time_buckets(
    analysis_date
).pipe(
    gtfs_schedule_wrangling.add_peak_offpeak_column
)

In [14]:
df4 = pd.merge(
    df3.drop(columns = "service_minutes"),
    time_of_day,
    on  = "trip_instance_key",
    how = "inner"
)

Base off of this:
https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/segment_speed_utils/segment_calcs.py#L89-L132

In [15]:
def weighted_average_function(
    df: pd.DataFrame, 
    group_cols: list
):
    df2 = (df.groupby(group_cols + ["peak_offpeak"])
            .agg({
                 "trip_instance_key": "count",
                 #"rt_service_min": "mean", # can't use this twice...
                 # only if we move this to portfolio_utils.aggregate()

                 # weighted average for trip updates
                 "total_min_w_gtfs": "sum",
                 "rt_service_min": "sum",

                 # weighted average of pings per min
                 "total_pings_for_trip": "sum",
                 "service_minutes": "sum", # is it this one or rt_service_min?

                 # weighted spatial accuracy  
                 "total_vp": "sum",
                 "vp_in_shape": "sum",
             }).reset_index()
            )

    return df2

In [16]:
weighted_average_function(df4, ["schedule_gtfs_dataset_key", 
                             "route_id", "direction_id", 
                              "sched_rt_category"])

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,sched_rt_category,peak_offpeak,trip_instance_key,total_min_w_gtfs,rt_service_min,total_pings_for_trip,service_minutes,total_vp,vp_in_shape
0,015d67d5b75b5cf2b710bbadadfb75f5,17,0,vp_sched,offpeak,10,729,732.983333,2153,567.0,2153.0,1708.0
1,015d67d5b75b5cf2b710bbadadfb75f5,17,0,vp_sched,peak,12,808,839.933333,2388,690.0,2388.0,2195.0
2,015d67d5b75b5cf2b710bbadadfb75f5,17,1,vp_sched,offpeak,11,675,697.966667,1992,569.0,1992.0,1705.0
3,015d67d5b75b5cf2b710bbadadfb75f5,17,1,vp_sched,peak,11,596,618.783333,1757,595.0,1757.0,1693.0
4,015d67d5b75b5cf2b710bbadadfb75f5,219,0,vp_sched,offpeak,9,183,242.500000,533,152.0,533.0,428.0
...,...,...,...,...,...,...,...,...,...,...,...,...
5620,ff1bc5dde661d62c877165421e9ca257,ROUTEA,0,vp_sched,peak,3,113,114.033333,340,90.0,340.0,155.0
5621,ff1bc5dde661d62c877165421e9ca257,ROUTEA,1,vp_sched,offpeak,8,409,408.433333,1252,222.0,1252.0,481.0
5622,ff1bc5dde661d62c877165421e9ca257,ROUTEA,1,vp_sched,peak,8,337,336.850000,1038,239.0,1038.0,522.0
5623,ff1bc5dde661d62c877165421e9ca257,ROUTEB,1,vp_sched,offpeak,3,159,159.416667,467,141.0,467.0,335.0


In [None]:
def calculate_percent_normalized_metrics(df: pd.DataFrame):
    # metrics like pings per minute, percent of trip with RT
    # should be calculated after aggregation
    # can be done at trip-level, can be done after sums are taken for route-direction
    # do not do simple averages in aggregation
    return