## RT vs Schedule Route Aggregation Issues

#### 1. Keep schedule and vp apart 
For metrics derived from `vp_usable`, wherever we can merge with schedule info, do it. These metrics will include `vp_only` and `vp_and_schedule`, but **not** `schedule_only`. Schedule stuff `schedule_only` and `vp_and_schedule` will be done in `route_typologies`.

Merge RT and schedule stuff **after** route-direction-time period aggregation.
Trips that are in vp will not have a route_id or direction_id, so our aggregation will wrap up all those into "Unknown" routes.

For RT stuff, keep speed separate from other metrics for now.

#### 2. Add columns for trip table 
* Need `schedule_gtfs_dataset_key` from `vp_usable` and also `helpers.import_scheduled_trips` to get route-direction info.
* Add `time_of_day`, `peak_offpeak` column with `gtfs_schedule_wrangling`

#### 3. Get metrics
Set up a function to do the division for certain percentages or other normalized metrics.

This can be used for trip-level table, but will also need to be used after route-direction aggregation.

#### 4. Set up for weighted metrics 
Set up a function to help with weighted averages or percents. This should include all the columns we need to sum for a given grouping. 

For trips, this won't do anything, and it can be passed onto the metrics function in 3.
For route-direction-time_period, this will do something, and it will be passed onto the metrics function in 3.

#### 5. Are functions generalizable?
For these functions for aggregation, put it separately in a script / `segment_speed_utils`. Leave this until it's obvious what can be used.

#### 6. References to review while making changes
* how to set up speed-trip tables [add natural identifiers where necessary](https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/scripts/stop_arrivals_to_speed.py)
* [averaging of speeds](https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/scripts/average_speeds.py)
* crosswalk of operator identifiers [created here](https://github.com/cal-itp/data-analyses/blob/main/gtfs_funnel/crosswalk_gtfs_dataset_key_to_organization.py) and there is a [helper function](https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/segment_speed_utils/helpers.py#L169)...so use this!  
* [segment_calcs](https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/segment_speed_utils/segment_calcs.py) for some aggregation
* [time helpers](https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/segment_speed_utils/time_helpers.py)

In [31]:
import dask.dataframe as dd
import pandas as pd
import yaml

from shared_utils import rt_dates 
from segment_speed_utils import helpers, gtfs_schedule_wrangling
from segment_speed_utils.project_vars import RT_SCHED_GCS, SEGMENT_GCS

In [32]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [33]:
analysis_date = rt_dates.DATES["dec2023"]

In [34]:
with open("config.yml") as f:
    config_dict = yaml.safe_load(f)
    
config_dict

{'trip_metrics': 'vp_trip/trip_metrics',
 'route_direction_metrics': 'vp_route_dir/route_direction_metrics'}

In [35]:
EXPORT_FILE = config_dict["trip_metrics"]

In [36]:
df = pd.read_parquet(
    f"{RT_SCHED_GCS}trip_level_metrics/{analysis_date}_metrics.parquet"
)
df.to_parquet(f"{RT_SCHED_GCS}{EXPORT_FILE}_{analysis_date}.parquet")

In [37]:
df.head(2)

Unnamed: 0,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings_for_trip,total_min_w_gtfs,total_vp,vp_in_shape,speed_mph,service_minutes,pings_per_min,spatial_accuracy_pct,rt_triptime_w_gtfs_pct,rt_v_scheduled_trip_time_pct
0,5d25a4366c173007d9c29fdead0299d7,74.03,73,216,74,216.0,148.0,21.01,58.0,2.92,68.52,99.95,27.64
1,4b72b80fc9cfe5e613bab95585cbe7e4,23.45,21,59,23,59.0,19.0,54.95,58.0,2.52,32.2,98.08,-59.57


In [38]:
gtfs_key = dd.read_parquet(
    f"{SEGMENT_GCS}vp_usable_{analysis_date}",
    columns = ["schedule_gtfs_dataset_key", "trip_instance_key"]
).drop_duplicates().compute()

In [39]:
gtfs_key.head(2)

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key
29901,63029a23cb0e73f2a5d98a345c5e2e40,5d25a4366c173007d9c29fdead0299d7
30117,63029a23cb0e73f2a5d98a345c5e2e40,4b72b80fc9cfe5e613bab95585cbe7e4


In [40]:
df2 = pd.merge(
    gtfs_key,
    df,
    on = "trip_instance_key",
    how = "inner",
)

In [41]:
df2.head(2)

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings_for_trip,total_min_w_gtfs,total_vp,vp_in_shape,speed_mph,service_minutes,pings_per_min,spatial_accuracy_pct,rt_triptime_w_gtfs_pct,rt_v_scheduled_trip_time_pct
0,63029a23cb0e73f2a5d98a345c5e2e40,5d25a4366c173007d9c29fdead0299d7,74.03,73,216,74,216.0,148.0,21.01,58.0,2.92,68.52,99.95,27.64
1,63029a23cb0e73f2a5d98a345c5e2e40,4b72b80fc9cfe5e613bab95585cbe7e4,23.45,21,59,23,59.0,19.0,54.95,58.0,2.52,32.2,98.08,-59.57


In [42]:
trip_to_route = helpers.import_scheduled_trips(
    analysis_date,
    columns = ["trip_instance_key", "route_id", "direction_id"],
    get_pandas = True
)

In [43]:
trip_to_route.head(2)

Unnamed: 0,trip_instance_key,route_id,direction_id
0,595914b0c046d093f4fd5f9e88ab5635,3402,1.0
1,5ad8f3475c016f517dcb2611ccd69764,3402,1.0


In [44]:
# The left only merges are in vp, but not in schedule
# Fill in route_id and direction_id with missing
df3 = pd.merge(
    df2,
    trip_to_route,
    on = "trip_instance_key",
    how = "left",
    indicator = "sched_rt_category"
)

In [45]:
df3.sched_rt_category.value_counts()

both          77977
left_only      8151
right_only        0
Name: sched_rt_category, dtype: int64

In [46]:
df3 = df3.assign(
    route_id = df3.route_id.fillna("Unknown"),
    direction_id = df3.direction_id.astype("Int64"),
    sched_rt_category = df3.apply(
        lambda x: "vp_only" if x.sched_rt_category=="left_only"
        else "vp_sched",
        axis=1)
)

In [47]:
df3.sched_rt_category.value_counts()

vp_sched    77977
vp_only      8151
Name: sched_rt_category, dtype: int64

In [48]:
df3.head(2)

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings_for_trip,total_min_w_gtfs,total_vp,vp_in_shape,speed_mph,service_minutes,pings_per_min,spatial_accuracy_pct,rt_triptime_w_gtfs_pct,rt_v_scheduled_trip_time_pct,route_id,direction_id,sched_rt_category
0,63029a23cb0e73f2a5d98a345c5e2e40,5d25a4366c173007d9c29fdead0299d7,74.03,73,216,74,216.0,148.0,21.01,58.0,2.92,68.52,99.95,27.64,3428,1,vp_sched
1,63029a23cb0e73f2a5d98a345c5e2e40,4b72b80fc9cfe5e613bab95585cbe7e4,23.45,21,59,23,59.0,19.0,54.95,58.0,2.52,32.2,98.08,-59.57,3428,1,vp_sched


In [49]:
gtfs_schedule_wrangling.get_trip_time_buckets(
    analysis_date
).head(2)

Unnamed: 0,trip_instance_key,service_hours,trip_first_departure_datetime_pacific,time_of_day,service_minutes
0,595914b0c046d093f4fd5f9e88ab5635,0.55,2023-12-13 18:35:00,PM Peak,33.0
1,5ad8f3475c016f517dcb2611ccd69764,0.55,2023-12-13 19:05:00,PM Peak,33.0


In [50]:
time_of_day = gtfs_schedule_wrangling.get_trip_time_buckets(
    analysis_date
).pipe(
    gtfs_schedule_wrangling.add_peak_offpeak_column
)

In [51]:
time_of_day.head(2)

Unnamed: 0,trip_instance_key,service_hours,trip_first_departure_datetime_pacific,time_of_day,service_minutes,peak_offpeak
0,595914b0c046d093f4fd5f9e88ab5635,0.55,2023-12-13 18:35:00,PM Peak,33.0,peak
1,5ad8f3475c016f517dcb2611ccd69764,0.55,2023-12-13 19:05:00,PM Peak,33.0,peak


In [52]:
df4 = pd.merge(
    df3.drop(columns = "service_minutes"),
    time_of_day,
    on  = "trip_instance_key",
    how = "inner"
)

In [53]:
df4.head(2)

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings_for_trip,total_min_w_gtfs,total_vp,vp_in_shape,speed_mph,pings_per_min,spatial_accuracy_pct,rt_triptime_w_gtfs_pct,rt_v_scheduled_trip_time_pct,route_id,direction_id,sched_rt_category,service_hours,trip_first_departure_datetime_pacific,time_of_day,service_minutes,peak_offpeak
0,63029a23cb0e73f2a5d98a345c5e2e40,5d25a4366c173007d9c29fdead0299d7,74.03,73,216,74,216.0,148.0,21.01,2.92,68.52,99.95,27.64,3428,1,vp_sched,0.97,2023-12-13 05:34:00,Early AM,58.0,offpeak
1,63029a23cb0e73f2a5d98a345c5e2e40,4b72b80fc9cfe5e613bab95585cbe7e4,23.45,21,59,23,59.0,19.0,54.95,2.52,32.2,98.08,-59.57,3428,1,vp_sched,0.97,2023-12-13 06:34:00,Early AM,58.0,offpeak


Base off of this:
https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/segment_speed_utils/segment_calcs.py#L89-L132

In [54]:
def weighted_average_function(
    df: pd.DataFrame, 
    group_cols: list
):
    df2 = (df.groupby(group_cols + ["peak_offpeak"])
            .agg({
                 "trip_instance_key": "count",
                 #"rt_service_min": "mean", # can't use this twice...
                 # only if we move this to portfolio_utils.aggregate()

                 # weighted average for trip updates
                 "total_min_w_gtfs": "sum",
                 "rt_service_min": "sum",

                 # weighted average of pings per min
                 "total_pings_for_trip": "sum",
                 "service_minutes": "sum", # is it this one or rt_service_min?

                 # weighted spatial accuracy  
                 "total_vp": "sum",
                 "vp_in_shape": "sum",
             }).reset_index()
            )

    return df2

In [55]:
weighted_average_function(df4, ["schedule_gtfs_dataset_key", 
                             "route_id", "direction_id", 
                              "sched_rt_category"])

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,sched_rt_category,peak_offpeak,trip_instance_key,total_min_w_gtfs,rt_service_min,total_pings_for_trip,service_minutes,total_vp,vp_in_shape
0,015d67d5b75b5cf2b710bbadadfb75f5,17,0,vp_sched,offpeak,10,729,732.98,2153,567.0,2153.0,1708.0
1,015d67d5b75b5cf2b710bbadadfb75f5,17,0,vp_sched,peak,12,808,839.93,2388,690.0,2388.0,2195.0
2,015d67d5b75b5cf2b710bbadadfb75f5,17,1,vp_sched,offpeak,11,675,697.97,1992,569.0,1992.0,1705.0
3,015d67d5b75b5cf2b710bbadadfb75f5,17,1,vp_sched,peak,11,596,618.78,1757,595.0,1757.0,1693.0
4,015d67d5b75b5cf2b710bbadadfb75f5,219,0,vp_sched,offpeak,9,183,242.5,533,152.0,533.0,428.0
5,015d67d5b75b5cf2b710bbadadfb75f5,219,0,vp_sched,peak,10,170,169.43,490,168.0,490.0,436.0
6,015d67d5b75b5cf2b710bbadadfb75f5,219,1,vp_sched,offpeak,9,202,201.92,589,126.0,589.0,578.0
7,015d67d5b75b5cf2b710bbadadfb75f5,219,1,vp_sched,peak,12,278,279.62,816,184.0,816.0,796.0
8,015d67d5b75b5cf2b710bbadadfb75f5,22,0,vp_sched,offpeak,14,803,809.65,2389,569.0,664.0,522.0
9,015d67d5b75b5cf2b710bbadadfb75f5,22,0,vp_sched,peak,12,641,654.38,1895,484.0,501.0,369.0


In [56]:
def calculate_percent_normalized_metrics(df: pd.DataFrame):
    # metrics like pings per minute, percent of trip with RT
    # should be calculated after aggregation
    # can be done at trip-level, can be done after sums are taken for route-direction
    # do not do simple averages in aggregation
    return