## Update `trips`
* cd rt_segment_speeds && pip install -r requirements.txt && cd ../_shared_utils && make setup_env
* https://github.com/cal-itp/data-analyses/pull/1016
    * Keep source data + metrics tightly defined with GCS bucket organization.
    * vp_usable is source data for rt_vs_sched metrics, do not merge in schedule data until gtfs_digest report. Only bring in schedule_gtfs_dataset_key column in    
    * vp_usable + route_id-direction_id for trips also present in schedule. If not in schedule, fill it with route_id = Unknown and direction_id as Int64
    * Add function to concatenate trip file, enable us to put in 1 day or 7 days for aggregation
    * A single function for normalized metrics (percent, per min, etc)
    * A single function for aggregation (summing up numerator / denominator)
    
* https://github.com/cal-itp/data-analyses/issues/989

* Notes 2/6
    * GTFS digest creates four datasets: schedule, average speeds, segment speeds, and rt vs schedule
    * Currently, merging is challenging.
    * Time categories are not necessarily the same (peak/offpeak/all-day)
    * Want all datasets to merge on the same set of columns (schedule gtfs key, route id, dir id, service date, and time categories) because `shapes` are unstable.
    * `Route ID` has been stabilized by Tiffany 
    * Update work from `rt_v_scheduled.py` (steps already outlined in `scripts/route_aggregation.ipynb`)
        * Do steps up until row 339 when the % are calculated. 
        * Take away `speeds`.
        * Save this "pre-metric" data somewhere since it takes so long to run?
        * Bring in schedule gtfs key, trip instance key, route id, direction id either at the beginning or the end using `helpers.import_scheduled_trips`
        * Coerce DIR ID to Int64, don't fill it in with 0. It's not 0, it's Nan
        * Save files with the analysis date at the end instead of the beginning.
        * Split off the workstream -> one for trip level and one for route level
            * Use the config.yml to save the trips and routes stuff into their own folder.
            * Routes:
                * For routes, the minutes/pings should be totalled up. Currently, just taking the average of an average isn't really accurate.
                * The route level should be able to take multiple days of data and concatenate so we can get metrics for a week/2 weeks/etc instead of for a single day. [Done here](https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/scripts/average_speeds.py)
                * Add the route frequency as well?
           * Trips:
               * Do up to step 339 in `rt_v_scheduled.py`
               * Write a new generalized function to create all the % 
            
* Notes 2/12
    * Figure out how to set up Config file

In [73]:
import dask.dataframe as dd
import pandas as pd
import yaml
from segment_speed_utils import gtfs_schedule_wrangling, helpers
from segment_speed_utils.project_vars import RT_SCHED_GCS, SEGMENT_GCS
from shared_utils import rt_dates

In [74]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [75]:
# analysis_date = rt_dates.DATES["dec2023"]

### Load in `rt_v_scheduled_trip` functions

In [76]:
dec_df = pd.read_parquet("./ah_testing_2023-12-01.parquet")

In [77]:
dec_df.sample()

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings_for_trip,total_min_w_gtfs,total_vp,vp_in_shape
53008,baeeb157e85a901e47b828ef9fe75091,d64ae007985922e56621dbbd9cee440d,25.68,24,55,25,55.0,55.0


In [78]:
nov_df = pd.read_parquet("./ah_testing_2023-11-15.parquet")

In [79]:
nov_df.sample()

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings_for_trip,total_min_w_gtfs,total_vp,vp_in_shape
64508,cf0f7df88da36cd9ca4248eb1d6a0f39,8f50ebbbf74960f56852f0bca654644a,43.37,43,128,43,128.0,124.0


### Add back routes-schedule-trip instance
* This will go into rt_v_scheduled.trip

In [80]:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/"
COMPILED_CACHED_VIEWS = f"{GCS_FILE_PATH}rt_delay/compiled_cached_views/"

In [81]:
# FILE = f"{COMPILED_CACHED_VIEWS}trips_{analysis_date}.parquet"
# RENAME_DICT = {"gtfs_dataset_key": "schedule_gtfs_dataset_key"}

In [82]:
# routes_df_og = pd.read_parquet(FILE)

In [83]:
# routes_df_og.sample()

#### Notes/Questions
* How come we want to do the inner join?
    `df3 = pd.merge(df2, time_buckets, on=["trip_instance_key"], how="inner")`
* Am I supposed to add back speeds?

In [84]:
def temp_function(df, analysis_date:str):
    routes_df = helpers.import_scheduled_trips(
    analysis_date,
    columns=[
        "gtfs_dataset_key",
        "route_id",
        "direction_id",
        "trip_instance_key",
    ],
    get_pandas=True
    )
    
    df2 = pd.merge(
    df,
    routes_df,
    on=["schedule_gtfs_dataset_key", "trip_instance_key"],
    how="left",
    indicator="sched_rt_category")
    
    
    df2 = df2.assign(
    route_id=df2.route_id.fillna("Unknown"),
    direction_id=df2.direction_id.astype("Int64"),
    sched_rt_category=df2.apply(
        lambda x: "vp_only" if x.sched_rt_category == "left_only" else "vp_sched",
        axis=1,
    ))
    
    display(df2.sched_rt_category.value_counts())
    
    time_buckets = (gtfs_schedule_wrangling
                    .get_trip_time_buckets(analysis_date)[
    ["trip_instance_key", "time_of_day", "service_minutes"]]
                    .pipe(gtfs_schedule_wrangling.add_peak_offpeak_column)
                    [['trip_instance_key', 'service_minutes', 'peak_offpeak']])
    #### How come we want to do the inner join?
    df3 = pd.merge(df2, time_buckets, on=["trip_instance_key"], how="inner")
    return df3

In [85]:
rt_dates.DATES["nov2023"]

'2023-11-15'

In [86]:
nov_df2 = temp_function(nov_df, rt_dates.DATES["nov2023"])

vp_sched    78190
vp_only      8642
Name: sched_rt_category, dtype: int64

In [87]:
nov_df2.sample()

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings_for_trip,total_min_w_gtfs,total_vp,vp_in_shape,route_id,direction_id,sched_rt_category,service_minutes,peak_offpeak
11360,9809d3f8121513057bc5cb8de7b54ce2,9c6e52563021b64280d19f1eb6724561,33.68,9,26,16,26.0,25.0,034-131,0,vp_sched,24.0,offpeak


In [88]:
nov_df2.sched_rt_category.value_counts()

vp_sched    78190
Name: sched_rt_category, dtype: int64

In [89]:
dec_df2 = temp_function(dec_df, rt_dates.DATES["dec2023"])

vp_sched    77977
vp_only      8151
Name: sched_rt_category, dtype: int64

In [90]:
dec_df2.sched_rt_category.value_counts()

vp_sched    77977
Name: sched_rt_category, dtype: int64

In [91]:
dec_df2.to_parquet("./concat_test_2023-12-01.parquet")

In [92]:
nov_df2.to_parquet("./concat_test_2023-11-15.parquet")

In [93]:
dec_df2.sample()

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings_for_trip,total_min_w_gtfs,total_vp,vp_in_shape,route_id,direction_id,sched_rt_category,service_minutes,peak_offpeak
13180,1c7027faabfeec976ea388973100bcf3,eead101b073c18844020a706b249f6dd,39.5,25,57,32,57.0,57.0,40,1,vp_sched,28.0,offpeak


### Trips: add back metrics

In [94]:
def add_metrics(df: pd.DataFrame) -> pd.DataFrame:

    df["pings_per_min"] = df.total_pings_for_trip / df.rt_service_min
    df["spatial_accuracy_pct"] = (df.vp_in_shape / df.total_vp) * 100
    df["rt_triptime_w_gtfs_pct"] = (df.total_min_w_gtfs / df.rt_service_min) * 100
    df["rt_v_scheduled_trip_time_pct"] = (
        df.rt_service_min / df.service_minutes - 1
    ) * 100

    # Mask rt_triptime_w_gtfs_pct for any values above 100%
    df.rt_triptime_w_gtfs_pct = df.rt_triptime_w_gtfs_pct.mask(
        df.rt_triptime_w_gtfs_pct > 100
    ).fillna(100)
    
    drop_cols = ['total_pings_for_trip',
                'vp_in_shape',
                'total_vp',
                'total_min_w_gtfs']
    df = df.drop(columns = drop_cols)
    return df

In [95]:
dec_trip = add_metrics(dec_df2)

In [96]:
nov_trip = add_metrics(nov_df2)

In [97]:
nov_trip.sample()

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,route_id,direction_id,sched_rt_category,service_minutes,peak_offpeak,pings_per_min,spatial_accuracy_pct,rt_triptime_w_gtfs_pct,rt_v_scheduled_trip_time_pct
39499,d9272b05e39a35ce5f7e774170e94ff1,0a9687e156cf0f219ce259d3a0490b99,52.97,52,47,0,vp_sched,46.0,peak,2.96,91.72,100.0,15.14


### Routes add multiple days

In [98]:
def concatenate_trip_segment_speeds(
    analysis_date_list: list
) -> pd.DataFrame:
    """
    Concatenate the speed-trip parquets together, 
    whether it's for single day or multi-day averages.
    Add columns for peak_offpeak, weekday_weekend based 
    on day of week and time-of-day.
    """
    """
    SPEED_FILE = dict_inputs["stage4"]
  
    df = pd.concat([
        pd.read_parquet(
            f"{SEGMENT_GCS}{SPEED_FILE}_{analysis_date}.parquet").assign(
            service_date = pd.to_datetime(analysis_date)
        ) for analysis_date in analysis_date_list], 
        axis=0, ignore_index = True
    )
    """
    df = pd.concat([
        pd.read_parquet(
            f"./concat_test_{analysis_date}.parquet").assign(
            service_date = pd.to_datetime(analysis_date)
        ) for analysis_date in analysis_date_list], 
        axis=0, ignore_index = True
    )
    return df

In [99]:
all_routes = concatenate_trip_segment_speeds(['2023-11-15', '2023-12-01'])

In [100]:
all_routes.sched_rt_category.value_counts()

vp_sched    156167
Name: sched_rt_category, dtype: int64

In [101]:
all_routes.head()

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings_for_trip,total_min_w_gtfs,total_vp,vp_in_shape,route_id,direction_id,sched_rt_category,service_minutes,peak_offpeak,service_date
0,63029a23cb0e73f2a5d98a345c5e2e40,56f15f118776aaafbf3a1c69c5821c14,62.38,62,185,63,185.0,144.0,3428,1,vp_sched,58.0,offpeak,2023-11-15
1,63029a23cb0e73f2a5d98a345c5e2e40,4244cbaa19bdbc3f6e4cc95cb792ccb0,67.7,67,201,68,201.0,147.0,3428,1,vp_sched,58.0,offpeak,2023-11-15
2,63029a23cb0e73f2a5d98a345c5e2e40,ce51c00d412991d09ad1de4ea2715f6e,127.38,127,377,127,377.0,207.0,3428,0,vp_sched,58.0,peak,2023-11-15
3,63029a23cb0e73f2a5d98a345c5e2e40,d01f03119c56bdda01210558a6f25ec2,152.02,151,449,151,449.0,186.0,3428,0,vp_sched,58.0,peak,2023-11-15
4,63029a23cb0e73f2a5d98a345c5e2e40,90e793547709584c8921f0786f9d310f,76.3,75,227,76,227.0,124.0,3429,1,vp_sched,55.0,offpeak,2023-11-15


In [102]:
all_routes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156167 entries, 0 to 156166
Data columns (total 14 columns):
 #   Column                       Non-Null Count   Dtype         
---  ------                       --------------   -----         
 0   schedule_gtfs_dataset_key    156167 non-null  object        
 1   trip_instance_key            156167 non-null  object        
 2   rt_service_min               156167 non-null  float64       
 3   min_w_atleast2_trip_updates  156167 non-null  int64         
 4   total_pings_for_trip         156167 non-null  int64         
 5   total_min_w_gtfs             156167 non-null  int64         
 6   total_vp                     149500 non-null  float64       
 7   vp_in_shape                  149500 non-null  float64       
 8   route_id                     156167 non-null  object        
 9   direction_id                 153098 non-null  Int64         
 10  sched_rt_category            156167 non-null  object        
 11  service_minutes           

#### Add back metrics

In [103]:
def weighted_average_function(df: pd.DataFrame, group_cols: list):
    sum_cols = [
        "total_min_w_gtfs",
        "rt_service_min",
        "total_pings_for_trip",
        "service_minutes",
        "total_vp",
        "vp_in_shape",
    ]

    count_cols = ["trip_instance_key"]
    df2 = (
        df.groupby(group_cols + ["peak_offpeak"])
        .agg({**{e: "sum" for e in sum_cols}, **{e: "count" for e in count_cols}})
        .reset_index()
    )
    
    df2 = df2.rename(columns = {'trip_instance_key':'n_trips'})
    
    df2 = add_metrics(df2)
    
    return df2

In [104]:
all_routes2 = weighted_average_function(
    all_routes,
    ["schedule_gtfs_dataset_key", "route_id", "direction_id", "sched_rt_category"],
)

In [105]:
all_routes2.columns

Index(['schedule_gtfs_dataset_key', 'route_id', 'direction_id',
       'sched_rt_category', 'peak_offpeak', 'rt_service_min',
       'service_minutes', 'n_trips', 'pings_per_min', 'spatial_accuracy_pct',
       'rt_triptime_w_gtfs_pct', 'rt_v_scheduled_trip_time_pct'],
      dtype='object')

#### rt_v_scheduled_trip_time_pct -> just delete entirely?
* This is to determine trip timeliness, how much longer (or shorter) a trip took based on RT data compared to its scheduled length.

In [110]:
all_routes2.loc[all_routes2.route_id == '17']

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,sched_rt_category,peak_offpeak,rt_service_min,service_minutes,n_trips,pings_per_min,spatial_accuracy_pct,rt_triptime_w_gtfs_pct,rt_v_scheduled_trip_time_pct
0,015d67d5b75b5cf2b710bbadadfb75f5,17,0,vp_sched,offpeak,1539.42,1134.0,20,2.91,75.62,98.28,35.75
1,015d67d5b75b5cf2b710bbadadfb75f5,17,0,vp_sched,peak,1723.65,1380.0,24,2.84,90.06,96.02,24.9
2,015d67d5b75b5cf2b710bbadadfb75f5,17,1,vp_sched,offpeak,1433.97,1138.0,22,2.79,86.59,94.42,26.01
3,015d67d5b75b5cf2b710bbadadfb75f5,17,1,vp_sched,peak,1309.88,1190.0,22,2.8,95.43,95.12,10.07
445,1ebafaca8716652559b2017b6eedc4ef,17,0,vp_sched,peak,212.5,224.0,6,1.97,100.0,99.76,-5.13
521,239f3baf3dd3b9e9464f66a777f9897d,17,0,vp_sched,offpeak,190.1,122.0,11,1.11,80.09,86.27,55.82
522,239f3baf3dd3b9e9464f66a777f9897d,17,0,vp_sched,peak,230.85,134.0,12,1.07,74.19,84.04,72.28
523,239f3baf3dd3b9e9464f66a777f9897d,17,1,vp_sched,offpeak,147.85,171.0,9,1.23,73.63,92.66,-13.54
524,239f3baf3dd3b9e9464f66a777f9897d,17,1,vp_sched,peak,145.65,171.0,9,1.24,80.0,94.75,-14.82
1721,43d8d305ee692724a532f30ea63a1cbe,17,1,vp_sched,offpeak,2787.37,1944.0,36,1.91,96.6,96.97,43.38


In [107]:
all_routes2.rt_v_scheduled_trip_time_pct.describe()

count   6276.00
mean      39.02
std       72.62
min      -76.17
25%       14.84
50%       27.27
75%       44.34
max     2320.67
Name: rt_v_scheduled_trip_time_pct, dtype: float64

In [108]:
len(all_routes2)

6276