## Update `trips`
* cd rt_segment_speeds && pip install -r requirements.txt && cd ../_shared_utils && make setup_env
* https://github.com/cal-itp/data-analyses/pull/1016
    * Keep source data + metrics tightly defined with GCS bucket organization.
    * vp_usable is source data for rt_vs_sched metrics, do not merge in schedule data until gtfs_digest report. Only bring in schedule_gtfs_dataset_key column in * * vp_usable + route_id-direction_id for trips also present in schedule. If not in schedule, fill it with route_id = Unknown and direction_id as Int64
    * Add function to concatenate trip file, enable us to put in 1 day or 7 days for aggregation
    * A single function for normalized metrics (percent, per min, etc)
    * A single function for aggregation (summing up numerator / denominator)
    
* https://github.com/cal-itp/data-analyses/issues/989

* Notes 2/6
    * GTFS digest creates four datasets: schedule, average speeds, segment speeds, and rt vs schedule
    * Currently, merging is challenging.
    * Time categories are not necessarily the same (peak/offpeak/all-day)
    * Want all datasets to merge on the same set of columns (schedule gtfs key, route id, dir id, service date, and time categories) because `shapes` are unstable.
    * `Route ID` has been stabilized by Tiffany 
    * Update work from `rt_v_scheduled.py` (steps already outlined in `scripts/route_aggregation.ipynb`)
        * Do steps up until row 339 when the % are calculated. 
        * Take away `speeds`.
        * Save this "pre-metric" data somewhere since it takes so long to run?
        * Bring in schedule gtfs key, trip instance key, route id, direction id either at the beginning or the end using `helpers.import_scheduled_trips`
        * Coerce DIR ID to Int64, don't fill it in with 0. It's not 0, it's Nan
        * Save files with the analysis date at the end instead of the beginning.
        * Split off the workstream -> one for trip level and one for route level
            * Use the config.yml to save the trips and routes stuff into their own folder.
            * Routes:
                * For routes, the minutes/pings should be totalled up. Currently, just taking the average of an average isn't really accurate.
                * The route level should be able to take multiple days of data and concatenate so we can get metrics for a week/2 weeks/etc instead of for a single day. [Done here](https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/scripts/average_speeds.py)
                * Add the route frequency as well?
           * Trips:
               * Do up to step 339 in `rt_v_scheduled.py`
               * Write a new generalized function to create all the % 
            
            

In [1]:
import dask.dataframe as dd
import pandas as pd
import yaml
from segment_speed_utils import gtfs_schedule_wrangling, helpers
from segment_speed_utils.project_vars import RT_SCHED_GCS, SEGMENT_GCS
from shared_utils import rt_dates

In [2]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [3]:
analysis_date = rt_dates.DATES["dec2023"]

### Load in `rt_v_scheduled_trip` functions

In [4]:
dec_df = pd.read_parquet("./ah_testing_dec_2023.parquet")

In [5]:
dec_df.sample()

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings_for_trip,total_min_w_gtfs,total_vp,vp_in_shape
69534,7cc0cb1871dfd558f11a2885c145d144,6326acd03cbd1e868a3adb64e0e66aed,39.75,39,117,40,117.0,87.0


### Add back routes-schedule-trip instance

In [6]:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/"
COMPILED_CACHED_VIEWS = f"{GCS_FILE_PATH}rt_delay/compiled_cached_views/"

In [7]:
FILE = f"{COMPILED_CACHED_VIEWS}trips_{analysis_date}.parquet"
RENAME_DICT = {"gtfs_dataset_key": "schedule_gtfs_dataset_key"}

In [8]:
routes_df_og = pd.read_parquet(FILE)

In [9]:
routes_df_og.sample()

Unnamed: 0,feed_key,gtfs_dataset_key,name,regional_feed_type,service_date,trip_start_date_pacific,trip_id,trip_instance_key,route_key,route_id,route_type,route_short_name,route_long_name,route_desc,direction_id,shape_array_key,shape_id,trip_first_departure_datetime_pacific,trip_last_arrival_datetime_pacific,service_hours,trip_start_date_local_tz,trip_first_departure_datetime_local_tz,trip_last_arrival_datetime_local_tz
84810,82dd924efb837dcc0cd2c9ad2fb0f418,3f3f36b4c41cc6b5df3eb7f5d8ea6e3c,LA Metro Bus Schedule,,2023-12-13,2023-12-13,10460002001513-DEC23,580e5c5af3a8083d8f9df591fdc71017,886cd5ec3b256eb3743e11083948e5de,460-13172,3,460,Metro Express Line,DOWNTOWN LA - DISNEYLAND VIA HARBOR TWAY-105 FWY,0.0,939223c5ca048d378cfd8c039ff46f5b,4600200_DEC23,2023-12-13 15:13:00,2023-12-13 17:34:00,2.35,2023-12-13,2023-12-13 15:13:00,2023-12-13 17:34:00


In [10]:
routes_df = helpers.import_scheduled_trips(
    analysis_date,
    columns=[
        "gtfs_dataset_key",
        "route_id",
        "direction_id",
        "trip_instance_key",
    ],
    get_pandas=True,
)

In [11]:
pd.merge(
    dec_df,
    routes_df,
    on=["trip_instance_key"],
    how="outer",
    indicator="sched_rt_category",
)[["sched_rt_category"]].value_counts()

sched_rt_category
both                 77977
right_only           24342
left_only             8151
dtype: int64

In [12]:
dec_df2 = pd.merge(
    dec_df,
    routes_df,
    on=["schedule_gtfs_dataset_key", "trip_instance_key"],
    how="left",
    indicator="sched_rt_category",
)

In [13]:
dec_df2 = dec_df2.assign(
    route_id=dec_df2.route_id.fillna("Unknown"),
    direction_id=dec_df2.direction_id.astype("Int64"),
    sched_rt_category=dec_df2.apply(
        lambda x: "vp_only" if x.sched_rt_category == "left_only" else "vp_sched",
        axis=1,
    ),
)

In [14]:
len(dec_df2)

86128

In [15]:
time_buckets = gtfs_schedule_wrangling.get_trip_time_buckets(analysis_date)[
    ["trip_instance_key", "time_of_day", "service_minutes"]
].pipe(gtfs_schedule_wrangling.add_peak_offpeak_column)[['trip_instance_key', 'service_minutes', 'peak_offpeak']]

In [16]:
time_buckets.sample()

Unnamed: 0,trip_instance_key,service_minutes,peak_offpeak
70088,6b6ad088430e0a2d309a2c52b4e0825c,47.0,offpeak


In [17]:
pd.merge(dec_df2, time_buckets, on=["trip_instance_key"], how="outer", indicator=True)[
    ["_merge"]
].value_counts()

_merge    
both          77977
right_only    24342
left_only      8151
dtype: int64

#### How come we want to do the inner join?

In [18]:
dec_df3 = pd.merge(dec_df2, time_buckets, on=["trip_instance_key"], how="inner")

In [19]:
dec_df3.sample()

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings_for_trip,total_min_w_gtfs,total_vp,vp_in_shape,route_id,direction_id,sched_rt_category,service_minutes,peak_offpeak
14659,3f3f36b4c41cc6b5df3eb7f5d8ea6e3c,03a7307eb3dd2f4a9e559ef2b048155c,123.0,99,312,123,312.0,312.0,4-13172,0,vp_sched,114.0,peak


In [20]:
dec_df3.sched_rt_category.value_counts()

vp_sched    77977
Name: sched_rt_category, dtype: int64

### Trips: add back metrics

In [21]:
def add_metrics(df: pd.DataFrame) -> pd.DataFrame:

    df["pings_per_min"] = df.total_pings_for_trip / df.rt_service_min
    df["spatial_accuracy_pct"] = (df.vp_in_shape / df.total_vp) * 100
    df["rt_triptime_w_gtfs_pct"] = (df.total_min_w_gtfs / df.rt_service_min) * 100
    df["rt_v_scheduled_trip_time_pct"] = (
        df.rt_service_min / df.service_minutes - 1
    ) * 100

    # Mask rt_triptime_w_gtfs_pct for any values above 100%
    df.rt_triptime_w_gtfs_pct = df.rt_triptime_w_gtfs_pct.mask(
        df.rt_triptime_w_gtfs_pct > 100
    ).fillna(100)
    
    drop_cols = ['total_pings_for_trip',
                'vp_in_shape',
                'total_vp',
                'total_min_w_gtfs']
    df = df.drop(columns = drop_cols)
    return df

In [22]:
trips = add_metrics(dec_df3)

In [23]:
trips.sample()

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,route_id,direction_id,sched_rt_category,service_minutes,peak_offpeak,pings_per_min,spatial_accuracy_pct,rt_triptime_w_gtfs_pct,rt_v_scheduled_trip_time_pct
23007,3f3f36b4c41cc6b5df3eb7f5d8ea6e3c,b15dcf960e625d742d49413fea5095cf,78.18,67,212-13172,1,vp_sched,69.0,peak,2.56,99.5,99.77,13.31


### Routes: add back metrics

In [24]:
def weighted_average_function(df: pd.DataFrame, group_cols: list):
    sum_cols = [
        "total_min_w_gtfs",
        "rt_service_min",
        "total_pings_for_trip",
        "service_minutes",
        "total_vp",
        "vp_in_shape",
    ]

    count_cols = ["trip_instance_key"]
    df2 = (
        df.groupby(group_cols + ["peak_offpeak"])
        .agg({**{e: "sum" for e in sum_cols}, **{e: "count" for e in count_cols}})
        .reset_index()
    )
    
    df2 = df2.rename(columns = {'trip_instance_key':'n_trips'})
    
    df2 = add_metrics(df2)
    
    
    return df2

In [25]:
routes = weighted_average_function(
    dec_df3,
    ["schedule_gtfs_dataset_key", "route_id", "direction_id", "sched_rt_category"],
)

In [26]:
routes.columns

Index(['schedule_gtfs_dataset_key', 'route_id', 'direction_id',
       'sched_rt_category', 'peak_offpeak', 'rt_service_min',
       'service_minutes', 'n_trips', 'pings_per_min', 'spatial_accuracy_pct',
       'rt_triptime_w_gtfs_pct', 'rt_v_scheduled_trip_time_pct'],
      dtype='object')

In [27]:
routes.head()

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,sched_rt_category,peak_offpeak,rt_service_min,service_minutes,n_trips,pings_per_min,spatial_accuracy_pct,rt_triptime_w_gtfs_pct,rt_v_scheduled_trip_time_pct
0,015d67d5b75b5cf2b710bbadadfb75f5,17,0,vp_sched,offpeak,732.98,567.0,10,2.94,79.33,99.46,29.27
1,015d67d5b75b5cf2b710bbadadfb75f5,17,0,vp_sched,peak,839.93,690.0,12,2.84,91.92,96.2,21.73
2,015d67d5b75b5cf2b710bbadadfb75f5,17,1,vp_sched,offpeak,697.97,569.0,11,2.85,85.59,96.71,22.67
3,015d67d5b75b5cf2b710bbadadfb75f5,17,1,vp_sched,peak,618.78,595.0,11,2.84,96.36,96.32,4.0
4,015d67d5b75b5cf2b710bbadadfb75f5,219,0,vp_sched,offpeak,242.5,152.0,9,2.2,80.3,75.46,59.54


In [28]:
len(routes)

5625