## Update `trips`
* cd rt_segment_speeds && pip install -r requirements.txt && cd ../_shared_utils && make setup_env
* https://github.com/cal-itp/data-analyses/pull/1016
    * Keep source data + metrics tightly defined with GCS bucket organization.
    * vp_usable is source data for rt_vs_sched metrics, do not merge in schedule data until gtfs_digest report. Only bring in schedule_gtfs_dataset_key column in    
    * vp_usable + route_id-direction_id for trips also present in schedule. If not in schedule, fill it with route_id = Unknown and direction_id as Int64
    * Add function to concatenate trip file, enable us to put in 1 day or 7 days for aggregation
    * A single function for normalized metrics (percent, per min, etc)
    * A single function for aggregation (summing up numerator / denominator)
    
* https://github.com/cal-itp/data-analyses/issues/989

* Notes 2/6
    * GTFS digest creates four datasets: schedule, average speeds, segment speeds, and rt vs schedule
    * Currently, merging is challenging.
    * Time categories are not necessarily the same (peak/offpeak/all-day)
    * Want all datasets to merge on the same set of columns (schedule gtfs key, route id, dir id, service date, and time categories) because `shapes` are unstable.
    * `Route ID` has been stabilized by Tiffany 
    * Update work from `rt_v_scheduled.py` (steps already outlined in `scripts/route_aggregation.ipynb`)
        * Do steps up until row 339 when the % are calculated. 
        * Take away `speeds`.
        * Bring in schedule gtfs key, trip instance key, route id, direction id either at the beginning or the end using `helpers.import_scheduled_trips`
        * Coerce DIR ID to Int64, don't fill it in with 0. It's not 0, it's Nan
        * Save files with the analysis date at the end instead of the beginning.
        * Split off the workstream -> one for trip level and one for route level
            * Use the config.yml to save the trips and routes stuff into their own folder.
            * Routes:
                * For routes, the minutes/pings should be totalled up. Currently, just taking the average of an average isn't really accurate.
                * The route level should be able to take multiple days of data and concatenate so we can get metrics for a week/2 weeks/etc instead of for a single day. [Done here](https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/scripts/average_speeds.py)
                * Add the route frequency as well?
           * Trips:
               * Do up to step 339 in `rt_v_scheduled.py`
               * Write a new generalized function to create all the % 
            
* Notes 2/13
    * Figure out how to set up Config file
    * Tiffany:
        * add_metrics looks good, just remove the coercing of percents to 0-100 to a separate function. I want everything from 0-1, and then before charting, scaled up to 0-100 all at once. Can you write a general         * function for this....all the chart display / cleaning functions should live in 1 script in segment_speed_utils.
        * Another tweak for a step somewhere before add_metrics. Certain columns can be coerced to be integers, like total_vp and vp_in_shape, just like how total_min_w_gtfs is an integer. Coerce all the ones that can be integers to be integers for your trip table, and this will save on the rounding step later.
        * Column naming: think about how you want to change the column names. total_pings_for_trip is not going to make sense once you aggregate, so maybe go with something more generic. Otherwise, you're going to be aggregating and renaming columns constantly. I would just rely on the other columns in the row to tell us whether it's per trip or per route , and the metrics all use generic names that are suitable for passing through aggregation functions. (edited) 

In [71]:
import dask.dataframe as dd
import pandas as pd
import yaml
from segment_speed_utils import gtfs_schedule_wrangling, helpers
from segment_speed_utils.project_vars import RT_SCHED_GCS, SEGMENT_GCS
from shared_utils import portfolio_utils, rt_dates, rt_utils

In [72]:
# Times
import datetime

from loguru import logger

In [73]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [74]:
# analysis_date = rt_dates.DATES["dec2023"]

### Load in `rt_v_scheduled_trip` functions

In [75]:
dec_df = pd.read_parquet("./ah_testing_2023-12-01.parquet")

In [76]:
len(dec_df)

86128

In [77]:
dec_df.sample()

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings_for_trip,total_min_w_gtfs,total_vp,vp_in_shape
66485,7cc0cb1871dfd558f11a2885c145d144,7e43220c566ccce16a98ebec64474a5a,50.83,50,151,51,151.0,123.0


In [78]:
nov_df = pd.read_parquet("./ah_testing_2023-11-15.parquet")

In [79]:
nov_df.sample()

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings_for_trip,total_min_w_gtfs,total_vp,vp_in_shape
31451,a068c9b5e54692d4f729dde66c36cdb8,760f8f5f7f3e3ee2d72cc465cea47e6f,76.0,75,216,76,,


In [80]:
len(nov_df)

86832

### Add back routes-schedule-trip instance
* This will go into rt_v_scheduled.trip
#### Fix time_of_day buckets
* https://github.com/cal-itp/data-analyses/blob/route_agg/rt_segment_speeds/segment_speed_utils/gtfs_schedule_wrangling.py


In [81]:
def temp_function(df, analysis_date: str):
    routes_df = helpers.import_scheduled_trips(
        analysis_date,
        columns=[
            "gtfs_dataset_key",
            "route_id",
            "direction_id",
            "trip_instance_key",
        ],
        get_pandas=True,
    )

    df2 = pd.merge(
        df,
        routes_df,
        on=["schedule_gtfs_dataset_key", "trip_instance_key"],
        how="left",
        indicator="sched_rt_category",
    )

    df2 = df2.assign(
        route_id=df2.route_id.fillna("Unknown"),
        direction_id=df2.direction_id.astype("Int64"),
        total_vp=df2.total_vp.fillna(0).astype("Int64"),
        vp_in_shape=df2.vp_in_shape.fillna(0).astype("Int64"),
        rt_service_min=df2.rt_service_min.round(0).astype("Int64"),
        sched_rt_category=df2.apply(
            lambda x: "vp_only" if x.sched_rt_category == "left_only" else "vp_sched",
            axis=1,
        ),
    )

    sched_time_of_day = gtfs_schedule_wrangling.get_trip_time_buckets(analysis_date)[
        ["trip_instance_key", "time_of_day", "service_minutes"]
    ].pipe(gtfs_schedule_wrangling.add_peak_offpeak_column)[
        ["trip_instance_key", "service_minutes", "peak_offpeak"]
    ]

    df3 = pd.merge(df2, sched_time_of_day, on="trip_instance_key", how="left")

    rt_time_of_day = gtfs_schedule_wrangling.get_vp_trip_time_buckets(analysis_date)

    df4 = pd.merge(
        df3,
        rt_time_of_day,
        on=["schedule_gtfs_dataset_key", "trip_instance_key"],
        how="inner",
    )
    df4 = df4.assign(peak_offpeak=df4.peak_offpeak_x.fillna(df4.peak_offpeak_y))

    df4 = df4.drop(columns=["peak_offpeak_x", "peak_offpeak_y"])

    return df4

In [82]:
start = datetime.datetime.now()
print(start)
nov_df2 = temp_function(nov_df, rt_dates.DATES["nov2023"])
end = datetime.datetime.now()
print(end)

2024-02-14 15:52:01.329677
2024-02-14 15:53:13.646792


In [83]:
start = datetime.datetime.now()
print(start)
dec_df2 = temp_function(dec_df, rt_dates.DATES["dec2023"])
end = datetime.datetime.now()
print(end)

2024-02-14 15:53:13.655648
2024-02-14 15:54:21.711426


### Trips: add back metrics

In [84]:
def add_metrics(df: pd.DataFrame) -> pd.DataFrame:

    df["pings_per_min"] = df.total_pings_for_trip / df.rt_service_min
    df["spatial_accuracy_pct"] = df.vp_in_shape / df.total_vp
    df["rt_w_gtfs_pct"] = df.total_min_w_gtfs / df.rt_service_min
    df["rt_v_scheduled_time_pct"] = df.rt_service_min / df.service_minutes - 1

    # Mask rt_triptime_w_gtfs_pct for any values above 100%
    df.rt_w_gtfs_pct = df.rt_w_gtfs_pct.mask(df.rt_w_gtfs_pct > 1,1)

    drop_cols = ["total_pings_for_trip", "vp_in_shape", "total_vp", "total_min_w_gtfs"]
    df = df.drop(columns=drop_cols)
    return df

In [85]:
dec_trip = add_metrics(dec_df2)

In [86]:
dec_trip.sample()

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,route_id,direction_id,sched_rt_category,service_minutes,peak_offpeak,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct
72683,7cc0cb1871dfd558f11a2885c145d144,6b7f2521d00760a4e61061db2f20810f,48,47,M,0,vp_sched,40.48,peak,2.94,1.0,1.0,0.19


In [87]:
nov_trip = add_metrics(nov_df2)

In [88]:
nov_trip.sample()

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,route_id,direction_id,sched_rt_category,service_minutes,peak_offpeak,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct
20507,3f3f36b4c41cc6b5df3eb7f5d8ea6e3c,b4db78155afd5d1a53ada36a935e703c,101,100,117-13168,1,vp_sched,81.0,peak,2.98,1.0,1.0,0.25


In [89]:
nov_trip.sched_rt_category.value_counts()

vp_sched    78190
vp_only      8642
Name: sched_rt_category, dtype: int64

In [90]:
dec_trip.sched_rt_category.value_counts()

vp_sched    77977
vp_only      8151
Name: sched_rt_category, dtype: int64

In [91]:
nov_trip.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 86832 entries, 0 to 86831
Data columns (total 13 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   schedule_gtfs_dataset_key    86832 non-null  object 
 1   trip_instance_key            86832 non-null  object 
 2   rt_service_min               86832 non-null  Int64  
 3   min_w_atleast2_trip_updates  86832 non-null  int64  
 4   route_id                     86832 non-null  object 
 5   direction_id                 76528 non-null  Int64  
 6   sched_rt_category            86832 non-null  object 
 7   service_minutes              78190 non-null  float64
 8   peak_offpeak                 86832 non-null  object 
 9   pings_per_min                86832 non-null  Float64
 10  spatial_accuracy_pct         86832 non-null  Float64
 11  rt_w_gtfs_pct                86832 non-null  Float64
 12  rt_v_scheduled_time_pct      86832 non-null  Float64
dtypes: Float64(4), I

In [92]:
nov_df2.to_parquet("./concat_test_2023-11-15.parquet")

In [93]:
dec_df2.to_parquet("./concat_test_2023-12-01.parquet")

### Routes add multiple days

In [94]:
def concatenate_trip_segment_speeds(analysis_date_list: list) -> pd.DataFrame:
    """
    Concatenate the speed-trip parquets together,
    whether it's for single day or multi-day averages.
    Add columns for peak_offpeak, weekday_weekend based
    on day of week and time-of-day.
    """
    """
    SPEED_FILE = dict_inputs["stage4"]
  
    df = pd.concat([
        pd.read_parquet(
            f"{SEGMENT_GCS}{SPEED_FILE}_{analysis_date}.parquet").assign(
            service_date = pd.to_datetime(analysis_date)
        ) for analysis_date in analysis_date_list], 
        axis=0, ignore_index = True
    )
    """
    df = pd.concat(
        [
            pd.read_parquet(f"./concat_test_{analysis_date}.parquet").assign(
                service_date=pd.to_datetime(analysis_date)
            )
            for analysis_date in analysis_date_list
        ],
        axis=0,
        ignore_index=True,
    )
    return df

In [95]:
all_routes = concatenate_trip_segment_speeds(["2023-11-15", "2023-12-01"])

In [96]:
all_routes.sched_rt_category.value_counts()

vp_sched    156167
vp_only      16793
Name: sched_rt_category, dtype: int64

In [97]:
all_routes.head()

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings_for_trip,total_min_w_gtfs,total_vp,vp_in_shape,route_id,direction_id,sched_rt_category,service_minutes,peak_offpeak,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct,service_date
0,ddad56d2731ac6296304cecfba77d88e,a3647253d4cc8f847e972ed8c83d1b9b,23,22,65,23,0,0,Unknown,,vp_only,,peak,2.83,,1.0,,2023-11-15
1,ddad56d2731ac6296304cecfba77d88e,7029f592047be84e5bb1d28d299be35d,17,16,48,17,0,0,Unknown,,vp_only,,peak,2.82,,1.0,,2023-11-15
2,ddad56d2731ac6296304cecfba77d88e,1040196034fd380818a2cbcf1eafd9b8,41,40,118,41,0,0,Unknown,,vp_only,,offpeak,2.88,,1.0,,2023-11-15
3,ddad56d2731ac6296304cecfba77d88e,5c6d43026fe5f02e5b31c18fcb8c0bf5,63,61,176,63,0,0,Unknown,,vp_only,,offpeak,2.79,,1.0,,2023-11-15
4,ddad56d2731ac6296304cecfba77d88e,ee2f1fd83d87e85119f66014da5d74d5,14,13,37,15,0,0,Unknown,,vp_only,,offpeak,2.64,,1.0,,2023-11-15


#### Add back metrics

In [98]:
def weighted_average_function(df: pd.DataFrame, group_cols: list):
    sum_cols = [
        "total_min_w_gtfs",
        "rt_service_min",
        "total_pings_for_trip",
        "service_minutes",
        "total_vp",
        "vp_in_shape",
    ]

    count_cols = ["trip_instance_key"]
    df2 = (
        df.groupby(group_cols + ["peak_offpeak"])
        .agg({**{e: "sum" for e in sum_cols}, **{e: "count" for e in count_cols}})
        .reset_index()
    )

    df2 = df2.rename(columns={"trip_instance_key": "n_trips"})

    df2 = add_metrics(df2)

    return df2

In [99]:
all_routes2 = weighted_average_function(
    all_routes,
    ["schedule_gtfs_dataset_key", "route_id", "direction_id", "sched_rt_category"],
)

In [100]:
all_routes2.sample()

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,sched_rt_category,peak_offpeak,rt_service_min,service_minutes,n_trips,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct
1164,3f3f36b4c41cc6b5df3eb7f5d8ea6e3c,222-13168,1,vp_sched,peak,1278,943.0,16,2.96,0.99,1.0,0.36


#### How come there are missing rt with GTFS Pct even if  it's vp_sched?

In [101]:
all_routes2.loc[
    (all_routes2["route_id"] == "49")
    & (all_routes2["schedule_gtfs_dataset_key"] == "015d67d5b75b5cf2b710bbadadfb75f5")
]

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,sched_rt_category,peak_offpeak,rt_service_min,service_minutes,n_trips,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct
40,015d67d5b75b5cf2b710bbadadfb75f5,49,0,vp_sched,offpeak,1758,1382.0,23,2.98,0.94,1.0,0.27
41,015d67d5b75b5cf2b710bbadadfb75f5,49,0,vp_sched,peak,1935,1702.0,27,2.97,0.98,1.0,0.14
42,015d67d5b75b5cf2b710bbadadfb75f5,49,1,vp_sched,offpeak,2321,1946.0,28,2.89,0.87,0.97,0.19
43,015d67d5b75b5cf2b710bbadadfb75f5,49,1,vp_sched,peak,2128,1854.0,26,2.97,0.91,1.0,0.15


In [102]:
all_routes.loc[
    (all_routes.route_id == "49")
    & (all_routes.schedule_gtfs_dataset_key == "015d67d5b75b5cf2b710bbadadfb75f5")
][["service_date"]].drop_duplicates()

Unnamed: 0,service_date
86560,2023-11-15
172675,2023-12-01


In [103]:
(
    nov_df2.loc[
        (nov_df2.route_id == "49")
        & (nov_df2.schedule_gtfs_dataset_key == "015d67d5b75b5cf2b710bbadadfb75f5")
        & (nov_df2.schedule_gtfs_dataset_key == "015d67d5b75b5cf2b710bbadadfb75f5")
        & (nov_df2.direction_id == 0)
        & (nov_df2.peak_offpeak == "offpeak")
    ]
)

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings_for_trip,total_min_w_gtfs,total_vp,vp_in_shape,route_id,direction_id,sched_rt_category,service_minutes,peak_offpeak,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct
86561,015d67d5b75b5cf2b710bbadadfb75f5,0921a3f1d8a4624b2f6f2821ee726905,126,125,375,126,375,236,49,0,vp_sched,61.0,offpeak,2.98,0.63,1.0,1.07
86575,015d67d5b75b5cf2b710bbadadfb75f5,d08a2a58b37fb3b37b03a005322e1f3f,71,71,212,71,212,212,49,0,vp_sched,60.0,offpeak,2.99,1.0,1.0,0.18
86577,015d67d5b75b5cf2b710bbadadfb75f5,36d9c3b6bc315ca01fdb04c57eeb730f,71,71,211,71,211,211,49,0,vp_sched,60.0,offpeak,2.97,1.0,1.0,0.18
86579,015d67d5b75b5cf2b710bbadadfb75f5,675534f3670096c51444371c1d584ef4,68,68,203,68,203,202,49,0,vp_sched,60.0,offpeak,2.99,1.0,1.0,0.13
86581,015d67d5b75b5cf2b710bbadadfb75f5,a39fce1e065f5a8f61afa20b75ffb4d0,73,72,216,72,216,213,49,0,vp_sched,60.0,offpeak,2.96,0.99,0.99,0.22
86583,015d67d5b75b5cf2b710bbadadfb75f5,10b07ff8e7ffeafd530d184e92533abd,74,74,221,74,221,219,49,0,vp_sched,60.0,offpeak,2.99,0.99,1.0,0.23
86585,015d67d5b75b5cf2b710bbadadfb75f5,1b69255df814f27cea09f41f00a4ea54,74,72,219,74,219,218,49,0,vp_sched,60.0,offpeak,2.96,1.0,1.0,0.23
86587,015d67d5b75b5cf2b710bbadadfb75f5,9ce296156000f273b09c4e715555b83a,76,76,226,76,226,226,49,0,vp_sched,60.0,offpeak,2.97,1.0,1.0,0.27
86589,015d67d5b75b5cf2b710bbadadfb75f5,4a4d590315b63f42d1459fdf1fd87503,71,71,213,72,213,213,49,0,vp_sched,60.0,offpeak,3.0,1.0,1.0,0.18
86591,015d67d5b75b5cf2b710bbadadfb75f5,dff41cc4c4322ecb35d348bc3ca62b93,88,87,262,88,262,262,49,0,vp_sched,60.0,offpeak,2.98,1.0,1.0,0.47


In [104]:
(
    dec_df2.loc[
        (dec_df2.route_id == "49")
        & (dec_df2.schedule_gtfs_dataset_key == "015d67d5b75b5cf2b710bbadadfb75f5")
        & (dec_trip.schedule_gtfs_dataset_key == "015d67d5b75b5cf2b710bbadadfb75f5")
        & (dec_df2.direction_id == 0)
        & (dec_df2.peak_offpeak == "offpeak")
    ]
)

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings_for_trip,total_min_w_gtfs,total_vp,vp_in_shape,route_id,direction_id,sched_rt_category,service_minutes,peak_offpeak,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct
85844,015d67d5b75b5cf2b710bbadadfb75f5,7be0bbcb121153455bdad9d0c5296654,110,110,329,110,329,157,49,0,vp_sched,61.0,offpeak,2.99,0.48,1.0,0.8
85858,015d67d5b75b5cf2b710bbadadfb75f5,6bd8394f9eb2edf599a9c083a3235e70,67,66,198,67,198,198,49,0,vp_sched,60.0,offpeak,2.96,1.0,1.0,0.12
85860,015d67d5b75b5cf2b710bbadadfb75f5,dfe63da8f5f1dd27093c216d36360853,65,64,192,65,192,192,49,0,vp_sched,60.0,offpeak,2.95,1.0,1.0,0.08
85862,015d67d5b75b5cf2b710bbadadfb75f5,8428f904ef047f7dea1655e1251005d6,71,71,212,71,212,211,49,0,vp_sched,60.0,offpeak,2.99,1.0,1.0,0.18
85864,015d67d5b75b5cf2b710bbadadfb75f5,8b01c4e4db831afdf5582f027ce348cf,75,75,223,75,223,223,49,0,vp_sched,60.0,offpeak,2.97,1.0,1.0,0.25
85866,015d67d5b75b5cf2b710bbadadfb75f5,4a493b90720580da520545dd56b3c49d,70,69,208,70,208,207,49,0,vp_sched,60.0,offpeak,2.97,1.0,1.0,0.17
85868,015d67d5b75b5cf2b710bbadadfb75f5,c6b623aeec97d4ea17a2b3faad38139d,72,71,213,72,213,212,49,0,vp_sched,60.0,offpeak,2.96,1.0,1.0,0.2
85870,015d67d5b75b5cf2b710bbadadfb75f5,e03f52a045f61e42c0a166a54cccc472,76,76,226,76,226,226,49,0,vp_sched,60.0,offpeak,2.97,1.0,1.0,0.27
85872,015d67d5b75b5cf2b710bbadadfb75f5,ac0b8330ef9e7945e888b1f85294e0c3,71,70,212,72,212,211,49,0,vp_sched,60.0,offpeak,2.99,1.0,1.0,0.18
85874,015d67d5b75b5cf2b710bbadadfb75f5,37c09fdb9f26edbbe7e81e3322e08778,81,81,241,81,241,241,49,0,vp_sched,60.0,offpeak,2.98,1.0,1.0,0.35


In [105]:
all_routes2.columns

Index(['schedule_gtfs_dataset_key', 'route_id', 'direction_id',
       'sched_rt_category', 'peak_offpeak', 'rt_service_min',
       'service_minutes', 'n_trips', 'pings_per_min', 'spatial_accuracy_pct',
       'rt_w_gtfs_pct', 'rt_v_scheduled_time_pct'],
      dtype='object')

In [106]:
all_routes2.sample()

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,sched_rt_category,peak_offpeak,rt_service_min,service_minutes,n_trips,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct
2076,5456c80d420043e15c8eb7368a8a4d89,50,1,vp_sched,offpeak,66,48.0,2,1.91,0.87,1.0,0.38


#### Cleaning Function

In [107]:
pct_cols = [
    "rt_w_gtfs_pct",
    "rt_v_scheduled_time_pct",
    "spatial_accuracy_pct",
]

In [108]:
int_cols = [
    "rt_service_min",
    "service_minutes",
]

In [109]:
def clean_df(df: pd.DataFrame, pct_cols: list, int_cols: list) -> pd.DataFrame:
    for i in pct_cols:
        df[i] = df[i] * 100
    for i in int_cols:
        df[i] = df[i].fillna(0).round()

    df.columns = df.columns.str.replace("_", " ").str.strip().str.title()
    return df

In [110]:
all_routes3 = clean_df(all_routes2, pct_cols, int_cols)