## Update `trips`
* cd rt_segment_speeds && pip install -r requirements.txt && cd ../_shared_utils && make setup_env
* https://github.com/cal-itp/data-analyses/pull/1016
    * Keep source data + metrics tightly defined with GCS bucket organization.
    * vp_usable is source data for rt_vs_sched metrics, do not merge in schedule data until gtfs_digest report. Only bring in schedule_gtfs_dataset_key column in    
    * vp_usable + route_id-direction_id for trips also present in schedule. If not in schedule, fill it with route_id = Unknown and direction_id as Int64
    * Add function to concatenate trip file, enable us to put in 1 day or 7 days for aggregation
    * A single function for normalized metrics (percent, per min, etc)
    * A single function for aggregation (summing up numerator / denominator)
    
* https://github.com/cal-itp/data-analyses/issues/989

* Notes 2/6
    * GTFS digest creates four datasets: schedule, average speeds, segment speeds, and rt vs schedule
    * Currently, merging is challenging.
    * Time categories are not necessarily the same (peak/offpeak/all-day)
    * Want all datasets to merge on the same set of columns (schedule gtfs key, route id, dir id, service date, and time categories) because `shapes` are unstable.
    * `Route ID` has been stabilized by Tiffany 
    * Update work from `rt_v_scheduled.py` (steps already outlined in `scripts/route_aggregation.ipynb`)
        * Do steps up until row 339 when the % are calculated. 
        * Take away `speeds`.
        * Bring in schedule gtfs key, trip instance key, route id, direction id either at the beginning or the end using `helpers.import_scheduled_trips`
        * Coerce DIR ID to Int64, don't fill it in with 0. It's not 0, it's Nan
        * Save files with the analysis date at the end instead of the beginning.
        * Split off the workstream -> one for trip level and one for route level
            * Use the config.yml to save the trips and routes stuff into their own folder.
            * Routes:
                * For routes, the minutes/pings should be totalled up. Currently, just taking the average of an average isn't really accurate.
                * The route level should be able to take multiple days of data and concatenate so we can get metrics for a week/2 weeks/etc instead of for a single day. [Done here](https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/scripts/average_speeds.py)
                * Add the route frequency as well?
           * Trips:
               * Do up to step 339 in `rt_v_scheduled.py`
               * Write a new generalized function to create all the % 
            
* Notes 2/13
    * Figure out how to set up Config file
    * Tiffany:
        * add_metrics looks good, just remove the coercing of percents to 0-100 to a separate function. I want everything from 0-1, and then before charting, scaled up to 0-100 all at once. Can you write a general         * function for this....all the chart display / cleaning functions should live in 1 script in segment_speed_utils.
        * Another tweak for a step somewhere before add_metrics. Certain columns can be coerced to be integers, like total_vp and vp_in_shape, just like how total_min_w_gtfs is an integer. Coerce all the ones that can be integers to be integers for your trip table, and this will save on the rounding step later.
        * Column naming: think about how you want to change the column names. total_pings_for_trip is not going to make sense once you aggregate, so maybe go with something more generic. Otherwise, you're going to be aggregating and renaming columns constantly. I would just rely on the other columns in the row to tell us whether it's per trip or per route , and the metrics all use generic names that are suitable for passing through aggregation functions. (edited) 

In [42]:
import dask.dataframe as dd
import pandas as pd
import yaml
from segment_speed_utils import gtfs_schedule_wrangling, helpers
from segment_speed_utils.project_vars import RT_SCHED_GCS, SEGMENT_GCS
from shared_utils import portfolio_utils, rt_dates, rt_utils

In [43]:
# Times
import datetime

from loguru import logger

In [44]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [45]:
# analysis_date = rt_dates.DATES["dec2023"]

In [46]:
RT_SCHED_GCS

'gs://calitp-analytics-data/data-analyses/rt_vs_schedule/'

In [47]:
rt_dates.DATES

{'feb2022': '2022-02-08',
 'mar2022': '2022-03-30',
 'may2022': '2022-05-04',
 'jun2022': '2022-06-15',
 'jul2022': '2022-07-13',
 'aug2022': '2022-08-17',
 'sep2022': '2022-09-14',
 'sep2022a': '2022-09-21',
 'oct2022': '2022-10-12',
 'nov2022a': '2022-11-07',
 'nov2022b': '2022-11-08',
 'nov2022c': '2022-11-09',
 'nov2022d': '2022-11-10',
 'nov2022': '2022-11-16',
 'dec2022': '2022-12-14',
 'jan2023': '2023-01-18',
 'feb2023': '2023-02-15',
 'mar2023': '2023-03-15',
 'apr2023a': '2023-04-10',
 'apr2023b': '2023-04-11',
 'apr2023': '2023-04-12',
 'apr2023c': '2023-04-13',
 'apr2023d': '2023-04-14',
 'apr2023e': '2023-04-15',
 'apr2023f': '2023-04-16',
 'may2023': '2023-05-17',
 'jun2023': '2023-06-14',
 'jul2023': '2023-07-12',
 'aug2023': '2023-08-15',
 'aug2023a': '2023-08-23',
 'sep2023': '2023-09-13',
 'oct2023a': '2023-10-09',
 'oct2023b': '2023-10-10',
 'oct2023': '2023-10-11',
 'oct2023c': '2023-10-12',
 'oct2023d': '2023-10-13',
 'oct2023e': '2023-10-14',
 'oct2023f': '2023-10

### Routes add multiple days

In [48]:
routes_df = pd.read_parquet("gs://calitp-analytics-data/data-analyses/rt_vs_schedule/vp_route_dir/route_direction_metrics/trip_2023_09_13_to_2023_10_11.parquet")

In [49]:
routes_df.head(20)

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,sched_rt_category,peak_offpeak,time_of_day,total_min_w_gtfs,rt_service_min,total_pings,service_minutes,total_vp,vp_in_shape,n_trips,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct
0,015d67d5b75b5cf2b710bbadadfb75f5,17,0,vp_sched,offpeak,Early AM,139,139,411,90.0,411,258,2,2.96,0.63,1.0,0.54
1,015d67d5b75b5cf2b710bbadadfb75f5,17,0,vp_sched,offpeak,Evening,311,313,919,270.0,919,755,5,2.94,0.82,0.99,0.16
2,015d67d5b75b5cf2b710bbadadfb75f5,17,0,vp_sched,offpeak,Midday,843,866,2500,720.0,2500,1952,12,2.89,0.78,0.97,0.2
3,015d67d5b75b5cf2b710bbadadfb75f5,17,0,vp_sched,peak,AM Peak,689,729,2045,556.0,2045,1695,10,2.81,0.83,0.95,0.31
4,015d67d5b75b5cf2b710bbadadfb75f5,17,0,vp_sched,peak,PM Peak,825,874,2446,648.0,2446,2117,11,2.8,0.87,0.94,0.35
5,015d67d5b75b5cf2b710bbadadfb75f5,17,1,vp_sched,offpeak,Early AM,361,361,1071,294.0,1071,836,6,2.97,0.78,1.0,0.23
6,015d67d5b75b5cf2b710bbadadfb75f5,17,1,vp_sched,offpeak,Evening,173,173,512,147.0,512,483,3,2.96,0.94,1.0,0.18
7,015d67d5b75b5cf2b710bbadadfb75f5,17,1,vp_sched,offpeak,Midday,651,650,1930,594.0,1930,1846,11,2.97,0.96,1.0,0.09
8,015d67d5b75b5cf2b710bbadadfb75f5,17,1,vp_sched,peak,AM Peak,607,635,1790,562.0,1790,1678,10,2.82,0.94,0.96,0.13
9,015d67d5b75b5cf2b710bbadadfb75f5,17,1,vp_sched,peak,PM Peak,539,551,1601,471.0,1601,1493,9,2.91,0.93,0.98,0.17


In [50]:
routes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11810 entries, 0 to 11809
Data columns (total 17 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   schedule_gtfs_dataset_key  11810 non-null  object 
 1   route_id                   11810 non-null  object 
 2   direction_id               11810 non-null  Int64  
 3   sched_rt_category          11810 non-null  object 
 4   peak_offpeak               11810 non-null  object 
 5   time_of_day                11810 non-null  object 
 6   total_min_w_gtfs           11810 non-null  int64  
 7   rt_service_min             11810 non-null  Int64  
 8   total_pings                11810 non-null  int64  
 9   service_minutes            11810 non-null  float64
 10  total_vp                   11810 non-null  Int64  
 11  vp_in_shape                11810 non-null  Int64  
 12  n_trips                    11810 non-null  int64  
 13  pings_per_min              11810 non-null  Flo

### Cleaning Function

In [51]:
pct_cols = [
    "rt_w_gtfs_pct",
    "rt_v_scheduled_time_pct",
    "spatial_accuracy_pct",
]

In [52]:
int_cols = [
    "rt_service_min",
    "service_minutes",
]

In [53]:
def clean_df(df: pd.DataFrame, pct_cols: list, int_cols: list) -> pd.DataFrame:
    for i in pct_cols:
        df[i] = df[i] * 100
    for i in int_cols:
        df[i] = df[i].fillna(0).round()

    df.columns = df.columns.str.replace("_", " ").str.strip().str.title()
    return df

In [54]:
routes_df2 = clean_df(routes_df, pct_cols, int_cols)

In [56]:
routes_df2.sample(3)

Unnamed: 0,Schedule Gtfs Dataset Key,Route Id,Direction Id,Sched Rt Category,Peak Offpeak,Time Of Day,Total Min W Gtfs,Rt Service Min,Total Pings,Service Minutes,Total Vp,Vp In Shape,N Trips,Pings Per Min,Spatial Accuracy Pct,Rt W Gtfs Pct,Rt V Scheduled Time Pct
1944,3f3f36b4c41cc6b5df3eb7f5d8ea6e3c,162-13168,1,vp_sched,offpeak,Early AM,1917,1985,5642,1486.0,5642,5131,18,2.84,90.94,96.57,33.58
9179,d9d0325e50e50064e3cc8384b1751d67,24,0,vp_sched,offpeak,Early AM,166,167,327,108.0,327,327,2,1.96,100.0,99.4,54.63
913,239f3baf3dd3b9e9464f66a777f9897d,1,0,vp_sched,peak,PM Peak,486,541,618,420.0,618,496,28,1.14,80.26,89.83,28.81
