## Update `trips`
* cd rt_segment_speeds && pip install -r requirements.txt && cd ../_shared_utils && make setup_env
* https://github.com/cal-itp/data-analyses/pull/1016
    * Keep source data + metrics tightly defined with GCS bucket organization.
    * vp_usable is source data for rt_vs_sched metrics, do not merge in schedule data until gtfs_digest report. Only bring in schedule_gtfs_dataset_key column in    
    * vp_usable + route_id-direction_id for trips also present in schedule. If not in schedule, fill it with route_id = Unknown and direction_id as Int64
    * Add function to concatenate trip file, enable us to put in 1 day or 7 days for aggregation
    * A single function for normalized metrics (percent, per min, etc)
    * A single function for aggregation (summing up numerator / denominator)
    
* https://github.com/cal-itp/data-analyses/issues/989

* Notes 2/6
    * GTFS digest creates four datasets: schedule, average speeds, segment speeds, and rt vs schedule
    * Currently, merging is challenging.
    * Time categories are not necessarily the same (peak/offpeak/all-day)
    * Want all datasets to merge on the same set of columns (schedule gtfs key, route id, dir id, service date, and time categories) because `shapes` are unstable.
    * `Route ID` has been stabilized by Tiffany 
    * Update work from `rt_v_scheduled.py` (steps already outlined in `scripts/route_aggregation.ipynb`)
        * Do steps up until row 339 when the % are calculated. 
        * Take away `speeds`.
        * Bring in schedule gtfs key, trip instance key, route id, direction id either at the beginning or the end using `helpers.import_scheduled_trips`
        * Coerce DIR ID to Int64, don't fill it in with 0. It's not 0, it's Nan
        * Save files with the analysis date at the end instead of the beginning.
        * Split off the workstream -> one for trip level and one for route level
            * Use the config.yml to save the trips and routes stuff into their own folder.
            * Routes:
                * For routes, the minutes/pings should be totalled up. Currently, just taking the average of an average isn't really accurate.
                * The route level should be able to take multiple days of data and concatenate so we can get metrics for a week/2 weeks/etc instead of for a single day. [Done here](https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/scripts/average_speeds.py)
                * Add the route frequency as well?
           * Trips:
               * Do up to step 339 in `rt_v_scheduled.py`
               * Write a new generalized function to create all the % 
            
* Notes 2/13
    * Figure out how to set up Config file
    * Tiffany:
        * add_metrics looks good, just remove the coercing of percents to 0-100 to a separate function. I want everything from 0-1, and then before charting, scaled up to 0-100 all at once. Can you write a general         * function for this....all the chart display / cleaning functions should live in 1 script in segment_speed_utils.
        * Another tweak for a step somewhere before add_metrics. Certain columns can be coerced to be integers, like total_vp and vp_in_shape, just like how total_min_w_gtfs is an integer. Coerce all the ones that can be integers to be integers for your trip table, and this will save on the rounding step later.
        * Column naming: think about how you want to change the column names. total_pings_for_trip is not going to make sense once you aggregate, so maybe go with something more generic. Otherwise, you're going to be aggregating and renaming columns constantly. I would just rely on the other columns in the row to tell us whether it's per trip or per route , and the metrics all use generic names that are suitable for passing through aggregation functions. (edited) 

In [1]:
import dask.dataframe as dd
import pandas as pd
import yaml
from segment_speed_utils import gtfs_schedule_wrangling, helpers
from segment_speed_utils.project_vars import RT_SCHED_GCS, SEGMENT_GCS
from shared_utils import portfolio_utils, rt_dates, rt_utils

In [2]:
# Times
import datetime

from loguru import logger

In [3]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [4]:
# analysis_date = rt_dates.DATES["dec2023"]

In [5]:
RT_SCHED_GCS

'gs://calitp-analytics-data/data-analyses/rt_vs_schedule/'

In [6]:
rt_dates.DATES

{'feb2022': '2022-02-08',
 'mar2022': '2022-03-30',
 'may2022': '2022-05-04',
 'jun2022': '2022-06-15',
 'jul2022': '2022-07-13',
 'aug2022': '2022-08-17',
 'sep2022': '2022-09-14',
 'sep2022a': '2022-09-21',
 'oct2022': '2022-10-12',
 'nov2022a': '2022-11-07',
 'nov2022b': '2022-11-08',
 'nov2022c': '2022-11-09',
 'nov2022d': '2022-11-10',
 'nov2022': '2022-11-16',
 'dec2022': '2022-12-14',
 'jan2023': '2023-01-18',
 'feb2023': '2023-02-15',
 'mar2023': '2023-03-15',
 'apr2023a': '2023-04-10',
 'apr2023b': '2023-04-11',
 'apr2023': '2023-04-12',
 'apr2023c': '2023-04-13',
 'apr2023d': '2023-04-14',
 'apr2023e': '2023-04-15',
 'apr2023f': '2023-04-16',
 'may2023': '2023-05-17',
 'jun2023': '2023-06-14',
 'jul2023': '2023-07-12',
 'aug2023': '2023-08-15',
 'aug2023a': '2023-08-23',
 'sep2023': '2023-09-13',
 'oct2023a': '2023-10-09',
 'oct2023b': '2023-10-10',
 'oct2023': '2023-10-11',
 'oct2023c': '2023-10-12',
 'oct2023d': '2023-10-13',
 'oct2023e': '2023-10-14',
 'oct2023f': '2023-10

### Routes add multiple days

In [7]:
routes_df = pd.read_parquet("gs://calitp-analytics-data/data-analyses/rt_vs_schedule/vp_route_dir/route_direction_metrics/trip_2023_09_13_to_2023_10_11.parquet")

In [8]:
routes_df.head(20)

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,total_min_w_gtfs,rt_service_min,total_pings,service_minutes,total_vp,vp_in_shape,n_trips,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct
0,015d67d5b75b5cf2b710bbadadfb75f5,17,0,2807,2921,8321,2284.0,8321,6777,40,2.85,0.81,0.96,0.28
1,015d67d5b75b5cf2b710bbadadfb75f5,17,1,2331,2370,6904,2068.0,6904,6336,39,2.91,0.92,0.98,0.15
2,015d67d5b75b5cf2b710bbadadfb75f5,219,0,760,830,2197,640.0,2197,1708,38,2.65,0.78,0.92,0.3
3,015d67d5b75b5cf2b710bbadadfb75f5,219,1,815,821,2378,522.0,2378,2322,35,2.9,0.98,0.99,0.57
4,015d67d5b75b5cf2b710bbadadfb75f5,22,0,2789,2791,8268,1995.0,2193,1260,50,2.96,0.57,1.0,0.4
5,015d67d5b75b5cf2b710bbadadfb75f5,22,1,2401,2697,7112,1942.0,1838,1722,51,2.64,0.94,0.89,0.39
6,015d67d5b75b5cf2b710bbadadfb75f5,228,0,886,898,2610,748.0,2610,2540,15,2.91,0.97,0.99,0.2
7,015d67d5b75b5cf2b710bbadadfb75f5,228,1,1096,1094,3246,934.0,3246,2841,18,2.97,0.88,1.0,0.17
8,015d67d5b75b5cf2b710bbadadfb75f5,23,0,2455,2454,7279,2120.0,7279,7172,44,2.97,0.99,1.0,0.16
9,015d67d5b75b5cf2b710bbadadfb75f5,23,1,2848,3501,8429,2305.0,8429,8033,47,2.41,0.95,0.81,0.52


In [9]:
routes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3174 entries, 0 to 3173
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   schedule_gtfs_dataset_key  3174 non-null   object 
 1   route_id                   3174 non-null   object 
 2   direction_id               3174 non-null   Int64  
 3   total_min_w_gtfs           3174 non-null   int64  
 4   rt_service_min             3174 non-null   Int64  
 5   total_pings                3174 non-null   int64  
 6   service_minutes            3174 non-null   float64
 7   total_vp                   3174 non-null   Int64  
 8   vp_in_shape                3174 non-null   Int64  
 9   n_trips                    3174 non-null   int64  
 10  pings_per_min              3174 non-null   Float64
 11  spatial_accuracy_pct       3174 non-null   Float64
 12  rt_w_gtfs_pct              3174 non-null   Float64
 13  rt_v_scheduled_time_pct    3174 non-null   Float

### Cleaning Function

In [10]:
pct_cols = [
    "rt_w_gtfs_pct",
    "rt_v_scheduled_time_pct",
    "spatial_accuracy_pct",
]

In [11]:
int_cols = [
    "rt_service_min",
    "service_minutes",
]

In [12]:
def clean_df(df: pd.DataFrame, pct_cols: list, int_cols: list) -> pd.DataFrame:
    for i in pct_cols:
        df[i] = df[i] * 100
    for i in int_cols:
        df[i] = df[i].fillna(0).round()

    df.columns = df.columns.str.replace("_", " ").str.strip().str.title()
    return df

In [13]:
routes_df2 = clean_df(routes_df, pct_cols, int_cols)

In [14]:
routes_df2.sample(3)

Unnamed: 0,Schedule Gtfs Dataset Key,Route Id,Direction Id,Total Min W Gtfs,Rt Service Min,Total Pings,Service Minutes,Total Vp,Vp In Shape,N Trips,Pings Per Min,Spatial Accuracy Pct,Rt W Gtfs Pct,Rt V Scheduled Time Pct
195,1770249a5a2e770ca90628434d4934b1,4251,0,239,239,702,150.0,702,620,5,2.94,88.32,100.0,59.33
525,3f3f36b4c41cc6b5df3eb7f5d8ea6e3c,158-13168,0,2896,3194,8342,2496.0,8342,6578,34,2.61,78.85,90.67,27.96
1699,9f9f2ccf5e29b2e48891b9716f66476b,799,1,478,484,749,483.0,749,668,5,1.55,89.19,98.76,0.21


### Checking
* Fix time of day 
* https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/segment_speed_utils/segment_calcs.py#L135-L163

In [24]:
trips = pd.read_parquet("gs://calitp-analytics-data/data-analyses/rt_vs_schedule/vp_trip/trip_metrics/trip_2023-10-11.parquet")

In [29]:
trips.head()

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,rt_service_min,min_w_atleast2_trip_updates,total_pings,total_min_w_gtfs,total_vp,vp_in_shape,route_id,direction_id,sched_rt_category,service_minutes,time_of_day,peak_offpeak,pings_per_min,spatial_accuracy_pct,rt_w_gtfs_pct,rt_v_scheduled_time_pct
0,ddad56d2731ac6296304cecfba77d88e,0975f532d046ada21eb55491d265ccde,53,46,135,49,0,0,Unknown,,vp_only,,PM Peak,peak,2.55,,0.92,
1,ddad56d2731ac6296304cecfba77d88e,a1bd6eed047cc2658463eb818f408fb0,19,19,51,19,0,0,Unknown,,vp_only,,PM Peak,peak,2.68,,1.0,
2,ddad56d2731ac6296304cecfba77d88e,1121fc9b119c91116d65a5e9d9a4cb7b,24,24,68,24,0,0,Unknown,,vp_only,,PM Peak,peak,2.83,,1.0,
3,ddad56d2731ac6296304cecfba77d88e,6d5f8a082af441b633c077d9c6697092,14,13,39,14,0,0,Unknown,,vp_only,,PM Peak,peak,2.79,,1.0,
4,ddad56d2731ac6296304cecfba77d88e,565ea1da39170b3fafa5df1c206e4531,48,17,48,17,0,0,Unknown,,vp_only,,Midday,offpeak,1.0,,0.35,


In [30]:
trips.time_of_day.value_counts()

Midday      26484
PM Peak     25686
AM Peak     17094
Early AM     9216
Evening      6938
Owl          1068
Name: time_of_day, dtype: int64

In [31]:
trips.peak_offpeak.value_counts()

offpeak    43706
peak       42780
Name: peak_offpeak, dtype: int64

In [25]:
roll_singleday_route_dir_df = pd.read_parquet("gs://calitp-analytics-data/data-analyses/rt_segment_speeds/rollup_singleday/speeds_route_dir_2023-10-15.parquet")

In [26]:
roll_singleday_route_dir_df.time_period.value_counts()

all_day    1615
offpeak    1601
peak       1575
Name: time_period, dtype: int64

In [27]:
roll_singleday_speeds_df = pd.read_parquet("gs://calitp-analytics-data/data-analyses/rt_segment_speeds/rollup_singleday/speeds_trip_2023-10-15.parquet")

In [28]:
roll_singleday_speeds_df.time_period.value_counts()

all_day    42092
offpeak    21742
peak       20350
Name: time_period, dtype: int64