## Merges going wrong
* There should only be 6 rows for each route (ideally) when its `sched_rt_category` is `schedule_and_vp`
* Dir 1: all day, peak, offpeak
* Dir 0: all day, peak, offpeak.
* This is impacting the graphs for: Timeliness, Frequency of Trips, Average Speed
* December 2024 looks fine but January 2025 is messed up.

In [1]:
import geopandas as gpd
import numpy as np
import pandas as pd

In [3]:
import _section1_utils
import _section2_utils

from segment_speed_utils import gtfs_schedule_wrangling
from shared_utils import rt_dates
from update_vars import GTFS_DATA_DICT, RT_SCHED_GCS, SCHED_GCS, SEGMENT_GCS

In [9]:
import merge_data

In [2]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [4]:
analysis_date_list = [rt_dates.y2024_dates[-1]] + rt_dates.y2025_dates

In [5]:
# Test with SF
schd_key = "7cc0cb1871dfd558f11a2885c145d144"

In [6]:
org_name = "City and County of San Francisco"

In [7]:
route_id = "22"

### It seems like `df_sched` is messing everything up because the two values that are supposed to be `peak` and `offpeak` are empty -> Update  `gtfs_funnel/schedule_stats_by_route_direction`

In [10]:
df_sched = merge_data.concatenate_schedule_by_route_direction(analysis_date_list)

In [11]:
df_sched = df_sched.loc[
    (df_sched.schedule_gtfs_dataset_key == schd_key) & (df_sched.route_id == route_id)
]

In [12]:
df_sched.head(2)

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,time_period,route_primary_direction,avg_scheduled_service_minutes,avg_stop_miles,n_scheduled_trips,frequency,is_express,is_rapid,is_rail,is_coverage,is_downtown_local,is_local,service_date
9646,7cc0cb1871dfd558f11a2885c145d144,22,0.0,all_day,Southbound,46.87,0.13,186,7.75,0.0,0.0,0.0,0.0,1.0,0.0,2024-12-11
9647,7cc0cb1871dfd558f11a2885c145d144,22,0.0,all_day,Southbound,46.87,2.35,186,7.75,0.0,0.0,0.0,0.0,1.0,0.0,2025-01-15


In [13]:
df_sched[['service_date','time_period','direction_id']].sort_values(by = ['service_date'])

Unnamed: 0,service_date,time_period,direction_id
9646,2024-12-11,all_day,0.0
9648,2024-12-11,offpeak,0.0
9650,2024-12-11,peak,0.0
9652,2024-12-11,all_day,1.0
9654,2024-12-11,offpeak,1.0
9656,2024-12-11,peak,1.0
9647,2025-01-15,all_day,0.0
9649,2025-01-15,offpeak,0.0
9651,2025-01-15,peak,0.0
9653,2025-01-15,all_day,1.0


In [14]:
df_avg_speeds = merge_data.concatenate_speeds_by_route_direction(
        analysis_date_list
    )

In [17]:
sched_vp_df = _section2_utils.load_schedule_vp_metrics(org_name)

In [19]:
sched_vp_df.head(2)

Unnamed: 0,schedule_gtfs_dataset_key,dir_0_1,Period,Average Scheduled Service (trip minutes),Average Stop Distance (miles),# scheduled trips,Trips per Hour,is_express,is_rapid,is_rail,is_coverage,is_downtown_local,is_local,Date,Route typology,# Minutes with 1+ VP per Minute,# Minutes with 2+ VP per Minute,Aggregate Actual Service Minutes,Aggregate Scheduled Service Minutes (all trips),# VP,# VP within Scheduled Shape,# Early Arrival Trips,# On-Time Trips,# Late Trips,# Trips with VP,Average VP per Minute,% VP within Scheduled Shape,pct_rt_journey_atleast1_vp,pct_rt_journey_atleast2_vp,% Scheduled Trip w/ 1+ VP/Minute,% Scheduled Trip w/ 2+ VP/Minute,Realtime versus Scheduled Service Ratio,Average Actual Service (Trip Minutes),GTFS Availability,Speed (MPH),route_long_name,route_short_name,Route,Route ID,Base64 Encoded Feed URL,Organization ID,Organization,District,Direction,schedule_source_record_id,Transit Operator,ruler_100_pct,ruler_for_vp_per_min,headway_in_minutes
0,7cc0cb1871dfd558f11a2885c145d144,0.0,all_day,41.33,0.12,151,6.29,0.0,0.0,0.0,0.0,1.0,0.0,2023-04-12,downtown_local,7816,7708,12084.08,6194.0,23106,21485,4,28,118,150,1.91,93.0,65.0,64.0,100.0,100.0,1.95,80.56,schedule_and_vp,5.56,CALIFORNIA,1,1 CALIFORNIA,1,aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1TRg==,rechaapWbeffO33OX,City and County of San Francisco,04 - Oakland,Westbound,recHD22phgJs34JHP,Bay Area 511 Muni Schedule,100,2,9.54
1,7cc0cb1871dfd558f11a2885c145d144,0.0,all_day,41.33,0.11,151,6.29,0.0,0.0,0.0,0.0,1.0,0.0,2023-05-17,downtown_local,8015,7898,12137.89,6194.0,23681,21951,0,27,123,150,1.95,93.0,66.0,65.0,100.0,100.0,1.96,80.92,schedule_and_vp,5.1,CALIFORNIA,1,1 CALIFORNIA,1,aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1TRg==,rechaapWbeffO33OX,City and County of San Francisco,04 - Oakland,Westbound,recHD22phgJs34JHP,Bay Area 511 Muni Schedule,100,2,9.54


In [26]:
jan_only = sched_vp_df.loc[sched_vp_df.Date == '2025-01-15T00:00:00.000000000']

In [27]:
jan_only.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 360 entries, 19 to 7093
Data columns (total 49 columns):
 #   Column                                           Non-Null Count  Dtype         
---  ------                                           --------------  -----         
 0   schedule_gtfs_dataset_key                        360 non-null    object        
 1   dir_0_1                                          360 non-null    float64       
 2   Period                                           360 non-null    object        
 3   Average Scheduled Service (trip minutes)         360 non-null    float64       
 4   Average Stop Distance (miles)                    360 non-null    float64       
 5   # scheduled trips                                360 non-null    int64         
 6   Trips per Hour                                   360 non-null    float64       
 7   is_express                                       360 non-null    float64       
 8   is_rapid                              

In [23]:
sched_vp_df.Period.value_counts()

all_day    2633
offpeak    2548
peak       2365
Name: Period, dtype: int64