# Correct Crosswalk
ok - rewind a lot. within merge_data, there's this comment. After having the portfolio_organization_names merged in, we looked at what made the grain correct, which is name-portfolio_organization_name-service_date. That means that if you back up the lines before that...reading in columns like organization_name to begin with was what created this 1:m issue, which we now know how to handle.
 
```
df = df.assign(
        caltrans_district = df.caltrans_district.map(
            portfolio_utils.CALTRANS_DISTRICT_DICT
        )
    ).pipe(
        portfolio_utils.standardize_portfolio_organization_names, 
        PORTFOLIO_ORGANIZATIONS_DICT
    )
    # to aggregate up to organization, 
    # group by name-service_date-portfolio_organization_name
    # because name indicates different feeds, so we want to sum those.
```
Back up to this section right before: 

```
crosswalk_cols = [
       "schedule_gtfs_dataset_key",
       "name",
       "schedule_source_record_id",
       "base64_url",
       #"organization_source_record_id", # remove this column which gave us 1:m issue
       #"organization_name", # remove this column which gave us the 1:m issue
       "caltrans_district"
   ]
   df = time_series_utils.concatenate_datasets_across_dates(
       SCHED_GCS,
       FILE,
       date_list,
       data_type = "df",
       columns = crosswalk_cols
   )
```
continue pipe for caltrans district, 
continue pipe for portfolio_organization_name, which uses (schedule_gtfs_dataset)name to map the dictionary
 
This change actually addresses the core issue after we looked at it and determined that (schedule_gtfs_dataset)_name-portfolio_organization_name-service_date is the correct grain. Make this change and check how your routes look.
 
I got to this point because I looked at why your rows had City of Moorpark, City of Thousand Oaks, and that doesn't seem correct. 
In the yaml, there is one entry: Thousand Oaks Flex: City of Thousand Oaks, but when I read in monthly_route_schedule_vp, I did find Thousand Oaks Flex in there at all. Flex is not the problem. 
However, the presence of City of Thousand Oaks tells me organization_name is still somewhere, and that shouldn't be used because we've already moved away from it, hence never using it in the aggregation after.
It's showing up in merge_data because the crosswalk wasn't adjusted to remove it. We don't need those columns, it gave us issues already, and now it needs to be cleaned up and removed.
 

In [1]:
import altair as alt
import calitp_data_analysis.magics
import geopandas as gpd
import google.auth
import pandas as pd
import yaml
from IPython.display import HTML, Image, Markdown, display, display_html
from omegaconf import OmegaConf
from segment_speed_utils import gtfs_schedule_wrangling, time_series_utils
from shared_utils import (
    catalog_utils,
    gtfs_utils_v2,
    portfolio_utils,
    publish_utils,
    rt_dates,
    rt_utils,
)
from update_vars import GTFS_DATA_DICT, RT_SCHED_GCS, SCHED_GCS, SEGMENT_GCS

readable_dict = OmegaConf.load("readable2.yml")
credentials, project = google.auth.default()

import _report_route_dir_visuals
import merge_data

In [2]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [3]:
# portfolio_name = "City and County of San Francisco"
portfolio_name = "Ventura County (VCTC, Gold Coast, Cities of Camarillo, Moorpark, Ojai, Simi Valley, Thousand Oaks)"

In [4]:
date_list = rt_dates.y2025_dates

## `concatenate_crosswalk_organization` in `merge_data`

In [5]:
FILE = GTFS_DATA_DICT.schedule_tables.gtfs_key_crosswalk

crosswalk_cols = [
    "schedule_gtfs_dataset_key",
    "name",
    "schedule_source_record_id",
    "base64_url",
    # "organization_source_record_id",
    # "organization_name",
    "caltrans_district",
]

In [6]:
df = time_series_utils.concatenate_datasets_across_dates(
    SCHED_GCS, FILE, date_list, data_type="df", columns=crosswalk_cols
)

In [7]:
df.head(2)

Unnamed: 0,schedule_gtfs_dataset_key,name,schedule_source_record_id,base64_url,caltrans_district,service_date
0,ff1bc5dde661d62c877165421e9ca257,Santa Ynez Mecatran Schedule,recuWhPXfxMatv6rL,aHR0cDovL2FwcC5tZWNhdHJhbi5jb20vdXJiL3dzL2ZlZWQvYzJsMFpUMXplWFowTzJOc2FXVnVkRDF6Wld4bU8yVjRjR2x5WlQwN2RIbHdaVDFuZEdaek8ydGxlVDAwTWpjd056UTBaVFk0TlRBek9UTXlNREl4TURkak56STBNRFJrTXpZeU5UTTRNekkwWXpJMA==,05 - San Luis Obispo,2025-01-15
1,f4c3ea214214ee0d96f7646b3e9d69dc,SLO Peak Transit Schedule,rec0EeeizKvsEDfRQ,aHR0cDovL2RhdGEucGVha3RyYW5zaXQuY29tL3N0YXRpY2d0ZnMvMS9ndGZzLnppcA==,05 - San Luis Obispo,2025-01-15


In [8]:
with open("../_shared_utils/shared_utils/portfolio_organization_name.yml", "r") as f:
    PORTFOLIO_ORGANIZATIONS_DICT = yaml.safe_load(f)

In [9]:
df = df.assign(
    caltrans_district=df.caltrans_district.map(portfolio_utils.CALTRANS_DISTRICT_DICT)
).pipe(
    portfolio_utils.standardize_portfolio_organization_names,
    PORTFOLIO_ORGANIZATIONS_DICT,
)

In [10]:
df.head(2)

Unnamed: 0,schedule_gtfs_dataset_key,name,schedule_source_record_id,base64_url,caltrans_district,service_date,portfolio_organization_name
0,ff1bc5dde661d62c877165421e9ca257,Santa Ynez Mecatran Schedule,recuWhPXfxMatv6rL,aHR0cDovL2FwcC5tZWNhdHJhbi5jb20vdXJiL3dzL2ZlZWQvYzJsMFpUMXplWFowTzJOc2FXVnVkRDF6Wld4bU8yVjRjR2x5WlQwN2RIbHdaVDFuZEdaek8ydGxlVDAwTWpjd056UTBaVFk0TlRBek9UTXlNREl4TURkak56STBNRFJrTXpZeU5UTTRNekkwWXpJMA==,05 - San Luis Obispo / Santa Barbara,2025-01-15,City of Solvang
1,f4c3ea214214ee0d96f7646b3e9d69dc,SLO Peak Transit Schedule,rec0EeeizKvsEDfRQ,aHR0cDovL2RhdGEucGVha3RyYW5zaXQuY29tL3N0YXRpY2d0ZnMvMS9ndGZzLnppcA==,05 - San Luis Obispo / Santa Barbara,2025-01-15,San Luis Obispo Regional Transit Authority


In [11]:
df.loc[df.portfolio_organization_name == portfolio_name]

Unnamed: 0,schedule_gtfs_dataset_key,name,schedule_source_record_id,base64_url,caltrans_district,service_date,portfolio_organization_name
162,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,07 - Los Angeles / Ventura,2025-01-15,"Ventura County (VCTC, Gold Coast, Cities of Camarillo, Moorpark, Ojai, Simi Valley, Thousand Oaks)"
162,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,07 - Los Angeles / Ventura,2025-02-12,"Ventura County (VCTC, Gold Coast, Cities of Camarillo, Moorpark, Ojai, Simi Valley, Thousand Oaks)"
159,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,07 - Los Angeles / Ventura,2025-03-12,"Ventura County (VCTC, Gold Coast, Cities of Camarillo, Moorpark, Ojai, Simi Valley, Thousand Oaks)"
162,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,07 - Los Angeles / Ventura,2025-04-16,"Ventura County (VCTC, Gold Coast, Cities of Camarillo, Moorpark, Ojai, Simi Valley, Thousand Oaks)"
158,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,07 - Los Angeles / Ventura,2025-05-14,"Ventura County (VCTC, Gold Coast, Cities of Camarillo, Moorpark, Ojai, Simi Valley, Thousand Oaks)"


In [12]:
df_sched = merge_data.concatenate_schedule_by_route_direction(date_list)

df_avg_speeds = merge_data.concatenate_speeds_by_route_direction(date_list)

df_rt_sched = merge_data.concatenate_rt_vs_schedule_by_route_direction(date_list)

In [13]:
df_sched.head(2)

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,time_period,avg_scheduled_service_minutes,avg_stop_miles,n_scheduled_trips,frequency,service_date,is_express,is_ferry,is_rail,is_coverage,is_local,is_downtown_local,is_rapid,typology,name,combined_name,recent_combined_name,recent_route_id,route_primary_direction
0,0139b1253130b33adcd4b3a4490530d2,0177a66b-9f33-407d-a72e-776429fb73d4,,all_day,63.5,0.92,2,0.08,2025-01-15,0.0,0.0,0.0,1.0,0.0,0.0,0.0,coverage,TCRTA TripShot Schedule,C70 LOOP__70,C70 LOOP 70,0177a66b-9f33-407d-a72e-776429fb73d4,Eastbound
1,0139b1253130b33adcd4b3a4490530d2,0177a66b-9f33-407d-a72e-776429fb73d4,,all_day,63.5,5.98,2,0.08,2025-02-12,0.0,0.0,0.0,1.0,0.0,0.0,0.0,coverage,TCRTA TripShot Schedule,C70 LOOP__70,C70 LOOP 70,0177a66b-9f33-407d-a72e-776429fb73d4,Eastbound


In [14]:
df_avg_speeds.head(2)

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,time_period,speed_mph,service_date
0,0139b1253130b33adcd4b3a4490530d2,0177a66b-9f33-407d-a72e-776429fb73d4,,all_day,33.49,2025-04-16
1,0139b1253130b33adcd4b3a4490530d2,0177a66b-9f33-407d-a72e-776429fb73d4,,peak,33.49,2025-04-16


In [15]:
df_rt_sched.head(2)

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,time_period,minutes_atleast1_vp,minutes_atleast2_vp,total_rt_service_minutes,total_scheduled_service_minutes,total_vp,vp_in_shape,is_early,is_ontime,is_late,n_vp_trips,vp_per_minute,pct_in_shape,pct_rt_journey_atleast1_vp,pct_rt_journey_atleast2_vp,pct_sched_journey_atleast1_vp,pct_sched_journey_atleast2_vp,rt_sched_journey_ratio,avg_rt_service_minutes,service_date
0,0139b1253130b33adcd4b3a4490530d2,0177a66b-9f33-407d-a72e-776429fb73d4,,all_day,159,158,156.95,127.0,473,390,0,1,1,2,3.01,0.82,1.0,1.0,1.0,1.0,1.24,78.47,2025-01-15
1,0139b1253130b33adcd4b3a4490530d2,0177a66b-9f33-407d-a72e-776429fb73d4,,all_day,161,159,158.92,127.0,479,378,0,1,1,2,3.01,0.79,1.0,1.0,1.0,1.0,1.25,79.46,2025-02-12


In [16]:
df2 = merge_data.merge_data_sources_by_route_direction(
    df_sched, df_rt_sched, df_avg_speeds, df
)

In [17]:
df2.columns

Index(['schedule_gtfs_dataset_key', 'route_id', 'direction_id', 'time_period',
       'avg_scheduled_service_minutes', 'avg_stop_miles', 'n_scheduled_trips',
       'frequency', 'service_date', 'is_express', 'is_ferry', 'is_rail',
       'is_coverage', 'is_local', 'is_downtown_local', 'is_rapid', 'typology',
       'name', 'combined_name', 'recent_combined_name', 'recent_route_id',
       'route_primary_direction', 'minutes_atleast1_vp', 'minutes_atleast2_vp',
       'total_rt_service_minutes', 'total_scheduled_service_minutes',
       'total_vp', 'vp_in_shape', 'is_early', 'is_ontime', 'is_late',
       'n_vp_trips', 'vp_per_minute', 'pct_in_shape',
       'pct_rt_journey_atleast1_vp', 'pct_rt_journey_atleast2_vp',
       'pct_sched_journey_atleast1_vp', 'pct_sched_journey_atleast2_vp',
       'rt_sched_journey_ratio', 'avg_rt_service_minutes', 'sched_rt_category',
       'speed_mph', 'schedule_source_record_id', 'base64_url',
       'caltrans_district', 'portfolio_organization_name']

## Check Ventura County Route 80 

In [18]:
df2.columns

Index(['schedule_gtfs_dataset_key', 'route_id', 'direction_id', 'time_period',
       'avg_scheduled_service_minutes', 'avg_stop_miles', 'n_scheduled_trips',
       'frequency', 'service_date', 'is_express', 'is_ferry', 'is_rail',
       'is_coverage', 'is_local', 'is_downtown_local', 'is_rapid', 'typology',
       'name', 'combined_name', 'recent_combined_name', 'recent_route_id',
       'route_primary_direction', 'minutes_atleast1_vp', 'minutes_atleast2_vp',
       'total_rt_service_minutes', 'total_scheduled_service_minutes',
       'total_vp', 'vp_in_shape', 'is_early', 'is_ontime', 'is_late',
       'n_vp_trips', 'vp_per_minute', 'pct_in_shape',
       'pct_rt_journey_atleast1_vp', 'pct_rt_journey_atleast2_vp',
       'pct_sched_journey_atleast1_vp', 'pct_sched_journey_atleast2_vp',
       'rt_sched_journey_ratio', 'avg_rt_service_minutes', 'sched_rt_category',
       'speed_mph', 'schedule_source_record_id', 'base64_url',
       'caltrans_district', 'portfolio_organization_name']

In [19]:
df2.service_date.unique()

array(['2025-01-15T00:00:00.000000000', '2025-02-12T00:00:00.000000000',
       '2025-03-12T00:00:00.000000000', '2025-04-16T00:00:00.000000000',
       '2025-05-14T00:00:00.000000000'], dtype='datetime64[ns]')

In [20]:
route_80_89 = df2.loc[
    (df2.portfolio_organization_name == portfolio_name)
    & (df2.recent_combined_name == "80-89 Coastal Express")
    & (df2.service_date == "2025-05-14T00:00:00.000000000")
]

In [21]:
route_80_89.name.value_counts()

VCTC GMV Schedule    239
Name: name, dtype: int64

### There are two directions, 3 time periods so I should only see 6 rows per date?

In [22]:
len(route_80_89), len(route_80_89.drop_duplicates())

(239, 41)

In [23]:
route_80_89_dedup = route_80_89.drop_duplicates()

In [24]:
route_80_89_dedup.columns

Index(['schedule_gtfs_dataset_key', 'route_id', 'direction_id', 'time_period',
       'avg_scheduled_service_minutes', 'avg_stop_miles', 'n_scheduled_trips',
       'frequency', 'service_date', 'is_express', 'is_ferry', 'is_rail',
       'is_coverage', 'is_local', 'is_downtown_local', 'is_rapid', 'typology',
       'name', 'combined_name', 'recent_combined_name', 'recent_route_id',
       'route_primary_direction', 'minutes_atleast1_vp', 'minutes_atleast2_vp',
       'total_rt_service_minutes', 'total_scheduled_service_minutes',
       'total_vp', 'vp_in_shape', 'is_early', 'is_ontime', 'is_late',
       'n_vp_trips', 'vp_per_minute', 'pct_in_shape',
       'pct_rt_journey_atleast1_vp', 'pct_rt_journey_atleast2_vp',
       'pct_sched_journey_atleast1_vp', 'pct_sched_journey_atleast2_vp',
       'rt_sched_journey_ratio', 'avg_rt_service_minutes', 'sched_rt_category',
       'speed_mph', 'schedule_source_record_id', 'base64_url',
       'caltrans_district', 'portfolio_organization_name']

In [25]:
route_80_89_dedup.time_period.value_counts()

all_day    19
offpeak    13
peak        9
Name: time_period, dtype: int64

## What about unique route ID's that are repeated for the same `key, recent_combined_name,service_date, and portfolio_organization_name` combo?

In [26]:
route_80_89_dedup.loc[
    (route_80_89_dedup.time_period == "peak") & (route_80_89_dedup.direction_id == 1)
][
    [
        "recent_combined_name",
        "route_id",
        "direction_id",
        "time_period",
        "avg_scheduled_service_minutes",
        "avg_stop_miles",
        "n_scheduled_trips",
        "frequency",
        "route_primary_direction",
        "minutes_atleast1_vp",
        "minutes_atleast2_vp",
        "total_rt_service_minutes",
        "total_scheduled_service_minutes",
        "total_vp",
        "vp_in_shape",
        "is_early",
        "is_ontime",
        "is_late",
        "n_vp_trips",
        "vp_per_minute",
        "pct_in_shape",
        "pct_rt_journey_atleast1_vp",
        "pct_rt_journey_atleast2_vp",
        "pct_sched_journey_atleast1_vp",
        "pct_sched_journey_atleast2_vp",
        "rt_sched_journey_ratio",
        "avg_rt_service_minutes",
        "speed_mph",
        "portfolio_organization_name",
    ]
]

Unnamed: 0,recent_combined_name,route_id,direction_id,time_period,avg_scheduled_service_minutes,avg_stop_miles,n_scheduled_trips,frequency,route_primary_direction,minutes_atleast1_vp,minutes_atleast2_vp,total_rt_service_minutes,total_scheduled_service_minutes,total_vp,vp_in_shape,is_early,is_ontime,is_late,n_vp_trips,vp_per_minute,pct_in_shape,pct_rt_journey_atleast1_vp,pct_rt_journey_atleast2_vp,pct_sched_journey_atleast1_vp,pct_sched_journey_atleast2_vp,rt_sched_journey_ratio,avg_rt_service_minutes,speed_mph,portfolio_organization_name
13835,80-89 Coastal Express,4134,1,peak,86.8,7.71,5,0.62,Eastbound,466,455,589.67,346.0,1367,0,0,0,4,4,2.32,0.0,0.79,0.77,1.0,1.0,1.7,147.42,15.47,"Ventura County (VCTC, Gold Coast, Cities of Camarillo, Moorpark, Ojai, Simi Valley, Thousand Oaks)"
13934,80-89 Coastal Express,4136,1,peak,50.0,32.1,1,0.12,Eastbound,90,85,159.92,50.0,279,0,0,0,1,1,1.74,0.0,0.56,0.53,1.0,1.0,3.2,159.92,,"Ventura County (VCTC, Gold Coast, Cities of Camarillo, Moorpark, Ojai, Simi Valley, Thousand Oaks)"
14004,80-89 Coastal Express,4137,1,peak,125.5,28.88,2,0.25,Eastbound,78,78,77.67,134.0,234,0,1,0,0,1,3.01,0.0,1.0,1.0,0.58,0.58,0.58,77.67,19.46,"Ventura County (VCTC, Gold Coast, Cities of Camarillo, Moorpark, Ojai, Simi Valley, Thousand Oaks)"
14109,80-89 Coastal Express,4141,1,peak,128.0,17.82,2,0.25,Eastbound,309,264,375.4,262.0,816,0,0,0,2,2,2.17,0.0,0.82,0.7,1.0,1.0,1.43,187.7,,"Ventura County (VCTC, Gold Coast, Cities of Camarillo, Moorpark, Ojai, Simi Valley, Thousand Oaks)"
14173,80-89 Coastal Express,4143,1,peak,129.0,3.7,1,0.12,Eastbound,0,0,,,0,0,0,0,0,0,,,,,,,,,,"Ventura County (VCTC, Gold Coast, Cities of Camarillo, Moorpark, Ojai, Simi Valley, Thousand Oaks)"
14237,80-89 Coastal Express,4145,1,peak,106.0,2.91,2,0.25,Eastbound,349,341,1428.93,212.0,1029,0,0,0,2,2,0.72,0.0,0.24,0.24,1.0,1.0,6.74,714.46,30.45,"Ventura County (VCTC, Gold Coast, Cities of Camarillo, Moorpark, Ojai, Simi Valley, Thousand Oaks)"
14307,80-89 Coastal Express,4146,1,peak,129.0,38.06,1,0.12,Eastbound,263,258,579.38,129.0,777,0,0,0,1,1,1.34,0.0,0.45,0.45,1.0,1.0,4.49,579.38,,"Ventura County (VCTC, Gold Coast, Cities of Camarillo, Moorpark, Ojai, Simi Valley, Thousand Oaks)"
14377,80-89 Coastal Express,4148,1,peak,149.67,5.57,3,0.38,Eastbound,622,621,974.28,449.0,1858,0,0,0,3,3,1.91,0.0,0.64,0.64,1.0,1.0,2.17,324.76,,"Ventura County (VCTC, Gold Coast, Cities of Camarillo, Moorpark, Ojai, Simi Valley, Thousand Oaks)"


# Aggregation per my notebook #17

In [27]:
df2.head(1)

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,time_period,avg_scheduled_service_minutes,avg_stop_miles,n_scheduled_trips,frequency,service_date,is_express,is_ferry,is_rail,is_coverage,is_local,is_downtown_local,is_rapid,typology,name,combined_name,recent_combined_name,recent_route_id,route_primary_direction,minutes_atleast1_vp,minutes_atleast2_vp,total_rt_service_minutes,total_scheduled_service_minutes,total_vp,vp_in_shape,is_early,is_ontime,is_late,n_vp_trips,vp_per_minute,pct_in_shape,pct_rt_journey_atleast1_vp,pct_rt_journey_atleast2_vp,pct_sched_journey_atleast1_vp,pct_sched_journey_atleast2_vp,rt_sched_journey_ratio,avg_rt_service_minutes,sched_rt_category,speed_mph,schedule_source_record_id,base64_url,caltrans_district,portfolio_organization_name
0,0139b1253130b33adcd4b3a4490530d2,0177a66b-9f33-407d-a72e-776429fb73d4,0,all_day,63.5,0.92,2,0.08,2025-01-15,0.0,0.0,0.0,1.0,0.0,0.0,0.0,coverage,TCRTA TripShot Schedule,C70 LOOP__70,C70 LOOP 70,0177a66b-9f33-407d-a72e-776429fb73d4,Eastbound,159,158,156.95,127.0,473,390,0,1,1,2,3.01,0.82,1.0,1.0,1.0,1.0,1.24,78.47,schedule_and_vp,,recGeFW9Cz2cr1jJd,aHR0cHM6Ly90Y3J0YS50cmlwc2hvdC5jb20vdjEvZ3Rmcy56aXA_cmVnaW9uSWQ9Q0E1NThEREMtRDdGMi00QjQ4LTlDQUMtREVFQTExMzRGODIw,06 - Fresno / Bakersfield,Tulare County Regional Transit Agency


## Check if route_name-direction are unique to the portfolio_organization_name

In [28]:
unique_route_names = (
    df2.groupby(["service_date", "recent_combined_name"])
    .agg({"portfolio_organization_name": "nunique"})
    .reset_index()
)

In [29]:
unique_route_names.head(2)

Unnamed: 0,service_date,recent_combined_name,portfolio_organization_name
0,2025-01-15,,1
1,2025-01-15,01 City Hall - Armory + Arbor,0


In [30]:
unique_route_names2 = unique_route_names.loc[
    unique_route_names.portfolio_organization_name > 1
]

In [31]:
unique_route_names2.shape

(193, 3)

In [32]:
unique_route_names2.sort_values(
    by=["portfolio_organization_name"], ascending=False
).head()

Unnamed: 0,service_date,recent_combined_name,portfolio_organization_name
2119,2025-01-15,Route 4,6
11358,2025-05-14,Route 4,6
9044,2025-04-16,Route 4,6
4441,2025-02-12,Route 4,6
6694,2025-03-12,Route 4,6


In [33]:
df2.loc[df2.recent_combined_name == "Route 4"].portfolio_organization_name.unique()

array(['Ventura County (VCTC, Gold Coast, Cities of Camarillo, Moorpark, Ojai, Simi Valley, Thousand Oaks)',
       'City of Visalia', 'City of Beaumont',
       'Antelope Valley Transit Authority', 'Redding Area Bus Authority',
       'City of Monterey Park'], dtype=object)

In [34]:
unique_route_ids = (
    df2.groupby(["service_date", "recent_combined_name"])
    .agg({"route_id": "nunique"})
    .reset_index()
)

In [35]:
unique_route_ids2 = unique_route_ids.loc[unique_route_ids.route_id > 1]

In [36]:
unique_route_ids2.sort_values(by=["route_id"], ascending=False).head()

Unnamed: 0,service_date,recent_combined_name,route_id
1408,2025-01-15,80-89 Coastal Express,12
6003,2025-03-12,80-89 Coastal Express,12
8321,2025-04-16,80-89 Coastal Express,12
10648,2025-05-14,80-89 Coastal Express,12
3735,2025-02-12,80-89 Coastal Express,12


## De duplicating on key, name, and portfolio_organization_name

In [38]:
df3 = df2.drop_duplicates(
    subset=[
        "name",
        "service_date",
        "schedule_gtfs_dataset_key",
        "portfolio_organization_name",
        "recent_combined_name",
        "route_id",
        "time_period",
        "direction_id",
    ]
)

In [43]:
# Use only 80-89 Coastal Express
df3 = df3.loc[
    (df3.portfolio_organization_name == portfolio_name)
    & (df3.recent_combined_name == "80-89 Coastal Express")
    & (df3.service_date == "2025-05-14T00:00:00.000000000")
    & (df3.time_period == "all_day")
]

In [40]:
crosswalk_cols = [
    "schedule_gtfs_dataset_key",
    "name",
    "portfolio_organization_name",
]

## Aggregate similar to quarterly rollup?
* `route_id`, `recent_route_id`, and `combined_name` all cause multiple rows to pop up for 80-89 Coastal Express


In [41]:
groupby_cols = [
    "portfolio_organization_name",
    "direction_id",
    "service_date",
    "recent_combined_name",
    "time_period",
]

### To Calculate again
* 	avg_rt_service_minutes	
* 'avg_scheduled_service_minutes',
 'avg_stop_miles',

In [58]:
df3.columns

Index(['schedule_gtfs_dataset_key', 'route_id', 'direction_id', 'time_period',
       'avg_scheduled_service_minutes', 'avg_stop_miles', 'n_scheduled_trips',
       'frequency', 'service_date', 'is_express', 'is_ferry', 'is_rail',
       'is_coverage', 'is_local', 'is_downtown_local', 'is_rapid', 'typology',
       'name', 'combined_name', 'recent_combined_name', 'recent_route_id',
       'route_primary_direction', 'minutes_atleast1_vp', 'minutes_atleast2_vp',
       'total_rt_service_minutes', 'total_scheduled_service_minutes',
       'total_vp', 'vp_in_shape', 'is_early', 'is_ontime', 'is_late',
       'n_vp_trips', 'vp_per_minute', 'pct_in_shape',
       'pct_rt_journey_atleast1_vp', 'pct_rt_journey_atleast2_vp',
       'pct_sched_journey_atleast1_vp', 'pct_sched_journey_atleast2_vp',
       'rt_sched_journey_ratio', 'avg_rt_service_minutes', 'sched_rt_category',
       'speed_mph', 'schedule_source_record_id', 'base64_url',
       'caltrans_district', 'portfolio_organization_name']

In [70]:
to_sum = [
    "n_scheduled_trips",
    "minutes_atleast1_vp",
    "minutes_atleast2_vp",
    "total_rt_service_minutes",
    "total_scheduled_service_minutes",
    "total_vp",
    "vp_in_shape",
    "is_early",
    "is_ontime",
    "is_late",
    "n_vp_trips",
    "frequency",
]

In [71]:
agg1 = (
    df3.groupby(
        [
            "schedule_gtfs_dataset_key",
            "direction_id",
            "time_period",
            "service_date",
            "is_express",
            "is_ferry",
            "is_rail",
            "is_coverage",
            "is_local",
            "is_downtown_local",
            "is_rapid",
            "typology",
            "recent_combined_name",
            "route_primary_direction",
            "schedule_source_record_id",
            "base64_url",
            "caltrans_district",
            "portfolio_organization_name",
            
        ]
    )
    .agg({col: "sum" for col in to_sum})
    .reset_index()
)

In [73]:
agg1 = agg1.rename(
    columns={
        "total_rt_service_minutes": "rt_service_minutes",
        "total_scheduled_service_minutes": "scheduled_service_minutes",
    }).pipe(
     metrics.calculate_rt_vs_schedule_metrics
    ).rename(
       columns={
        "rt_service_minutes": "total_rt_service_minutes",
        "scheduled_service_minutes": "total_scheduled_service_minutes"

    })

In [76]:
set(list(df3.columns))-set(list(agg1.columns))

{'avg_rt_service_minutes',
 'avg_scheduled_service_minutes',
 'avg_stop_miles',
 'combined_name',
 'name',
 'recent_route_id',
 'route_id',
 'rt_sched_journey_ratio',
 'sched_rt_category',
 'speed_mph'}

### Directin 1's `pct_sched_journey_atleast1_vp ` and `pct_sched_journey_atleast2_vp  ` are more than 1, which is wrong

In [79]:
metrics.calculate_rt_vs_schedule_metrics??

[0;31mSignature:[0m [0mmetrics[0m[0;34m.[0m[0mcalculate_rt_vs_schedule_metrics[0m[0;34m([0m[0mdf[0m[0;34m:[0m [0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mframe[0m[0;34m.[0m[0mDataFrame[0m[0;34m)[0m [0;34m->[0m [0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mframe[0m[0;34m.[0m[0mDataFrame[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mcalculate_rt_vs_schedule_metrics[0m[0;34m([0m[0mdf[0m[0;34m:[0m[0mpd[0m[0;34m.[0m[0mDataFrame[0m[0;34m)[0m[0;34m->[0m[0mpd[0m[0;34m.[0m[0mDataFrame[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""[0m
[0;34m    Calculate RT vs schedule metrics[0m
[0;34m    """[0m[0;34m[0m
[0;34m[0m    [0mdf[0m [0;34m=[0m [0mdf[0m[0;34m.[0m[0massign[0m[0;34m([0m[0;34m[0m
[0;34m[0m        [0mvp_per_minute[0m [0;34m=[0m [0mdf[0m[0;34m.[0m[0mtotal_vp[0m [0;34m/[0m [0mdf[0m[0;34m.[0m[0mrt_service_minutes[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    

In [83]:
df3.loc[df3.direction_id == 1].minutes_atleast1_vp.sum()

2309

In [84]:
df3.loc[df3.direction_id == 1].total_scheduled_service_minutes.sum()

1704.0

In [85]:
df3.loc[df3.direction_id == 1].minutes_atleast2_vp.sum()

2232

In [82]:
df3.loc[df3.direction_id == 1][["minutes_atleast1_vp", "total_scheduled_service_minutes", "minutes_atleast2_vp"]]

Unnamed: 0,minutes_atleast1_vp,total_scheduled_service_minutes,minutes_atleast2_vp
13800,466,346.0,455
13870,90,50.0,85
13969,78,134.0,78
14039,441,384.0,394
14144,0,,0
14202,349,212.0,341
14272,263,129.0,258
14342,622,449.0,621


In [87]:
route_80_89_dedup.loc[
    (route_80_89_dedup.time_period == "all_day") & (route_80_89_dedup.direction_id == 1)
][["minutes_atleast1_vp", "total_scheduled_service_minutes", "minutes_atleast2_vp"]]

Unnamed: 0,minutes_atleast1_vp,total_scheduled_service_minutes,minutes_atleast2_vp
13800,466,346.0,455
13870,90,50.0,85
13969,78,134.0,78
14039,441,384.0,394
14144,0,,0
14202,349,212.0,341
14272,263,129.0,258
14342,622,449.0,621


In [88]:
route_80_89.loc[
    (route_80_89.time_period == "all_day") & (route_80_89.direction_id == 1)
][["minutes_atleast1_vp", "total_scheduled_service_minutes", "minutes_atleast2_vp"]]

Unnamed: 0,minutes_atleast1_vp,total_scheduled_service_minutes,minutes_atleast2_vp
13800,466,346.0,455
13870,90,50.0,85
13969,78,134.0,78
14039,441,384.0,394
14144,0,,0
14202,349,212.0,341
14272,263,129.0,258
14342,622,449.0,621


In [78]:
agg1.T

Unnamed: 0,0,1
schedule_gtfs_dataset_key,1770249a5a2e770ca90628434d4934b1,1770249a5a2e770ca90628434d4934b1
direction_id,0,1
time_period,all_day,all_day
service_date,2025-05-14 00:00:00,2025-05-14 00:00:00
is_express,1.00,1.00
is_ferry,0.00,0.00
is_rail,0.00,0.00
is_coverage,0.00,0.00
is_local,0.00,0.00
is_downtown_local,0.00,0.00
