# Route identification (time-series)

Over time, even `route_ids` change. Pick out a couple of examples of this.

In [None]:
import pandas as pd
import yaml

from update_vars import SCHED_GCS, GTFS_DATA_DICT
from shared_utils import portfolio_utils
from segment_speed_utils import time_series_utils

with open(
    "../_shared_utils/shared_utils/portfolio_organization_name.yml", "r"
) as f:
    PORTFOLIO_ORGANIZATIONS_DICT = yaml.safe_load(f)

In [None]:
CLEANED_ROUTE_NAMING = GTFS_DATA_DICT.schedule_tables.route_identification

df = pd.read_parquet(
    f"{SCHED_GCS}{CLEANED_ROUTE_NAMING}.parquet"
).pipe(
    portfolio_utils.standardize_portfolio_organization_names, 
    PORTFOLIO_ORGANIZATIONS_DICT
)

## LA Metro

`route_id` has suffix added every time a new feed goes into effect.

In [None]:
subset_cols = [
    "name", "portfolio_organization_name", 
    "route_id", "service_date", "combined_name",
    "recent_combined_name", "recent_route_id2"
]

df[(df.name.str.contains("LA Metro")) & 
   (df.recent_combined_name == "2__Metro Local Line")
][subset_cols]

In [None]:
df[(df.name.str.contains("LA Metro")) & 
   (df.recent_combined_name == "2__Metro Local Line")
  ].route_id.value_counts()

## VCTC

These were flagged as a complicated case where metrics were duplicated in GTFS Digest.

Within [time_series_utils](https://github.com/cal-itp/data-analyses/blob/main/rt_segment_speeds/segment_speed_utils/time_series_utils.py#L84-L105), in Apr 2024, when this function was added, VCTC should only have route_long_names kept. 

Recently, since Sep 2024, it appears they've grouped a set of routes together, now it's appearing as 80-89 Coastal Express.
Given that, we should remove VCTC GMV from that list where we do extra route cleaning.

In [None]:
vctc_df = df[(df.name=="VCTC GMV Schedule") & 
   (df.recent_combined_name.str.contains("Coastal Express"))
  ]
vctc_df[subset_cols]

In [None]:
vctc_df.pipe(
    time_series_utils.clean_standardized_route_names
)

In [None]:
unique_routes = vctc_df.pipe(
    time_series_utils.clean_standardized_route_names
).route_id.unique()

The new `recent_combined_name` since Sep 2024-May 2025 is capturing so many route_ids!

This confirms what was seen GTFS Digest.

In [None]:
vctc_df.pipe(
    time_series_utils.clean_standardized_route_names
).astype(str).groupby(
    ["recent_combined_name"]
).agg({
    "route_id": lambda x: list(set(x)),
    "service_date": ["min", "max"],
}).reset_index()

In Apr 2024, this looks ok, `recent_combined_name` is reasonable.

In [None]:
vctc_df[vctc_df.service_date <= pd.to_datetime("2024-04-01")].pipe(
    time_series_utils.clean_standardized_route_names
).astype(str).groupby(
    ["recent_combined_name"]
).agg({
    "route_id": lambda x: list(set(x)),
    "service_date": ["min", "max"],
}).reset_index()

In [None]:
def display_subset(df: pd.DataFrame) -> tuple[pd.DataFrame]:
    """
    Compare what happens when we pipe vs don't pipe for the operators in the list we
    do extra data wrangling on.
    """
    cols = ["route_id", "route_short_name", "route_long_name", 
         "recent_combined_name"
           ]
    with_pipe = df.pipe(
        time_series_utils.clean_standardized_route_names
    ).astype(str).groupby(
        ["recent_combined_name"]
    ).agg({
        "route_id": lambda x: list(set(x)),
        "service_date": ["min", "max"],
    }).reset_index()

    no_pipe = df.astype(str).groupby(
        ["recent_combined_name"]
    ).agg({
        "route_id": lambda x: list(set(x)),
        "service_date": ["min", "max"],
    }).reset_index()
    
    print(f"status quo, with time_series_utils.pipe affecting it")
    display(with_pipe)
    
    print(f"status quo, no pipe, no extra parsing affecting it")
    display(no_pipe)
    
    return

In [None]:
display_subset(vctc_df)

## Check other operators

Find other operators ones that might also have had changes and see if any can benefit from having extra parsing removed.

In [None]:
for operator in time_series_utils.operators_only_route_long_name:
    print(f"Operator: {operator}")
          
    display_subset(df[df.name==operator])