# SHN and Route Typology
* https://docs.google.com/spreadsheets/d/1gmRmVC4phwA3EunOhI4-aJ7uF5R2nZhIhPV2h3FM25w/edit?gid=0#gid=0

## Questions
* Is `route_typology` refreshed January of each year?
* Do I need to go back to 2023 and add back the route typologies? Or can I just add route typologies from August onward?
* Best way to troubleshoot why a dataframe increases in rows after a merge?
* What's the difference between `shape_id` in `open_data/create_routes` vs `common_shape_id` in `route_typology_df?`

## Steps
1. start with: open data routes (1 day? most recent date...for the most part, operator-service_date-route-shape). 
bring in route typology (which is a route's designation for that year) can be merged onto open data, needs an aggregation here to operator-route. this should be merged onto open data routes with a m:1 merge  (open data routes on left, route typology on right, merge on route)
2. to the above, you want to be able to tag those routes as being on shn or not. open data routes will have columns for is_on_shn, route_typology (is_express, is_rapid, is_rail, etc).
3. newer research task: the comparison of "what do you miss when you only sample Wed) is using merge_data.concatenate_schedule_data_by_route_direction(use a week here, instead of Wed), and you're doing this at some point, and this is when you start with gtfs_digest/merge_dataand this stuff is exploratory in a notebook

In [1]:
import geopandas as gpd
import google.auth
import numpy as np
import pandas as pd

credentials, project = google.auth.default()

import gcsfs

fs = gcsfs.GCSFileSystem()
import yaml

In [2]:
from calitp_data_analysis import geography_utils, utils
from segment_speed_utils import gtfs_schedule_wrangling, helpers, time_series_utils
from shared_utils import (
    catalog_utils,
    dask_utils,
    gtfs_utils_v2,
    portfolio_utils,
    publish_utils,
    rt_dates,
    rt_utils,
)
from update_vars import GTFS_DATA_DICT, RT_SCHED_GCS, SCHED_GCS, SEGMENT_GCS

In [3]:
TRAFFIC_OPS_GCS = f"{GTFS_DATA_DICT.gcs_paths.GCS}traffic_ops/"

In [4]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [5]:
bart_org_name = "San Francisco Bay Area Rapid Transit District"
bart_gtfs_dataset_name = "Bay Area 511 BART Schedule"

In [6]:
analysis_date = rt_dates.DATES["jul2025"]

* Insert route typology before organization name is merged in. 
* When I'm still working with feed key/gtfs schedule that's when I want to start adding things. 
* Move my merge to just name and route_id that will be a lot cleaner. 

In [7]:
ca_transit_routes = gpd.read_parquet(
    f"gs://calitp-analytics-data/data-analyses/traffic_ops/ca_transit_routes.parquet",
    filters=[[("agency", "==", bart_org_name)]],
    storage_options={"token": credentials.token},
)

* Make sure the year is something I consider.
* Can concat all of the year to see which ones can merge on. 
* Because we patch previous dates in, we might be missing stuff if we only grab 2025. 
* After dataframe has been patched with `patch_previous_dates`, see what's left.
* `standardize_operator_info_for_exports` needs to be moved out and added on after I add in the SHN and route typologies. 
* All organization name renaming stuff needs to move to the end. 
* Read only what I need.

In [8]:
year = "2025"
route_typologies = pd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/gtfs_schedule/"
    f"nacto_typologies/route_typologies_{year}.parquet",
    filters=[[("name", "==", bart_gtfs_dataset_name)]],
)

In [9]:
ca_transit_routes.head(1).drop(columns=["geometry"])

Unnamed: 0,org_id,agency,route_id,route_type,route_name,shape_id,n_trips,base64_url
0,recoQLeNRISCKF8I0,San Francisco Bay Area Rapid Transit District,Blue-N,1,Blue-N,012A_shp,58,aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1CQQ==


In [10]:
route_typologies.head(1)

Unnamed: 0,schedule_gtfs_dataset_key,name,route_id,route_type,route_long_name,route_short_name,combined_name,is_express,is_rapid,is_rail,is_ferry,is_local,common_shape_id,is_coverage,is_downtown_local
0,8a1405af8da1379acc062e346187ac98,Bay Area 511 BART Schedule,Beige-N,1,Oakland Airport to Coliseum,Beige-N,Beige-N__Oakland Airport to Coliseum,0,0,1,0,0,,0,0


* Always use `name` and `route_id`
* `route_typologies` is not always filled in for everythign we have.
* There are a number of reasons, Tiffany samples the dates and only runs it quarterly.
* Merging something daily to quarterly doesn't always work the best.
* Don't use organization_name. 
* Get rid of all the shape_id variations for intersecting.
    * WE just have to publish all of the `shape_id` because there are truly different variations.
    * 

Steps
1. Patch missing data
2. Merge `route_typologies` with `open_data` on the left
3. Check that I'm not missing too many.
4. Add in SHN stuff.
    * Can dedup `name, route_id,` keep one `shape_id` with the longest route length in a separate dataframe as the "representative" shape. 
    * Use that `shape_id` for doing the intersection between transit route x SHN to determine the column we are creating.
    * Merge this new dataframe back to the original dataframe. 
5. Add in `standardize_org_name` function
6. Add in `finalize_export_df` 

In [11]:
test_m1 = pd.merge(
    ca_transit_routes.assign(name=bart_gtfs_dataset_name),
    route_typologies,
    on=["name", "route_id"],
    how="left",
    indicator=True,
)
test_m1._merge.value_counts()

both          24
left_only      2
right_only     0
Name: _merge, dtype: int64

## Step 1: Concat all the years for `route_typologies`

In [12]:
ROUTE_TYPOLOGIES_FILE = GTFS_DATA_DICT.schedule_tables.route_typologies

In [13]:
route_typology_paths = [
    f"{SCHED_GCS}{ROUTE_TYPOLOGIES_FILE}" for year in rt_dates.years_available
]

In [14]:
rt_dates.years_available

[2023, 2024, 2025]

In [15]:
route_typology_paths

['gs://calitp-analytics-data/data-analyses/gtfs_schedule/nacto_typologies/route_typologies',
 'gs://calitp-analytics-data/data-analyses/gtfs_schedule/nacto_typologies/route_typologies',
 'gs://calitp-analytics-data/data-analyses/gtfs_schedule/nacto_typologies/route_typologies']

In [16]:
route_typology_df = dask_utils.get_ddf(
    route_typology_paths,
    rt_dates.years_available,
    data_type="df",
    get_pandas=True,
    columns=[
        "name",
        "route_id",
        # "route_type",
        "schedule_gtfs_dataset_key",
        "is_express",
        "is_ferry",
        "is_rail",
        "is_coverage",
        "is_local",
        "is_downtown_local",
        "is_rapid",
    ],
    add_date=False,
    add_year=True,
)

In [17]:
route_typology_df.head(2)

Unnamed: 0,name,route_id,schedule_gtfs_dataset_key,is_express,is_ferry,is_rail,is_coverage,is_local,is_downtown_local,is_rapid,year
0,TCRTA TripShot Schedule,0177a66b-9f33-407d-a72e-776429fb73d4,0139b1253130b33adcd4b3a4490530d2,0,0,0,1,0,0,1,2023
1,TCRTA TripShot Schedule,0ad6c6aa-1939-45a0-a3a8-02ebe8e19092,0139b1253130b33adcd4b3a4490530d2,0,0,0,1,0,0,1,2023


In [18]:
route_typology_df.columns

Index(['name', 'route_id', 'schedule_gtfs_dataset_key', 'is_express',
       'is_ferry', 'is_rail', 'is_coverage', 'is_local', 'is_downtown_local',
       'is_rapid', 'year'],
      dtype='object')

In [19]:
len(route_typology_df)

12353

* Need to drop duplicates for `route_typology_df` for routes that do appear repeatedly over the years.

In [20]:
route_typology_df2 = route_typology_df.sort_values(
    by=["year"], ascending=False
).drop_duplicates(
    subset=[
        "name",
        "route_id",
        # "route_type",
        "schedule_gtfs_dataset_key",
        "is_express",
        "is_ferry",
        "is_rail",
        "is_coverage",
        "is_local",
        "is_downtown_local",
        "is_rapid",
    ]
)

In [21]:
route_typology_df2.year.value_counts()

2023    3195
2025    3000
2024    1911
Name: year, dtype: int64

In [22]:
len(route_typology_df2)

8106

## Step 2: Patch missing data (remove `standardize org name`) function

In [23]:
trips = helpers.import_scheduled_trips(analysis_date, get_pandas=True).dropna(
    subset="shape_array_key"
)

In [24]:
trips.columns

Index(['schedule_gtfs_dataset_key', 'name', 'trip_id', 'shape_id',
       'shape_array_key', 'route_id', 'route_key', 'direction_id',
       'route_short_name'],
      dtype='object')

In [25]:
def create_routes_file_for_export(date: str) -> gpd.GeoDataFrame:
    """
    Create a shapes (with associated route info) file for export.
    This allows users to plot the various shapes,
    transit path options, and select between variations for
    a given route.
    """
    # Read in local parquets
    trips = helpers.import_scheduled_trips(
        date,
        columns=[
            "name",
            "gtfs_dataset_key",
            "route_id",
            "route_type",
            "shape_id",
            "shape_array_key",
            "route_long_name",
            "route_short_name",
            "route_desc",
        ],
        get_pandas=True,
    ).dropna(subset="shape_array_key")

    shapes = helpers.import_scheduled_shapes(
        date,
        columns=["shape_array_key", "n_trips", "geometry"],
        get_pandas=True,
        crs=geography_utils.WGS84,
    ).dropna(subset="shape_array_key")

    df = (
        pd.merge(shapes, trips, on="shape_array_key", how="inner")
        .drop_duplicates(subset="shape_array_key")
        .drop(columns="shape_array_key")
    )

    drop_cols = ["route_short_name", "route_long_name", "route_desc"]
    route_shape_cols = ["schedule_gtfs_dataset_key", "route_id", "shape_id"]

    routes_assembled = (
        portfolio_utils.add_route_name(df)
        .drop(columns=drop_cols)
        .sort_values(route_shape_cols)
        .drop_duplicates(subset=route_shape_cols)
        .reset_index(drop=True)
    )

    routes_assembled = routes_assembled.assign(
        route_length_feet=routes_assembled.geometry.to_crs(
            geography_utils.CA_NAD83Albers_ft
        ).length
    )
    return routes_assembled

In [26]:
routes = create_routes_file_for_export(analysis_date)

In [27]:
def patch_previous_dates(
    current_routes: gpd.GeoDataFrame,
    current_date: str,
    published_operators_yaml: str = "../gtfs_funnel/published_operators.yml",
) -> gpd.GeoDataFrame:
    """
    Compare to the yaml for what operators we want, and
    patch in previous dates for the 10 or so operators
    that do not have data for this current date.
    """
    # Read in the published operators file
    with open(published_operators_yaml) as f:
        published_operators_dict = yaml.safe_load(f)

    # Convert the published operators file into a dict mapping dates to an iterable of operators
    patch_operators_dict = {
        str(date): operator_list
        for date, operator_list in published_operators_dict.items()
        if str(date)
        != current_date  # Exclude the current (analysis) date, since that does not need to be patched
    }

    partial_dfs = []

    # For each date and corresponding iterable of operators, get the data from the last time they appeared
    for one_date, operator_list in patch_operators_dict.items():
        df_to_add = publish_utils.subset_table_from_previous_date(
            gcs_bucket=TRAFFIC_OPS_GCS,
            filename=f"ca_transit_routes",
            operator_and_dates_dict=patch_operators_dict,
            date=one_date,
            crosswalk_col="schedule_gtfs_dataset_key",
            data_type="gdf",
        )

        partial_dfs.append(df_to_add)

    patch_routes = pd.concat(partial_dfs, axis=0, ignore_index=True)

    # Concat the current data to the "backfill" data
    published_routes = pd.concat(
        [current_routes, patch_routes], axis=0, ignore_index=True
    )

    return published_routes

In [28]:
routes.sample().drop(columns=["geometry"])

Unnamed: 0,n_trips,name,schedule_gtfs_dataset_key,route_id,route_type,shape_id,route_name_used,route_length_feet
6179,15,Foothill Schedule,f74424acf8c41e4c1e9fd42838c4875c,195,3,14404_shp,195,54509.81


In [29]:
len(routes)

6616

In [30]:
published_routes = patch_previous_dates(
    routes,
    analysis_date,
)

In [31]:
len(published_routes)

7442

## Step 3: Merge `route_typologies` with `open_data`

In [32]:
m1 = pd.merge(
    published_routes,
    route_typology_df2,
    on=["name", "route_id", "schedule_gtfs_dataset_key"],
    how="left",
    indicator=True,
)

In [33]:
m1._merge.value_counts()

both          6344
left_only     2114
right_only       0
Name: _merge, dtype: int64

In [34]:
len(m1) - len(published_routes)

1016

In [35]:
m2 = pd.merge(
    published_routes,
    route_typology_df2,
    on=["name", "route_id"],
    how="left",
    indicator=True,
)

In [36]:
m2._merge.value_counts()

both          13259
left_only       923
right_only        0
Name: _merge, dtype: int64

* Adding `route_type` doesn't really improve things.

## Step 4: Add in SHN 
* Keep only the `shape_id` with the longest length

In [37]:
m1.columns

Index(['n_trips', 'geometry', 'name', 'schedule_gtfs_dataset_key', 'route_id',
       'route_type', 'shape_id', 'route_name_used', 'route_length_feet',
       'base64_url', 'organization_source_record_id', 'organization_name',
       'caltrans_district', 'is_express', 'is_ferry', 'is_rail', 'is_coverage',
       'is_local', 'is_downtown_local', 'is_rapid', 'year', '_merge'],
      dtype='object')

In [38]:
longest_shape = m1.sort_values(
    by=["name", "route_id", "route_length_feet"], ascending=[True, True, False]
)[["name", "route_id", "route_length_feet", "geometry"]].drop_duplicates(
    subset=["name", "route_id"]
)

In [39]:
len(longest_shape)

2661

In [54]:
def routes_shn_intersection(
    routes_gdf: gpd.GeoDataFrame, buffer_amount: int
) -> gpd.GeoDataFrame:
    """
    Overlay the most recent transit routes with a buffered version
    of the SHN
    """
    GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/shared_data/"

    # Read in buffered shn here or re buffer if we don't have it available.
    HWY_FILE = f"{GCS_FILE_PATH}shn_buffered_{buffer_amount}_ft_shn_dissolved_by_ct_district_route.parquet"

    if fs.exists(HWY_FILE):
        shn_routes_gdf = gpd.read_parquet(
            HWY_FILE, storage_options={"token": credentials.token}
        )
    else:
        shn_routes_gdf = shared_data.buffer_shn(buffer_amount)

    # Process the most recent transit route geographies and ensure the
    # CRS matches the SHN routes' GDF so the overlay doesn't go wonky.
    routes_gdf = routes_gdf.to_crs(shn_routes_gdf.crs)

    # Overlay transit routes with the SHN geographies.
    gdf = gpd.overlay(
        routes_gdf, shn_routes_gdf, how="intersection", keep_geom_type=True
    )

    # Calcuate the percent of the transit route that runs on a highway, round it up and
    # multiply it by 100. Drop the geometry because we want the original transit route
    # shapes.
    gdf = gdf.assign(
        pct_route_on_hwy=(gdf.geometry.length / gdf.route_length_feet).round(3) * 100,
    )
    # Subset
    gdf2 = gdf[
        [
            "name",
            "pct_route_on_hwy",
            "route_id",
            "shape_id", # maybe comment out later
            "district",
            "shn_route",
        ]
    ]

    # Clean up
    gdf2.district = gdf2.district.fillna(0).astype(int)

    gdf2 = gdf2.rename(
        columns={
            "pct_route_on_hwy": "pct_route_on_hwy_across_districts",
            "district": "shn_districts",
        }
    )
    return gdf2

In [56]:
shn_routes = routes_shn_intersection(m1, 50)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gdf2.district = gdf2.district.fillna(0).astype(int)


In [57]:
shn_routes.sample(5)

Unnamed: 0,name,pct_route_on_hwy_across_districts,route_id,shape_id,shn_districts,shn_route
7151,Big Blue Bus Schedule,0.2,16,shp-016-51,7,2
5275,LA Metro Bus Schedule,0.2,236-13188,2360136_JUNE25,7,5
995,Bay Area 511 Muni Schedule,0.7,12,1256,4,101
14345,Flixbus Schedule,0.0,2017,cea2d568e77875df73d3b78b6c496ee3,12,1
15242,San Diego Schedule,0.5,7,S2_7_0_277,11,805


In [58]:
shn_routes.pct_route_on_hwy_across_districts.describe()

count   19950.00
mean        6.73
std        16.01
min         0.00
25%         0.10
50%         0.40
75%         2.70
max        99.20
Name: pct_route_on_hwy_across_districts, dtype: float64

In [59]:
def group_route_district(df: pd.DataFrame, pct_route_on_hwy_agg: str) -> pd.DataFrame:
    """
    Aggregate by adding all the districts and SHN to a single row, rather than
    multiple and sum up the total % of SHN a transit route intersects with.

    df: the dataframe you want to aggregate
    pct_route_on_hwy_agg: whether you want to find the max, min, sum, etc on the column
    "pct_route_on_hwy_across_districts"
    """

    agg1 = (
        df.groupby(
            [
                "name",
                "route_id",
                "shape_id" # maybe comment out later
            ],
            as_index=False,
        )[["shn_route", "shn_districts", "pct_route_on_hwy_across_districts"]]
        .agg(
            {
                "shn_route": lambda x: ", ".join(set(x.astype(str))),
                "shn_districts": lambda x: ", ".join(set(x.astype(str))),
                "pct_route_on_hwy_across_districts": pct_route_on_hwy_agg,
            }
        )
        .reset_index(drop=True)
    )

    # Clean up
    agg1.pct_route_on_hwy_across_districts = (
        agg1.pct_route_on_hwy_across_districts.astype(float).round(2)
    )

    return agg1

In [60]:
grouped = group_route_district(shn_routes, "sum")

In [61]:
grouped.sample(10)

Unnamed: 0,name,route_id,shape_id,shn_route,shn_districts,pct_route_on_hwy_across_districts
4313,SBMTD Schedule,5,shp-5-55,101,5,0.5
4348,SLORTA Schedule,21,9,"101, 1",5,33.7
2833,LA Metro Bus Schedule,246-13188,2460082_JUNE25,"110, 47, 405, 1",7,6.5
4299,SBMTD Schedule,25,shp-25-03,101,5,1.3
1653,Big Blue Bus Swiftly Schedule,3908,27519,"405, 10, 1",7,1.2
3334,Madera Metro Schedule,76873,p_1436527,"145, 99",6,0.0
5043,Stanford Schedule,3,3:4,82,4,17.6
457,Bay Area 511 Capitol Corridor Schedule,SF,ejnn,"580, 80, 880",4,138.4
3554,Monterey Salinas Schedule,094,0940061,"218, 68, 1",5,13.4
283,Bay Area 511 AC Transit Schedule,376,shp-376-51,"123, 80",4,6.6


In [62]:
grouped.pct_route_on_hwy_across_districts.describe()

count   5679.00
mean      23.63
std       37.80
min        0.00
25%        0.80
50%        2.60
75%       37.35
max      275.70
Name: pct_route_on_hwy_across_districts, dtype: float64

In [None]:
m1.columns

In [None]:
grouped.columns

In [None]:
len(grouped)

In [None]:
# Merge back with the original dataframe
shn_typology = pd.merge(m1, grouped, on=["route_id", "name"], how="left")

In [None]:
len(shn_typology), len(m1)

In [None]:
def add_shn_information(gdf: gpd.GeoDataFrame, buffer_amt: int) -> pd.DataFrame:
    """
    Prepare the gdf to join with the existing transit_routes
    dataframe that is published on the Open Data Portal
    """
    # Retain only the longest shape for each name-route_id combo
    # so finding the intersection with SHN won't take as long
    """
    longest_shape = gdf.sort_values(
        by=["name", "route_id", "route_length_feet"], ascending=[True, True, False]
    )[["name", "route_id", "route_length_feet", "geometry"]].drop_duplicates(
        subset=["name", "route_id"]
    )
    """
    # Overlay
    intersecting = routes_shn_intersection(gdf, buffer_amt)
    
    # Group the dataframe so that one route only has one
    # row instead of multiple rows after finding its
    # intersection with any SHN routes.
    # print(intersecting.columns)
    agg1 = group_route_district(intersecting, "sum")

    # Merge the dataframe with all the SHS info with the original
    # gdf so we can get the original transit route geometries &
    # any routes that don't intersect with the state highway routes.
    m1 = pd.merge(gdf, agg1, on=["route_id", "name"], how="left")

    # Add yes/no column to signify if a transit route intersects
    # with a SHN route
    m1["on_shs"] = np.where(m1["pct_route_on_hwy_across_districts"] == 0, "N", "Y")

    # Clean up rows that are tagged as "on_shs==N" but still have values
    # that appear.
    m1.loc[
       (m1["on_shs"] == "N") & (m1["shn_districts"] != "0"),
        ["shn_districts", "shn_route"],
    ] = np.nan
   
    return m1

In [None]:
shn_typology = add_shn_information(m1, 50)

In [None]:
type(shn_typology)

In [None]:
len(shn_typology)

In [None]:
shn_typology.on_shs.value_counts()

In [None]:
shn_typology.loc[(shn_typology.name == "Antelope Valley Transit Authority Schedule")
                & (shn_typology.route_id == "50")].drop(columns = ["geometry", "schedule_gtfs_dataset_key"]).T

In [None]:
shn_typology.drop(columns = ["geometry", "schedule_gtfs_dataset_key"]).sample(3).T

## Step 5: Add back `standardize_org_name`

## Step 6: Add in `finalize_export_df`