# SHN and Route Typology
* https://docs.google.com/spreadsheets/d/1gmRmVC4phwA3EunOhI4-aJ7uF5R2nZhIhPV2h3FM25w/edit?gid=0#gid=0

## Questions
* Is `route_typology` refreshed January of each year?
* Do I need to go back to 2023 and add back the route typologies? Or can I just add route typologies from August onward?
* Best way to troubleshoot why a dataframe increases in rows after a merge?
* What's the difference between `shape_id` in `open_data/create_routes` vs `common_shape_id` in `route_typology_df?`

## Steps
1. start with: open data routes (1 day? most recent date...for the most part, operator-service_date-route-shape). 
bring in route typology (which is a route's designation for that year) can be merged onto open data, needs an aggregation here to operator-route. this should be merged onto open data routes with a m:1 merge  (open data routes on left, route typology on right, merge on route)
2. to the above, you want to be able to tag those routes as being on shn or not. open data routes will have columns for is_on_shn, route_typology (is_express, is_rapid, is_rail, etc).
3. newer research task: the comparison of "what do you miss when you only sample Wed) is using merge_data.concatenate_schedule_data_by_route_direction(use a week here, instead of Wed), and you're doing this at some point, and this is when you start with gtfs_digest/merge_dataand this stuff is exploratory in a notebook

In [1]:
import geopandas as gpd
import google.auth
import numpy as np
import pandas as pd

credentials, project = google.auth.default()

import gcsfs

fs = gcsfs.GCSFileSystem()
import yaml

In [2]:
from calitp_data_analysis import geography_utils, utils
from segment_speed_utils import gtfs_schedule_wrangling, helpers, time_series_utils
from shared_utils import (
    catalog_utils,
    dask_utils,
    gtfs_utils_v2,
    portfolio_utils,
    publish_utils,
    rt_dates,
    rt_utils,
)
from update_vars import GTFS_DATA_DICT, RT_SCHED_GCS, SCHED_GCS, SEGMENT_GCS

In [3]:
TRAFFIC_OPS_GCS = f"{GTFS_DATA_DICT.gcs_paths.GCS}traffic_ops/"

In [4]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [5]:
bart_org_name = "San Francisco Bay Area Rapid Transit District"
bart_gtfs_dataset_name = "Bay Area 511 BART Schedule"

In [6]:
analysis_date = rt_dates.DATES["jul2025"]

* Insert route typology before organization name is merged in. 
* When I'm still working with feed key/gtfs schedule that's when I want to start adding things. 
* Move my merge to just name and route_id that will be a lot cleaner. 

In [7]:
ca_transit_routes = gpd.read_parquet(
    f"gs://calitp-analytics-data/data-analyses/traffic_ops/ca_transit_routes.parquet",
    filters=[[("agency", "==", bart_org_name)]],
    storage_options={"token": credentials.token},
)

* Make sure the year is something I consider.
* Can concat all of the year to see which ones can merge on. 
* Because we patch previous dates in, we might be missing stuff if we only grab 2025. 
* After dataframe has been patched with `patch_previous_dates`, see what's left.
* `standardize_operator_info_for_exports` needs to be moved out and added on after I add in the SHN and route typologies. 
* All organization name renaming stuff needs to move to the end. 
* Read only what I need.

In [8]:
year = "2025"
route_typologies = pd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/gtfs_schedule/"
    f"nacto_typologies/route_typologies_{year}.parquet",
    filters=[[("name", "==", bart_gtfs_dataset_name)]],
)

In [9]:
ca_transit_routes.head(1).drop(columns=["geometry"])

Unnamed: 0,org_id,agency,route_id,route_type,route_name,route_length_feet,shape_id,n_trips,base64_url,shn_route,on_shs,shn_districts,pct_route_on_hwy_across_districts


In [10]:
route_typologies.head(1)

Unnamed: 0,schedule_gtfs_dataset_key,name,route_id,route_type,route_long_name,route_short_name,combined_name,is_express,is_rapid,is_rail,is_ferry,is_local,common_shape_id,is_coverage,is_downtown_local
0,8a1405af8da1379acc062e346187ac98,Bay Area 511 BART Schedule,Beige-N,1,Oakland Airport to Coliseum,Beige-N,Beige-N__Oakland Airport to Coliseum,0,0,1,0,0,,0,0


* Always use `name` and `route_id`
* `route_typologies` is not always filled in for everythign we have.
* There are a number of reasons, Tiffany samples the dates and only runs it quarterly.
* Merging something daily to quarterly doesn't always work the best.
* Don't use organization_name. 
* Get rid of all the shape_id variations for intersecting.
    * WE just have to publish all of the `shape_id` because there are truly different variations.
    * 

Steps
1. Patch missing data
2. Merge `route_typologies` with `open_data` on the left
3. Check that I'm not missing too many.
4. Add in SHN stuff.
    * Can dedup `name, route_id,` keep one `shape_id` with the longest route length in a separate dataframe as the "representative" shape. 
    * Use that `shape_id` for doing the intersection between transit route x SHN to determine the column we are creating.
    * Merge this new dataframe back to the original dataframe. 
5. Add in `standardize_org_name` function
6. Add in `finalize_export_df` 

In [11]:
test_m1 = pd.merge(
    ca_transit_routes.assign(name=bart_gtfs_dataset_name),
    route_typologies,
    on=["name", "route_id"],
    how="left",
    indicator=True,
)
test_m1._merge.value_counts()

left_only     0
right_only    0
both          0
Name: _merge, dtype: int64

In [12]:
test_m1.loc[test_m1.route_id.isin(["Grey-N", "Grey-S"])].drop(
    columns=["geometry", "base64_url"]
)

Unnamed: 0,org_id,agency,route_type_x,route_name,route_length_feet,shape_id,n_trips,shn_route,on_shs,shn_districts,pct_route_on_hwy_across_districts,schedule_gtfs_dataset_key,name,route_id,route_type_y,route_long_name,route_short_name,combined_name,is_express,is_rapid,is_rail,is_ferry,is_local,common_shape_id,is_coverage,is_downtown_local,_merge


In [13]:
test_m1.loc[test_m1._merge == "left_only"].drop(columns=["geometry"])

Unnamed: 0,org_id,agency,route_type_x,route_name,route_length_feet,shape_id,n_trips,base64_url,shn_route,on_shs,shn_districts,pct_route_on_hwy_across_districts,schedule_gtfs_dataset_key,name,route_id,route_type_y,route_long_name,route_short_name,combined_name,is_express,is_rapid,is_rail,is_ferry,is_local,common_shape_id,is_coverage,is_downtown_local,_merge


In [14]:
test_m2 = pd.merge(
    ca_transit_routes.assign(name=bart_gtfs_dataset_name),
    route_typologies,
    on=["name", "route_id", "route_type"],
    how="left",
    indicator=True,
)
test_m2._merge.value_counts()

left_only     0
right_only    0
both          0
Name: _merge, dtype: int64

In [15]:
len(ca_transit_routes)

0

## Step 1: Concat all the years for `route_typologies`

In [16]:
def concatenate_route_typologies() -> pd.DataFrame:
    """
    Concatenate the years available for
    route typologies on the operator-route_id
    grain.
    """
    ROUTE_TYPOLOGIES_FILE = GTFS_DATA_DICT.schedule_tables.route_typologies

    route_typology_paths = [
        f"{SCHED_GCS}{ROUTE_TYPOLOGIES_FILE}" for year in rt_dates.years_available
    ]
    route_typology_df = dask_utils.get_ddf(
        route_typology_paths,
        rt_dates.years_available,
        data_type="df",
        get_pandas=True,
        columns=[
            "name",
            "route_id",
            "is_express",
            "is_ferry",
            "is_rail",
            "is_coverage",
            "is_local",
            "is_downtown_local",
            "is_rapid",
        ],
        add_date=False,
        add_year=True,
    )

    # Drop duplicates of operator-route_id to keep only the
    # row with the most current year.
    route_typology_df2 = route_typology_df.sort_values(
        by=["name", "route_id", "year"], ascending=[True, True, False]
    ).drop_duplicates(
        subset=[
            "name",
            "route_id",
        ]
    )
    return route_typology_df2

In [17]:
ROUTE_TYPOLOGIES_FILE = GTFS_DATA_DICT.schedule_tables.route_typologies

In [18]:
route_typology_df = concatenate_route_typologies()

In [19]:
route_typology_df.loc[route_typology_df.name == "Yolobus Schedule"].head(10)

Unnamed: 0,name,route_id,is_express,is_ferry,is_rail,is_coverage,is_local,is_downtown_local,is_rapid,year
917,Yolobus Schedule,021ced44-152f-40cc-b486-f04e0a43e5ff,0,0,0,1,0,1,0,2023
918,Yolobus Schedule,04818ef1-576a-492b-9a18-e9fa6577ba3e,0,0,0,1,0,0,1,2023
828,Yolobus Schedule,07959480-2a40-4a51-92ac-8ca2029d5f4f,0,0,0,1,0,0,1,2024
829,Yolobus Schedule,08a6c620-4d4c-4be5-8dd9-9172952a13d8,0,0,0,0,0,1,1,2024
2871,Yolobus Schedule,097918a3-9cb6-43c0-abf8-f66bc22a1489,0,0,0,1,0,1,0,2023
2872,Yolobus Schedule,12a618d6-d441-4f89-be83-aa402b867ed8,0,0,0,1,0,1,0,2023
701,Yolobus Schedule,138,0,0,0,0,1,0,0,2025
2873,Yolobus Schedule,178155d4-718e-49ca-9e50-a32a61e64a71,0,0,0,1,0,1,0,2023
831,Yolobus Schedule,1a3ef0a8-fa0f-4e27-bb45-934fc45b3181,0,0,0,1,0,1,0,2024
832,Yolobus Schedule,1ae28e33-41d5-41b3-a481-9f51490ec40e,0,0,0,1,0,0,1,2024


In [20]:
# route_typology_paths = [
#    f"{SCHED_GCS}{ROUTE_TYPOLOGIES_FILE}" for year in rt_dates.years_available
# ]

In [21]:
# route_typology_df.head(2)

In [22]:
# route_typology_df.sort_values(
#    by=["name", "route_id", "year"], ascending=[True, True, False]
# ).head()

In [23]:
# route_typology_df2 = route_typology_df.sort_values(
#    by=["name", "route_id", "year"], ascending=[True, True, False]
# ).drop_duplicates(
#    subset=[
#        "name",
#        "route_id",
#    ]
# )

In [24]:
# route_typology_df2.year.value_counts()

## Step 2: Patch missing data (remove `standardize org name`) function

In [25]:
def remove_erroneous_shapes(
    shapes_with_route_info: gpd.GeoDataFrame,
) -> gpd.GeoDataFrame:
    """
    Check if line is simple for Amtrak. If it is, keep.
    If it's not simple (line crosses itself), drop.

    In Jun 2023, some Amtrak shapes appeared to be funky,
    but in prior months, it's been ok.
    Checking for length is fairly time-consuming.
    """
    amtrak = "Amtrak Schedule"

    possible_error = shapes_with_route_info[shapes_with_route_info.name == amtrak]
    ok = shapes_with_route_info[shapes_with_route_info.name != amtrak]

    # Check if the line crosses itself
    ok_amtrak = (
        possible_error.assign(simple=possible_error.geometry.is_simple)
        .query("simple == True")
        .drop(columns="simple")
    )

    ok_shapes = pd.concat([ok, ok_amtrak], axis=0).reset_index(drop=True)

    return ok_shapes

In [26]:
def create_routes_file_for_export(date: str) -> gpd.GeoDataFrame:
    """
    Create a shapes (with associated route info) file for export.
    This allows users to plot the various shapes,
    transit path options, and select between variations for
    a given route.
    """
    # Read in local parquets
    trips = helpers.import_scheduled_trips(
        date,
        columns=[
            "name",
            "gtfs_dataset_key",
            "route_id",
            "route_type",
            "shape_id",
            "shape_array_key",
            "route_long_name",
            "route_short_name",
            "route_desc",
        ],
        get_pandas=True,
    ).dropna(subset="shape_array_key")

    shapes = helpers.import_scheduled_shapes(
        date,
        columns=["shape_array_key", "n_trips", "geometry"],
        get_pandas=True,
        crs=geography_utils.WGS84,
    ).dropna(subset="shape_array_key")

    df = (
        pd.merge(shapes, trips, on="shape_array_key", how="inner")
        .drop_duplicates(subset="shape_array_key")
        .drop(columns="shape_array_key")
    )

    drop_cols = ["route_short_name", "route_long_name", "route_desc"]
    route_shape_cols = ["schedule_gtfs_dataset_key", "route_id", "shape_id"]

    routes_assembled = (
        portfolio_utils.add_route_name(df)
        .drop(columns=drop_cols)
        .sort_values(route_shape_cols)
        .drop_duplicates(subset=route_shape_cols)
        .reset_index(drop=True)
    )

    routes_assembled = routes_assembled.pipe(remove_erroneous_shapes)

    routes_assembled = routes_assembled.assign(
        route_length_feet=routes_assembled.geometry.to_crs(
            geography_utils.CA_NAD83Albers_ft
        ).length
    )
    return routes_assembled

In [27]:
routes = create_routes_file_for_export(analysis_date)

In [28]:
def patch_previous_dates(
    current_routes: gpd.GeoDataFrame,
    current_date: str,
    published_operators_yaml: str = "../gtfs_funnel/published_operators.yml",
) -> gpd.GeoDataFrame:
    """
    Compare to the yaml for what operators we want, and
    patch in previous dates for the 10 or so operators
    that do not have data for this current date.
    """
    # Read in the published operators file
    with open(published_operators_yaml) as f:
        published_operators_dict = yaml.safe_load(f)

    # Convert the published operators file into a dict mapping dates to an iterable of operators
    patch_operators_dict = {
        str(date): operator_list
        for date, operator_list in published_operators_dict.items()
        if str(date)
        != current_date  # Exclude the current (analysis) date, since that does not need to be patched
    }

    partial_dfs = []

    # For each date and corresponding iterable of operators, get the data from the last time they appeared
    for one_date, operator_list in patch_operators_dict.items():
        df_to_add = publish_utils.subset_table_from_previous_date(
            gcs_bucket=TRAFFIC_OPS_GCS,
            filename=f"ca_transit_routes",
            operator_and_dates_dict=patch_operators_dict,
            date=one_date,
            crosswalk_col="schedule_gtfs_dataset_key",
            data_type="gdf",
        )

        partial_dfs.append(df_to_add)

    patch_routes = pd.concat(partial_dfs, axis=0, ignore_index=True)

    # Concat the current data to the "backfill" data
    published_routes = pd.concat(
        [current_routes, patch_routes], axis=0, ignore_index=True
    )

    # Drop Duplicates
    published_routes = published_routes.drop_duplicates().reset_index(drop=True)

    return published_routes

In [29]:
routes.sample().drop(columns=["geometry"])

Unnamed: 0,n_trips,name,schedule_gtfs_dataset_key,route_id,route_type,shape_id,route_name_used,route_length_feet
4628,1,Flixbus Schedule,a37760dde6b9fdcb76b82e57afab7274,US1000,3,241cbec0e51d5674bce61cbad646efb9,Greyhound US1000,7786211.37


In [30]:
len(routes)

6600

In [31]:
published_routes = patch_previous_dates(
    routes,
    analysis_date,
)

In [32]:
len(published_routes)

7326

In [33]:
published_routes.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 7326 entries, 0 to 7325
Data columns (total 13 columns):
 #   Column                         Non-Null Count  Dtype   
---  ------                         --------------  -----   
 0   n_trips                        7326 non-null   int64   
 1   geometry                       7326 non-null   geometry
 2   name                           7326 non-null   object  
 3   schedule_gtfs_dataset_key      7326 non-null   object  
 4   route_id                       7326 non-null   object  
 5   route_type                     7326 non-null   object  
 6   shape_id                       7326 non-null   object  
 7   route_name_used                7326 non-null   object  
 8   route_length_feet              6600 non-null   float64 
 9   base64_url                     726 non-null    object  
 10  organization_source_record_id  726 non-null    object  
 11  organization_name              726 non-null    object  
 12  caltrans_district         

## Step 3: Merge `route_typologies` with `open_data`

In [34]:
m1 = pd.merge(
    published_routes,
    route_typology_df,
    on=["name", "route_id"],
    how="left",
)

In [35]:
len(m1)

7326

In [36]:
len(published_routes)

7326

In [37]:
len(published_routes.drop_duplicates())

7326

In [38]:
len(m1.drop_duplicates())

7326

## Step 4: Add in SHN 
* <s>Keep only the `shape_id` with the longest length</s> joining on all the rows works pretty quickly.

In [39]:
def routes_shn_intersection(
    routes_gdf: gpd.GeoDataFrame, buffer_amount: int
) -> gpd.GeoDataFrame:
    """
    Overlay the most recent transit routes with a buffered version
    of the SHN
    """
    GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/shared_data/"

    # Read in buffered shn here or re buffer if we don't have it available.
    HWY_FILE = f"{GCS_FILE_PATH}shn_buffered_{buffer_amount}_ft_shn_dissolved_by_ct_district_route.parquet"

    if fs.exists(HWY_FILE):
        shn_routes_gdf = gpd.read_parquet(
            HWY_FILE, storage_options={"token": credentials.token}
        )
    else:
        shn_routes_gdf = shared_data.buffer_shn(buffer_amount)

    # Process the most recent transit route geographies and ensure the
    # CRS matches the SHN routes' GDF so the overlay doesn't go wonky.
    routes_gdf = routes_gdf.to_crs(shn_routes_gdf.crs)

    # Overlay transit routes with the SHN geographies.
    gdf = gpd.overlay(
        routes_gdf, shn_routes_gdf, how="intersection", keep_geom_type=True
    )

    # Calcuate the percent of the transit route that runs on a highway, round it up and
    # multiply it by 100. Drop the geometry because we want the original transit route
    # shapes.
    gdf = gdf.assign(
        pct_route_on_hwy=(gdf.geometry.length / gdf.route_length_feet).round(3) * 100,
    )
    # Subset
    gdf2 = gdf[
        [
            "name",
            "pct_route_on_hwy",
            "route_id",
            "shape_id",  # maybe comment out later
            "district",
            "shn_route",
        ]
    ]

    # Clean up
    gdf2.district = gdf2.district.fillna(0).astype(int)

    gdf2 = gdf2.rename(
        columns={
            "pct_route_on_hwy": "pct_route_on_hwy_across_districts",
            "district": "shn_districts",
        }
    )
    return gdf2

In [40]:
# shn_routes = routes_shn_intersection(m1, 50)

In [41]:
# shn_routes.sample(5)

In [42]:
# shn_routes.pct_route_on_hwy_across_districts.describe()

In [43]:
def group_route_district(df: pd.DataFrame, pct_route_on_hwy_agg: str) -> pd.DataFrame:
    """
    Aggregate by adding all the districts and SHN to a single row, rather than
    multiple and sum up the total % of SHN a transit route intersects with.

    df: the dataframe you want to aggregate
    pct_route_on_hwy_agg: whether you want to find the max, min, sum, etc on the column
    "pct_route_on_hwy_across_districts"
    """

    agg1 = (
        df.groupby(
            ["name", "route_id", "shape_id"],  # maybe comment out later
            as_index=False,
        )[["shn_route", "shn_districts", "pct_route_on_hwy_across_districts"]]
        .agg(
            {
                "shn_route": lambda x: ", ".join(set(x.astype(str))),
                "shn_districts": lambda x: ", ".join(set(x.astype(str))),
                "pct_route_on_hwy_across_districts": pct_route_on_hwy_agg,
            }
        )
        .reset_index(drop=True)
    )

    # Clean up
    agg1.pct_route_on_hwy_across_districts = (
        agg1.pct_route_on_hwy_across_districts.astype(float).round(2)
    )

    return agg1

In [44]:
# grouped = group_route_district(shn_routes, "sum")

In [45]:
# grouped.sample(10)

In [46]:
# grouped.pct_route_on_hwy_across_districts.describe()

In [47]:
# m1.columns

In [48]:
# grouped.columns

In [49]:
# len(grouped)

In [50]:
# Merge back with the original dataframe
# shn_typology = pd.merge(m1, grouped, on=["route_id", "name", "shape_id"], how="left")

In [51]:
# len(shn_typology), len(m1)

In [52]:
def add_shn_information(gdf: gpd.GeoDataFrame, buffer_amt: int) -> pd.DataFrame:
    """
    Prepare the gdf to join with the existing transit_routes
    dataframe that is published on the Open Data Portal
    """
    # Retain only the longest shape for each name-route_id combo
    # so finding the intersection with SHN won't take as long
    """
    gdf = gdf.sort_values(
        by=["name", "route_id", "route_length_feet"], ascending=[True, True, False]
    )[["name", "route_id", "route_length_feet", "geometry"]].drop_duplicates(
        subset=["name", "route_id"]
    )
    """
    # Overlay
    intersecting = routes_shn_intersection(gdf, buffer_amt)

    # Group the dataframe so that one route only has one
    # row instead of multiple rows after finding its
    # intersection with any SHN routes.
    # print(intersecting.columns)
    agg1 = group_route_district(intersecting, "sum")

    # Merge the dataframe with all the SHS info with the original
    # gdf so we can get the original transit route geometries &
    # any routes that don't intersect with the state highway routes.
    m1 = pd.merge(gdf, agg1, on=["route_id", "name", "shape_id"], how="left")

    # Add yes/no column to signify if a transit route intersects
    # with a SHN route
    m1.pct_route_on_hwy_across_districts = m1.pct_route_on_hwy_across_districts.fillna(
        0
    )
    m1["on_shs"] = np.where(m1["pct_route_on_hwy_across_districts"] == 0, "N", "Y")

    # Clean up rows that are tagged as "on_shs==N" but still have values
    # that appear.
    m1.loc[
        (m1["on_shs"] == "N") & (m1["shn_districts"] != "0"),
        ["shn_districts", "shn_route"],
    ] = np.nan

    return m1

In [53]:
shn_typology = add_shn_information(m1, 50)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gdf2.district = gdf2.district.fillna(0).astype(int)


In [54]:
type(shn_typology)

geopandas.geodataframe.GeoDataFrame

In [55]:
len(shn_typology)

7326

In [56]:
len(m1)

7326

In [57]:
shn_typology.on_shs.value_counts()

Y    5100
N    2226
Name: on_shs, dtype: int64

In [58]:
shn_typology.loc[
    (shn_typology.name == "Antelope Valley Transit Authority Schedule")
    & (shn_typology.route_id == "50")
].drop(columns=["geometry", "schedule_gtfs_dataset_key"]).T

Unnamed: 0,3160,3161
n_trips,8,8
name,Antelope Valley Transit Authority Schedule,Antelope Valley Transit Authority Schedule
route_id,50,50
route_type,3,3
shape_id,9011_shp,9014_shp
route_name_used,50,50
route_length_feet,164112.17,159511.76
base64_url,,
organization_source_record_id,,
organization_name,,


In [59]:
shn_typology.drop(columns=["geometry", "schedule_gtfs_dataset_key"]).sample(3).T

Unnamed: 0,6590,5507,4222
n_trips,1,4,1
name,Amtrak Schedule,Bay Area 511 ACE Schedule,Flixbus Schedule
route_id,88,ACETrain,2704
route_type,2,2,3
shape_id,90,pv35,007a919289586aaccc791d019c7f8549
route_name_used,Northeast Regional,ACETrain,FlixBus 2704
route_length_feet,3357917.11,453280.30,467406.30
base64_url,,,
organization_source_record_id,,,
organization_name,,,


In [60]:
shn_typology.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 7326 entries, 0 to 7325
Data columns (total 25 columns):
 #   Column                             Non-Null Count  Dtype   
---  ------                             --------------  -----   
 0   n_trips                            7326 non-null   int64   
 1   geometry                           7326 non-null   geometry
 2   name                               7326 non-null   object  
 3   schedule_gtfs_dataset_key          7326 non-null   object  
 4   route_id                           7326 non-null   object  
 5   route_type                         7326 non-null   object  
 6   shape_id                           7326 non-null   object  
 7   route_name_used                    7326 non-null   object  
 8   route_length_feet                  6600 non-null   float64 
 9   base64_url                         726 non-null    object  
 10  organization_source_record_id      726 non-null    object  
 11  organization_name                  

## Step 5: Add back `standardize_org_name`

In [61]:
def standardize_operator_info_for_exports(df: pd.DataFrame, date: str) -> pd.DataFrame:
    """
    Use our crosswalk file created in gtfs_funnel
    and add in the organization columns we want to
    publish on.
    """

    CROSSWALK_FILE = GTFS_DATA_DICT.schedule_tables.gtfs_key_crosswalk

    public_feeds = gtfs_utils_v2.filter_to_public_schedule_gtfs_dataset_keys()

    # Get the crosswalk file
    crosswalk = pd.read_parquet(
        f"{SCHED_GCS}{CROSSWALK_FILE}_{date}.parquet",
        columns=[
            "schedule_gtfs_dataset_key",
            "name",
            "base64_url",
            "organization_source_record_id",
            "organization_name",
            "caltrans_district",
        ],
        filters=[[("schedule_gtfs_dataset_key", "in", public_feeds)]],
    )

    # Checked whether we need a left merge to keep stops outside of CA
    # that may not have caltrans_district
    # and inner merge is fine. All operators are assigned a caltrans_district
    # so Amtrak / FlixBus stops have values populated

    # Merge the crosswalk and the input DF
    crosswalk_input_merged = pd.merge(
        df,
        crosswalk,
        on=["schedule_gtfs_dataset_key"],
        suffixes=[
            "_original",
            None,
        ],  # Keep the source record id from the crosswalk as the "definitive" version
        how="inner",
    )

    # Drop dups
    crosswalk_input_merged = crosswalk_input_merged.drop_duplicates()
    return crosswalk_input_merged

In [62]:
CROSSWALK_FILE = GTFS_DATA_DICT.schedule_tables.gtfs_key_crosswalk

public_feeds = gtfs_utils_v2.filter_to_public_schedule_gtfs_dataset_keys()

In [63]:
len(public_feeds)

2254

In [64]:
# Get the crosswalk file
crosswalk = pd.read_parquet(
    f"{SCHED_GCS}{CROSSWALK_FILE}_{analysis_date}.parquet",
    columns=[
        "schedule_gtfs_dataset_key",
        "name",
        "base64_url",
        "organization_source_record_id",
        "organization_name",
        "caltrans_district",
    ],
    filters=[[("schedule_gtfs_dataset_key", "in", public_feeds)]],
)

In [65]:
crosswalk.shape

(199, 6)

In [66]:
crosswalk.head(2)

Unnamed: 0,schedule_gtfs_dataset_key,name,base64_url,organization_source_record_id,organization_name,caltrans_district
0,ff1bc5dde661d62c877165421e9ca257,Santa Ynez Mecatran Schedule,aHR0cDovL2FwcC5tZWNhdHJhbi5jb20vdXJiL3dzL2ZlZWQvYzJsMFpUMXplWFowTzJOc2FXVnVkRDF6Wld4bU8yVjRjR2x5WlQwN2RIbHdaVDFuZEdaek8ydGxlVDAwTWpjd056UTBaVFk0TlRBek9UTXlNREl4TURkak56STBNRFJrTXpZeU5UTTRNekkwWXpJMA==,reckp33bhAuZlmO1M,City of Solvang,05 - San Luis Obispo / Santa Barbara
1,d3ec92741001094ed14a27847c72e9d0,GET Schedule,aHR0cDovL2V0YS5nZXRidXMub3JnL3J0dC9wdWJsaWMvdXRpbGl0eS9ndGZzLmFzcHg=,recIh3vq8jwuuJlvL,Golden Empire Transit District,06 - Fresno / Bakersfield


In [67]:
shn_typology2 = standardize_operator_info_for_exports(shn_typology, analysis_date)

## Step 6: Add in `finalize_export_df`

In [68]:
STANDARDIZED_COLUMNS_DICT = {
    "caltrans_district": "district_name",
    "organization_source_record_id": "org_id",
    "organization_name": "agency",
    "agency_name_primary": "agency_primary",
    "agency_name_secondary": "agency_secondary",
    "route_name_used": "route_name",
    "route_types_served": "routetypes",
    "meters_to_shn": "meters_to_ca_state_highway",
}

In [69]:
def finalize_export_df(df: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    """
    Suppress certain columns used in our internal modeling for export.
    """
    # Change column order
    route_cols = [
        "organization_source_record_id",
        "organization_name",
        "route_id",
        "route_type",
        "route_name_used",
        "route_length_feet",
    ]
    shape_cols = ["shape_id", "n_trips"]
    agency_ids = ["base64_url"]
    shn_cols = [
        "shn_route",
        "on_shs",
        "shn_districts",
        "pct_route_on_hwy_across_districts",
    ]
    col_order = route_cols + shape_cols + agency_ids + shn_cols + ["geometry"]

    df2 = (
        df[col_order]
        .reindex(columns=col_order)
        .rename(columns=STANDARDIZED_COLUMNS_DICT)
        .reset_index(drop=True)
    )

    return df2

In [70]:
shn_typology3 = shn_typology2.pipe(finalize_export_df)

## Check #1: There are many more rows after piping?

In [71]:
shn_typology3.shape

(9261, 14)

In [72]:
shn_typology2.shape

(9261, 30)

In [73]:
len(shn_typology2.drop_duplicates())

9261

In [74]:
shn_typology.shape

(7326, 25)

### Check original dataframes and see if this is the case.

In [75]:
published_transit_routes = gpd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/traffic_ops/ca_transit_routes.parquet",
    storage_options={"token": credentials.token},
)

In [76]:
july_16_routes = gpd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/traffic_ops/ca_transit_routes_2025-07-16.parquet",
    storage_options={"token": credentials.token},
)

In [77]:
len(published_transit_routes) - len(july_16_routes)

726

In [78]:
    len(published_transit_routes)

7326

In [79]:
len(july_16_routes)

6600

In [80]:
july_16_routes.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 6600 entries, 0 to 6599
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   n_trips                    6600 non-null   int64   
 1   geometry                   6600 non-null   geometry
 2   name                       6600 non-null   object  
 3   schedule_gtfs_dataset_key  6600 non-null   object  
 4   route_id                   6600 non-null   object  
 5   route_type                 6600 non-null   object  
 6   shape_id                   6600 non-null   object  
 7   route_name_used            6600 non-null   object  
 8   route_length_feet          6600 non-null   float64 
dtypes: float64(1), geometry(1), int64(1), object(6)
memory usage: 464.2+ KB


In [81]:
shn_typology3.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 9261 entries, 0 to 9260
Data columns (total 14 columns):
 #   Column                             Non-Null Count  Dtype   
---  ------                             --------------  -----   
 0   org_id                             9261 non-null   object  
 1   agency                             9261 non-null   object  
 2   route_id                           9261 non-null   object  
 3   route_type                         9261 non-null   object  
 4   route_name                         9261 non-null   object  
 5   route_length_feet                  9261 non-null   float64 
 6   shape_id                           9261 non-null   object  
 7   n_trips                            9261 non-null   int64   
 8   base64_url                         9261 non-null   object  
 9   shn_route                          6718 non-null   object  
 10  on_shs                             9261 non-null   object  
 11  shn_districts                      

## Check 2: Make sure `create_routes_data` works with my newly added lines

In [82]:
f"{GTFS_DATA_DICT.gcs_paths.GCS}ah_testing/"

'gs://calitp-analytics-data/data-analyses/ah_testing/'

In [83]:
url1 = "gs://calitp-analytics-data/data-analyses/traffic_ops/ca_transit_routes.parquet"