# Research Request - GTFS Digest: Add Rail and Ferry Operators. #1386

Tiffany's comment:
If it's just a couple of rail, (Amtrak, Metrolink) and a handful of ferry operators, it's worth digging into the why they dropped off, and start by looking for their rows in the 4 schedule tables: trips, shapes, stops, stop_times, and then look for it in a vp table.

* I think the ferry operators and Metrolink are already associated to a district. Even Amtrak might be? But if Amtrak isn't, you can create a separate "district = Amtrak" the merged df so it always has a tab for itself. Amtrak plots for the entire country!
* District 4: San Francisco Bay Area Rapid Transit (BART), City and County of San Francisco (Muni)
* District 7: Los Angeles County Metropolitan Transportation Authority (LA Metro)
* District 11: San Diego Metropolitan Transit System

Amanda
* All the ferry operators are gone. 
* Amtrak is in District 3 but it has schedule_only data, which isn't true? 

Other operators (thanks Meta.AI)
* Strikethroughs = these operators are already in our `schd_vp_df2`
Rail Services

    <s>Amtrak California: Offers intercity rail services throughout the state</s>
    
    <s>BART (Bay Area Rapid Transit): Provides rail services in the San Francisco Bay Area</s>
    
    Caltrain: Offers commuter rail services in the San Francisco Bay Area ¹
    
    LA Metro Rail: Provides rail services in Los Angeles County ¹ **There's Los Angeles County Metropolitan Transportation Authority** 
    
    Metrolink: Offers commuter rail services in Southern California ¹ **Would this be Southern California Regional Rail Authority?**
    
    San Diego Trolley: Provides light rail services in San Diego ¹ **Is this part of San Diego Metropolitan Transit System?**
    
    San Joaquin Regional Rail Commission (ACE): Offers commuter rail services in the San Joaquin Valley ¹ 
    
    <s>SMART (Sonoma-Marin Area Rail Transit): Provides commuter rail services in Sonoma and Marin counties ¹</s>
    
    <s>VTA (Santa Clara Valley Transportation Authority): Offers light rail services in Santa Clara County ¹</s>

In [1]:
import _section1_utils as section1
import _section2_utils as section2
import geopandas as gpd
import merge_data
import merge_operator_data
import numpy as np
import pandas as pd
from segment_speed_utils import gtfs_schedule_wrangling, helpers
from segment_speed_utils.project_vars import COMPILED_CACHED_VIEWS, PROJECT_CRS
from shared_utils import catalog_utils, portfolio_utils, rt_dates
from update_vars import GTFS_DATA_DICT, RT_SCHED_GCS, SCHED_GCS, SEGMENT_GCS

In [2]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [3]:
analysis_date_list = [rt_dates.DATES["feb2025"]]

In [4]:
type(analysis_date_list)

list

In [5]:
analysis_date = rt_dates.DATES["feb2025"]

## Look at `operators_prep`
* Ferry operators aren't here.
* 

In [6]:
schd_vp_url = f"{GTFS_DATA_DICT.digest_tables.dir}{GTFS_DATA_DICT.digest_tables.route_schedule_vp}.parquet"

In [7]:
schd_vp_df = pd.read_parquet(
    schd_vp_url,
    columns=[
        "schedule_gtfs_dataset_key",
        "caltrans_district",
        "organization_name",
        "name",
        "sched_rt_category",
        "service_date",
    ],
)

In [8]:
schd_vp_df2 = schd_vp_df.loc[
    (schd_vp_df.service_date == "2025-01-15")
    | (schd_vp_df.service_date == "2024-12-11")
]

In [9]:
schd_vp_df2[
    ["organization_name", "service_date", "sched_rt_category", "caltrans_district"]
].drop_duplicates(subset=["organization_name"]).sort_values(by=["organization_name"])

Unnamed: 0,organization_name,service_date,sched_rt_category,caltrans_district
235475,Alameda-Contra Costa Transit District,2024-12-11,schedule_and_vp,04 - Oakland
79911,Amador Regional Transit System,2024-12-11,schedule_only,10 - Stockton
102929,Amtrak,2024-12-11,schedule_only,03 - Marysville
235184,Anaheim Transportation Network,2025-01-15,schedule_only,12 - Irvine
163413,Antelope Valley Transit Authority,2024-12-11,schedule_and_vp,07 - Los Angeles
196075,Basin Transit,2024-12-11,schedule_and_vp,08 - San Bernardino
131713,Butte County Association of Governments,2024-12-11,schedule_and_vp,03 - Marysville
129319,Calaveras Transit Agency,2024-12-11,schedule_only,10 - Stockton
301695,Capitol Corridor Joint Powers Authority,2024-12-11,schedule_only,04 - Oakland
120619,Central Contra Costa Transit Authority,2024-12-11,schedule_and_vp,04 - Oakland


In [10]:
CLEAN_ROUTES = GTFS_DATA_DICT.schedule_tables.route_identification

In [11]:
route_names_df = pd.read_parquet(f"{SCHED_GCS}{CLEAN_ROUTES}.parquet")

In [12]:
route_names_df.columns

Index(['schedule_gtfs_dataset_key', 'name', 'route_id', 'route_long_name',
       'route_short_name', 'route_desc', 'service_date', 'combined_name',
       'route_id2', 'recent_combined_name', 'recent_route_id2'],
      dtype='object')

In [13]:
operators_to_keep = [
    "Amtrak",
    "Los Angeles County Metropolitan Transportation Authority",
    "San Diego Metropolitan Transit System",
    "Capitol Corridor Joint Powers Authority",
    "Southern California Regional Rail Authority",
    "San Joaquin Regional Rail Commission",
    "City and County of San Francisco"
]

In [14]:
schd_vp_df3 = pd.read_parquet(schd_vp_url)

In [15]:
schd_vp_df4 = schd_vp_df3.loc[schd_vp_df3.organization_name.isin(operators_to_keep)]

In [16]:
sched_keys_to_keep = list(
    schd_vp_df2.loc[
        schd_vp_df2.organization_name.isin(operators_to_keep)
    ].schedule_gtfs_dataset_key.unique()
)

In [17]:
schd_vp = schd_vp_df4[
    ["organization_name", "route_combined_name", "route_id"]
].drop_duplicates()

In [18]:
route_names_df2 = route_names_df.loc[
    route_names_df.schedule_gtfs_dataset_key.isin(sched_keys_to_keep)
]

In [19]:
route_names_df2.columns

Index(['schedule_gtfs_dataset_key', 'name', 'route_id', 'route_long_name',
       'route_short_name', 'route_desc', 'service_date', 'combined_name',
       'route_id2', 'recent_combined_name', 'recent_route_id2'],
      dtype='object')

In [20]:
unique_routes_df = route_names_df2[
    ["route_id", "route_long_name", "route_short_name", "route_desc"]
].drop_duplicates()

### Bring in `route_typologies`

In [24]:
EXPORT = GTFS_DATA_DICT.schedule_tables.route_typologies

In [25]:
route_typologies = pd.read_parquet(
            f"{SCHED_GCS}{EXPORT}_{analysis_date}.parquet")

In [36]:
route_typologies.columns

Index(['schedule_gtfs_dataset_key', 'name', 'route_type', 'route_id',
       'route_long_name', 'route_short_name', 'combined_name', 'is_express',
       'is_rapid', 'is_rail', 'is_local', 'direction_id', 'common_shape_id',
       'route_name', 'route_meters', 'is_coverage', 'is_downtown_local'],
      dtype='object')

In [32]:
route_typologies2 = route_typologies[['route_type', 'route_id', 'combined_name', 'is_express',
       'is_rapid', 'is_rail', 'is_local', ]].drop_duplicates()

In [33]:
m1 = pd.merge(schd_vp, unique_routes_df, on=["route_id"], how="left")

In [34]:
m2 = pd.merge(m1, route_typologies2, on = "route_id", how = "left")

In [38]:
m2 = m2.drop(columns = ["combined_name"])

In [40]:
m3 = m2.sort_values(by = ["organization_name"]).drop_duplicates()

In [42]:
m3 = m3.fillna("not in route typologies")

In [43]:
m3.loc[m3.is_rail == "not in route typologies"]

Unnamed: 0,organization_name,route_combined_name,route_id,route_long_name,route_short_name,route_desc,route_type,is_express,is_rapid,is_rail,is_local
3748,City and County of San Francisco,KT INGLESIDE-THIRD,KT,INGLESIDE-THIRD,KT,Weekdays 6am-12 midnight Weekends 8am-8pm,not in route typologies,not in route typologies,not in route typologies,not in route typologies,not in route typologies
3401,City and County of San Francisco,S SHUTTLE,S,SHUTTLE,S,Additional Weekday Service,not in route typologies,not in route typologies,not in route typologies,not in route typologies,not in route typologies
3390,City and County of San Francisco,MBUS OCEAN VIEW BUS,MBUS,OCEANVIEW BUS,MBUS,Weekend 6am-8pm,not in route typologies,not in route typologies,not in route typologies,not in route typologies,not in route typologies
3389,City and County of San Francisco,MBUS OCEAN VIEW BUS,MBUS,OCEAN VIEW BUS,MBUS,1130 pm-1 am daily,not in route typologies,not in route typologies,not in route typologies,not in route typologies,not in route typologies
3372,City and County of San Francisco,KLM MUNI METRO SHUTTLE,KLM,MUNI METRO SHUTTLE,KLM,9pm-12 midnight daily,not in route typologies,not in route typologies,not in route typologies,not in route typologies,not in route typologies
3373,City and County of San Francisco,KLM MUNI METRO SHUTTLE,KLM,MUNI METRO SHUTTLE,KLM,9 pm-12 midnight daily,not in route typologies,not in route typologies,not in route typologies,not in route typologies,not in route typologies
3380,City and County of San Francisco,LBUS TARAVAL BUS,LBUS,TARAVAL BUS,LBUS,5am-10 pm daily,not in route typologies,not in route typologies,not in route typologies,not in route typologies,not in route typologies
3381,City and County of San Francisco,LBUS TARAVAL BUS,LBUS,TARAVAL BUS,LBUS,9 pm -12 am daily,not in route typologies,not in route typologies,not in route typologies,not in route typologies,not in route typologies
1935,Los Angeles County Metropolitan Transportation Authority,487 Metro Express Line,487,not in route typologies,not in route typologies,not in route typologies,not in route typologies,not in route typologies,not in route typologies,not in route typologies,not in route typologies
1929,Los Angeles County Metropolitan Transportation Authority,Metro E Line (Expo),806,Metro E Line (Expo),,not in route typologies,not in route typologies,not in route typologies,not in route typologies,not in route typologies,not in route typologies


## Scheduled Trips

In [None]:
scheduled_trips_df = pd.concat(
    [
        helpers.import_scheduled_trips(
            analysis_date,
            columns=[
                "gtfs_dataset_key",
                "name",
                "route_id",
                "route_long_name",
                "route_short_name",
                "route_desc",
            ],
            get_pandas=True,
        ).assign(service_date=pd.to_datetime(analysis_date))
        for analysis_date in analysis_date_list
    ],
    axis=0,
    ignore_index=True,
)

In [None]:
scheduled_trips_df.head(1)

### Find the ferry

In [None]:
scheduled_trips_df.loc[scheduled_trips_df.name.str.contains("Ferry")][
    ["name"]
].drop_duplicates()

In [None]:
scheduled_trips_df.columns

In [None]:
ferry_schd_keys = list(
    scheduled_trips_df.loc[
        scheduled_trips_df.name.str.contains("Ferry")
    ].schedule_gtfs_dataset_key.unique()
)

In [None]:
ferry_names = list(
    scheduled_trips_df.loc[scheduled_trips_df.name.str.contains("Ferry")].name.unique()
)

In [None]:
scheduled_trips_df2 = scheduled_trips_df.loc[
    scheduled_trips_df.schedule_gtfs_dataset_key.isin(ferry_schd_keys)
]

In [None]:
len(scheduled_trips_df2)

In [None]:
scheduled_trips_df2.head(2)

In [None]:
# scheduled_trips_df2

## Scheduled Shapes 

In [None]:
TABLE = GTFS_DATA_DICT.schedule_downloads.shapes
FILE = f"{COMPILED_CACHED_VIEWS}{TABLE}_{analysis_date}.parquet"

In [None]:
shapes = gpd.read_parquet(FILE)

In [None]:
shapes.columns

In [None]:
scheduled_shapes_df = helpers.import_scheduled_shapes(
    analysis_date,
    columns=["shape_array_key", "geometry"],
    get_pandas=True,
    crs=PROJECT_CRS,
)

In [None]:
scheduled_shapes_df.columns

## Scheduled Stops

In [None]:
TABLE = GTFS_DATA_DICT.rt_vs_schedule_tables.stop_times_direction
FILE = f"{RT_SCHED_GCS}{TABLE}_{analysis_date}.parquet"

In [None]:
stops_df = gpd.read_parquet(FILE)

In [None]:
stops_df.columns

In [None]:
stops_df2 = stops_df.loc[stops_df.schedule_gtfs_dataset_key.isin(ferry_schd_keys)]

In [None]:
len(stops_df2)

In [None]:
# stops_df2.explore()

## Scheduled Stop Times

In [None]:
TABLE = GTFS_DATA_DICT.rt_vs_schedule_tables.stop_times_direction
FILE = f"{RT_SCHED_GCS}{TABLE}_{analysis_date}.parquet"

In [None]:
sched_stops = gpd.read_parquet(FILE)

In [None]:
sched_stops.columns

In [None]:
sched_stops2 = sched_stops.loc[
    sched_stops.schedule_gtfs_dataset_key.isin(ferry_schd_keys)
]

In [None]:
# sched_stops2.explore()