# Research Request - GTFS Digest: Add Rail and Ferry Operators. #1386

Tiffany's comment:
If it's just a couple of rail, (Amtrak, Metrolink) and a handful of ferry operators, it's worth digging into the why they dropped off, and start by looking for their rows in the 4 schedule tables: trips, shapes, stops, stop_times, and then look for it in a vp table.

* I think the ferry operators and Metrolink are already associated to a district. Even Amtrak might be? But if Amtrak isn't, you can create a separate "district = Amtrak" the merged df so it always has a tab for itself. Amtrak plots for the entire country!
* District 4: San Francisco Bay Area Rapid Transit (BART), City and County of San Francisco (Muni)
* District 7: Los Angeles County Metropolitan Transportation Authority (LA Metro)
* District 11: San Diego Metropolitan Transit System

Amanda
* All the ferry operators are gone. 
* Amtrak is in District 3 but it has schedule_only data, which isn't true? 

Other operators (thanks Meta.AI)
* Strikethroughs = these operators are already in our `schd_vp_df2`
Rail Services

    <s>Amtrak California: Offers intercity rail services throughout the state</s>
    
    <s>BART (Bay Area Rapid Transit): Provides rail services in the San Francisco Bay Area</s>
    
    Caltrain: Offers commuter rail services in the San Francisco Bay Area ¹
    
    LA Metro Rail: Provides rail services in Los Angeles County ¹ **There's Los Angeles County Metropolitan Transportation Authority** 
    
    Metrolink: Offers commuter rail services in Southern California ¹ **Would this be Southern California Regional Rail Authority?**
    
    San Diego Trolley: Provides light rail services in San Diego ¹ **Is this part of San Diego Metropolitan Transit System?**
    
    San Joaquin Regional Rail Commission (ACE): Offers commuter rail services in the San Joaquin Valley ¹ 
    
    <s>SMART (Sonoma-Marin Area Rail Transit): Provides commuter rail services in Sonoma and Marin counties ¹</s>
    
    <s>VTA (Santa Clara Valley Transportation Authority): Offers light rail services in Santa Clara County ¹</s>
    
Here's a list of ferry operators in California:

    San Francisco Bay Ferry: operates 10 ferry routes in the San Francisco Bay Area, with two seasonal routes ¹
    Golden Gate Ferry: operates ferry services between Larkspur, Sausalito, Tiburon, and San Francisco ¹
    Blue and Gold Fleet: connects San Francisco with Sausalito, Tiburon, Angel Island, Oakland, Alameda, and Vallejo ²
    Balboa Island Ferry: provides daily ferry service between the Balboa Peninsula in Newport Beach and Balboa Island ¹
    Tideline Marine Group: operates commuter ferry service between Berkeley and San Francisco ¹
    Caltrans: operates the J-Mack Ferry, a cable ferry service between Ryde and Ryer Island near Sacramento ¹
    California Department of Transportation: operates the Howard Landing Ferry on the California Delta ²

In [1]:
import _section1_utils as section1
import _section2_utils as section2
import geopandas as gpd
import merge_data
import merge_operator_data
import numpy as np
import pandas as pd
from segment_speed_utils import gtfs_schedule_wrangling, helpers
from segment_speed_utils.project_vars import COMPILED_CACHED_VIEWS, PROJECT_CRS
from shared_utils import catalog_utils, portfolio_utils, rt_dates
from update_vars import GTFS_DATA_DICT, RT_SCHED_GCS, SCHED_GCS, SEGMENT_GCS

In [2]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [3]:
analysis_date_list = [rt_dates.DATES["feb2025"]]

In [4]:
type(analysis_date_list)

list

In [5]:
analysis_date = rt_dates.DATES["feb2025"]

## Look at operators in `digest/schedule_vp_metrics` without any filters to see if ferry and rail operators are in here.
* Ferry operators except Bay Area Water Emergency Services (which isn't even a ferry?) aren't here.

In [6]:
schd_vp_url = f"{GTFS_DATA_DICT.digest_tables.dir}{GTFS_DATA_DICT.digest_tables.route_schedule_vp}.parquet"

In [7]:
GTFS_DATA_DICT.digest_tables.route_schedule_vp

'digest/schedule_vp_metrics'

In [8]:
schd_vp_df = pd.read_parquet(
    schd_vp_url,
    columns=[
        "schedule_gtfs_dataset_key",
        "caltrans_district",
        "organization_name",
        "name",
        "sched_rt_category",
        "service_date",
    ],
)

In [9]:
# Filter for Jan and Feb
schd_vp_df2 = schd_vp_df.loc[
    (schd_vp_df.service_date == "2025-01-15")
    | (schd_vp_df.service_date == "2024-12-11")
]

In [10]:
# Drop duplicates
schd_vp_df3 = (
    schd_vp_df2[
        ["organization_name", "service_date", "sched_rt_category", "caltrans_district"]
    ]
    .drop_duplicates(subset=["organization_name"])
    .sort_values(by=["organization_name"])
)

In [11]:
schd_vp_df3.sched_rt_category.value_counts()

schedule_and_vp    102
schedule_only       89
vp_only              6
Name: sched_rt_category, dtype: int64

### Southern California Regional Rail Authority is vehicle positions only, sort of strange.

In [12]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Rail")]

Unnamed: 0,organization_name,service_date,sched_rt_category,caltrans_district
259484,San Joaquin Regional Rail Commission,2024-12-11,schedule_and_vp,10 - Stockton
11331,Sonoma-Marin Area Rail Transit District,2024-12-11,schedule_and_vp,04 - Oakland
338245,Southern California Regional Rail Authority,2024-12-11,vp_only,07 - Los Angeles


In [13]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Water")]

Unnamed: 0,organization_name,service_date,sched_rt_category,caltrans_district
334697,San Francisco Bay Area Water Emergency Transit Authority,2024-12-11,vp_only,04 - Oakland


In [14]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Metropolitan")]

Unnamed: 0,organization_name,service_date,sched_rt_category,caltrans_district
4127,Los Angeles County Metropolitan Transportation Authority,2024-12-11,schedule_and_vp,07 - Los Angeles
202479,San Diego Metropolitan Transit System,2024-12-11,schedule_and_vp,11 - San Diego
69932,Santa Barbara Metropolitan Transit District,2024-12-11,schedule_and_vp,05 - San Luis Obispo
127086,Santa Cruz Metropolitan Transit District,2024-12-11,schedule_and_vp,05 - San Luis Obispo


In [15]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Fleet")]

Unnamed: 0,organization_name,service_date,sched_rt_category,caltrans_district


In [16]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Ferry")]

Unnamed: 0,organization_name,service_date,sched_rt_category,caltrans_district


In [17]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Fleet")]

Unnamed: 0,organization_name,service_date,sched_rt_category,caltrans_district


### Look at how rail routes are recorded.

In [18]:
operators_to_keep = [
    "Amtrak",
    "Los Angeles County Metropolitan Transportation Authority",
    "San Diego Metropolitan Transit System",
    "Capitol Corridor Joint Powers Authority",
    "Southern California Regional Rail Authority",
    "San Joaquin Regional Rail Commission",
    "City and County of San Francisco",
    "San Francisco Bay Area Water Emergency Transit Authority",
    "Sonoma-Marin Area Rail Transit District"
]

In [19]:
rail_ops_only = pd.read_parquet(schd_vp_url)

In [20]:
rail_ops_only2 = rail_ops_only.loc[
    rail_ops_only.organization_name.isin(operators_to_keep)
]

In [21]:
rail_ops_only2.organization_name.value_counts()

Los Angeles County Metropolitan Transportation Authority    16877
San Diego Metropolitan Transit System                       13025
City and County of San Francisco                             8531
San Francisco Bay Area Water Emergency Transit Authority      898
Southern California Regional Rail Authority                   542
Capitol Corridor Joint Powers Authority                       240
Sonoma-Marin Area Rail Transit District                       144
San Joaquin Regional Rail Commission                          123
Amtrak                                                         24
Name: organization_name, dtype: int64

In [22]:
sched_keys_to_keep = list(
    rail_ops_only2.loc[
        rail_ops_only2.organization_name.isin(operators_to_keep)
    ].schedule_gtfs_dataset_key.unique()
)

### Bring in route typologies & merge everything together 
* There aren't any routes categorized as ferries in `route_typologies`

In [23]:
EXPORT = GTFS_DATA_DICT.schedule_tables.route_typologies

In [24]:
route_typologies = pd.read_parquet(f"{SCHED_GCS}{EXPORT}_{analysis_date}.parquet")

In [25]:
rail_ops_only2.head(1)

Unnamed: 0,schedule_gtfs_dataset_key,direction_id,time_period,avg_scheduled_service_minutes,avg_stop_miles,n_scheduled_trips,frequency,is_express,is_rapid,is_rail,is_coverage,is_downtown_local,is_local,service_date,typology,minutes_atleast1_vp,minutes_atleast2_vp,total_rt_service_minutes,total_scheduled_service_minutes,total_vp,vp_in_shape,is_early,is_ontime,is_late,n_vp_trips,vp_per_minute,pct_in_shape,pct_rt_journey_atleast1_vp,pct_rt_journey_atleast2_vp,pct_sched_journey_atleast1_vp,pct_sched_journey_atleast2_vp,rt_sched_journey_ratio,avg_rt_service_minutes,sched_rt_category,speed_mph,route_long_name,route_short_name,route_combined_name,route_id,base64_url,organization_source_record_id,organization_name,caltrans_district,route_primary_direction,name,schedule_source_record_id
4110,0666caf3ec1ecc96b74f4477ee4bc939,0.0,all_day,81.85,0.18,65,2.71,0.0,0.0,0.0,0.0,1.0,0.0,2024-05-22,downtown_local,6721,6615,9461.83,5320.0,19827,18501,1,3,61,65,2.1,0.93,0.71,0.7,1.0,1.0,1.78,145.57,schedule_and_vp,9.75,Metro Local Line,10/48,10/48 Metro Local Line,10,aHR0cHM6Ly9naXRsYWIuY29tL0xBQ01UQS9ndGZzX2J1cy9yYXcvbWFzdGVyL2d0ZnNfYnVzLnppcA==,recPnGkwdpnr8jmHB,Los Angeles County Metropolitan Transportation Authority,07 - Los Angeles,Southbound,LA Metro Bus Schedule,recX8JOPmBQM9aWLC


In [26]:
route_typologies2 = route_typologies[
    [
        "route_type",
        "route_id",
        "schedule_gtfs_dataset_key",
    ]
].drop_duplicates()

In [27]:
route_typologies2.route_type.unique()

array(['3', '2', '0', '1', '5'], dtype=object)

In [28]:
route_typologies2.loc[route_typologies2.route_type == "4"]

Unnamed: 0,route_type,route_id,schedule_gtfs_dataset_key


In [29]:
m1 = pd.merge(
    rail_ops_only2,
    route_typologies2,
    on=["schedule_gtfs_dataset_key", "route_id"],
    how="left",
    indicator=True,
)

In [30]:
m1._merge.value_counts()

left_only     21852
both          18552
right_only        0
Name: _merge, dtype: int64

In [31]:
m1.route_type.unique()

array([nan, '2', '0', '1', '3', '5'], dtype=object)

In [32]:
rail_only = m1.loc[(m1.is_rail == 1) | m1.route_type.isin(["1", "2", "12"])]

In [33]:
rail_only.columns

Index(['schedule_gtfs_dataset_key', 'direction_id', 'time_period',
       'avg_scheduled_service_minutes', 'avg_stop_miles', 'n_scheduled_trips',
       'frequency', 'is_express', 'is_rapid', 'is_rail', 'is_coverage',
       'is_downtown_local', 'is_local', 'service_date', 'typology',
       'minutes_atleast1_vp', 'minutes_atleast2_vp',
       'total_rt_service_minutes', 'total_scheduled_service_minutes',
       'total_vp', 'vp_in_shape', 'is_early', 'is_ontime', 'is_late',
       'n_vp_trips', 'vp_per_minute', 'pct_in_shape',
       'pct_rt_journey_atleast1_vp', 'pct_rt_journey_atleast2_vp',
       'pct_sched_journey_atleast1_vp', 'pct_sched_journey_atleast2_vp',
       'rt_sched_journey_ratio', 'avg_rt_service_minutes', 'sched_rt_category',
       'speed_mph', 'route_long_name', 'route_short_name',
       'route_combined_name', 'route_id', 'base64_url',
       'organization_source_record_id', 'organization_name',
       'caltrans_district', 'route_primary_direction', 'name',
       '

In [34]:
rail_only2 = rail_only.loc[rail_only.service_date == '2025-02-12T00:00:00.000000000']

In [35]:
rail_only2.organization_name.value_counts()

City and County of San Francisco                            58
Los Angeles County Metropolitan Transportation Authority    36
Sonoma-Marin Area Rail Transit District                      6
Amtrak                                                       6
San Joaquin Regional Rail Commission                         6
Capitol Corridor Joint Powers Authority                      6
Name: organization_name, dtype: int64

### Why aren't rail routes showing for operators that certifably do have rail such as SF Muni and Amtrak when you do `is_rail == 0`?

In [36]:
rail_only2.sched_rt_category.value_counts()

schedule_and_vp    106
schedule_only        6
vp_only              6
Name: sched_rt_category, dtype: int64

In [37]:
# https://gtfs.org/documentation/schedule/reference/#
route_type_crosswalk = {
    "route_type": ["0", "1", "2", "3", "4", "5", "6", "7", "11", "12"],
    "route_type_str": [
        "Tram, Streetcar, Light rail",
        "Subway, Metro",
        "Rail",
        "Bus",
        "Ferry.",
        "Cable tram.",
        "Aerial lift, suspended cable car (e.g., gondola lift, aerial tramway).",
        "Funicular.",
        "Trolleybus.",
        "Monorail.",
    ],
}

In [38]:
route_type_crosswalk_df = pd.DataFrame(route_type_crosswalk)

In [39]:
route_type_crosswalk_df

Unnamed: 0,route_type,route_type_str
0,0,"Tram, Streetcar, Light rail"
1,1,"Subway, Metro"
2,2,Rail
3,3,Bus
4,4,Ferry.
5,5,Cable tram.
6,6,"Aerial lift, suspended cable car (e.g., gondola lift, aerial tramway)."
7,7,Funicular.
8,11,Trolleybus.
9,12,Monorail.


In [40]:
agg1 = rail_only2.groupby(
    [
        "organization_name",
        "sched_rt_category",
        "route_type"
    ]
).agg(
    {   "route_id":"count",
        "is_rail": "sum",
        
    }
).reset_index()

In [41]:
agg1 = agg1.sort_values(by = ['organization_name',"route_id"], ascending = [True, False])

### Sonoma disappeared completely?

In [42]:
agg1.loc[(agg1.route_id != 0) | (agg1.is_rail !=0)]

Unnamed: 0,organization_name,sched_rt_category,route_type,route_id,is_rail
2,Amtrak,schedule_only,2,6,0.0
18,Capitol Corridor Joint Powers Authority,vp_only,2,6,0.0
32,City and County of San Francisco,schedule_and_vp,0,40,40.0
35,City and County of San Francisco,schedule_and_vp,5,18,18.0
44,Los Angeles County Metropolitan Transportation Authority,schedule_and_vp,0,24,24.0
45,Los Angeles County Metropolitan Transportation Authority,schedule_and_vp,1,12,12.0
58,San Joaquin Regional Rail Commission,schedule_and_vp,2,6,0.0
70,Sonoma-Marin Area Rail Transit District,schedule_and_vp,2,6,0.0


In [43]:
rail_only2[['organization_name','route_long_name', 'route_short_name',
       'route_combined_name', 'route_id',"route_type"]].drop_duplicates()

Unnamed: 0,organization_name,route_long_name,route_short_name,route_combined_name,route_id,route_type
6629,Sonoma-Marin Area Rail Transit District,Main Line,SMART,SMART Main Line,SMART,2
9404,Los Angeles County Metropolitan Transportation Authority,Metro A Line,,Metro A Line,801,0
9530,Los Angeles County Metropolitan Transportation Authority,Metro B Line,,Metro B Line,802,1
9656,Los Angeles County Metropolitan Transportation Authority,Metro C Line,,Metro C Line,803,0
9782,Los Angeles County Metropolitan Transportation Authority,Metro E Line,,Metro E Line,804,0
9908,Los Angeles County Metropolitan Transportation Authority,Metro D Line,,Metro D Line,805,1
10039,Los Angeles County Metropolitan Transportation Authority,Metro K Line,,Metro K Line,807,0
19532,Amtrak,Pacific Surfliner,,Pacific Surfliner,78,2
25551,City and County of San Francisco,CALIFORNIA STREET CABLE CAR,CA,CA CALIFORNIA STREET CABLE CAR,CA,5
25677,City and County of San Francisco,MARKET & WHARVES,F,F MARKET & WHARVES,F,0


## Scheduled Trips

In [44]:
scheduled_trips_df = pd.concat(
    [
        helpers.import_scheduled_trips(
            analysis_date,
            columns=[
                "gtfs_dataset_key",
                "name",
                "route_id",
                "route_long_name",
                "route_short_name",
                "route_desc",
            ],
            get_pandas=True,
        ).assign(service_date=pd.to_datetime(analysis_date))
        for analysis_date in analysis_date_list
    ],
    axis=0,
    ignore_index=True,
)

In [45]:
scheduled_trips_df.head(1)

Unnamed: 0,schedule_gtfs_dataset_key,name,route_id,route_long_name,route_short_name,route_desc,service_date
0,ff1bc5dde661d62c877165421e9ca257,Santa Ynez Mecatran Schedule,ROUTEA,Express,,,2025-02-12


### Find the ferry

In [46]:
scheduled_trips_df.loc[scheduled_trips_df.name.str.contains("Ferry")][
    ["name"]
].drop_duplicates()

Unnamed: 0,name
1102,Bay Area 511 Golden Gate Ferry Schedule
1165,Bay Area 511 San Francisco Bay Ferry Schedule
1445,Bay Area 511 Treasure Island Ferry Schedule
2288,Havasu Landing Ferry Schedule


In [47]:
scheduled_trips_df.columns

Index(['schedule_gtfs_dataset_key', 'name', 'route_id', 'route_long_name',
       'route_short_name', 'route_desc', 'service_date'],
      dtype='object')

In [48]:
ferry_schd_keys = list(
    scheduled_trips_df.loc[
        scheduled_trips_df.name.str.contains("Ferry")
    ].schedule_gtfs_dataset_key.unique()
)

In [49]:
ferry_names = list(
    scheduled_trips_df.loc[scheduled_trips_df.name.str.contains("Ferry")].name.unique()
)

In [50]:
scheduled_trips_df2 = scheduled_trips_df.loc[
    scheduled_trips_df.schedule_gtfs_dataset_key.isin(ferry_schd_keys)
]

In [51]:
len(scheduled_trips_df2)

13

In [52]:
scheduled_trips_df2.head(2)

Unnamed: 0,schedule_gtfs_dataset_key,name,route_id,route_long_name,route_short_name,route_desc,service_date
1102,ca270cd1ac30a9ec5336a11bc9223c41,Bay Area 511 Golden Gate Ferry Schedule,AISF,Angel Island - San Francisco Ferry,AISF,,2025-02-12
1103,ca270cd1ac30a9ec5336a11bc9223c41,Bay Area 511 Golden Gate Ferry Schedule,LSSF,Larkspur - San Francisco Ferry,LSSF,,2025-02-12


In [53]:
# scheduled_trips_df2

## Scheduled Shapes 

In [54]:
TABLE = GTFS_DATA_DICT.schedule_downloads.shapes
FILE = f"{COMPILED_CACHED_VIEWS}{TABLE}_{analysis_date}.parquet"

In [55]:
shapes = gpd.read_parquet(FILE)

In [56]:
shapes.columns

Index(['feed_key', 'feed_timezone', 'service_date',
       'shape_first_departure_datetime_pacific',
       'shape_last_arrival_datetime_pacific', 'shape_id', 'shape_array_key',
       'n_trips', 'geometry'],
      dtype='object')

In [57]:
scheduled_shapes_df = helpers.import_scheduled_shapes(
    analysis_date,
    columns=["shape_array_key", "geometry"],
    get_pandas=True,
    crs=PROJECT_CRS,
)

In [58]:
scheduled_shapes_df.columns

Index(['shape_array_key', 'geometry'], dtype='object')

## Scheduled Stops

In [59]:
TABLE = GTFS_DATA_DICT.rt_vs_schedule_tables.stop_times_direction
FILE = f"{RT_SCHED_GCS}{TABLE}_{analysis_date}.parquet"

In [60]:
stops_df = gpd.read_parquet(FILE)

In [61]:
stops_df.columns

Index(['feed_key', 'stop_id', 'stop_sequence', 'schedule_gtfs_dataset_key',
       'trip_instance_key', 'shape_array_key', 'stop_name', 'geometry',
       'stop_meters', 'prior_stop_sequence', 'subseq_stop_sequence',
       'stop_primary_direction', 'stop_pair', 'stop_pair_name'],
      dtype='object')

In [62]:
stops_df2 = stops_df.loc[stops_df.schedule_gtfs_dataset_key.isin(ferry_schd_keys)]

In [63]:
len(stops_df2)

687

In [64]:
# stops_df2.explore()

## Scheduled Stop Times

In [65]:
TABLE = GTFS_DATA_DICT.rt_vs_schedule_tables.stop_times_direction
FILE = f"{RT_SCHED_GCS}{TABLE}_{analysis_date}.parquet"

In [66]:
sched_stops = gpd.read_parquet(FILE)

In [67]:
sched_stops.columns

Index(['feed_key', 'stop_id', 'stop_sequence', 'schedule_gtfs_dataset_key',
       'trip_instance_key', 'shape_array_key', 'stop_name', 'geometry',
       'stop_meters', 'prior_stop_sequence', 'subseq_stop_sequence',
       'stop_primary_direction', 'stop_pair', 'stop_pair_name'],
      dtype='object')

In [68]:
sched_stops2 = sched_stops.loc[
    sched_stops.schedule_gtfs_dataset_key.isin(ferry_schd_keys)
]

In [69]:
# sched_stops2.explore()