# Research Request - GTFS Digest: Add Rail and Ferry Operators. #1386

Tiffany's comment:
If it's just a couple of rail, (Amtrak, Metrolink) and a handful of ferry operators, it's worth digging into the why they dropped off, and start by looking for their rows in the 4 schedule tables: trips, shapes, stops, stop_times, and then look for it in a vp table.

* I think the ferry operators and Metrolink are already associated to a district. Even Amtrak might be? But if Amtrak isn't, you can create a separate "district = Amtrak" the merged df so it always has a tab for itself. Amtrak plots for the entire country!
* District 4: San Francisco Bay Area Rapid Transit (BART), City and County of San Francisco (Muni)
* District 7: Los Angeles County Metropolitan Transportation Authority (LA Metro)
* District 11: San Diego Metropolitan Transit System

Amanda
* Ferry operator: Bay Area WETA, City of Alameda, and Golden Gate Bridge, Highway and Transportation District show up. All 3 are vp_only so they were filtered out -> incorporate them in? 
* The only ferry operator missing is Santa Cruz Harbor. 
* Amtrak is in District 3 but it has schedule_only data, which isn't true? 

Here's a list of ferry operators in California from Evan's comment [here](https://github.com/cal-itp/data-analyses/issues/1357):
    
    * City of Alameda
    * Golden Gate
    * SF WETA
    * Santa Cruz Harbor

List from Meta AI

    * San Francisco Bay Ferry: operates 10 ferry routes in the San Francisco Bay Area, with two seasonal routes ¹
    * Golden Gate Ferry: operates ferry services between Larkspur, Sausalito, Tiburon, and San Francisco ¹
    * Blue and Gold Fleet: connects San Francisco with Sausalito, Tiburon, Angel Island, Oakland, Alameda, and Vallejo ²
    * Balboa Island Ferry: provides daily ferry service between the Balboa Peninsula in Newport Beach and Balboa Island ¹
    * Tideline Marine Group: operates commuter ferry service between Berkeley and San Francisco ¹
    * Caltrans: operates the J-Mack Ferry, a cable ferry service between Ryde and Ryer Island near Sacramento ¹
    * California Department of Transportation: operates the Howard Landing Ferry on the California Delta ²

In [1]:
import _section1_utils as section1
import _section2_utils as section2
import geopandas as gpd
import merge_data
import merge_operator_data
import numpy as np
import pandas as pd
from segment_speed_utils import gtfs_schedule_wrangling, helpers
from segment_speed_utils.project_vars import COMPILED_CACHED_VIEWS, PROJECT_CRS
from shared_utils import catalog_utils, portfolio_utils, rt_dates
from update_vars import GTFS_DATA_DICT, RT_SCHED_GCS, SCHED_GCS, SEGMENT_GCS

In [2]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [3]:
analysis_date_list = [rt_dates.DATES["feb2025"]]

In [4]:
analysis_date = rt_dates.DATES["feb2025"]

In [5]:
schd_vp_url = f"{GTFS_DATA_DICT.digest_tables.dir}{GTFS_DATA_DICT.digest_tables.route_schedule_vp}.parquet"

In [6]:
EXPORT = GTFS_DATA_DICT.schedule_tables.route_typologies

In [7]:
route_typologies = pd.read_parquet(f"{SCHED_GCS}{EXPORT}_{analysis_date}.parquet")

## Look at operators in `digest/schedule_vp_metrics` without any filters to see if ferry and rail operators are in here.
* Ferry operators except Bay Area Water Emergency Services (which isn't even a ferry?) aren't here.

In [8]:
schd_vp_df = pd.read_parquet(
    schd_vp_url,
    columns=[
        "schedule_gtfs_dataset_key",
        "caltrans_district",
        "organization_name",
        "name",
        "sched_rt_category",
        "service_date",
    ],
)

In [9]:
# Filter for Jan and Feb
schd_vp_df2 = schd_vp_df.loc[
    (schd_vp_df.service_date == "2025-01-15")
    | (schd_vp_df.service_date == "2024-12-11")
]

In [10]:
# Drop duplicates
schd_vp_df3 = (
    schd_vp_df2[
        ["organization_name", "service_date", "sched_rt_category", "caltrans_district"]
    ]
    .drop_duplicates(subset=["organization_name"])
    .sort_values(by=["organization_name"])
)

In [11]:
schd_vp_df3.sched_rt_category.value_counts()

schedule_and_vp    102
schedule_only       89
vp_only              6
Name: sched_rt_category, dtype: int64

In [67]:
schd_vp_df3.loc[schd_vp_df3.sched_rt_category == "vp_only"]

Unnamed: 0,organization_name,service_date,sched_rt_category,caltrans_district
334698,City of Alameda,2024-12-11,vp_only,04 - Oakland
322978,Dumbarton Bridge Regional Operations Consortium,2024-12-11,vp_only,04 - Oakland
335613,"Golden Gate Bridge, Highway and Transportation District",2024-12-11,vp_only,04 - Oakland
338244,San Bernardino County Transportation Authority,2024-12-11,vp_only,08 - San Bernardino
334697,San Francisco Bay Area Water Emergency Transit Authority,2024-12-11,vp_only,04 - Oakland
338245,Southern California Regional Rail Authority,2024-12-11,vp_only,07 - Los Angeles


### Southern California Regional Rail Authority is vehicle positions only, sort of strange.

In [12]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Rail")]

Unnamed: 0,organization_name,service_date,sched_rt_category,caltrans_district
259484,San Joaquin Regional Rail Commission,2024-12-11,schedule_and_vp,10 - Stockton
11331,Sonoma-Marin Area Rail Transit District,2024-12-11,schedule_and_vp,04 - Oakland
338245,Southern California Regional Rail Authority,2024-12-11,vp_only,07 - Los Angeles


In [13]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Metropolitan")]

Unnamed: 0,organization_name,service_date,sched_rt_category,caltrans_district
4127,Los Angeles County Metropolitan Transportation Authority,2024-12-11,schedule_and_vp,07 - Los Angeles
202479,San Diego Metropolitan Transit System,2024-12-11,schedule_and_vp,11 - San Diego
69932,Santa Barbara Metropolitan Transit District,2024-12-11,schedule_and_vp,05 - San Luis Obispo
127086,Santa Cruz Metropolitan Transit District,2024-12-11,schedule_and_vp,05 - San Luis Obispo


In [14]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Fleet")]

Unnamed: 0,organization_name,service_date,sched_rt_category,caltrans_district


In [15]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Ferry")]

Unnamed: 0,organization_name,service_date,sched_rt_category,caltrans_district


In [16]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Bay")]

Unnamed: 0,organization_name,service_date,sched_rt_category,caltrans_district
246865,City of Morro Bay,2024-12-11,schedule_only,05 - San Luis Obispo
144763,Mission Bay Transportation Management Agency,2024-12-11,schedule_only,04 - Oakland
167289,San Francisco Bay Area Rapid Transit District,2024-12-11,schedule_only,04 - Oakland
334697,San Francisco Bay Area Water Emergency Transit Authority,2024-12-11,vp_only,04 - Oakland


In [17]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Alameda")]

Unnamed: 0,organization_name,service_date,sched_rt_category,caltrans_district
235475,Alameda-Contra Costa Transit District,2024-12-11,schedule_and_vp,04 - Oakland
334698,City of Alameda,2024-12-11,vp_only,04 - Oakland


In [18]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Golden")]

Unnamed: 0,organization_name,service_date,sched_rt_category,caltrans_district
104541,Golden Empire Transit District,2024-12-11,schedule_and_vp,06 - Fresno
335613,"Golden Gate Bridge, Highway and Transportation District",2024-12-11,vp_only,04 - Oakland


In [19]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Santa Cruz")]

Unnamed: 0,organization_name,service_date,sched_rt_category,caltrans_district
198967,City of Santa Cruz,2024-12-11,schedule_only,05 - San Luis Obispo
127086,Santa Cruz Metropolitan Transit District,2024-12-11,schedule_and_vp,05 - San Luis Obispo
198966,"University of California, Santa Cruz",2024-12-11,schedule_only,05 - San Luis Obispo


## Look at ferry operators and see how to incorporate them
* San Francisco Bay Area Water Emergency Transit Authority
* City of Alameda
* Golden Gate Bridge, Highway and Transportation District

### City of Alameda

In [20]:
city_of_alameda_df = pd.read_parquet(
    schd_vp_url,
    filters=[
        [
            ("organization_name", "==", "City of Alameda"),
        ]
    ],
)

In [21]:
city_of_alameda_df[
    [
        "route_primary_direction",
        "route_long_name",
        "route_short_name",
        "route_combined_name",
        "route_id",
        "typology",
    ]
].drop_duplicates()

Unnamed: 0,route_primary_direction,route_long_name,route_short_name,route_combined_name,route_id,typology
334696,,Harbor Bay,HB,HB Harbor Bay,HB,
334744,,Oakland & Alameda,OA,OA Oakland & Alameda,OA,
334792,,Oakland Alameda Water Shuttle,OAS,OAS Oakland Alameda Water Shuttle,OAS,
334840,,Richmond,RCH,RCH Richmond,RCH,
334888,,Alameda Seaplane,SEA,SEA Alameda Seaplane,SEA,
334936,,South San Francisco,SSF,SSF South San Francisco,SSF,
334976,,Vallejo,VJO,VJO Vallejo,VJO,


In [22]:
city_of_alameda_df.schedule_gtfs_dataset_key.unique()

array(['82f30e22dafe8156367297eb9a316c57'], dtype=object)

#### No ferry typologies.

In [23]:
route_typologies.loc[
    route_typologies.schedule_gtfs_dataset_key == "82f30e22dafe8156367297eb9a316c57"
]

Unnamed: 0,schedule_gtfs_dataset_key,name,route_type,route_id,route_long_name,route_short_name,combined_name,is_express,is_rapid,is_rail,is_local,direction_id,common_shape_id,route_name,route_meters,is_coverage,is_downtown_local


### San Francisco Bay Area Water Emergency Transit Authority
* Duplicates City of Alameda data except for Oyster Bay.

In [24]:
weta_df = pd.read_parquet(
    schd_vp_url,
    filters=[
        [
            (
                "organization_name",
                "==",
                "San Francisco Bay Area Water Emergency Transit Authority",
            ),
        ]
    ],
)

In [25]:
weta_df[
    [
        "route_primary_direction",
        "route_long_name",
        "route_short_name",
        "route_combined_name",
        "route_id",
        "typology",
    ]
].drop_duplicates()

Unnamed: 0,route_primary_direction,route_long_name,route_short_name,route_combined_name,route_id,typology
334695,,Harbor Bay,HB,HB Harbor Bay,HB,
334743,,Oakland & Alameda,OA,OA Oakland & Alameda,OA,
334791,,Oakland Alameda Water Shuttle,OAS,OAS Oakland Alameda Water Shuttle,OAS,
334839,,Richmond,RCH,RCH Richmond,RCH,
334887,,Alameda Seaplane,SEA,SEA Alameda Seaplane,SEA,
334935,,South San Francisco,SSF,SSF South San Francisco,SSF,
334975,,Vallejo,VJO,VJO Vallejo,VJO,
342814,,Oyster Point Limited,OPL,OPL Oyster Point Limited,OPL,


### Golden Gate
* Only Bus Routes.
* This should be schedule too? 

In [26]:
goldengate_df = pd.read_parquet(
    schd_vp_url,
    filters=[
        [
            (
                "organization_name",
                "==",
                "Golden Gate Bridge, Highway and Transportation District",
            ),
        ]
    ],
)

In [59]:
goldengate_df.sched_rt_category.value_counts()

vp_only            1287
schedule_only         0
schedule_and_vp       0
Name: sched_rt_category, dtype: int64

In [58]:
goldengate_df.columns

Index(['schedule_gtfs_dataset_key', 'direction_id', 'time_period',
       'avg_scheduled_service_minutes', 'avg_stop_miles', 'n_scheduled_trips',
       'frequency', 'is_express', 'is_rapid', 'is_rail', 'is_coverage',
       'is_downtown_local', 'is_local', 'service_date', 'typology',
       'minutes_atleast1_vp', 'minutes_atleast2_vp',
       'total_rt_service_minutes', 'total_scheduled_service_minutes',
       'total_vp', 'vp_in_shape', 'is_early', 'is_ontime', 'is_late',
       'n_vp_trips', 'vp_per_minute', 'pct_in_shape',
       'pct_rt_journey_atleast1_vp', 'pct_rt_journey_atleast2_vp',
       'pct_sched_journey_atleast1_vp', 'pct_sched_journey_atleast2_vp',
       'rt_sched_journey_ratio', 'avg_rt_service_minutes', 'sched_rt_category',
       'speed_mph', 'route_long_name', 'route_short_name',
       'route_combined_name', 'route_id', 'base64_url',
       'organization_source_record_id', 'organization_name',
       'caltrans_district', 'route_primary_direction', 'name',
       '

In [65]:
goldengate_df.schedule_gtfs_dataset_key.unique()

array(['aea4108997c66a74fbdae27b34b69fde'], dtype=object)

In [27]:
goldengate_df[
    [
        "route_primary_direction",
        "route_long_name",
        "route_short_name",
        "route_combined_name",
        "route_id",
        "typology",
    ]
].drop_duplicates()

Unnamed: 0,route_primary_direction,route_long_name,route_short_name,route_combined_name,route_id,typology
335593,,Santa Rosa - San Francisco,101,101 Santa Rosa - San Francisco,101,
335731,,Mill Valley - San Francisco,114,114 Mill Valley - San Francisco,114,
335862,,San Rafael - San Francisco,130,130 San Rafael - San Francisco,130,
336000,,San Anselmo - San Francisco,132,132 San Anselmo - San Francisco,132,
336115,,San Rafael - San Francisco,150,150 San Rafael - San Francisco,150,
336253,,Novato - San Francisco,154,154 Novato - San Francisco,154,
336368,,Petaluma - San Francisco,164,164 Petaluma - San Francisco,164,
336433,,Santa Rosa - San Francisco,172,172 Santa Rosa - San Francisco,172,
336571,,Santa Rosa - San Francisco Express,172X,172X Santa Rosa - San Francisco Express,172X,
336623,,Del Norte BART Station - San Rafael,580,580 Del Norte BART Station - San Rafael,580,


### Seeing which graphs are vp_only using 

In [28]:
import _report_utils
import altair as alt
import yaml

In [29]:
with open("readable.yml") as f:
    readable_dict = yaml.safe_load(f)

In [30]:
with open("color_palettes.yml") as f:
    color_dict = yaml.safe_load(f)

In [31]:
df = weta_df.copy()

In [32]:
# Round float columns
float_columns = df.select_dtypes(include=["float"])
for i in float_columns:
    df[i] = df[i].round(2)

# Multiply percent columns to 100%
pct_cols = df.columns[df.columns.str.contains("pct")].tolist()
for i in pct_cols:
    df[i] = df[i] * 100

In [33]:
# Add column to create rulers for the charts
df["ruler_100_pct"] = 100
df["ruler_for_vp_per_min"] = 2

# Add a column that flips frequency to be every X minutes instead
# of every hour.
df["headway_in_minutes"] = 60 / df.frequency

In [34]:
df.route_primary_direction = df.route_primary_direction.fillna("None")

In [35]:
df = _report_utils.replace_column_names(df)

In [36]:
routes_list = df["Route"].unique().tolist()

route_dropdown = alt.binding_select(
    options=routes_list,
    name="Routes: ",
)
# Column that controls the bar charts
xcol_param = alt.selection_point(
    fields=["Route"], value=routes_list[0], bind=route_dropdown
)

# Filter for only rows that are "all day" statistics
all_day = df.loc[df["Period"] == "all_day"].reset_index(drop=True)

In [37]:
timeliness_df = section2.timeliness_trips(df)

In [38]:
timeliness_df.head(2)

Unnamed: 0,Date,Organization,Route,Period,Direction,dir_0_1,variable,value
0,2024-11-13,San Francisco Bay Area Water Emergency Transit Authority,HB Harbor Bay,offpeak,,0.0,# Early Arrival Trips,0
1,2024-12-11,San Francisco Bay Area Water Emergency Transit Authority,HB Harbor Bay,offpeak,,0.0,# Early Arrival Trips,0


In [39]:
def pct_vp_journey(df: pd.DataFrame, col1: str, col2: str) -> pd.DataFrame:
    """
    Reshape the data for the charts that display the % of
    a journey that recorded 2+ vehicle positions/minute.
    """
    to_keep = [
        "Date",
        "Organization",
        "dir_0_1",
        col1,
        col2,
        "Route",
        "Period",
        "ruler_100_pct",
    ]
    df2 = df[to_keep]

    df3 = df2.melt(
        id_vars=[
            "Date",
            "Organization",
            "Route",
            "dir_0_1",
            "Period",
            "ruler_100_pct",
        ],
        value_vars=[col1, col2],
    )

    df3 = df3.rename(
        columns={"variable": "Category", "value": "% of Actual Trip Minutes"}
    )
    return df3

In [40]:
sched_journey_vp = pct_vp_journey(
    all_day,
    "% Scheduled Trip w/ 1+ VP/Minute",
    "% Scheduled Trip w/ 2+ VP/Minute",
)

In [41]:
sched_journey_vp.head(2)

Unnamed: 0,Date,Organization,Route,dir_0_1,Period,ruler_100_pct,Category,% of Actual Trip Minutes
0,2024-11-13,San Francisco Bay Area Water Emergency Transit Authority,HB Harbor Bay,0.0,all_day,100,% Scheduled Trip w/ 1+ VP/Minute,100.0
1,2024-12-11,San Francisco Bay Area Water Emergency Transit Authority,HB Harbor Bay,0.0,all_day,100,% Scheduled Trip w/ 1+ VP/Minute,100.0


In [42]:
route_stats_df = section2.route_stats(df)

In [43]:
route_stats_df.head(2)

Unnamed: 0,Route,Direction,Dir 0 1,Average Scheduled Service (Trip Minutes),Average Stop Distance (Miles),# Scheduled Trips,Gtfs Availability,Peak Avg Speed,Peak Scheduled Trips,Peak Hourly Freq,Offpeak Avg Speed,Offpeak Scheduled Trips,Trips Per Hour
0,HB Harbor Bay,,0.0,0.0,0.0,0,vp_only,0.0,0,0.0,0.0,0.0,0.0
1,HB Harbor Bay,,1.0,0.0,0.0,0,vp_only,0.0,0,0.0,0.0,0.0,0.0


### `Average Scheduled Minutes` chart doesn't work.

In [44]:
(
    (
        section2.base_facet_chart(
            timeliness_df.loc[timeliness_df["dir_0_1"] == 0],
            0,
            "value",
            "variable",
            "Period",
            readable_dict["timeliness_trips_graph"]["title"],
            readable_dict["timeliness_trips_graph"]["subtitle"],
        )
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Period"] = df["Period"].str.replace("_", " ").str.title()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[y_col] = df[y_col].fillna(0).astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[f"{y_col}_str"] = df[y_col].astype(str)


In [45]:
(
    (
        section2.base_facet_chart(
            timeliness_df.loc[timeliness_df["dir_0_1"] == 1],
            1,
            "value",
            "variable",
            "Period",
            readable_dict["timeliness_trips_graph"]["title"],
            "",
        )
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Period"] = df["Period"].str.replace("_", " ").str.title()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[y_col] = df[y_col].fillna(0).astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[f"{y_col}_str"] = df[y_col].astype(str)


In [46]:
df.headway_in_minutes = df.headway_in_minutes.fillna(0)

### `Frequency` doesn't work.

In [47]:
(
    section2.frequency_chart(
        df,
        0,
        readable_dict["frequency_graph"]["title"],
        readable_dict["frequency_graph"]["subtitle"],
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

#### `speed` also doesn't work.

In [48]:
(
    section2.grouped_bar_chart(
        df.loc[df.dir_0_1 == 0],
        "Period",
        "Speed (MPH)",
        "Period",
        readable_dict["speed_graph_dir_0"]["title"],
        readable_dict["speed_graph_dir_0"]["subtitle"],
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Period"] = df["Period"].str.replace("_", " ").str.title()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[y_col] = df[y_col].fillna(0).astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[f"{y_col}_str"] = df[y_col].astype(str)


In [49]:
all_day.head(1).T

Unnamed: 0,0
schedule_gtfs_dataset_key,82f30e22dafe8156367297eb9a316c57
dir_0_1,0.00
Period,all_day
Average Scheduled Service (trip minutes),
Average Stop Distance (miles),
# scheduled trips,0
Trips per Hour,
is_express,
is_rapid,
is_rail,


In [50]:
(
    (
        section2.base_facet_with_ruler_chart(
            all_day.loc[all_day.dir_0_1 == 0],
            "Average VP per Minute",
            "ruler_for_vp_per_min",
            readable_dict["vp_per_min_graph"]["title"],
            readable_dict["vp_per_min_graph"]["subtitle"],
            color_dict["vp_domain"],
            color_dict["vp_range"],
        )
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Period"] = df["Period"].str.replace("_", " ").str.title()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[y_col] = df[y_col].fillna(0).astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[f"{y_col}_str"] = df[y_col].astype(str)


In [51]:
(
    (
        section2.base_facet_with_ruler_chart(
            all_day.loc[all_day.dir_0_1 == 1],
            "Average VP per Minute",
            "ruler_for_vp_per_min",
            readable_dict["vp_per_min_graph"]["title"],
            readable_dict["vp_per_min_graph"]["subtitle"],
            color_dict["vp_domain"],
            color_dict["vp_range"],
        )
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Period"] = df["Period"].str.replace("_", " ").str.title()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[y_col] = df[y_col].fillna(0).astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[f"{y_col}_str"] = df[y_col].astype(str)


In [53]:
sched_journey_vp.columns

Index(['Date', 'Organization', 'Route', 'dir_0_1', 'Period', 'ruler_100_pct',
       'Category', '% of Actual Trip Minutes', '% of Actual Trip Minutes_str'],
      dtype='object')

In [54]:
sched_journey_vp = sched_journey_vp.rename(columns={"dir_0_1": "Direction"})

In [55]:
(
    section2.base_facet_circle(
        sched_journey_vp,
        "% of Actual Trip Minutes",
        "Category",
        "ruler_100_pct",
        readable_dict["sched_vp_per_min_graph"]["title"],
        readable_dict["sched_vp_per_min_graph"]["subtitle"],
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

#### The bars are stacked because the direction 0/1 are coded as "None" in `route_primary_direction`
* Need to drop Direction and rename `dir_0_1` as Direction.

In [56]:
all_day = all_day.drop(columns=["Direction"]).rename(columns={"dir_0_1": "Direction"})

In [57]:
(
    section2.base_facet_with_ruler_chart(
        all_day,
        "% VP within Scheduled Shape",
        "ruler_100_pct",
        readable_dict["spatial_accuracy_graph"]["title"],
        readable_dict["spatial_accuracy_graph"]["subtitle"],
        color_dict["spatial_accuracy_domain"],
        color_dict["spatial_accuracy_range"],
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

## Why is <i>Golden Gate Bridge, Highway and Transportation District</i> `vp_only`? It should have schedule data!

In [60]:
import merge_data

In [62]:
analysis_date

'2025-02-12'

In [63]:
sched_df = merge_data.concatenate_schedule_by_route_direction(analysis_date_list)

In [66]:
sched_df.loc[sched_df.schedule_gtfs_dataset_key == 'aea4108997c66a74fbdae27b34b69fde']

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,time_period,route_primary_direction,avg_scheduled_service_minutes,avg_stop_miles,n_scheduled_trips,frequency,is_express,is_rapid,is_rail,is_coverage,is_downtown_local,is_local,service_date


## Look at how rail routes are recorded.

In [None]:
operators_to_keep = [
    "Amtrak",
    "Los Angeles County Metropolitan Transportation Authority",
    "San Diego Metropolitan Transit System",
    "Capitol Corridor Joint Powers Authority",
    "Southern California Regional Rail Authority",
    "San Joaquin Regional Rail Commission",
    "City and County of San Francisco",
    "San Francisco Bay Area Water Emergency Transit Authority",
    "Sonoma-Marin Area Rail Transit District",
]

In [None]:
rail_ops_only = pd.read_parquet(schd_vp_url)

In [None]:
rail_ops_only2 = rail_ops_only.loc[
    rail_ops_only.organization_name.isin(operators_to_keep)
]

In [None]:
rail_ops_only2.organization_name.value_counts()

In [None]:
sched_keys_to_keep = list(
    rail_ops_only2.loc[
        rail_ops_only2.organization_name.isin(operators_to_keep)
    ].schedule_gtfs_dataset_key.unique()
)

### Bring in route typologies & merge everything together 
* There aren't any routes categorized as ferries in `route_typologies`

In [None]:
route_typologies2 = route_typologies[
    [
        "route_type",
        "route_id",
        "schedule_gtfs_dataset_key",
    ]
].drop_duplicates()

In [None]:
route_typologies2.route_type.unique()

In [None]:
route_typologies2.loc[route_typologies2.route_type == "4"]

In [None]:
m1 = pd.merge(
    rail_ops_only2,
    route_typologies2,
    on=["schedule_gtfs_dataset_key", "route_id"],
    how="left",
    indicator=True,
)

In [None]:
m1._merge.value_counts()

In [None]:
m1.route_type.unique()

In [None]:
rail_only = m1.loc[(m1.is_rail == 1) | m1.route_type.isin(["1", "2", "12"])]

In [None]:
rail_only2 = rail_only.loc[rail_only.service_date == "2025-02-12T00:00:00.000000000"]

In [None]:
rail_only2.organization_name.value_counts()

### Why aren't rail routes showing for operators that certifably do have rail such as SF Muni and Amtrak when you do `is_rail == 0`?

In [None]:
rail_only2.sched_rt_category.value_counts()

In [None]:
# https://gtfs.org/documentation/schedule/reference/#
route_type_crosswalk = {
    "route_type": ["0", "1", "2", "3", "4", "5", "6", "7", "11", "12"],
    "route_type_str": [
        "Tram, Streetcar, Light rail",
        "Subway, Metro",
        "Rail",
        "Bus",
        "Ferry.",
        "Cable tram.",
        "Aerial lift, suspended cable car (e.g., gondola lift, aerial tramway).",
        "Funicular.",
        "Trolleybus.",
        "Monorail.",
    ],
}

In [None]:
route_type_crosswalk_df = pd.DataFrame(route_type_crosswalk)

In [None]:
route_type_crosswalk_df

In [None]:
agg1 = (
    rail_only2.groupby(["organization_name", "sched_rt_category", "route_type"])
    .agg(
        {
            "route_id": "count",
            "is_rail": "sum",
        }
    )
    .reset_index()
)

In [None]:
agg1 = agg1.sort_values(by=["organization_name", "route_id"], ascending=[True, False])

### Sonoma disappeared completely?

In [None]:
agg1.loc[(agg1.route_id != 0) | (agg1.is_rail != 0)]

In [None]:
rail_only2[
    [
        "organization_name",
        "route_long_name",
        "route_short_name",
        "route_combined_name",
        "route_id",
        "route_type",
    ]
].drop_duplicates()

## Scheduled Trips

In [None]:
scheduled_trips_df = pd.concat(
    [
        helpers.import_scheduled_trips(
            analysis_date,
            columns=[
                "gtfs_dataset_key",
                "name",
                "route_id",
                "route_long_name",
                "route_short_name",
                "route_desc",
            ],
            get_pandas=True,
        ).assign(service_date=pd.to_datetime(analysis_date))
        for analysis_date in analysis_date_list
    ],
    axis=0,
    ignore_index=True,
)

In [None]:
scheduled_trips_df.head(1)

### Find the ferry

In [None]:
scheduled_trips_df.loc[scheduled_trips_df.name.str.contains("Ferry")][
    ["name"]
].drop_duplicates()

In [None]:
scheduled_trips_df.columns

In [None]:
ferry_schd_keys = list(
    scheduled_trips_df.loc[
        scheduled_trips_df.name.str.contains("Ferry")
    ].schedule_gtfs_dataset_key.unique()
)

In [None]:
ferry_names = list(
    scheduled_trips_df.loc[scheduled_trips_df.name.str.contains("Ferry")].name.unique()
)

In [None]:
scheduled_trips_df2 = scheduled_trips_df.loc[
    scheduled_trips_df.schedule_gtfs_dataset_key.isin(ferry_schd_keys)
]

In [None]:
len(scheduled_trips_df2)

In [None]:
scheduled_trips_df2.head(2)

In [None]:
# scheduled_trips_df2

## Scheduled Shapes 

In [None]:
TABLE = GTFS_DATA_DICT.schedule_downloads.shapes
FILE = f"{COMPILED_CACHED_VIEWS}{TABLE}_{analysis_date}.parquet"

In [None]:
shapes = gpd.read_parquet(FILE)

In [None]:
shapes.columns

In [None]:
scheduled_shapes_df = helpers.import_scheduled_shapes(
    analysis_date,
    columns=["shape_array_key", "geometry"],
    get_pandas=True,
    crs=PROJECT_CRS,
)

In [None]:
scheduled_shapes_df.columns

## Scheduled Stops

In [None]:
TABLE = GTFS_DATA_DICT.rt_vs_schedule_tables.stop_times_direction
FILE = f"{RT_SCHED_GCS}{TABLE}_{analysis_date}.parquet"

In [None]:
stops_df = gpd.read_parquet(FILE)

In [None]:
stops_df.columns

In [None]:
stops_df2 = stops_df.loc[stops_df.schedule_gtfs_dataset_key.isin(ferry_schd_keys)]

In [None]:
len(stops_df2)

In [None]:
# stops_df2.explore()

## Scheduled Stop Times

In [None]:
TABLE = GTFS_DATA_DICT.rt_vs_schedule_tables.stop_times_direction
FILE = f"{RT_SCHED_GCS}{TABLE}_{analysis_date}.parquet"

In [None]:
sched_stops = gpd.read_parquet(FILE)

In [None]:
sched_stops.columns

In [None]:
sched_stops2 = sched_stops.loc[
    sched_stops.schedule_gtfs_dataset_key.isin(ferry_schd_keys)
]

In [None]:
# sched_stops2.explore()