# Research Request - GTFS Digest: Add Rail and Ferry Operators. #1386

Tiffany's comment:
If it's just a couple of rail, (Amtrak, Metrolink) and a handful of ferry operators, it's worth digging into the why they dropped off, and start by looking for their rows in the 4 schedule tables: trips, shapes, stops, stop_times, and then look for it in a vp table.

* I think the ferry operators and Metrolink are already associated to a district. Even Amtrak might be? But if Amtrak isn't, you can create a separate "district = Amtrak" the merged df so it always has a tab for itself. Amtrak plots for the entire country!
* District 4: San Francisco Bay Area Rapid Transit (BART), City and County of San Francisco (Muni)
* District 7: Los Angeles County Metropolitan Transportation Authority (LA Metro)
* District 11: San Diego Metropolitan Transit System

Amanda
* Ferry operator: Bay Area WETA, City of Alameda, and Golden Gate Bridge, Highway and Transportation District show up. All 3 are vp_only so they were filtered out -> incorporate them in? 
* The only ferry operator missing is Santa Cruz Harbor. 
* Amtrak is in District 3 but it has schedule_only data, which isn't true? 

* Here's a list of ferry operators in California from Evan's comment [here](https://github.com/cal-itp/data-analyses/issues/1357):
    
    * City of Alameda
    * Golden Gate
    * SF WETA
    * Santa Cruz Harbor
* 2/26
    * Don't filter out any of the operators when generating the yaml that powers GTFS Digest. 
    * Come up with a different section of code for operators that are `vp_only`

In [None]:
import _section1_utils as section1
import _section2_utils as section2
import geopandas as gpd
import merge_data
import merge_operator_data
import numpy as np
import pandas as pd
from segment_speed_utils import gtfs_schedule_wrangling, helpers
from segment_speed_utils.project_vars import COMPILED_CACHED_VIEWS, PROJECT_CRS
from shared_utils import catalog_utils, portfolio_utils, rt_dates
from update_vars import GTFS_DATA_DICT, RT_SCHED_GCS, SCHED_GCS, SEGMENT_GCS

In [None]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [None]:
analysis_date_list = [rt_dates.DATES["feb2025"]]

In [None]:
analysis_date = rt_dates.DATES["feb2025"]

In [None]:
schd_vp_url = f"{GTFS_DATA_DICT.digest_tables.dir}{GTFS_DATA_DICT.digest_tables.route_schedule_vp}.parquet"

In [None]:
EXPORT = GTFS_DATA_DICT.schedule_tables.route_typologies

In [None]:
route_typologies = pd.read_parquet(f"{SCHED_GCS}{EXPORT}_{analysis_date}.parquet")

## Look at operators in `digest/schedule_vp_metrics` without any filters to see if ferry and rail operators are in here.
* Ferry operators except Bay Area Water Emergency Services (which isn't even a ferry?) aren't here.

In [None]:
schd_vp_df = pd.read_parquet(
    schd_vp_url,
    columns=[
        "schedule_gtfs_dataset_key",
        "caltrans_district",
        "organization_name",
        "name",
        "sched_rt_category",
        "service_date",
    ],
)

In [None]:
# Filter for Jan and Feb
schd_vp_df2 = schd_vp_df.loc[
    (schd_vp_df.service_date == "2025-01-15")
    | (schd_vp_df.service_date == "2024-12-11")
]

In [None]:
# Drop duplicates
schd_vp_df3 = (
    schd_vp_df2[
        ["organization_name", "service_date", "sched_rt_category", "caltrans_district"]
    ]
    .drop_duplicates(subset=["organization_name"])
    .sort_values(by=["organization_name"])
)

In [None]:
schd_vp_df3.sched_rt_category.value_counts()

In [None]:
schd_vp_df3.loc[schd_vp_df3.sched_rt_category == "vp_only"]

In [None]:
schd_vp_df3.columns

### Southern California Regional Rail Authority is vehicle positions only, sort of strange.

In [None]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Rail")]

In [None]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Metropolitan")]

In [None]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Fleet")]

In [None]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Ferry")]

In [None]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Bay")]

In [None]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Alameda")]

In [None]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Golden")]

In [None]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Santa Cruz")]

## Look at ferry operators and see how to incorporate them
* San Francisco Bay Area Water Emergency Transit Authority
* City of Alameda
* Golden Gate Bridge, Highway and Transportation District

### City of Alameda

In [None]:
city_of_alameda_df = pd.read_parquet(
    schd_vp_url,
    filters=[
        [
            ("organization_name", "==", "City of Alameda"),
        ]
    ],
)

In [None]:
city_of_alameda_df[
    [
        "route_primary_direction",
        "route_long_name",
        "route_short_name",
        "route_combined_name",
        "route_id",
        "typology",
    ]
].drop_duplicates()

In [None]:
city_of_alameda_df.schedule_gtfs_dataset_key.unique()

In [None]:
city_of_alameda_df.columns

#### No ferry typologies.

In [None]:
route_typologies.loc[
    route_typologies.schedule_gtfs_dataset_key == "82f30e22dafe8156367297eb9a316c57"
]

### San Francisco Bay Area Water Emergency Transit Authority
* Duplicates City of Alameda data except for Oyster Bay.

In [None]:
weta_df = pd.read_parquet(
    schd_vp_url,
    filters=[
        [
            (
                "organization_name",
                "==",
                "San Francisco Bay Area Water Emergency Transit Authority",
            ),
        ]
    ],
)

In [None]:
weta_df[
    [
        "route_primary_direction",
        "route_long_name",
        "route_short_name",
        "route_combined_name",
        "route_id",
        "typology",
    ]
].drop_duplicates()

### Golden Gate
* Only Bus Routes.
* This should be schedule too? 

In [None]:
goldengate_df = pd.read_parquet(
    schd_vp_url,
    filters=[
        [
            (
                "organization_name",
                "==",
                "Golden Gate Bridge, Highway and Transportation District",
            ),
        ]
    ],
)

In [None]:
goldengate_df.sched_rt_category.value_counts()

In [None]:
goldengate_df.columns

In [None]:
goldengate_df.schedule_gtfs_dataset_key.unique()

In [None]:
goldengate_df[
    [
        "route_primary_direction",
        "route_long_name",
        "route_short_name",
        "route_combined_name",
        "route_id",
        "typology",
    ]
].drop_duplicates()

### Seeing which graphs are vp_only using 

In [None]:
import _report_utils
import altair as alt
import yaml

In [None]:
with open("readable.yml") as f:
    readable_dict = yaml.safe_load(f)

In [None]:
with open("color_palettes.yml") as f:
    color_dict = yaml.safe_load(f)

In [None]:
df = weta_df.copy()

In [None]:
# Round float columns
float_columns = df.select_dtypes(include=["float"])
for i in float_columns:
    df[i] = df[i].round(2)

# Multiply percent columns to 100%
pct_cols = df.columns[df.columns.str.contains("pct")].tolist()
for i in pct_cols:
    df[i] = df[i] * 100

In [None]:
# Add column to create rulers for the charts
df["ruler_100_pct"] = 100
df["ruler_for_vp_per_min"] = 2

# Add a column that flips frequency to be every X minutes instead
# of every hour.
df["headway_in_minutes"] = 60 / df.frequency

In [None]:
df.route_primary_direction = df.route_primary_direction.fillna("None")

In [None]:
df = _report_utils.replace_column_names(df)

In [None]:
routes_list = df["Route"].unique().tolist()

route_dropdown = alt.binding_select(
    options=routes_list,
    name="Routes: ",
)
# Column that controls the bar charts
xcol_param = alt.selection_point(
    fields=["Route"], value=routes_list[0], bind=route_dropdown
)

# Filter for only rows that are "all day" statistics
all_day = df.loc[df["Period"] == "all_day"].reset_index(drop=True)

In [None]:
timeliness_df = section2.timeliness_trips(df)

In [None]:
timeliness_df.head(2)

In [None]:
def pct_vp_journey(df: pd.DataFrame, col1: str, col2: str) -> pd.DataFrame:
    """
    Reshape the data for the charts that display the % of
    a journey that recorded 2+ vehicle positions/minute.
    """
    to_keep = [
        "Date",
        "Organization",
        "dir_0_1",
        col1,
        col2,
        "Route",
        "Period",
        "ruler_100_pct",
    ]
    df2 = df[to_keep]

    df3 = df2.melt(
        id_vars=[
            "Date",
            "Organization",
            "Route",
            "dir_0_1",
            "Period",
            "ruler_100_pct",
        ],
        value_vars=[col1, col2],
    )

    df3 = df3.rename(
        columns={"variable": "Category", "value": "% of Actual Trip Minutes"}
    )
    return df3

In [None]:
sched_journey_vp = pct_vp_journey(
    all_day,
    "% Scheduled Trip w/ 1+ VP/Minute",
    "% Scheduled Trip w/ 2+ VP/Minute",
)

In [None]:
sched_journey_vp.head(2)

In [None]:
route_stats_df = section2.route_stats(df)

In [None]:
route_stats_df.head(2)

## Build this into a function

In [None]:
def load_vp_metrics(organization:str)->pd.DataFrame:
    """
    Load schedule versus realtime file.
    """
    schd_vp_url = f"{GTFS_DATA_DICT.digest_tables.dir}{GTFS_DATA_DICT.digest_tables.route_schedule_vp}.parquet"
   
    # Keep only rows that are found in both schedule and real time data
    df = (pd.read_parquet(schd_vp_url, 
          filters=[[("organization_name", "==", organization),]])
         )
    
    # Delete duplicates
    df = df.drop_duplicates().reset_index(drop = True)
    
    # Round float columns
    float_columns = df.select_dtypes(include=['float'])
    for i in float_columns:
        df[i] = df[i].round(2)
    
    # Multiply percent columns to 100% 
    pct_cols = df.columns[df.columns.str.contains("pct")].tolist()
    for i in pct_cols:
        df[i] = df[i] * 100
        
    # Add column to create rulers for the charts
    df["ruler_100_pct"] = 100
    df["ruler_for_vp_per_min"] = 2
    
    # Add a column that flips frequency to be every X minutes instead
    # of every hour.
    df["headway_in_minutes"] = 60/df.frequency
    
    # Replace missing values in route_primary_direction
    df.route_primary_direction = df.route_primary_direction.fillna(df.direction_id)
    
    # Replace column names
    df = _report_utils.replace_column_names(df)

    
    return df

In [None]:
dumbardton_df = load_vp_metrics("Dumbarton Bridge Regional Operations Consortium")

In [None]:
dumbardton_df.head(1)

In [None]:
socal_rail_df = load_vp_metrics("Southern California Regional Rail Authority")

In [None]:
gg_df = load_vp_metrics("Golden Gate Bridge, Highway and Transportation District")

In [None]:
def filtered_route(
    df: pd.DataFrame,
) -> alt.Chart:
    """
    This combines all the charts together, controlled by a single
    dropdown.
    
    Resources:
        https://stackoverflow.com/questions/58919888/multiple-selections-in-altair
    """
    # Create dropdown
    routes_list = df["Route"].unique().tolist()

    route_dropdown = alt.binding_select(
        options=routes_list,
        name="Routes: ",
    )
    # Column that controls the bar charts
    xcol_param = alt.selection_point(
    fields=["Route"], value=routes_list[0], bind=route_dropdown
    )

    # Filter for only rows that are "all day" statistics
    all_day = df.loc[df["Period"] == "all_day"].reset_index(drop=True)

    # Manipulate the df for some of the metrics
    timeliness_df = section2.timeliness_trips(df)
    sched_journey_vp = section2.pct_vp_journey(
        all_day,
       "% Scheduled Trip w/ 1+ VP/Minute",
      "% Scheduled Trip w/ 2+ VP/Minute",
    )
    route_stats_df = section2.route_stats(df)
    
    # Create the charts
    timeliness_trips_dir_0 = (
            (
                section2.base_facet_chart(
                    timeliness_df.loc[timeliness_df["dir_0_1"] == 0],
                    0,
                    "value",
                    "variable",
                    "Period",
                    readable_dict["timeliness_trips_graph"]["title"],
                    readable_dict["timeliness_trips_graph"]["subtitle"],
                    color_dict["tri_color2"]
                )
            )
            .add_params(xcol_param)
            .transform_filter(xcol_param)
        )
    timeliness_trips_dir_1 = (
            (
                section2.base_facet_chart(
                    timeliness_df.loc[timeliness_df["dir_0_1"] == 1],
                    1,
                    "value",
                    "variable",
                    "Period",
                    readable_dict["timeliness_trips_graph"]["title"],
                    "",
                    color_dict["tri_color2"],
                )
            )
            .add_params(xcol_param)
            .transform_filter(xcol_param)
        )

    speed_graph_dir_0 = (
       section2.grouped_bar_chart(
            df.loc[df.dir_0_1 == 0],
            "Period",
            "Speed (MPH)",
            "Period",
            readable_dict["speed_graph_dir_0"]["title"],
            readable_dict["speed_graph_dir_0"]["subtitle"],
            color_dict["tri_color2"]
        )
        .add_params(xcol_param)
        .transform_filter(xcol_param)
    )
    speed_graph_dir_1 = (
       section2.grouped_bar_chart(
            df.loc[df.dir_0_1 == 1],
            "Period",
            "Speed (MPH)",
            "Period",
            readable_dict["speed_graph_dir_1"]["title"],
            readable_dict["speed_graph_dir_0"]["subtitle"],
           color_dict["tri_color2"]
        )
        .add_params(xcol_param)
        .transform_filter(xcol_param)
    )
    vp_per_min_graph = (
        (
            section2.base_facet_with_ruler_chart(
                all_day,
                "Average VP per Minute",
                "ruler_for_vp_per_min",
                readable_dict["vp_per_min_graph"]["title"],
                readable_dict["vp_per_min_graph"]["subtitle"],
                color_dict["vp_domain"],
                color_dict["vp_range"]
            )
        )
        .add_params(xcol_param)
        .transform_filter(xcol_param)
    )

    sched_vp_per_min = (
        section2.base_facet_circle(
            sched_journey_vp,
            "% of Actual Trip Minutes",
            "Category",
            "ruler_100_pct",
            readable_dict["sched_vp_per_min_graph"]["title"],
            readable_dict["sched_vp_per_min_graph"]["subtitle"],
            color_dict["tri_color2"]
        )
        .add_params(xcol_param)
        .transform_filter(xcol_param)
    )
    spatial_accuracy = (
        section2.base_facet_with_ruler_chart(
            all_day,
            "% VP within Scheduled Shape",
            "ruler_100_pct",
            readable_dict["spatial_accuracy_graph"]["title"],
            readable_dict["spatial_accuracy_graph"]["subtitle"],
            color_dict["spatial_accuracy_domain"],
            color_dict["spatial_accuracy_range"]
        )
        .add_params(xcol_param)
        .transform_filter(xcol_param)
    )
    # Separate out the charts themetically.
    ride_quality = section2.divider_chart(df, readable_dict["ride_quality_graph"]["title"])
    data_quality = section2.divider_chart(df, readable_dict["data_quality_graph"]["title"])
    
    # Combine all the charts
    chart_list = [
    ride_quality,
    timeliness_trips_dir_0,
    timeliness_trips_dir_1,
    speed_graph_dir_0,
    speed_graph_dir_1,
    data_quality,
    vp_per_min_graph,
    sched_vp_per_min,
    spatial_accuracy,]

    chart = alt.vconcat(*chart_list)

    return chart


In [None]:
filtered_route(gg_df)

In [None]:
filtered_route(socal_rail_df)

In [None]:
filtered_route(dumbardton_df)

### `Average Scheduled Minutes` chart doesn't work.

In [None]:
(
            (
                section2.base_facet_chart(
                    timeliness_df.loc[timeliness_df["dir_0_1"] == 1],
                    1,
                    "value",
                    "variable",
                    "Period",
                    readable_dict["timeliness_trips_graph"]["title"],
                    "",
                )
            )
            .add_params(xcol_param)
            .transform_filter(xcol_param)
        )

In [None]:
df.headway_in_minutes = df.headway_in_minutes.fillna(0)

### `Frequency` doesn't work.

In [None]:
(
    section2.frequency_chart(
        df,
        0,
        readable_dict["frequency_graph"]["title"],
        readable_dict["frequency_graph"]["subtitle"],
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

#### `speed` also doesn't work.

In [None]:
(
    section2.grouped_bar_chart(
        df.loc[df.dir_0_1 == 0],
        "Period",
        "Speed (MPH)",
        "Period",
        readable_dict["speed_graph_dir_0"]["title"],
        readable_dict["speed_graph_dir_0"]["subtitle"],
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

In [None]:
all_day.head(1).T

In [None]:
(
    (
        section2.base_facet_with_ruler_chart(
            all_day.loc[all_day.dir_0_1 == 0],
            "Average VP per Minute",
            "ruler_for_vp_per_min",
            readable_dict["vp_per_min_graph"]["title"],
            readable_dict["vp_per_min_graph"]["subtitle"],
            color_dict["vp_domain"],
            color_dict["vp_range"],
        )
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

In [None]:
(
    (
        section2.base_facet_with_ruler_chart(
            all_day.loc[all_day.dir_0_1 == 1],
            "Average VP per Minute",
            "ruler_for_vp_per_min",
            readable_dict["vp_per_min_graph"]["title"],
            readable_dict["vp_per_min_graph"]["subtitle"],
            color_dict["vp_domain"],
            color_dict["vp_range"],
        )
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

In [None]:
sched_journey_vp.columns

In [None]:
sched_journey_vp = sched_journey_vp.rename(columns={"dir_0_1": "Direction"})

In [None]:
(
    section2.base_facet_circle(
        sched_journey_vp,
        "% of Actual Trip Minutes",
        "Category",
        "ruler_100_pct",
        readable_dict["sched_vp_per_min_graph"]["title"],
        readable_dict["sched_vp_per_min_graph"]["subtitle"],
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

#### The bars are stacked because the direction 0/1 are coded as "None" in `route_primary_direction`
* Need to drop Direction and rename `dir_0_1` as Direction.

In [None]:
all_day = all_day.drop(columns=["Direction"]).rename(columns={"dir_0_1": "Direction"})

In [None]:
(
    section2.base_facet_with_ruler_chart(
        all_day,
        "% VP within Scheduled Shape",
        "ruler_100_pct",
        readable_dict["spatial_accuracy_graph"]["title"],
        readable_dict["spatial_accuracy_graph"]["subtitle"],
        color_dict["spatial_accuracy_domain"],
        color_dict["spatial_accuracy_range"],
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

## Why is <i>Golden Gate Bridge, Highway and Transportation District</i> `vp_only`? It should have schedule data!

In [None]:
import merge_data

In [None]:
analysis_date

In [None]:
sched_df = merge_data.concatenate_schedule_by_route_direction(analysis_date_list)

In [None]:
sched_df.loc[sched_df.schedule_gtfs_dataset_key == 'aea4108997c66a74fbdae27b34b69fde']