# Research Request - GTFS Digest: Add Rail and Ferry Operators. #1386

Tiffany's comment:
If it's just a couple of rail, (Amtrak, Metrolink) and a handful of ferry operators, it's worth digging into the why they dropped off, and start by looking for their rows in the 4 schedule tables: trips, shapes, stops, stop_times, and then look for it in a vp table.

* I think the ferry operators and Metrolink are already associated to a district. Even Amtrak might be? But if Amtrak isn't, you can create a separate "district = Amtrak" the merged df so it always has a tab for itself. Amtrak plots for the entire country!
* District 4: San Francisco Bay Area Rapid Transit (BART), City and County of San Francisco (Muni)
* District 7: Los Angeles County Metropolitan Transportation Authority (LA Metro)
* District 11: San Diego Metropolitan Transit System

Amanda
* Ferry operator: Bay Area WETA, City of Alameda, and Golden Gate Bridge, Highway and Transportation District show up. All 3 are vp_only so they were filtered out -> incorporate them in? 
* The only ferry operator missing is Santa Cruz Harbor. 
* Amtrak is in District 3 but it has schedule_only data, which isn't true? 

* Here's a list of ferry operators in California from Evan's comment [here](https://github.com/cal-itp/data-analyses/issues/1357):
    
    * City of Alameda
    * Golden Gate
    * SF WETA
    * Santa Cruz Harbor
* **Goal**: all operators that are vp_only should also have schedule data. It is not possible to have realtime data without scheduled data. 

In [1]:
import _section1_utils as section1
import _section2_utils as section2
import geopandas as gpd
import merge_data
import merge_operator_data
import numpy as np
import pandas as pd
from segment_speed_utils import gtfs_schedule_wrangling, helpers
from segment_speed_utils.project_vars import COMPILED_CACHED_VIEWS, PROJECT_CRS
from shared_utils import catalog_utils, portfolio_utils, rt_dates
from update_vars import GTFS_DATA_DICT, RT_SCHED_GCS, SCHED_GCS, SEGMENT_GCS

In [2]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [3]:
analysis_date_list = [rt_dates.DATES["feb2025"]]

In [4]:
analysis_date = rt_dates.DATES["feb2025"]

In [5]:
schd_vp_url = f"{GTFS_DATA_DICT.digest_tables.dir}{GTFS_DATA_DICT.digest_tables.route_schedule_vp}.parquet"

In [6]:
EXPORT = GTFS_DATA_DICT.schedule_tables.route_typologies

In [7]:
route_typologies = pd.read_parquet(f"{SCHED_GCS}{EXPORT}_{analysis_date}.parquet")

## Look at operators in `digest/schedule_vp_metrics` without any filters to see if ferry and rail operators are in here.
* Ferry operators except Bay Area Water Emergency Services (which isn't even a ferry?) aren't here.

In [8]:
schd_vp_df = pd.read_parquet(
    schd_vp_url,
    columns=[
        "schedule_gtfs_dataset_key",
        "caltrans_district",
        "organization_name",
        "name",
        "sched_rt_category",
        "service_date",
    ],
)

In [9]:
# Filter for Jan and Feb
schd_vp_df2 = schd_vp_df.loc[
    (schd_vp_df.service_date == "2025-01-15")
    | (schd_vp_df.service_date == "2024-12-11")
]

In [10]:
# Drop duplicates
schd_vp_df3 = (
    schd_vp_df2[
        [
            "schedule_gtfs_dataset_key",
            "organization_name",
            "service_date",
            "sched_rt_category",
            "caltrans_district",
        ]
    ]
    .drop_duplicates(subset=["organization_name"])
    .sort_values(by=["organization_name"])
)

In [11]:
schd_vp_df3.sched_rt_category.value_counts()

schedule_and_vp    107
schedule_only       89
vp_only              2
Name: sched_rt_category, dtype: int64

In [12]:
schd_vp_df3.loc[schd_vp_df3.sched_rt_category == "vp_only"]

Unnamed: 0,schedule_gtfs_dataset_key,organization_name,service_date,sched_rt_category,caltrans_district
342768,c4092405159366c705b62df938293a4e,San Bernardino County Transportation Authority,2024-12-11,vp_only,08 - San Bernardino
342769,c4092405159366c705b62df938293a4e,Southern California Regional Rail Authority,2024-12-11,vp_only,07 - Los Angeles


### Try to see if I am categorizing these supposedly vp only operators incorrectly because I am merely dropping duplicates.
* No they really are vp only.

In [13]:
vp_only_ops = list(
    schd_vp_df3.loc[
        schd_vp_df3.sched_rt_category == "vp_only"
    ].organization_name.unique()
)

In [14]:
vp_only_ops_df = schd_vp_df.loc[schd_vp_df.organization_name.isin(vp_only_ops)]

In [15]:
len(vp_only_ops)

2

In [16]:
vp_only_ops_df.groupby(["organization_name", "sched_rt_category"]).agg(
    {"service_date": "max"}
)

Unnamed: 0_level_0,Unnamed: 1_level_0,service_date
organization_name,sched_rt_category,Unnamed: 2_level_1
San Bernardino County Transportation Authority,schedule_only,NaT
San Bernardino County Transportation Authority,vp_only,2025-02-12
San Bernardino County Transportation Authority,schedule_and_vp,NaT
Southern California Regional Rail Authority,schedule_only,NaT
Southern California Regional Rail Authority,vp_only,2025-02-12
Southern California Regional Rail Authority,schedule_and_vp,NaT


### Southern California Regional Rail Authority is vehicle positions only, sort of strange.

In [17]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Rail")]

Unnamed: 0,schedule_gtfs_dataset_key,organization_name,service_date,sched_rt_category,caltrans_district
265319,ce940c5c982e7d8e9cf790028a5cd134,San Joaquin Regional Rail Commission,2024-12-11,schedule_and_vp,10 - Stockton
11406,0881af3822466784992a49f1cc57d38f,Sonoma-Marin Area Rail Transit District,2024-12-11,schedule_and_vp,04 - Oakland
342769,c4092405159366c705b62df938293a4e,Southern California Regional Rail Authority,2024-12-11,vp_only,07 - Los Angeles


In [18]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Metropolitan")]

Unnamed: 0,schedule_gtfs_dataset_key,organization_name,service_date,sched_rt_category,caltrans_district
4190,0666caf3ec1ecc96b74f4477ee4bc939,Los Angeles County Metropolitan Transportation Authority,2024-12-11,schedule_and_vp,07 - Los Angeles
207802,baeeb157e85a901e47b828ef9fe75091,San Diego Metropolitan Transit System,2024-12-11,schedule_and_vp,11 - San Diego
70754,239f3baf3dd3b9e9464f66a777f9897d,Santa Barbara Metropolitan Transit District,2024-12-11,schedule_and_vp,05 - San Luis Obispo
128713,62cae2cb469ba696ca1b29a4cd274b96,Santa Cruz Metropolitan Transit District,2024-12-11,schedule_and_vp,05 - San Luis Obispo


In [19]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Fleet")]

Unnamed: 0,schedule_gtfs_dataset_key,organization_name,service_date,sched_rt_category,caltrans_district


In [20]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Ferry")]

Unnamed: 0,schedule_gtfs_dataset_key,organization_name,service_date,sched_rt_category,caltrans_district


In [21]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Bay")]

Unnamed: 0,schedule_gtfs_dataset_key,organization_name,service_date,sched_rt_category,caltrans_district
252640,c85fc19ac90c75c242a4955d294091a6,City of Morro Bay,2024-12-11,schedule_only,05 - San Luis Obispo
146390,749380f1a9f225d9123762d83ea2f50d,Mission Bay Transportation Management Agency,2024-12-11,schedule_only,04 - Oakland
169152,8a1405af8da1379acc062e346187ac98,San Francisco Bay Area Rapid Transit District,2024-12-11,schedule_only,04 - Oakland
166080,82f30e22dafe8156367297eb9a316c57,San Francisco Bay Area Water Emergency Transit Authority,2025-01-15,schedule_and_vp,04 - Oakland


In [22]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Alameda")]

Unnamed: 0,schedule_gtfs_dataset_key,organization_name,service_date,sched_rt_category,caltrans_district
240978,c499f905e33929a641f083dad55c521e,Alameda-Contra Costa Transit District,2024-12-11,schedule_and_vp,04 - Oakland
166079,82f30e22dafe8156367297eb9a316c57,City of Alameda,2025-01-15,schedule_and_vp,04 - Oakland


In [23]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Golden")]

Unnamed: 0,schedule_gtfs_dataset_key,organization_name,service_date,sched_rt_category,caltrans_district
106132,4c105bd9f414afe82dba2c3687cc1d88,Golden Empire Transit District,2024-12-11,schedule_and_vp,06 - Fresno
203465,aea4108997c66a74fbdae27b34b69fde,"Golden Gate Bridge, Highway and Transportation District",2025-01-15,schedule_and_vp,04 - Oakland


In [24]:
schd_vp_df3.loc[schd_vp_df3.organization_name.str.contains("Santa Cruz")]

Unnamed: 0,schedule_gtfs_dataset_key,organization_name,service_date,sched_rt_category,caltrans_district
204278,b34f8d2270968f55f23f80b267df1d5f,City of Santa Cruz,2024-12-11,schedule_only,05 - San Luis Obispo
128713,62cae2cb469ba696ca1b29a4cd274b96,Santa Cruz Metropolitan Transit District,2024-12-11,schedule_and_vp,05 - San Luis Obispo
204277,b34f8d2270968f55f23f80b267df1d5f,"University of California, Santa Cruz",2024-12-11,schedule_only,05 - San Luis Obispo


## Look at ferry operators and see how to incorporate them
* San Francisco Bay Area Water Emergency Transit Authority
* City of Alameda
* Golden Gate Bridge, Highway and Transportation District

### City of Alameda

In [25]:
city_of_alameda_df = pd.read_parquet(
    schd_vp_url,
    filters=[
        [
            ("organization_name", "==", "City of Alameda"),
        ]
    ],
)

In [26]:
city_of_alameda_df[
    [
        "route_primary_direction",
        "route_long_name",
        "route_short_name",
        "route_combined_name",
        "route_id",
        "typology",
    ]
].drop_duplicates()

Unnamed: 0,route_primary_direction,route_long_name,route_short_name,route_combined_name,route_id,typology
166079,Eastbound,Harbor Bay,HB,HB Harbor Bay,HB,unknown
166091,Westbound,Harbor Bay,HB,HB Harbor Bay,HB,unknown
166103,Eastbound,Oakland & Alameda,OA,OA Oakland & Alameda,OA,unknown
166115,Westbound,Oakland & Alameda,OA,OA Oakland & Alameda,OA,unknown
166127,Northbound,Oakland Alameda Water Shuttle,OAS,OAS Oakland Alameda Water Shuttle,OAS,unknown
166139,Southbound,Oakland Alameda Water Shuttle,OAS,OAS Oakland Alameda Water Shuttle,OAS,unknown
166151,Northbound,Richmond,RCH,RCH Richmond,RCH,unknown
166163,Southbound,Richmond,RCH,RCH Richmond,RCH,unknown
166175,Eastbound,Alameda Seaplane,SEA,SEA Alameda Seaplane,SEA,unknown
166187,Westbound,Alameda Seaplane,SEA,SEA Alameda Seaplane,SEA,unknown


In [27]:
city_of_alameda_df.schedule_gtfs_dataset_key.unique()

array(['82f30e22dafe8156367297eb9a316c57'], dtype=object)

In [28]:
city_of_alameda_df.columns

Index(['schedule_gtfs_dataset_key', 'direction_id', 'time_period',
       'avg_scheduled_service_minutes', 'avg_stop_miles', 'n_scheduled_trips',
       'frequency', 'is_express', 'is_rapid', 'is_rail', 'is_coverage',
       'is_downtown_local', 'is_local', 'service_date', 'typology',
       'minutes_atleast1_vp', 'minutes_atleast2_vp',
       'total_rt_service_minutes', 'total_scheduled_service_minutes',
       'total_vp', 'vp_in_shape', 'is_early', 'is_ontime', 'is_late',
       'n_vp_trips', 'vp_per_minute', 'pct_in_shape',
       'pct_rt_journey_atleast1_vp', 'pct_rt_journey_atleast2_vp',
       'pct_sched_journey_atleast1_vp', 'pct_sched_journey_atleast2_vp',
       'rt_sched_journey_ratio', 'avg_rt_service_minutes', 'sched_rt_category',
       'speed_mph', 'route_long_name', 'route_short_name',
       'route_combined_name', 'route_id', 'base64_url',
       'organization_source_record_id', 'organization_name',
       'caltrans_district', 'route_primary_direction', 'name',
       '

#### No ferry typologies.

In [29]:
route_typologies.loc[
    route_typologies.schedule_gtfs_dataset_key == "82f30e22dafe8156367297eb9a316c57"
]

Unnamed: 0,schedule_gtfs_dataset_key,name,route_type,route_id,route_long_name,route_short_name,combined_name,is_express,is_rapid,is_rail,is_local,direction_id,common_shape_id,route_name,route_meters,is_coverage,is_downtown_local


### San Francisco Bay Area Water Emergency Transit Authority
* Duplicates City of Alameda data except for Oyster Bay.

In [30]:
weta_df = pd.read_parquet(
    schd_vp_url,
    filters=[
        [
            (
                "organization_name",
                "==",
                "San Francisco Bay Area Water Emergency Transit Authority",
            ),
        ]
    ],
)

In [31]:
weta_df[
    [
        "route_primary_direction",
        "route_long_name",
        "route_short_name",
        "route_combined_name",
        "route_id",
        "typology",
    ]
].drop_duplicates()

Unnamed: 0,route_primary_direction,route_long_name,route_short_name,route_combined_name,route_id,typology
166080,Eastbound,Harbor Bay,HB,HB Harbor Bay,HB,unknown
166092,Westbound,Harbor Bay,HB,HB Harbor Bay,HB,unknown
166104,Eastbound,Oakland & Alameda,OA,OA Oakland & Alameda,OA,unknown
166116,Westbound,Oakland & Alameda,OA,OA Oakland & Alameda,OA,unknown
166128,Northbound,Oakland Alameda Water Shuttle,OAS,OAS Oakland Alameda Water Shuttle,OAS,unknown
166140,Southbound,Oakland Alameda Water Shuttle,OAS,OAS Oakland Alameda Water Shuttle,OAS,unknown
166152,Northbound,Richmond,RCH,RCH Richmond,RCH,unknown
166164,Southbound,Richmond,RCH,RCH Richmond,RCH,unknown
166176,Eastbound,Alameda Seaplane,SEA,SEA Alameda Seaplane,SEA,unknown
166188,Westbound,Alameda Seaplane,SEA,SEA Alameda Seaplane,SEA,unknown


### Golden Gate
* Only Bus Routes.
* This should be schedule too? 

In [32]:
goldengate_df = pd.read_parquet(
    schd_vp_url,
    filters=[
        [
            (
                "organization_name",
                "==",
                "Golden Gate Bridge, Highway and Transportation District",
            ),
        ]
    ],
)

In [33]:
goldengate_df.sched_rt_category.value_counts()

vp_only            1163
schedule_and_vp     124
schedule_only        48
Name: sched_rt_category, dtype: int64

In [34]:
goldengate_df.columns

Index(['schedule_gtfs_dataset_key', 'direction_id', 'time_period',
       'avg_scheduled_service_minutes', 'avg_stop_miles', 'n_scheduled_trips',
       'frequency', 'is_express', 'is_rapid', 'is_rail', 'is_coverage',
       'is_downtown_local', 'is_local', 'service_date', 'typology',
       'minutes_atleast1_vp', 'minutes_atleast2_vp',
       'total_rt_service_minutes', 'total_scheduled_service_minutes',
       'total_vp', 'vp_in_shape', 'is_early', 'is_ontime', 'is_late',
       'n_vp_trips', 'vp_per_minute', 'pct_in_shape',
       'pct_rt_journey_atleast1_vp', 'pct_rt_journey_atleast2_vp',
       'pct_sched_journey_atleast1_vp', 'pct_sched_journey_atleast2_vp',
       'rt_sched_journey_ratio', 'avg_rt_service_minutes', 'sched_rt_category',
       'speed_mph', 'route_long_name', 'route_short_name',
       'route_combined_name', 'route_id', 'base64_url',
       'organization_source_record_id', 'organization_name',
       'caltrans_district', 'route_primary_direction', 'name',
       '

In [35]:
goldengate_df.schedule_gtfs_dataset_key.unique()

array(['aea4108997c66a74fbdae27b34b69fde',
       'ca270cd1ac30a9ec5336a11bc9223c41'], dtype=object)

In [36]:
goldengate_df[
    [
        "route_primary_direction",
        "route_long_name",
        "route_short_name",
        "route_combined_name",
        "route_id",
        "typology",
    ]
].drop_duplicates()

Unnamed: 0,route_primary_direction,route_long_name,route_short_name,route_combined_name,route_id,typology
203465,Northbound,Santa Rosa - San Francisco,101,101 Santa Rosa - San Francisco,101,rapid
203471,Southbound,Santa Rosa - San Francisco,101,101 Santa Rosa - San Francisco,101,rapid
203477,Northbound,Mill Valley - San Francisco,114,114 Mill Valley - San Francisco,114,rapid
203483,Southbound,Mill Valley - San Francisco,114,114 Mill Valley - San Francisco,114,downtown_local
203489,Northbound,San Rafael - San Francisco,130,130 San Rafael - San Francisco,130,rapid
203495,Southbound,San Rafael - San Francisco,130,130 San Rafael - San Francisco,130,downtown_local
203501,Westbound,San Anselmo - San Francisco,132,132 San Anselmo - San Francisco,132,rapid
203505,Southbound,San Anselmo - San Francisco,132,132 San Anselmo - San Francisco,132,downtown_local
203511,Northbound,San Rafael - San Francisco,150,150 San Rafael - San Francisco,150,rapid
203517,Southbound,San Rafael - San Francisco,150,150 San Rafael - San Francisco,150,downtown_local


## Go back to schedule portion of `merge_data` and see why these vp_only operators are being dropped.

In [37]:
sched_data = merge_data.concatenate_schedule_by_route_direction(analysis_date_list)

In [38]:
sched_data.head(1)

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,time_period,route_primary_direction,avg_scheduled_service_minutes,avg_stop_miles,n_scheduled_trips,frequency,is_express,is_rapid,is_rail,is_coverage,is_downtown_local,is_local,service_date
0,0139b1253130b33adcd4b3a4490530d2,0177a66b-9f33-407d-a72e-776429fb73d4,0.0,all_day,Eastbound,63.5,2.23,2,0.08,0.0,0.0,0.0,1.0,0.0,0.0,2025-02-12


In [39]:
schd_vp_df2.columns

Index(['schedule_gtfs_dataset_key', 'caltrans_district', 'organization_name',
       'name', 'sched_rt_category', 'service_date'],
      dtype='object')

In [40]:
vp_only_sched_keys = list(vp_only_ops_df.schedule_gtfs_dataset_key.unique())

In [41]:
sched_data2 = sched_data.loc[
    sched_data.schedule_gtfs_dataset_key.isin(vp_only_sched_keys)
]

In [42]:
sched_data2

Unnamed: 0,schedule_gtfs_dataset_key,route_id,direction_id,time_period,route_primary_direction,avg_scheduled_service_minutes,avg_stop_miles,n_scheduled_trips,frequency,is_express,is_rapid,is_rail,is_coverage,is_downtown_local,is_local,service_date


### Digging into `gtfs_funnel/schedule_stats_by_route_direction.py`

In [43]:
import sys

sys.path.append("../gtfs_funnel/")
import schedule_stats_by_route_direction

#### Line 203

In [44]:
trip_metrics = schedule_stats_by_route_direction.assemble_scheduled_trip_metrics(
    analysis_date, GTFS_DATA_DICT
)

In [45]:
trip_metrics = trip_metrics.loc[
    trip_metrics.schedule_gtfs_dataset_key.isin(vp_only_sched_keys)
]

In [46]:
trip_metrics.head(1)

Unnamed: 0,schedule_gtfs_dataset_key,trip_instance_key,median_stop_meters,time_of_day,scheduled_service_minutes,route_id,direction_id
80393,c4092405159366c705b62df938293a4e,01dc5389671b18ec6c00c52306792d6b,28819.73,Midday,106.0,San Bernardino Line,1.0


In [47]:
len(vp_only_sched_keys)

2

#### Some operators havae duplicative schedule_gtfs_dataset_keys

In [48]:
trip_sched_keys = set(list(trip_metrics.schedule_gtfs_dataset_key.unique()))

In [49]:
type(trip_sched_keys)

set

In [50]:
og = set(vp_only_sched_keys)

In [51]:
type(og)

set

In [52]:
og - trip_sched_keys

{'759ad28de7d4bb8b2bf9bb7d83655100'}

#### Some `schedule_gtfs_datset_keys` are repeated for two different organizations
* SF Bay Area WETA = has two different keys
* SoCal Regional Rail Authority = also has two keys
* City of Alameda & SF WETA share the same key. 
* San Bernardino & Southern California Regional Rail share the same key.
* Maybe need to drop duplicates by schedule_gtfs_dataset_key as well. 

In [53]:
vp_only_ops_df[["schedule_gtfs_dataset_key", "organization_name"]].drop_duplicates()

Unnamed: 0,schedule_gtfs_dataset_key,organization_name
338765,759ad28de7d4bb8b2bf9bb7d83655100,Southern California Regional Rail Authority
342768,c4092405159366c705b62df938293a4e,San Bernardino County Transportation Authority
342769,c4092405159366c705b62df938293a4e,Southern California Regional Rail Authority


In [54]:
trip_metrics.direction_id = trip_metrics.direction_id.fillna(0)

In [55]:
len(trip_metrics)

202

### Why doesn't anything show up with this function `schedule_metrics_by_route_direction`

In [56]:
route_dir_metrics = (
    schedule_stats_by_route_direction.schedule_metrics_by_route_direction(
        trip_metrics, analysis_date, route_group_merge_cols
    )
)

NameError: name 'route_group_merge_cols' is not defined

In [None]:
route_dir_metrics.head(1)

#### Line 148

In [None]:
group_merge_cols = ["schedule_gtfs_dataset_key", "route_id", "direction_id"]

In [None]:
service_freq_df = gtfs_schedule_wrangling.aggregate_time_of_day_to_peak_offpeak(
        trip_metrics, group_merge_cols, long_or_wide="long"
    )

In [None]:
service_freq_df.schedule_gtfs_dataset_key.unique()

In [None]:
metrics_df = (
        trip_metrics.groupby(group_merge_cols, observed=True, group_keys=False, dropna=False)
        .agg(
            {
                "median_stop_meters": "mean",
                # take mean of the median stop spacing for trip
                # does this make sense?
                # median is the single boiled down metric at the trip-level
                "scheduled_service_minutes": "mean",
            }
        )
        .reset_index()
        .rename(
            columns={
                "median_stop_meters": "avg_stop_meters",
                "scheduled_service_minutes": "avg_scheduled_service_minutes",
            }
        )
    )

In [None]:
metrics_df.schedule_gtfs_dataset_key.unique()

In [None]:
from shared_utils.rt_utils import METERS_PER_MILE

In [None]:
metrics_df = metrics_df.assign(
        avg_stop_miles=metrics_df.avg_stop_meters.divide(METERS_PER_MILE).round(2)
    ).drop(columns=["avg_stop_meters"])

round_me = ["avg_stop_miles", "avg_scheduled_service_minutes"]
metrics_df[round_me] = metrics_df[round_me].round(2)

#### Line 179 = where the routes of interest are getting deleted.
* Delete out `.pipe(helpers.remove_shapes_outside_ca)` to bring back in ferry operators.
* I think the routes on the water are considered out of bounds with the California geography.

In [None]:
common_shape = gtfs_schedule_wrangling.most_common_shape_by_route_direction(
        analysis_date
    ).pipe(helpers.remove_shapes_outside_ca)


In [None]:
common_shape = common_shape.loc[
    common_shape.schedule_gtfs_dataset_key.isin(vp_only_sched_keys)
]

In [None]:
common_shape.schedule_gtfs_dataset_key.nunique()

In [None]:
common_shape2 = gtfs_schedule_wrangling.most_common_shape_by_route_direction(
        analysis_date
    )

In [None]:
common_shape2 = common_shape2.loc[
    common_shape2.schedule_gtfs_dataset_key.isin(vp_only_sched_keys)
]

In [None]:
len(common_shape2)

In [None]:
common_shape2.explore("schedule_gtfs_dataset_key")

## Seeing which graphs are vp_only using 

In [None]:
stop

In [None]:
import _report_utils
import altair as alt
import yaml

In [None]:
with open("readable.yml") as f:
    readable_dict = yaml.safe_load(f)

In [None]:
with open("color_palettes.yml") as f:
    color_dict = yaml.safe_load(f)

In [None]:
df = weta_df.copy()

In [None]:
# Round float columns
float_columns = df.select_dtypes(include=["float"])
for i in float_columns:
    df[i] = df[i].round(2)

# Multiply percent columns to 100%
pct_cols = df.columns[df.columns.str.contains("pct")].tolist()
for i in pct_cols:
    df[i] = df[i] * 100

In [None]:
# Add column to create rulers for the charts
df["ruler_100_pct"] = 100
df["ruler_for_vp_per_min"] = 2

# Add a column that flips frequency to be every X minutes instead
# of every hour.
df["headway_in_minutes"] = 60 / df.frequency

In [None]:
df.route_primary_direction = df.route_primary_direction.fillna("None")

In [None]:
df = _report_utils.replace_column_names(df)

In [None]:
routes_list = df["Route"].unique().tolist()

route_dropdown = alt.binding_select(
    options=routes_list,
    name="Routes: ",
)
# Column that controls the bar charts
xcol_param = alt.selection_point(
    fields=["Route"], value=routes_list[0], bind=route_dropdown
)

# Filter for only rows that are "all day" statistics
all_day = df.loc[df["Period"] == "all_day"].reset_index(drop=True)

In [None]:
timeliness_df = section2.timeliness_trips(df)

In [None]:
timeliness_df.head(2)

In [None]:
def pct_vp_journey(df: pd.DataFrame, col1: str, col2: str) -> pd.DataFrame:
    """
    Reshape the data for the charts that display the % of
    a journey that recorded 2+ vehicle positions/minute.
    """
    to_keep = [
        "Date",
        "Organization",
        "dir_0_1",
        col1,
        col2,
        "Route",
        "Period",
        "ruler_100_pct",
    ]
    df2 = df[to_keep]

    df3 = df2.melt(
        id_vars=[
            "Date",
            "Organization",
            "Route",
            "dir_0_1",
            "Period",
            "ruler_100_pct",
        ],
        value_vars=[col1, col2],
    )

    df3 = df3.rename(
        columns={"variable": "Category", "value": "% of Actual Trip Minutes"}
    )
    return df3

In [None]:
sched_journey_vp = pct_vp_journey(
    all_day,
    "% Scheduled Trip w/ 1+ VP/Minute",
    "% Scheduled Trip w/ 2+ VP/Minute",
)

In [None]:
sched_journey_vp.head(2)

In [None]:
route_stats_df = section2.route_stats(df)

In [None]:
route_stats_df.head(2)

## Build this into a function

In [None]:
def load_vp_metrics(organization: str) -> pd.DataFrame:
    """
    Load schedule versus realtime file.
    """
    schd_vp_url = f"{GTFS_DATA_DICT.digest_tables.dir}{GTFS_DATA_DICT.digest_tables.route_schedule_vp}.parquet"

    # Keep only rows that are found in both schedule and real time data
    df = pd.read_parquet(
        schd_vp_url,
        filters=[
            [
                ("organization_name", "==", organization),
            ]
        ],
    )

    # Delete duplicates
    df = df.drop_duplicates().reset_index(drop=True)

    # Round float columns
    float_columns = df.select_dtypes(include=["float"])
    for i in float_columns:
        df[i] = df[i].round(2)

    # Multiply percent columns to 100%
    pct_cols = df.columns[df.columns.str.contains("pct")].tolist()
    for i in pct_cols:
        df[i] = df[i] * 100

    # Add column to create rulers for the charts
    df["ruler_100_pct"] = 100
    df["ruler_for_vp_per_min"] = 2

    # Add a column that flips frequency to be every X minutes instead
    # of every hour.
    df["headway_in_minutes"] = 60 / df.frequency

    # Replace missing values in route_primary_direction
    df.route_primary_direction = df.route_primary_direction.fillna(df.direction_id)

    # Replace column names
    df = _report_utils.replace_column_names(df)

    return df

In [None]:
dumbardton_df = load_vp_metrics("Dumbarton Bridge Regional Operations Consortium")

In [None]:
dumbardton_df.head(1)

In [None]:
socal_rail_df = load_vp_metrics("Southern California Regional Rail Authority")

In [None]:
gg_df = load_vp_metrics("Golden Gate Bridge, Highway and Transportation District")

In [None]:
def filtered_route(
    df: pd.DataFrame,
) -> alt.Chart:
    """
    This combines all the charts together, controlled by a single
    dropdown.

    Resources:
        https://stackoverflow.com/questions/58919888/multiple-selections-in-altair
    """
    # Create dropdown
    routes_list = df["Route"].unique().tolist()

    route_dropdown = alt.binding_select(
        options=routes_list,
        name="Routes: ",
    )
    # Column that controls the bar charts
    xcol_param = alt.selection_point(
        fields=["Route"], value=routes_list[0], bind=route_dropdown
    )

    # Filter for only rows that are "all day" statistics
    all_day = df.loc[df["Period"] == "all_day"].reset_index(drop=True)

    # Manipulate the df for some of the metrics
    timeliness_df = section2.timeliness_trips(df)
    sched_journey_vp = section2.pct_vp_journey(
        all_day,
        "% Scheduled Trip w/ 1+ VP/Minute",
        "% Scheduled Trip w/ 2+ VP/Minute",
    )
    route_stats_df = section2.route_stats(df)

    # Create the charts
    timeliness_trips_dir_0 = (
        (
            section2.base_facet_chart(
                timeliness_df.loc[timeliness_df["dir_0_1"] == 0],
                0,
                "value",
                "variable",
                "Period",
                readable_dict["timeliness_trips_graph"]["title"],
                readable_dict["timeliness_trips_graph"]["subtitle"],
                color_dict["tri_color2"],
            )
        )
        .add_params(xcol_param)
        .transform_filter(xcol_param)
    )
    timeliness_trips_dir_1 = (
        (
            section2.base_facet_chart(
                timeliness_df.loc[timeliness_df["dir_0_1"] == 1],
                1,
                "value",
                "variable",
                "Period",
                readable_dict["timeliness_trips_graph"]["title"],
                "",
                color_dict["tri_color2"],
            )
        )
        .add_params(xcol_param)
        .transform_filter(xcol_param)
    )

    speed_graph_dir_0 = (
        section2.grouped_bar_chart(
            df.loc[df.dir_0_1 == 0],
            "Period",
            "Speed (MPH)",
            "Period",
            readable_dict["speed_graph_dir_0"]["title"],
            readable_dict["speed_graph_dir_0"]["subtitle"],
            color_dict["tri_color2"],
        )
        .add_params(xcol_param)
        .transform_filter(xcol_param)
    )
    speed_graph_dir_1 = (
        section2.grouped_bar_chart(
            df.loc[df.dir_0_1 == 1],
            "Period",
            "Speed (MPH)",
            "Period",
            readable_dict["speed_graph_dir_1"]["title"],
            readable_dict["speed_graph_dir_0"]["subtitle"],
            color_dict["tri_color2"],
        )
        .add_params(xcol_param)
        .transform_filter(xcol_param)
    )
    vp_per_min_graph = (
        (
            section2.base_facet_with_ruler_chart(
                all_day,
                "Average VP per Minute",
                "ruler_for_vp_per_min",
                readable_dict["vp_per_min_graph"]["title"],
                readable_dict["vp_per_min_graph"]["subtitle"],
                color_dict["vp_domain"],
                color_dict["vp_range"],
            )
        )
        .add_params(xcol_param)
        .transform_filter(xcol_param)
    )

    sched_vp_per_min = (
        section2.base_facet_circle(
            sched_journey_vp,
            "% of Actual Trip Minutes",
            "Category",
            "ruler_100_pct",
            readable_dict["sched_vp_per_min_graph"]["title"],
            readable_dict["sched_vp_per_min_graph"]["subtitle"],
            color_dict["tri_color2"],
        )
        .add_params(xcol_param)
        .transform_filter(xcol_param)
    )
    spatial_accuracy = (
        section2.base_facet_with_ruler_chart(
            all_day,
            "% VP within Scheduled Shape",
            "ruler_100_pct",
            readable_dict["spatial_accuracy_graph"]["title"],
            readable_dict["spatial_accuracy_graph"]["subtitle"],
            color_dict["spatial_accuracy_domain"],
            color_dict["spatial_accuracy_range"],
        )
        .add_params(xcol_param)
        .transform_filter(xcol_param)
    )
    # Separate out the charts themetically.
    ride_quality = section2.divider_chart(
        df, readable_dict["ride_quality_graph"]["title"]
    )
    data_quality = section2.divider_chart(
        df, readable_dict["data_quality_graph"]["title"]
    )

    # Combine all the charts
    chart_list = [
        ride_quality,
        timeliness_trips_dir_0,
        timeliness_trips_dir_1,
        speed_graph_dir_0,
        speed_graph_dir_1,
        data_quality,
        vp_per_min_graph,
        sched_vp_per_min,
        spatial_accuracy,
    ]

    chart = alt.vconcat(*chart_list)

    return chart

In [None]:
filtered_route(gg_df)

In [None]:
filtered_route(socal_rail_df)

In [None]:
filtered_route(dumbardton_df)

### `Average Scheduled Minutes` chart doesn't work.

In [None]:
(
    (
        section2.base_facet_chart(
            timeliness_df.loc[timeliness_df["dir_0_1"] == 1],
            1,
            "value",
            "variable",
            "Period",
            readable_dict["timeliness_trips_graph"]["title"],
            "",
        )
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

In [None]:
df.headway_in_minutes = df.headway_in_minutes.fillna(0)

### `Frequency` doesn't work.

In [None]:
(
    section2.frequency_chart(
        df,
        0,
        readable_dict["frequency_graph"]["title"],
        readable_dict["frequency_graph"]["subtitle"],
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

#### `speed` also doesn't work.

In [None]:
(
    section2.grouped_bar_chart(
        df.loc[df.dir_0_1 == 0],
        "Period",
        "Speed (MPH)",
        "Period",
        readable_dict["speed_graph_dir_0"]["title"],
        readable_dict["speed_graph_dir_0"]["subtitle"],
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

In [None]:
all_day.head(1).T

In [None]:
(
    (
        section2.base_facet_with_ruler_chart(
            all_day.loc[all_day.dir_0_1 == 0],
            "Average VP per Minute",
            "ruler_for_vp_per_min",
            readable_dict["vp_per_min_graph"]["title"],
            readable_dict["vp_per_min_graph"]["subtitle"],
            color_dict["vp_domain"],
            color_dict["vp_range"],
        )
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

In [None]:
(
    (
        section2.base_facet_with_ruler_chart(
            all_day.loc[all_day.dir_0_1 == 1],
            "Average VP per Minute",
            "ruler_for_vp_per_min",
            readable_dict["vp_per_min_graph"]["title"],
            readable_dict["vp_per_min_graph"]["subtitle"],
            color_dict["vp_domain"],
            color_dict["vp_range"],
        )
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

In [None]:
sched_journey_vp.columns

In [None]:
sched_journey_vp = sched_journey_vp.rename(columns={"dir_0_1": "Direction"})

In [None]:
(
    section2.base_facet_circle(
        sched_journey_vp,
        "% of Actual Trip Minutes",
        "Category",
        "ruler_100_pct",
        readable_dict["sched_vp_per_min_graph"]["title"],
        readable_dict["sched_vp_per_min_graph"]["subtitle"],
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

#### The bars are stacked because the direction 0/1 are coded as "None" in `route_primary_direction`
* Need to drop Direction and rename `dir_0_1` as Direction.

In [None]:
all_day = all_day.drop(columns=["Direction"]).rename(columns={"dir_0_1": "Direction"})

In [None]:
(
    section2.base_facet_with_ruler_chart(
        all_day,
        "% VP within Scheduled Shape",
        "ruler_100_pct",
        readable_dict["spatial_accuracy_graph"]["title"],
        readable_dict["spatial_accuracy_graph"]["subtitle"],
        color_dict["spatial_accuracy_domain"],
        color_dict["spatial_accuracy_range"],
    )
    .add_params(xcol_param)
    .transform_filter(xcol_param)
)

## Why is <i>Golden Gate Bridge, Highway and Transportation District</i> `vp_only`? It should have schedule data!

In [None]:
import merge_data

In [None]:
analysis_date

In [None]:
sched_df = merge_data.concatenate_schedule_by_route_direction(analysis_date_list)

In [None]:
sched_df.loc[sched_df.schedule_gtfs_dataset_key == "aea4108997c66a74fbdae27b34b69fde"]