# Metric 2: Expected Wait Time

Expected Wait Time: the time interval between the Next Predicted Route Stop Arrival Time and Next Experienced Route Stop Arrival Time for a given route_id/shape_id/stop_id and minute of the day combination.

Google Doc approach is to create a minute timetable, and for a shape, 
cover that entire span.
Stack all the trips for that shape and go see which trip you catch.

## Rabbit Hole 
* Instead of stacking number of minutes (row) and various trips (columns), 
let's just group by `shape_id-stop_id-stop-sequence_hour`? 
   * Demo this approach and see how many trips where this might even occur in our sample 
   * If there are 2 trips, same shape, occurring in the same hour, we might be interested in aggregating the expected wait time across trips this way.
   
   
Summary Levels
* Route by hour of day/day of week
* Stops by  hour of day/day of week
* Route by stops

In [1]:
import numpy as np
import pandas as pd

import chart_utils
import utils
from segment_speed_utils.project_vars import PREDICTIONS_GCS 

analysis_date = utils.analysis_date


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas


In [2]:
df = pd.read_parquet(
    f"{PREDICTIONS_GCS}rt_sched_stop_times_{analysis_date}.parquet", 
)

df._gtfs_dataset_name.unique()

array(['Anaheim Resort TripUpdates',
       'Bay Area 511 Dumbarton Express TripUpdates',
       'Bay Area 511 Fairfield and Suisun Transit TripUpdates',
       'Santa Cruz Trip Updates', 'Bear Trip Updates'], dtype=object)

Identify a shape where multiple trips use same stop within an hour.

In [3]:
def identify_stops_where_multiple_trips_occur_within_hour(
    df: pd.DataFrame
):
    """
    For each stop-hour, count how many unique trip_ids.
    If there are multiple, then potentially, a rider could get
    to a bus stop and there are several viable trips (buses)
    that could arrive for the rider to get on.
    """
    df = df.assign(
        hour = df.actual_stop_arrival_pacific.dt.hour
    )
    
    shape_stop_cols = [
        "gtfs_dataset_key", "_gtfs_dataset_name",
        "service_date", 
        "shape_id",
        "stop_id", "stop_sequence",
        "actual_stop_arrival_pacific", 
    ]
    
    indiv_stop = df[shape_stop_cols + 
                    ["trip_id", "hour"]].drop_duplicates()

    results = (indiv_stop.groupby(shape_stop_cols + ["hour"])
               .agg({"trip_id": "nunique"})
               .reset_index()
               .rename(columns = {"trip_id": "n_trips_per_hour"})
              )
    
    df2 = pd.merge(
        df,
        results,
        on = shape_stop_cols + ["hour"],
        how = "left"
    )
    
    return df2
    

In [4]:
df2 = identify_stops_where_multiple_trips_occur_within_hour(df)

In [5]:
df2[df2.n_trips_per_hour >=2]._gtfs_dataset_name.value_counts()

Anaheim Resort TripUpdates    52714
Name: _gtfs_dataset_name, dtype: int64

In [6]:
df2[df2.n_trips_per_hour >=2].trip_id.nunique()

300

In [7]:
single_trips = df2[df2.n_trips_per_hour < 2].reset_index(drop=True)
multiple_trips = df2[df2.n_trips_per_hour >= 2].reset_index(drop=True)

In [8]:
def get_shape_stop_arrival_times_for_multiple_trips(
    df: pd.DataFrame
) -> pd.DataFrame:
    """
    For just the subset of data that has multiple trips for 
    shape-stop-hour combination, get an array
    of the actual_stop_arrivals. Use this to do a comparison 
    for what riders' expected wait time would be, since 
    there might be an earlier trip that came.
    """
    shape_stop_hour_cols = [
        "gtfs_dataset_key", "shape_id", 
        "stop_id", "stop_sequence", "hour"
    ]
    
    shape_stop_arrival_array = (df.groupby(shape_stop_hour_cols)
       .agg({
           "actual_stop_arrival_pacific": lambda x: list(x.unique())
       }).reset_index()
       .rename(columns = {
           "actual_stop_arrival_pacific": "actual_stop_arrival_arr"
       })
    )
    
    df2 = pd.merge(
        df,
        shape_stop_arrival_array,
        on = shape_stop_hour_cols,
        how = "inner"
    )
        
    return df2

In [9]:
def expected_wait_time_by_row(
    extract_time: pd.Timestamp, 
    stop_arrivals_array: list,
    predicted_arrival: pd.Timestamp,
) -> int:
    """
    Loop by row to make this logic clear.
    For a given _extract_ts-stop, there is a prediction.
    
    This prediction time should be compared to
       * the value closet in time within the array of possible arrivals
       * AND occur after the _extract_ts
    
    We have an hour bin, so it's possible we're asking at 10:30, but
    the possible arrivals are 10:25 and 10:35. We want to compare the 
    10:30 prediction to the actual arrival of 10:35.
    
    Returns the number of seconds from prediction to next stop arrival.
    """
    # Of the array of stop arrivals, only the ones that occur
    # after the _extract_ts are valid
    valid_stop_arrivals = np.array(
        [i for i in stop_arrivals_array 
         if i >= extract_time]
    )
    
    # For the most part, we want to reduce the array above to valid arrivals
    # but, hour cut-offs can be arbitrary, and we might reduce to zero
    if len(valid_stop_arrivals) == 0:
        valid_stop_arrivals = stop_arrivals_array
    
    # Calculate expected wait time (seconds) as difference in 
    # prediction to earliest stop arrival
    expected_wait_sec = (
        # earliest stop arrival
        np.min(valid_stop_arrivals) - predicted_arrival
    ).total_seconds() 
    
    return expected_wait_sec
    
    
def calculate_expected_wait_for_multiple_trips(
    multiple_trips: pd.DataFrame
) -> pd.DataFrame: 
    """
    Input the 3 columns needed, get a pd.Series storing the results
    of expected wait time.
    Attach it to the df.
    Is this worth the effort if in our sample, only 1 operator came up.
    Even from that operator, only 300 trips have some shape-stops 
    with multiple trips in an hour.
    """
    expected_wait_seconds = [expected_wait_time_by_row(
        extracted_i, 
        arrivals_arr_i,
        predicted_i
    ) for extracted_i, arrivals_arr_i, predicted_i  
          in zip(
              multiple_trips._extract_ts_local, 
              multiple_trips.actual_stop_arrival_arr, 
              multiple_trips.predicted_pacific
          )
         ]
    
    expected_wait_min = [i/60 for i in expected_wait_seconds]
    multiple_trips = multiple_trips.assign(
        expected_wait_min = expected_wait_min
    )
        
    return multiple_trips

In [10]:
multiple_trips2 = get_shape_stop_arrival_times_for_multiple_trips(
    multiple_trips)

multiple_trips3 = calculate_expected_wait_for_multiple_trips(
    multiple_trips2)

In [11]:
# Quick look at the number of expected wait minutes
multiple_trips3.expected_wait_min.describe()

count    52714.000000
mean        13.368985
std         39.414384
min       -163.716667
25%          0.000000
50%         13.433333
75%         28.550000
max        518.666667
Name: expected_wait_min, dtype: float64

### Define Functions for Metrics

If the above approach is not worth the computation, the simpler wait time is to just the same value by calculating prediction error (how metric 4 terms it, even though metric 2 calls it expected wait time)

In [12]:
def calculate_experienced_wait_time(
    df: pd.DataFrame,
    stop_cols: list,
) -> pd.DataFrame:
    """
    Calcualte wait time, which is actual arrival - predicted.
    If it predicts 8:00 and bus actually comes at 8:05, we want it 
    to show +5 min.
    
    Then aggregate it across every minute we have predictions
    to a stop-level mean.
    As is...is this not similar to reliable accuracy...it's basically
    accuracy in that the difference is there, but 
    """
    df = df.assign(
        # this is the same as metric 4, prediction error
        prediction_error = (
            (df.actual_stop_arrival_pacific- df.predicted_pacific)
            .dt.total_seconds())
    )
    
    df2 = (df.groupby(stop_cols)
           .agg({"prediction_error": "mean"})
           .reset_index()
    )
    
    # Is this basically the same as metric 4, 
    # but just minutes, instead of seconds? metric 4 does more after this..
    df2 = df2.assign(
        wait_time_min = df2.prediction_error.divide(60).round(1)
    )
    
    return df2

In [13]:
def expected_wait_time_metric(df: pd.DataFrame) -> pd.DataFrame:
    """
    Calculate expected wait time by trip-stop-min.
    Wait time is prediction error (metric 4).
    This prediction error then gets averaged over every minute
    of prediction duration, change it from seconds to
    minutes, and we'll call this expected wait time?
    
    TODO: shape implementation difficult.
    
    """
    timestamp_col = "_extract_ts_local"
    
    all_stop_cols = [
        "gtfs_dataset_key", "_gtfs_dataset_name", 
        "service_date", 
        "shape_id", "route_id",
        "trip_id", 
        "stop_id", "stop_sequence",
        "scheduled_arrival", "actual_stop_arrival_pacific", 
    ]
    
    df2 = utils.exclude_predictions_after_actual_stop_arrival(
        df, timestamp_col)

    df3 = utils.parse_hour_min(df2, [timestamp_col])
    minute_cols = utils.minute_cols(timestamp_col)

    df4 = calculate_experienced_wait_time(
        df3, 
        all_stop_cols,
    )
    
    return df4

In [14]:
by_trip_stop = expected_wait_time_metric(df)

### Compare results

In [15]:
# this expected_wait_time is by predictions still
# aggregate before we compare

shape_stop_cols = [
    "gtfs_dataset_key", "_gtfs_dataset_name",
    "service_date", 
    "shape_id",
    "stop_id", "stop_sequence",
]

multiple_trips4 = (multiple_trips3.groupby(shape_stop_cols)
                   .agg({"expected_wait_min": "mean"})
                   .reset_index()
                   .reset_index()
                  )

In [16]:
comparison = pd.merge(
    by_trip_stop, 
    multiple_trips4[shape_stop_cols + ["expected_wait_min"]],
    on = shape_stop_cols,
    how = "inner"
)

In [17]:
comparison.head()

Unnamed: 0,gtfs_dataset_key,_gtfs_dataset_name,service_date,shape_id,route_id,trip_id,stop_id,stop_sequence,scheduled_arrival,actual_stop_arrival_pacific,prediction_error,wait_time_min,expected_wait_min
0,262d7b27183fa8d174ab8fc83ad5848f,Anaheim Resort TripUpdates,2023-03-15,063c940e-ae42-4473-b9d8-c36083d3ec23,8f305689-4315-445e-abea-920dbbf0be5e,3662f00e-b8d2-4e8b-8905-c1a5624ce879:1,1100,1.0,2023-03-15 19:00:00,2023-03-15 20:35:37,1773.777778,29.6,14.633382
1,262d7b27183fa8d174ab8fc83ad5848f,Anaheim Resort TripUpdates,2023-03-15,063c940e-ae42-4473-b9d8-c36083d3ec23,8f305689-4315-445e-abea-920dbbf0be5e,3662f00e-b8d2-4e8b-8905-c1a5624ce879:2,1100,1.0,2023-03-15 19:12:00,2023-03-15 20:35:37,161.290323,2.7,14.633382
2,262d7b27183fa8d174ab8fc83ad5848f,Anaheim Resort TripUpdates,2023-03-15,063c940e-ae42-4473-b9d8-c36083d3ec23,8f305689-4315-445e-abea-920dbbf0be5e,3662f00e-b8d2-4e8b-8905-c1a5624ce879:3,1100,1.0,2023-03-15 19:24:00,2023-03-15 21:20:06,1331.075269,22.2,14.633382
3,262d7b27183fa8d174ab8fc83ad5848f,Anaheim Resort TripUpdates,2023-03-15,063c940e-ae42-4473-b9d8-c36083d3ec23,8f305689-4315-445e-abea-920dbbf0be5e,3662f00e-b8d2-4e8b-8905-c1a5624ce879:4,1100,1.0,2023-03-15 19:36:00,2023-03-15 21:20:06,153.333333,2.6,14.633382
4,262d7b27183fa8d174ab8fc83ad5848f,Anaheim Resort TripUpdates,2023-03-15,063c940e-ae42-4473-b9d8-c36083d3ec23,8f305689-4315-445e-abea-920dbbf0be5e,3662f00e-b8d2-4e8b-8905-c1a5624ce879:5,1100,1.0,2023-03-15 19:48:00,2023-03-15 23:46:37,3911.328205,65.2,14.633382


In [18]:
(comparison.expected_wait_min <= 
 comparison.wait_time_min).value_counts()

True     465
False    343
dtype: int64

In [19]:
# When would this occur? when would comparing to the earlier arrival
# actually yield a worse result?
comparison[comparison.expected_wait_min > comparison.wait_time_min].head()

Unnamed: 0,gtfs_dataset_key,_gtfs_dataset_name,service_date,shape_id,route_id,trip_id,stop_id,stop_sequence,scheduled_arrival,actual_stop_arrival_pacific,prediction_error,wait_time_min,expected_wait_min
1,262d7b27183fa8d174ab8fc83ad5848f,Anaheim Resort TripUpdates,2023-03-15,063c940e-ae42-4473-b9d8-c36083d3ec23,8f305689-4315-445e-abea-920dbbf0be5e,3662f00e-b8d2-4e8b-8905-c1a5624ce879:2,1100,1.0,2023-03-15 19:12:00,2023-03-15 20:35:37,161.290323,2.7,14.633382
3,262d7b27183fa8d174ab8fc83ad5848f,Anaheim Resort TripUpdates,2023-03-15,063c940e-ae42-4473-b9d8-c36083d3ec23,8f305689-4315-445e-abea-920dbbf0be5e,3662f00e-b8d2-4e8b-8905-c1a5624ce879:4,1100,1.0,2023-03-15 19:36:00,2023-03-15 21:20:06,153.333333,2.6,14.633382
5,262d7b27183fa8d174ab8fc83ad5848f,Anaheim Resort TripUpdates,2023-03-15,063c940e-ae42-4473-b9d8-c36083d3ec23,8f305689-4315-445e-abea-920dbbf0be5e,3662f00e-b8d2-4e8b-8905-c1a5624ce879:6,1100,1.0,2023-03-15 20:00:00,2023-03-15 23:46:37,49.444444,0.8,14.633382
8,262d7b27183fa8d174ab8fc83ad5848f,Anaheim Resort TripUpdates,2023-03-15,063c940e-ae42-4473-b9d8-c36083d3ec23,8f305689-4315-445e-abea-920dbbf0be5e,ba3f8c8d-99a9-41c0-9a3c-b4e1e26be7bb:3,1100,1.0,2023-03-15 19:54:00,2023-03-15 21:53:37,121.793103,2.0,14.633382
9,262d7b27183fa8d174ab8fc83ad5848f,Anaheim Resort TripUpdates,2023-03-15,063c940e-ae42-4473-b9d8-c36083d3ec23,8f305689-4315-445e-abea-920dbbf0be5e,ba3f8c8d-99a9-41c0-9a3c-b4e1e26be7bb:4,1100,1.0,2023-03-15 20:06:00,2023-03-16 00:03:49,226.875,3.8,14.633382


In [20]:
cols = [
    "wait_time_min"
]

for i in by_trip_stop._gtfs_dataset_name.unique():
    display(
        chart_utils.describe_to_df(
            by_trip_stop,
            i,
            cols,
        )
    )

Unnamed: 0,Measure,Wait Time Min
0,Count,1800.0
1,Mean,9.4
2,Std,16.3
3,Min,-30.0
4,25%,1.4
5,50%,4.0
6,75%,10.2
7,Max,215.0


Unnamed: 0,Measure,Wait Time Min
0,Count,1490.0
1,Mean,9.4
2,Std,14.3
3,Min,-14.7
4,25%,0.8
5,50%,5.6
6,75%,12.7
7,Max,84.4


Unnamed: 0,Measure,Wait Time Min
0,Count,514.0
1,Mean,1.2
2,Std,3.2
3,Min,-4.9
4,25%,-0.5
5,50%,0.7
6,75%,2.4
7,Max,17.8


Unnamed: 0,Measure,Wait Time Min
0,Count,1289.0
1,Mean,2.0
2,Std,2.0
3,Min,-4.5
4,25%,0.5
5,50%,1.6
6,75%,3.0
7,Max,11.6


Unnamed: 0,Measure,Wait Time Min
0,Count,1201.0
1,Mean,2.0
2,Std,3.6
3,Min,-8.1
4,25%,0.0
5,50%,1.6
6,75%,3.6
7,Max,16.9


In [21]:
metric_df = chart_utils.prep_df_for_chart(
    df = by_trip_stop,
    percentage_column = [],
    columns_to_round = [],
    columns_to_keep = [
        "_gtfs_dataset_name",
        "trip_id",
        "stop_id",
        "stop_sequence",
        "wait_time_min",
    ],
)

In [22]:
for i in metric_df['Gtfs Dataset Name'].unique():
    display(chart_utils.basic_scatter_plot(
    metric_df,
    operator = i,
    x_col="Stop Sequence",
    y_col="Wait Time Min",
    dropdown_col="Trip Id",
    dropdown_col_title="Trip ID",))

In [23]:
'''
# create date/min range like Google Doc?
# unwieldy to work with to look across columns of trips
idx = pd.date_range(start = test.trip_start_time.min(),
                    end = test.actual_stop_arrival_pacific.max(),
                    freq = "T"
                   )
ts = idx.to_frame(index=False, name="minute")
'''

'\n# create date/min range like Google Doc?\n# unwieldy to work with to look across columns of trips\nidx = pd.date_range(start = test.trip_start_time.min(),\n                    end = test.actual_stop_arrival_pacific.max(),\n                    freq = "T"\n                   )\nts = idx.to_frame(index=False, name="minute")\n'