# Metric 3: Prediction Inconsistency

The "jitter" boils down to the slope of the predictions for a stop for the entire prediction duration until you reach that stop.

## Rabbit Hole 
* If we drop predictions from before the `trip_start_time`, we can lose a large fraction of it.
* Basically, some trips are not updating predicted arrival too much, and since these predictions are occurring from way far back in time, we'd be penalizing the predictions simply because time is ticking from 20 min before trip starts, to 19 min before, 18 min before, and the prediction for arrival at this stop is not changing.
   * maybe we consider handling first stop differently, or excluding completely
   * but for subsequent stops, especially for ones in the middle of a trip, their predictions are unlikely to change until the trip starts. should we be penalizing them because the clock is ticking down even before trip start, but the prediction isn't changing (bc the bus isn't moving)?
   * Right now, we are excluding to predictions after the `trip_start_time` 
* Implementation now, which skips calculating it each stop-min, but just sums up the `abs(actual_change (minutes) - expected_change (minutes))` for each stop, is still going to be computationally intensive because of the groupby-shifts.

Summary Levels
* Cumulatively across an entire route, or
* Rolling average
* Route by stops

In [1]:
import pandas as pd

import utils
from segment_speed_utils.project_vars import (PREDICTIONS_GCS, 
                                              analysis_date)


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas


In [2]:
df = pd.read_parquet(
    f"{PREDICTIONS_GCS}rt_sched_stop_times_{analysis_date}.parquet", 
)
df._gtfs_dataset_name.unique()

array(['Anaheim Resort TripUpdates',
       'Bay Area 511 Dumbarton Express TripUpdates',
       'Bay Area 511 Fairfield and Suisun Transit TripUpdates'],
      dtype=object)

In [3]:
df2 = utils.exclude_predictions_after_actual_stop_arrival(
    df, "_extract_ts_local")
df3 = utils.exclude_predictions_before_trip_start_time(df2)

In [4]:
print(f"rows to begin: {len(df)}")
print(f"rows post drop predictions after actual stop arrival: {len(df2)}")
print(f"rows post drop predictions before trip start time: {len(df3)}")

rows to begin: 1358328
rows post drop predictions after actual stop arrival: 1189614
rows post drop predictions before trip start time: 506694


In [5]:
def add_minutes_from_trip_start_and_aggregate_to_minute(
    df: pd.DataFrame,
    stop_cols: list,
    timestamp_col: str = "_extract_ts_local"
) -> pd.DataFrame:
    """
    Calculate the number of minutes from trip start, may not
    need to calculate minutes until arrival.
    Aggregate individual predictions (the 3 per minute) to just 1 prediction.
    
    Future TODO: confirm taking the minimum will yield the max difference?
    Mean is out, since this is datetime. Maybe max?
    """
    df2 = df.assign(
        # implicit in this set up is that we need to 
        # back out minutes until actual stop arrival...this is what each row is
        #min_until_arrival = (
        #    (df.actual_stop_arrival_pacific - df[timestamp_col])
        #    .dt.total_seconds().divide(60).round(0)
        #),
        min_since_start = (
            (df[timestamp_col] - df.trip_start_time)
            .dt.total_seconds().divide(60).round(0)
        ),
    )
    
    df3 = (df2.groupby(stop_cols + ["min_since_start"])
           .agg({"predicted_pacific": "min"})
           # we'll use min, but either min or max should yield
           # the most absolute_value(difference)
           .reset_index()
          )
    
    return df3

In [6]:
def calculate_expected_and_actual_change(
    df: pd.DataFrame,
    stop_cols: list
) -> pd.DataFrame:
    """
    Within a stop-minute, look to the previous row and 
    get the prediction and also the minute since trip started.
    Use this to get the deltas for slope calculation.
    """
    df2 = df.assign(
        prior_predicted = (df.sort_values(stop_cols)
                           .groupby(stop_cols, group_keys=False)
                           ["predicted_pacific"]
                           .apply(lambda x: x.shift(1))
                          )
    )
    
    df2 = df2.assign(
        expected_change_min = (df2.sort_values(stop_cols + ["min_since_start"])
                               .groupby(stop_cols, group_keys=False)
                               .min_since_start
                               .apply(lambda x: x.shift(1) - x)
                              ),
        actual_change_min = ((df2.prior_predicted - df2.predicted_pacific)
                             .dt.total_seconds().divide(60)
                            )
    )
    
    df2 = df2.assign(
        actual_minus_expected_change = abs(df2.actual_change_min - 
                                           df2.expected_change_min)
    )
    
    return df2

In [7]:
def aggregate_by_stop(
    df: pd.DataFrame, 
    stop_cols: list
) -> pd.DataFrame:
    """
    Don't need to calculate cumulative within each stop-min.
    Just take the sum across the whole stop and calculate
    the sum(actual_minus_expected_change) / prediction_duration.
    """
    df2 = (df.groupby(stop_cols)
           .agg({
               "actual_minus_expected_change": "sum",
               "min_since_start": "size"})
           .reset_index()
           .rename(columns = {
               "actual_minus_expected_change": "total_inconsistency",
               "min_since_start": "prediction_duration"
           })
          )
    
    df2 = df2.assign(
        prediction_inconsistency = df2.total_inconsistency.divide(
            df2.prediction_duration)
    )
    
    return df2

In [8]:
def prediction_inconsistency_metric(df: pd.DataFrame) -> pd.DataFrame: 
    """
    Start with assembled RT stop_time_updates with 
    scheduled stop_times and also final_trip_updates columns.
    
    For a given stop, back out the number of minutes since 
    the trip start. 
    For each minute, keep the min(prediction).
    For each minute, calculate the expected change and 
    actual change in prediction, in minutes.
    Sum it up for a stop across all the minutes.
    """
    timestamp_col = "_extract_ts_local"
    
    all_stop_cols = [
        "gtfs_dataset_key", "_gtfs_dataset_name", 
        "service_date", 
        "shape_id", "route_id",
        "trip_id", 
        "stop_id", "stop_sequence",
        "scheduled_arrival", "actual_stop_arrival_pacific", 
    ]
    
    df2 = utils.exclude_predictions_after_actual_stop_arrival(
        df, timestamp_col)
    
    df3 = utils.exclude_predictions_before_trip_start_time(df2)
    
    df4 = add_minutes_from_trip_start_and_aggregate_to_minute(
        df3, all_stop_cols, timestamp_col)
    
    df5 = calculate_expected_and_actual_change(
        df4, 
        all_stop_cols
    )
    
    df6 = aggregate_by_stop(df5, all_stop_cols)
    
    return df6

In [9]:
by_trip_stop = prediction_inconsistency_metric(df)

In [10]:
by_trip_stop.prediction_inconsistency.describe()

count    4393.000000
mean        1.119749
std         1.090634
min         0.000000
25%         0.775000
50%         0.923563
75%         1.147163
max        17.260784
Name: prediction_inconsistency, dtype: float64

In [11]:
def quick_descriptives(df: pd.DataFrame, 
                       operator: str,
                       cols_to_describe: list):
    print(f"------------- {operator}-------------")
    subset_df = df[df._gtfs_dataset_name==operator] 
    
    for c in cols_to_describe:
        print(subset_df[c].describe())
        print("\n")

In [12]:
cols = [
    "total_inconsistency", 
    "prediction_duration",
    "prediction_inconsistency"]

for i in by_trip_stop._gtfs_dataset_name.unique():
    quick_descriptives(by_trip_stop, i, cols)


------------- Anaheim Resort TripUpdates-------------
count    1734.000000
mean       26.825048
std        67.870180
min         0.000000
25%         8.358333
50%        14.375000
75%        30.341667
max      1467.166667
Name: total_inconsistency, dtype: float64


count    1734.000000
mean       18.810265
std        18.486729
min         1.000000
25%         8.000000
50%        13.000000
75%        25.000000
max       167.000000
Name: prediction_duration, dtype: float64


count    1734.000000
mean        1.299016
std         1.458940
min         0.000000
25%         0.826771
50%         1.047096
75%         1.280556
max        17.260784
Name: prediction_inconsistency, dtype: float64


------------- Bay Area 511 Dumbarton Express TripUpdates-------------
count    1375.000000
mean       43.697939
std        30.265312
min         0.000000
25%        20.366667
50%        39.650000
75%        63.941667
max       140.000000
Name: total_inconsistency, dtype: float64


count    1375.000000
me

### Pick out an example where we stop asking before the actual arrival.

In [13]:
def compare_predictions_to_extract_to_actual(
    df, one_trip, one_stop
):
    subset = df[(df.trip_id==one_trip) & 
                (df.stop_sequence==one_stop)]
    
    print(f"Predictions for trip_id: {one_trip}, stop_sequence: {one_stop}")
    print(subset.predicted_pacific.value_counts())
    
    print("Actual stop arrival")
    print(subset.actual_stop_arrival_pacific.iloc[0])
    
    print("Last time we ask for predictions")
    print(subset._extract_ts_local.max())

In [14]:
one_trip = df[df._gtfs_dataset_name.str.contains("Dumbarton")].trip_id.unique()[10]
one_stop = 7
compare_predictions_to_extract_to_actual(df, one_trip, one_stop)

Predictions for trip_id: 9383970, stop_sequence: 7
2023-03-15 09:10:07    30
2023-03-15 09:07:32     4
2023-03-15 09:46:59     3
2023-03-15 09:51:35     2
2023-03-15 09:46:50     1
2023-03-15 09:44:38     1
2023-03-15 09:44:07     1
2023-03-15 09:44:32     1
2023-03-15 09:50:52     1
2023-03-15 09:00:24     1
2023-03-15 09:51:11     1
Name: predicted_pacific, dtype: int64
Actual stop arrival
2023-03-15 09:51:35
Last time we ask for predictions
2023-03-15 08:11:00


In [15]:
stop_times = pd.read_parquet(
    f"{PREDICTIONS_GCS}stop_time_updates_{analysis_date}.parquet",
    filters = [[("trip_id", "==", one_trip), 
                ("stop_sequence", "==", one_stop)]]
)

In [16]:
stop_times._extract_ts_local.max()

Timestamp('2023-03-15 08:11:00')

In [17]:
stop_times.arrival_time_pacific.max()

Timestamp('2023-03-15 09:51:35')