# Metric 1: Update Completeness


### Rabbit Hole
* `_extract_ts_local` doesn't always lead up to the stop's actual arrival, or even the max(stop's predicted arrival). If we stop asking, should we penalize? 
* Right now, we'll only count the trip updates for as much as we're asking.
* If `_extract_ts` is not present, we're not asking, then that's a different issue.
* Notice that there is the presence of multiple predictions with the same `_extract_ts-trip_update_timestamp` combination. This can potentially mess up tagging the fresh trip updates. As of now, including `predicted_pacific` in the merge columns.
   * Newmark notes this, but he goes through and randomly keeps a prediction (not necessarily the first or the last). 
   * we can implement a step that also does this. as of now, it is unique on the `_extract_ts-trip_update_timestamp-predicted_pacific` combination, but there are duplicates once we go to `_extract_ts-trip_update_timestamp` for the same `stop_id-stop_sequence`.

In [1]:
import pandas as pd

import utils
from segment_speed_utils.project_vars import (PREDICTIONS_GCS, 
                                              analysis_date)


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas


In [2]:
import chart_utils

In [3]:
df = pd.read_parquet(
    f"{PREDICTIONS_GCS}rt_sched_stop_times_{analysis_date}.parquet", 
)
df._gtfs_dataset_name.unique()

array(['Anaheim Resort TripUpdates',
       'Bay Area 511 Dumbarton Express TripUpdates',
       'Bay Area 511 Fairfield and Suisun Transit TripUpdates'],
      dtype=object)

### Define Functions for Metrics

In [4]:
def flag_first_trip_update_prediction_for_stop(
    df: pd.DataFrame,
    stop_cols: list,
    timestamp_col: str = "_extract_ts_local",
    metric_timestamp_col: str = "trip_update_timestamp_local",
):
    """
    For every stop, tag a unique trip_update-prediction.
    Use this to track whether each minute has fresh trip updates.
    Since we are adding rows through _extract_ts, even if the operator
    does not provide fresh updates, we will generate a row. 
    Now, if it doesn't have a fresh update, that row will contain a 0, so 
    we won't count it as having fresh updates for that minute.
    
    Future TODO: decide whether prediction should be 
    included in this unique combination.
    """
    fresh_updates =(
        (df.sort_values(
            stop_cols + [timestamp_col, metric_timestamp_col])
         .drop_duplicates(subset = stop_cols + [metric_timestamp_col])
        )[stop_cols + [timestamp_col, metric_timestamp_col,
                       "predicted_pacific"]]
        .assign(fresh = 1)
    )
    
    df_with_fresh_flag = pd.merge(
        df,
        fresh_updates,
        on = stop_cols + [
            timestamp_col, metric_timestamp_col, 
            "predicted_pacific"],
        how = "left",
    )

    df_with_fresh_flag = df_with_fresh_flag.assign(
        fresh = df_with_fresh_flag.fresh.fillna(0).astype(int)
    )
    
    return df_with_fresh_flag

In [5]:
def atleast2_updates_by_trip_stop(
    df: pd.DataFrame,
    stop_cols: list,
    timestamp_col: str = "_extract_ts_local",
    metric_col: str = "fresh"
) -> pd.DataFrame: 
    """
    For every trip-stop-minute combination,
    count the number of unique trip_update_timestamps.
    (Checked that this is 3 max).
    If that minute has at least 2, flag that as passing.
    
    Sum up the number that of passing for that stop and 
    calculate the percent. The denominator is the number of 
    trip_min_elapsed.
    
    Note: size here used to count number of rows as denominator.
    But, if we are not asking for predictions (`_extract_ts`), 
    we are also not going to penalize operator for not having predictions
    leading up to the stop.
    """
    minute_cols = [f"{timestamp_col}_hour", f"{timestamp_col}_min"]
    
    # Count for every stop-min, how many unique trip updates
    df2 = (df.groupby(stop_cols + minute_cols)
           .agg({metric_col: "sum"})
           .reset_index()
    )
    
    # 1 if it has more than 2 updates, 0 otherwise.
    # Easier to sum and calculate percent.
    df2 = df2.assign(
        atleast2_trip_updates = df2.apply(
            lambda x: 1 if x[metric_col] >= 2
            else 0, axis=1)
    )    
    
    # Size: gets us number of rows for that stop
    df3 = (df2.groupby(stop_cols)
           .agg({
               f"{timestamp_col}_hour": "size",
               "atleast2_trip_updates": "sum"})
           .reset_index()
          ).rename(columns = {
            f"{timestamp_col}_hour": "trip_min_elapsed"
    })
    
    df3 = df3.assign(
        pct_update_complete = df3.atleast2_trip_updates.divide(
            df3.trip_min_elapsed)
    ) 
    
    return df3

In [6]:
def update_completeness_metric(df: pd.DataFrame) -> pd.DataFrame:
    """
    Start with assembled RT stop_time_updates with 
    scheduled stop_times and also final_trip_updates columns.
    
    For a given stop, if there are predictions/rows present because
    of _extract_ts after the "actual stop arrival" (final_trip_updates), 
    exclude those.
    """
    # Set timestamp columns here, in case these are not correct
    # Row should be derived from _extract_ts (convert to minute combinations)
    # along with stop identifiers
    # For metric, we want to get # unique trip updates
    timestamp_col = "_extract_ts_local"
    metric_col = "trip_update_timestamp_local"
    
    # define all the columns needed for stop grouping
    # include columns for future aggregations
    all_stop_cols = [
        "gtfs_dataset_key", "_gtfs_dataset_name", 
        "service_date", 
        "shape_id", "route_id",
        "trip_id", 
        "stop_id", "stop_sequence",
        "scheduled_arrival", "actual_stop_arrival_pacific", 
    ]

    df2 = utils.exclude_predictions_after_actual_stop_arrival(
        df, timestamp_col)
    
    df3 = flag_first_trip_update_prediction_for_stop(
        df2,
        all_stop_cols,
        timestamp_col,
        metric_col
    )

    df4 = utils.parse_hour_min(df3, [timestamp_col])

    df5 = atleast2_updates_by_trip_stop(
        df4, 
        all_stop_cols,
        timestamp_col,
        "fresh"
    )
    
    return df5

### Calculate Metric and Quick Descriptives

In [7]:
by_trip_stop = update_completeness_metric(df)

In [8]:
def quick_descriptives(df: pd.DataFrame, 
                       operator: str,
                       cols_to_describe: list):
    print(f"------------- {operator}-------------")
    subset_df = df[df._gtfs_dataset_name==operator] 
    
    for c in cols_to_describe:
        print(subset_df[c].describe())
        print("\n")

In [9]:
cols = [
    "atleast2_trip_updates", 
    "trip_min_elapsed",
    "pct_update_complete"]

for i in by_trip_stop._gtfs_dataset_name.unique():
    quick_descriptives(by_trip_stop, i, cols)

------------- Anaheim Resort TripUpdates-------------
count    1800.000000
mean       20.965000
std        21.919591
min         0.000000
25%         8.000000
50%        13.000000
75%        29.000000
max       172.000000
Name: atleast2_trip_updates, dtype: float64


count    1800.000000
mean       21.678333
std        21.908481
min         1.000000
25%         9.000000
50%        13.000000
75%        30.000000
max       173.000000
Name: trip_min_elapsed, dtype: float64


count    1800.000000
mean        0.904271
std         0.188698
min         0.000000
25%         0.900000
50%         0.971429
75%         1.000000
max         1.000000
Name: pct_update_complete, dtype: float64


------------- Bay Area 511 Dumbarton Express TripUpdates-------------
count    1490.000000
mean       79.076510
std        32.132943
min         1.000000
25%        56.000000
50%        82.000000
75%       101.000000
max       161.000000
Name: atleast2_trip_updates, dtype: float64


count    1490.000000
mean  

### Tables/Charts

In [10]:
for i in by_trip_stop._gtfs_dataset_name.unique():
    display(
        chart_utils.describe_to_df(
            by_trip_stop,
            i,
            cols,
        )
    )

Unnamed: 0,Measure,Atleast2 Trip Updates,Trip Min Elapsed,Pct Update Complete
0,Count,1800.0,1800.0,1800.0
1,Mean,21.0,21.7,0.9
2,Std,21.9,21.9,0.2
3,Min,0.0,1.0,0.0
4,25%,8.0,9.0,0.9
5,50%,13.0,13.0,1.0
6,75%,29.0,30.0,1.0
7,Max,172.0,173.0,1.0


Unnamed: 0,Measure,Atleast2 Trip Updates,Trip Min Elapsed,Pct Update Complete
0,Count,1490.0,1490.0,1490.0
1,Mean,79.1,80.2,1.0
2,Std,32.1,32.1,0.0
3,Min,1.0,3.0,0.2
4,25%,56.0,58.0,1.0
5,50%,82.0,83.0,1.0
6,75%,101.0,103.0,1.0
7,Max,161.0,161.0,1.0


Unnamed: 0,Measure,Atleast2 Trip Updates,Trip Min Elapsed,Pct Update Complete
0,Count,1289.0,1289.0,1289.0
1,Mean,49.5,50.2,1.0
2,Std,10.4,10.4,0.0
3,Min,2.0,3.0,0.7
4,25%,45.0,46.0,1.0
5,50%,47.0,47.0,1.0
6,75%,50.0,51.0,1.0
7,Max,112.0,113.0,1.0


In [11]:
metric_1_df = chart_utils.prep_df_for_chart(
    df = by_trip_stop,
    percentage_column = "pct_update_complete",
    columns_to_round = ["pct_update_complete"],
    columns_to_keep = [
        "_gtfs_dataset_name",
        "trip_id",
        "stop_id",
        "stop_sequence",
        "pct_update_complete",
    ],
)

In [12]:
for i in metric_1_df['Gtfs Dataset Name'].unique():
    display(chart_utils.scatter_plot_operator(
    metric_1_df,
    operator = i,
    x_col="Stop Sequence",
    y_col="Pct Update Complete",
    color_col="Rounded Pct Update Complete",
    dropdown_col="Trip Id",
    dropdown_col_title="Trip ID",))

AttributeError: module 'chart_utils' has no attribute 'scatter_plot_operator'

In [None]:
# Look at Fairfield and Suisin for stop sequence
by_trip_stop[by_trip_stop['trip_id'] == 't_5525634_b_79892_tn_6']

In [None]:
metric_1_df[metric_1_df['Trip Id'] == 't_5525634_b_79892_tn_6']