# Metric 1: Update Completeness


### Rabbit Hole
* `_extract_ts_local` doesn't always lead up to the stop's actual arrival, or even the max(stop's predicted arrival). If we stop asking, should we penalize? 
* Right now, we'll only count the trip updates for as much as we're asking.
* If `_extract_ts` is not present, we're not asking, then that's a different issue.
* Notice that if we subset to prediction durations, we might lose a lot of rows.

In [1]:
import pandas as pd
import utils
from segment_speed_utils.project_vars import PREDICTIONS_GCS, analysis_date


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas


In [2]:
import altair as alt
from shared_utils import calitp_color_palette as cp

In [3]:
def atleast2_updates_by_trip_stop(
    df: pd.DataFrame,
    timestamp_col: str = "_extract_ts_local",
    metric_timestamp_col: str = "trip_update_timestamp_local",
) -> pd.DataFrame:
    """
    For every trip-stop-minute combination,
    count the number of unique trip_update_timestamps.
    (Checked that this is 3 max).
    If that minute has at least 2, flag that as passing.

    Sum up the number that of passing for that stop and
    calculate the percent. The denominator is the number of
    trip_min_elapsed.

    Note: size here used to count number of rows as denominator.
    But, if we are not asking for predictions (`_extract_ts`),
    we are also not going to penalize operator for not having predictions
    leading up to the stop.
    """
    all_stop_cols = [
        "gtfs_dataset_key",
        "_gtfs_dataset_name",
        "service_date",
        "shape_id",
        "route_id",
        "trip_id",
        "stop_id",
        "stop_sequence",
        "scheduled_arrival",
        "actual_stop_arrival_pacific",
    ]
    minute_cols = [f"{timestamp_col}_hour", f"{timestamp_col}_min"]

    # Count for every stop-min, how many unique trip updates
    df2 = (
        df.groupby(all_stop_cols + minute_cols)
        .agg({metric_timestamp_col: "nunique"})
        .reset_index()
    )

    # 1 if it has more than 2 updates, 0 otherwise.
    # Easier to sum and calculate percent.
    df2 = df2.assign(
        atleast2_trip_updates=df2.apply(
            lambda x: 1 if x[metric_timestamp_col] >= 2 else 0, axis=1
        )
    )

    # Size: gets us number of rows for that stop
    df3 = (
        df2.groupby(all_stop_cols)
        .agg({f"{timestamp_col}_hour": "size", "atleast2_trip_updates": "sum"})
        .reset_index()
    ).rename(columns={f"{timestamp_col}_hour": "trip_min_elapsed"})

    df3 = df3.assign(
        pct_update_complete=df3.atleast2_trip_updates.divide(df3.trip_min_elapsed)
    )

    return df3

In [4]:
def update_completeness_metric(df: pd.DataFrame) -> pd.DataFrame:
    """
    Start with assembled RT stop_time_updates with
    scheduled stop_times and also final_trip_updates columns.

    For a given stop, if there are predictions/rows present because
    of _extract_ts after the "actual stop arrival" (final_trip_updates),
    exclude those.
    """
    # Set timestamp columns here, in case these are not correct
    # Row should be derived from _extract_ts (convert to minute combinations)
    # along with stop identifiers
    # For metric, we want to get # unique trip updates
    timestamp_col = "_extract_ts_local"
    metric_col = "trip_update_timestamp_local"

    df2 = utils.exclude_predictions_after_actual_stop_arrival(df, timestamp_col)
    df3 = utils.parse_hour_min(df2, [timestamp_col])

    df4 = atleast2_updates_by_trip_stop(df3, timestamp_col, metric_col)

    return df4

In [5]:
df = pd.read_parquet(
    f"{PREDICTIONS_GCS}rt_sched_stop_times_{analysis_date}.parquet",
)
df._gtfs_dataset_name.unique()

array(['Anaheim Resort TripUpdates',
       'Bay Area 511 Dumbarton Express TripUpdates',
       'Bay Area 511 Fairfield and Suisun Transit TripUpdates'],
      dtype=object)

In [6]:
df.shape

(700420, 17)

In [7]:
df.sample()

Unnamed: 0,gtfs_dataset_key,_gtfs_dataset_name,service_date,trip_id,trip_start_time,_trip_update_message_age,stop_id,stop_sequence,schedule_relationship,_extract_ts_local,trip_update_timestamp_local,shape_id,route_id,scheduled_arrival,actual_stop_arrival_pacific,predicted_pacific,prior_stop_arrival_pacific
2828,262d7b27183fa8d174ab8fc83ad5848f,Anaheim Resort TripUpdates,2023-03-15,3dd09dac-c3b9-4d60-b511-6e6395a8c7d0,2023-03-15 06:28:00,5,4019,3.0,SCHEDULED,2023-03-15 06:05:40,2023-03-15 06:05:35,320b27dd-bf84-4fa0-9261-ffa1b596c037,bf59bdaf-2bdf-45ad-ba93-adc1d37f5cd7,2023-03-15 06:35:00,2023-03-15 06:35:19,2023-03-15 06:35:00,2023-03-15 06:33:05


In [8]:
by_trip_stop = update_completeness_metric(df)

In [9]:
def quick_descriptives(df: pd.DataFrame, operator: str, cols_to_describe: list):
    print(f"------------- {operator}-------------")
    subset_df = df[df._gtfs_dataset_name == operator]

    for c in cols_to_describe:
        print(subset_df[c].describe())
        print("\n")

In [10]:
#cols = ["atleast2_trip_updates", "trip_min_elapsed", "pct_update_complete"]

#for i in by_trip_stop._gtfs_dataset_name.unique():
 #   quick_descriptives(by_trip_stop, i, cols)

In [11]:
by_trip_stop.columns

Index(['gtfs_dataset_key', '_gtfs_dataset_name', 'service_date', 'shape_id',
       'route_id', 'trip_id', 'stop_id', 'stop_sequence', 'scheduled_arrival',
       'actual_stop_arrival_pacific', 'trip_min_elapsed',
       'atleast2_trip_updates', 'pct_update_complete'],
      dtype='object')

### Charts

In [12]:
def trip_duration_categories(row):
    if row.trip_min_elapsed < 31:
        return "0 - 30 minutes"
    elif 30 < row.trip_min_elapsed < 61:
        return "31-60 minutes"
    elif 60 < row.trip_min_elapsed < 91:
        return "61-90 minutes"
    else:
        return "90+ minutes"

In [13]:
def pct_update_complete_categories(row):
    if  row.pct_update_complete < 21:
        return "0-20%"
    elif 20 < row.pct_update_complete < 41:
        return "21-40%"
    elif 40 < row.pct_update_complete < 61:
        return "41-60%"
    elif 60 < row.pct_update_complete < 80:
        return "61-80%"
    else:
        return "81-100%"    

In [16]:
def altair_dropdown(df, column_for_dropdown:str, title_of_dropdown:str):
    
    dropdown_list = df[column_for_dropdown].unique().tolist()
    initialize_first_op = sorted(dropdown_list)[0]
    input_dropdown = alt.binding_select(options=sorted(dropdown_list), name=title_of_dropdown)
    
    selection = alt.selection_single(
    name=title_of_dropdown,
    fields=[column_for_dropdown],
    bind=input_dropdown,
    init={column_for_dropdown: initialize_first_op},)
    
    return selection

In [19]:
def prep_metric_1(df, percentage_column:str, columns_to_round:list):
    
    df["trip_category"] = df.apply(trip_duration_categories, axis=1)
    
    df[percentage_column] = df[percentage_column] * 100
    
    df["pct_update_category"] = df.apply(pct_update_complete_categories, axis=1)
    
    # Rounds down. 96 becomes 90.
    for i in columns_to_round:
        df[f"rounded_{i}"] = ((df[i] / 100) * 10).astype(int) * 10
    
    # Find total stops per operator
    total_stops_ops = df.groupby(['gtfs_dataset_key']).agg({'stop_id':'count'}).rename(columns = {'stop_id':'total_stops_for_operator'}).reset_index()
    
    # Merge 
    m1 = pd.merge(df, total_stops_ops, how = "inner", on = 'gtfs_dataset_key')
    
    m1.columns = m1.columns.str.replace("_", " ").str.strip().str.title()
    
    return m1 

In [20]:
by_trip_stop_cleaned = prep_metric_1(
    by_trip_stop,
    "pct_update_complete",
    ["trip_min_elapsed", "atleast2_trip_updates", "pct_update_complete"],
)

In [21]:
selection = altair_dropdown(by_trip_stop_cleaned, "Gtfs Dataset Name", "Operator")

In [22]:
def chart_size(chart: alt.Chart) -> alt.Chart:
    chart = chart.properties(width= 500, height=400)
    return chart

In [55]:
agg1 = by_trip_stop_cleaned.groupby(['Gtfs Dataset Name', 'Trip Category']).agg({'Pct Update Complete':'median'}).reset_index().rename(columns = {'Pct Update Complete':'Median Pct Update Complete'})

In [68]:
def bar_chart(df, x_col:str, y_col:str, chart_title:str):
    chart = (alt.Chart(df)
    .mark_bar()
    .encode(
        x=alt.X(x_col),
        y=alt.Y(y_col, scale=alt.Scale(domain=[0, 100])),
        color=alt.Color(
            x_col,
            scale=alt.Scale(range=cp.CALITP_CATEGORY_BRIGHT_COLORS),
            legend=None,
        ),
        tooltip=df.columns.tolist(),
    )
    .properties(title=chart_title)
    .interactive())
    
    return chart

In [77]:
chart1 = bar_chart(agg1, 'Trip Category', 'Median Pct Update Complete', 'Median Pct Update Completeness by Trip Duration').add_selection(selection).transform_filter(selection)

In [70]:
chart_size(chart1)

In [71]:
agg2 = (
    by_trip_stop_cleaned.groupby(["Gtfs Dataset Name", "Rounded Pct Update Complete"])
    .agg({"Stop Id": "count"})
    .reset_index()
    .rename(columns={"Stop Id": "Total Stops"})
)

In [72]:
chart2 = alt.Chart(agg2).mark_arc(innerRadius=0).encode(
    theta="Total Stops",
    color=alt.Color(
        "Rounded Pct Update Complete:N",
        scale=alt.Scale(
            range=cp.CALITP_DIVERGING_COLORS,
            domain=agg2["Rounded Pct Update Complete"]
            .unique()
            .tolist(),
        ),
    ),
    tooltip=agg2.columns.tolist(),
).properties(title="Total Stops by Pct Update Complete").add_selection(selection).transform_filter(selection)

In [73]:
chart_size(chart2)

#### To do
* Aggregate roundedatleast2 trip updates vs. pct update complete. see if the more updates relates with percentage.

In [23]:
by_trip_stop_cleaned.sample(5)

Unnamed: 0,Gtfs Dataset Key,Gtfs Dataset Name,Service Date,Shape Id,Route Id,Trip Id,Stop Id,Stop Sequence,Scheduled Arrival,Actual Stop Arrival Pacific,Trip Min Elapsed,Atleast2 Trip Updates,Pct Update Complete,Trip Category,Pct Update Category,Rounded Trip Min Elapsed,Rounded Atleast2 Trip Updates,Rounded Pct Update Complete,Total Stops For Operator
14,262d7b27183fa8d174ab8fc83ad5848f,Anaheim Resort TripUpdates,2023-03-15,17f36b7a-1454-48cd-9ffa-09f0cda56a63,8726c033-2d11-4351-b4c4-bcd15d90316f,2e92f515-8219-4697-bea1-43355057454a:1,3008,4.0,2023-03-15 06:35:53,2023-03-15 06:39:56,7,7,100.0,0 - 30 minutes,81-100%,0,0,100,265
42,262d7b27183fa8d174ab8fc83ad5848f,Anaheim Resort TripUpdates,2023-03-15,186cf336-ab6f-468a-abce-250c51b17d14,bc404235-c139-4efb-90fb-798fbbddc35c,3ae839d8-c5fd-4919-a453-c130eb088343:3,6018,2.0,2023-03-15 07:35:00,2023-03-15 07:46:54,61,60,98.360656,61-90 minutes,81-100%,60,60,90,265
680,5c3e65766dda65958cf4da845286c0d5,Bay Area 511 Dumbarton Express TripUpdates,2023-03-15,DB0086,DB,9383973,55814,15.0,2023-03-15 09:56:00,2023-03-15 10:27:01,61,60,98.360656,61-90 minutes,81-100%,60,60,90,1424
2934,9255cb4744d73d4a39f512180a7cf63a,Bay Area 511 Fairfield and Suisun Transit Trip...,2023-03-15,p_2689,7,t_5525675_b_79892_tn_3,75268,1.0,2023-03-15 10:30:00,2023-03-15 10:33:26,46,45,97.826087,31-60 minutes,81-100%,40,40,90,1278
184,262d7b27183fa8d174ab8fc83ad5848f,Anaheim Resort TripUpdates,2023-03-15,a4149e18-434f-4452-b250-1f3941cdcc59,ddab055c-8523-472b-b29f-58f7bebd2f56,02be0282-4900-4c20-beb2-26301da4c0c3,19,1.0,2023-03-15 16:29:00,2023-03-15 16:26:03,24,23,95.833333,0 - 30 minutes,81-100%,20,20,90,265


* atleast2_trip_updates: every minute in which there are at least 2 trip updates is flagged as "one".
* trip_min_elapsed: trip duration
* pct_update_complete: those 2 columns above divided