# `stop_times_with_direction`: is `stop_pair` correctly constructed?

Is the fact that the `stop_pair` not correctly constructed simply due to sorting?
Or something else?

Add the `.sort_values` before the group shift and double check if the same behavior is happening. Check this against the dataset already saved out and see where there are differences.

Why is the same `trip_instance_key` showing up with multiple `stop_sequence==0` with different `stop_id` values? `stop_times` is trip-stop_sequence grain.

In [1]:
import geopandas as gpd
import pandas as pd

from shared_utils import rt_dates
from segment_speed_utils import helpers
from update_vars import GTFS_DATA_DICT, RT_SCHED_GCS

In [2]:
analysis_date = rt_dates.DATES["feb2025"]
EXPORT_FILE = GTFS_DATA_DICT.rt_vs_schedule_tables.stop_times_direction

In [3]:
# Pare down this function from `stop_times_with_direction` and get the sorting stuff
def find_prior_subseq_stop_info(
    stop_times: gpd.GeoDataFrame, 
    analysis_date: str,
    trip_cols: list = ["trip_instance_key"],
    trip_stop_cols: list = ["trip_instance_key", "stop_sequence"]
) -> gpd.GeoDataFrame:
    """
    For trip-stop, find the previous stop (using stop sequence).
    Attach the previous stop's geometry.
    This will determine the direction for the stop (it's from prior stop).
    Add in subseq stop information too.
    
    Create columns related to comparing current to prior stop.
    - stop_pair (stop_id1_stop_id2)
    - stop_pair_name (stop_name1__stop_name2)
    """
    
    gdf = stop_times[
        trip_stop_cols + ["stop_id", "stop_name"]
    ].sort_values(trip_stop_cols).reset_index(drop=True) 

    gdf = gdf.assign(

        prior_stop_sequence = (gdf.sort_values(trip_stop_cols)
                               .groupby(trip_cols)
                               .stop_sequence
                               .shift(1)),
        # add subseq stop info here
        subseq_stop_sequence = (gdf.sort_values(trip_stop_cols)
                                .groupby(trip_cols)
                                .stop_sequence
                                .shift(-1)),
        subseq_stop_id = (gdf.sort_values(trip_stop_cols)
                          .groupby(trip_cols)
                          .stop_id
                          .shift(-1)),
        subseq_stop_name = (gdf.sort_values(trip_stop_cols)
                            .groupby(trip_cols)
                            .stop_name
                            .shift(-1)),
    ).fillna({
        **{c: "" for c in ["subseq_stop_id", "subseq_stop_name"]}
    })
    
    
    # Just keep subset of columns because we'll get other stop columns back when we merge with stop_times
    keep_cols = [
        "trip_instance_key", "stop_sequence",
        "prior_stop_sequence", "subseq_stop_sequence"
    ]
    
    # Create stop pair with underscores, since stop_id 
    # can contain hyphens
    gdf2 = gdf[keep_cols].assign(
        stop_pair = gdf.stop_id.astype(str).str.cat(
            gdf.subseq_stop_id.astype(str), sep = "__"),
        stop_pair_name = gdf.stop_name.astype(str).str.cat(
            gdf.subseq_stop_name.astype(str), sep = "__"),
    )
    
    stop_times_geom_direction = pd.merge(
        stop_times,
        gdf2,
        on = trip_stop_cols,
        how = "inner"
    )

    return stop_times_geom_direction 

In [4]:
import stop_times_with_direction

scheduled_stop_times = stop_times_with_direction.prep_scheduled_stop_times(analysis_date)

trip_cols = ["trip_instance_key"]
trip_stop_cols = ["trip_instance_key", "stop_sequence"]

df = find_prior_subseq_stop_info(
    scheduled_stop_times,
    analysis_date,
    trip_cols = trip_cols,
    trip_stop_cols = trip_stop_cols
).sort_values(
    trip_stop_cols
).reset_index(drop=True)

In [5]:
existing_df = pd.read_parquet(
    f"{RT_SCHED_GCS}{EXPORT_FILE}_{analysis_date}.parquet",
    columns = trip_stop_cols + ["stop_pair"]
)

In [6]:
# outer join showed that every row merged
test = pd.merge(
    df,
    existing_df,
    on = trip_stop_cols,
    how = "inner",
)

In [7]:
def subset_to_trip(df: pd.DataFrame, one_trip: str):
    """
    Narrow down to 1 trip, see where differences in stop_pair are
    in the extra sorting vs unsorted version.
    Check the stop_sequence-stop_id to see what it should be.
    """
    cols = [
        "trip_instance_key", "stop_sequence", "stop_id", 
        "stop_pair_x", "stop_pair_y"
    ]
    
    df2 = df[
        (df.stop_pair_x != df.stop_pair_y) & 
        (df.trip_instance_key == one_trip)
    ][cols].sort_values(
        ["trip_instance_key", "stop_sequence"]
    ).drop_duplicates()
    
    display(df2.head(10))

The fact that multiple `stop_sequence==0` occurrs with different `stop_ids` shows that this is not the trip-stop grain that is expected. Look to the previous step where the merge is made to prepare scheduled stop times to see where this could be introduced.

We want to use `trip_instance_key` and have brought it over from `trips`, so it's definitely coming from the merge between `stop_times` and `trips`.

In [8]:
subset_to_trip(test, "000ca2403969d49c6a12f0a788b2f68b")

Unnamed: 0,trip_instance_key,stop_sequence,stop_id,stop_pair_x,stop_pair_y
767,000ca2403969d49c6a12f0a788b2f68b,0,7698054,7698054__5859600,5859600__3151246
770,000ca2403969d49c6a12f0a788b2f68b,0,7698054,5859600__3151246,7698054__5859600
775,000ca2403969d49c6a12f0a788b2f68b,0,5859600,7698054__5859600,5859600__3151246
778,000ca2403969d49c6a12f0a788b2f68b,0,5859600,5859600__3151246,7698054__5859600
783,000ca2403969d49c6a12f0a788b2f68b,1,3151246,3151246__3151223,3151223__3151247
786,000ca2403969d49c6a12f0a788b2f68b,1,3151246,3151223__3151247,3151246__3151223
791,000ca2403969d49c6a12f0a788b2f68b,1,3151223,3151246__3151223,3151223__3151247
794,000ca2403969d49c6a12f0a788b2f68b,1,3151223,3151223__3151247,3151246__3151223
799,000ca2403969d49c6a12f0a788b2f68b,2,3151247,3151247__3151224,3151224__3151225
802,000ca2403969d49c6a12f0a788b2f68b,2,3151247,3151224__3151225,3151247__3151224


In [9]:
subset_to_trip(test, "000ca2403969d49c6a12f0a788b2f68b")

Unnamed: 0,trip_instance_key,stop_sequence,stop_id,stop_pair_x,stop_pair_y
767,000ca2403969d49c6a12f0a788b2f68b,0,7698054,7698054__5859600,5859600__3151246
770,000ca2403969d49c6a12f0a788b2f68b,0,7698054,5859600__3151246,7698054__5859600
775,000ca2403969d49c6a12f0a788b2f68b,0,5859600,7698054__5859600,5859600__3151246
778,000ca2403969d49c6a12f0a788b2f68b,0,5859600,5859600__3151246,7698054__5859600
783,000ca2403969d49c6a12f0a788b2f68b,1,3151246,3151246__3151223,3151223__3151247
786,000ca2403969d49c6a12f0a788b2f68b,1,3151246,3151223__3151247,3151246__3151223
791,000ca2403969d49c6a12f0a788b2f68b,1,3151223,3151246__3151223,3151223__3151247
794,000ca2403969d49c6a12f0a788b2f68b,1,3151223,3151223__3151247,3151246__3151223
799,000ca2403969d49c6a12f0a788b2f68b,2,3151247,3151247__3151224,3151224__3151225
802,000ca2403969d49c6a12f0a788b2f68b,2,3151247,3151224__3151225,3151247__3151224


In [10]:
# This is happening very infrequently, but still present with the extra sorting
df.shape, test[test.stop_pair_x != test.stop_pair_y].shape

((4327233, 12), (816, 13))