# Feb 2025 `rt_stop_times` checks for Foothill Transit

An upstream table changed, and to make sure that `mart_gtfs.fct_vehicle_locations_grouped` is doing what we want, let's check 1 operator and move through each stage of the pipeline to see if the outputs are similar enough.

Do `rt_stop_times` segment type because that is the most straightforward, it trip-stop grain. If this one works, then other segment types are either aggregations or more granular grains.

## `rt_segment_speeds`
* `stage2`: the 2 nearest vp selected for each stop
* `stage3`: stop arrivals calculated
* `stage4`: speeds for segments

## `gtfs_funnel`
* `raw_vp`: either our new `fct_vehicle_locations_grouped` or `fct_vehicle_locations` 
* `vp_dwell`: this is the vp that have been grouped to location, removed trips that are too short, and have `vp_idx` and `vp_primary_direction`
* `vp_condensed`: vp path

In [1]:
import geopandas as gpd
import pandas as pd

from shared_utils import rt_dates
from segment_speed_utils.project_vars import GTFS_DATA_DICT, SEGMENT_GCS, RT_SCHED_GCS
from shared_utils import geo_utils

analysis_date =  rt_dates.DATES["jan2025"]
segment_type = "rt_stop_times"
dict_inputs = GTFS_DATA_DICT.rt_stop_times

## `gtfs_funnel` outputs

Select just Foothill trips

In [2]:
MART_VP = GTFS_DATA_DICT.speeds_tables.raw_vp
VP_DWELL = GTFS_DATA_DICT.speeds_tables.vp_dwell

subset_trips = pd.read_parquet(
    f"{SEGMENT_GCS}{MART_VP}_{analysis_date}.parquet",
    filters = [[("gtfs_dataset_name", "==", "Foothill Vehicle Positions")]],
    columns = ["trip_instance_key"]
).trip_instance_key.unique().tolist()

## stage2 nearest vp

Check that all rows merge in -- this is ok.

Check that all rows have similar results -- this is not ok.
* for the rows that differ, why is it selecting way wider bounds for each stop than before?
* nothing to do with nearest neighbor, it has to do with what vp is going into nearest neighbor.

In [3]:
STAGE2_FILE = dict_inputs["stage2"]

nearest1 = pd.read_parquet(
    f"{SEGMENT_GCS}{STAGE2_FILE}_{analysis_date}.parquet",
    filters = [[("trip_instance_key", "in", subset_trips)]]
)

nearest2 = pd.read_parquet(
    f"{SEGMENT_GCS}{STAGE2_FILE}_{analysis_date}_new.parquet",
    filters = [[("trip_instance_key", "in", subset_trips)]]
)

In [4]:
pd.merge(
    nearest1,
    nearest2,
    on = ["trip_instance_key", "stop_sequence", "shape_array_key", "stop_meters"],
    how = "outer",
    indicator=True
)._merge.value_counts()

both          85126
left_only         0
right_only        0
Name: _merge, dtype: int64

In [5]:
# outer join showed everything merged
nearest = pd.merge(
    nearest1,
    nearest2,
    on = ["trip_instance_key", "stop_sequence", "shape_array_key", "stop_meters"],
    how = "inner",
)

In [6]:
one_trip = "0017fe7b1802eafc4328a9c5bce307a9"

In [7]:
nearest[nearest.trip_instance_key==one_trip].sort_values("stop_sequence").head(10)

Unnamed: 0,trip_instance_key,stop_sequence,shape_array_key,stop_meters,prior_vp_idx_x,subseq_vp_idx_x,prior_vp_meters_x,subseq_vp_meters_x,prior_vp_idx_y,subseq_vp_idx_y,prior_vp_meters_y,subseq_vp_meters_y
0,0017fe7b1802eafc4328a9c5bce307a9,0,973dd94584238465f00537af4f12a18c,5.779608,14330179,14330180,0.0,159,5144,5145,0.0,159
1,0017fe7b1802eafc4328a9c5bce307a9,320,973dd94584238465f00537af4f12a18c,1590.082823,14330189,14330192,1315.890091,1595,5144,5154,0.0,1595
2,0017fe7b1802eafc4328a9c5bce307a9,456,973dd94584238465f00537af4f12a18c,2254.21347,14330191,14330194,1595.668796,2357,5153,5156,1585.538802,2357
3,0017fe7b1802eafc4328a9c5bce307a9,604,973dd94584238465f00537af4f12a18c,2979.885552,14330195,14330196,2725.992529,2987,5157,5158,2725.992529,2987
4,0017fe7b1802eafc4328a9c5bce307a9,682,973dd94584238465f00537af4f12a18c,3360.166537,14330198,14330200,3343.692017,3430,5160,5162,3343.692017,3430
5,0017fe7b1802eafc4328a9c5bce307a9,716,973dd94584238465f00537af4f12a18c,3526.200195,14330198,14330202,3343.692017,3582,5160,5164,3343.692017,3582
6,0017fe7b1802eafc4328a9c5bce307a9,768,973dd94584238465f00537af4f12a18c,3778.09423,14330199,14330203,3353.877204,3819,5161,5165,3353.877204,3819
7,0017fe7b1802eafc4328a9c5bce307a9,856,973dd94584238465f00537af4f12a18c,4209.194224,14330203,14330204,3819.644371,4288,5165,5166,3819.644371,4288
8,0017fe7b1802eafc4328a9c5bce307a9,924,973dd94584238465f00537af4f12a18c,4543.175175,14330204,14330206,4288.759516,4626,5166,5168,4288.759516,4626
9,0017fe7b1802eafc4328a9c5bce307a9,1020,973dd94584238465f00537af4f12a18c,5014.681295,14330206,-1,4626.128743,0,5169,5172,4816.520697,5053


In [8]:
nearest[(nearest.prior_vp_meters_x != nearest.prior_vp_meters_y) & 
       (nearest.subseq_vp_meters_x != nearest.subseq_vp_meters_y) & 
       (nearest.trip_instance_key==one_trip)][
    ["trip_instance_key", "stop_sequence", "stop_meters", 
     "prior_vp_meters_x", "prior_vp_meters_y", 
     "subseq_vp_meters_x", "subseq_vp_meters_y"
    ]]

Unnamed: 0,trip_instance_key,stop_sequence,stop_meters,prior_vp_meters_x,prior_vp_meters_y,subseq_vp_meters_x,subseq_vp_meters_y
9,0017fe7b1802eafc4328a9c5bce307a9,1020,5014.681295,4626.128743,4816.520697,0,5053
10,0017fe7b1802eafc4328a9c5bce307a9,1058,5200.647276,4626.128743,4816.520697,0,5311
22,0017fe7b1802eafc4328a9c5bce307a9,2277,11136.298364,10432.690629,10521.833162,0,11256
23,0017fe7b1802eafc4328a9c5bce307a9,2354,11511.205733,10449.900874,10896.041813,11950,11599
34,0017fe7b1802eafc4328a9c5bce307a9,3420,17165.925011,16712.023993,17138.727041,0,17182
35,0017fe7b1802eafc4328a9c5bce307a9,3469,17435.074659,16857.072406,17182.363291,17916,17441
53,0017fe7b1802eafc4328a9c5bce307a9,5388,29048.332982,28144.635666,28872.443739,0,29102
55,0017fe7b1802eafc4328a9c5bce307a9,5475,29534.76264,28144.635666,29043.723521,0,29591


## `fct_vehicle_locations_grouped`

Check that `vp_primary_direction` is correctly constructed -- this is ok.

Check that `vp_idx` seems to be increasing and traveling along the right order on the map -- this seems ok.

Check `n_vp`, where are the points getting grouped? -- can't tell at the endpoints yet.

In [9]:
vp = pd.read_parquet(
    f"{SEGMENT_GCS}{VP_DWELL}_{analysis_date}.parquet",
    columns = ["trip_instance_key", "vp_idx", "n_vp", "vp_primary_direction", "x", "y"],
    filters = [[("trip_instance_key", "==", one_trip)]]
).pipe(geo_utils.vp_as_gdf)

In [10]:
vp.explore(
    "vp_idx", tiles = "CartoDB Positron"
)

In [11]:
vp.explore(
    "vp_primary_direction", tiles = "CartoDB Positron"
)

In [12]:
vp.explore(
    "n_vp", tiles = "CartoDB Positron", categorical=True
)

## `stage3`: stop arrivals

In [13]:
STAGE3_FILE = dict_inputs["stage3"]

arrivals1 = pd.read_parquet(
    f"{SEGMENT_GCS}{STAGE3_FILE}_{analysis_date}.parquet",
    filters = [[("trip_instance_key", "in", subset_trips)]]
)

arrivals2 = pd.read_parquet(
    f"{SEGMENT_GCS}{STAGE3_FILE}_{analysis_date}_new.parquet",
    filters = [[("trip_instance_key", "in", subset_trips)]]
)

The arrivals dataset not fully merging...we used the bare minimum of merge columns.

Some rows are getting dropped in both datasets.

Check `stop_times_direction` to make sure that these stops are actually valid for the trip. -- this seems ok, actually we want all these stops, so the fact that they weren't merging means we should check why.

In [14]:
pd.merge(
    arrivals1,
    arrivals2,
    on = ["trip_instance_key", "stop_sequence", "shape_array_key"],
    how = "outer",
    indicator=True
)._merge.value_counts()

both          50388
left_only     14907
right_only    10302
Name: _merge, dtype: int64

In [15]:
arrivals_unmerged = pd.merge(
    arrivals1,
    arrivals2,
    on = ["trip_instance_key", "stop_sequence", "shape_array_key"],
    how = "outer",
    indicator=True
).query('_merge!="both"')

In [16]:
left_arrivals = arrivals_unmerged[arrivals_unmerged._merge=="left_only"][
    ["trip_instance_key", "stop_sequence", "shape_array_key"]]

right_arrivals = arrivals_unmerged[arrivals_unmerged._merge=="right_only"][
    ["trip_instance_key", "stop_sequence", "shape_array_key"]]

In [17]:
list(set(left_arrivals.trip_instance_key).intersection(
    set(right_arrivals.trip_instance_key)))[:10]

['fbde984626e4441fe13c3b6723f15f8f',
 'e3689f71912881aba3194cb7f0c7a207',
 '4611d526351e1afe382c095d53cd790f',
 '4caaf9662535c2bf48dff56ff1fdf87b',
 'c3198b0f38654dbdae222f9de3e7d60a',
 'e103b3b5fe66c4280176e8599e017abe',
 'b03dc4c5e8083bb9ea040f4901349260',
 'bed447f9298ffe4cf5fd6e0137f081b1',
 '8ded30d0067c85570b2064daae1eacd7',
 'e3e48ae76b52aff5bb28e4f92fe4a2f1']

## `stop_times_direction`

Which stops are present, why are they not all found in stage3 output?

In [18]:
ST_FILE = GTFS_DATA_DICT.rt_vs_schedule_tables.stop_times_direction

stop_times = gpd.read_parquet(
    f"{RT_SCHED_GCS}{ST_FILE}_{analysis_date}.parquet",
    filters = [[("trip_instance_key", "in", subset_trips)]]
)

In [19]:
stop_times[stop_times.trip_instance_key==one_trip].stop_sequence.unique()

array([   0,  320,  456,  604,  682,  716,  768,  856,  924, 1020, 1058,
       1203, 1273, 1390, 1524, 1601, 1692, 1773, 1904, 2018, 2130, 2204,
       2277, 2354, 2460, 2572, 2683, 2779, 2861, 2990, 3065, 3140, 3221,
       3328, 3420, 3469, 3601, 3729, 3917, 4022, 4126, 4200, 4417, 4476,
       4532, 4818, 4888, 4966, 5033, 5100, 5164, 5213, 5297, 5388, 5418,
       5475, 5582, 5635, 5709, 5754, 5812, 6000])

In [20]:
left_arrivals[
    (left_arrivals.trip_instance_key==one_trip) 
].stop_sequence.unique()

array([], dtype=int64)

In [21]:
right_arrivals[right_arrivals.trip_instance_key==one_trip].stop_sequence.unique()

array([1020, 1058, 2204, 2277, 3420, 5388, 5475])