# Stop groupings

**Goal:** Reliably group GTFS columns so we can identify a stop across trips, but within a route-direction?

* How many combinations should we expect?
* `stop_id` is associated with a point geometry, but a given stop can have multiple `stop_sequence` values within the same trip (loops, 1st and last stop are in a transit center, etc).
* `gtfs_segments` package uses `stop_id1`, `stop_id2`, which is created and combined to be `stop_pair` and `stop_pair_name`. This will help us see what a stop is from the perspective of a segment. 
   * Segment goes between 2 stops, then instead of saying Stop A, we say, the segment betweeen Stop A-B (bus goes from A to B)
   * Stop A-C (bus goes from stop A to C on a different trip and doesn't stop at B)...we would need to uniquely identify this.
   * For loopy trips, if a bus starts and ends at a transit center, we would have a segment tagged as Stop A-B vs Stop Z-A (from Stop Z back to A).

## Related notebooks
* In [route id changes](./08_route_id_changes.ipynb), even something as aggregated as `route_id` changes in roughly 6 months.
* Intra-day shape-stop variation is present. Across shapes, if we try to cut segments that are stop-to-stop, we end up with more segments at the route-direction level. The way to handle this variation is to select a common shape for that route-direction and ask where the vp was at a different stop position. 
   * [shape-stop segments](../rt_segment_speeds/16_stop_combos_for_segments.ipynb). 
   * [select stop segments](../rt_segment_speeds/scripts/select_stop_segments.py)

In [1]:
import pandas as pd

from segment_speed_utils import helpers
from shared_utils import rt_dates

analysis_date = rt_dates.DATES["apr2024"]

In [2]:
trips = helpers.import_scheduled_trips(
    analysis_date,
    columns = ["name", "trip_instance_key", "route_id", "direction_id"],
    get_pandas = True
)
    
stop_times_with_direction = helpers.import_scheduled_stop_times(
    analysis_date,
    get_pandas = True,
    with_direction=True
).drop(columns = ["stop_meters", "stop_primary_direction"])

In [3]:
df = pd.merge(
    stop_times_with_direction,
    trips,
    on = "trip_instance_key"
)

In [4]:
route_dir_cols = ["schedule_gtfs_dataset_key", "name", 
                  "route_id", "direction_id"]

stop_cols = ["schedule_gtfs_dataset_key", "stop_id"]

df2 = (df
       .groupby(route_dir_cols)
       .agg({
           "stop_id": "nunique",
           "stop_sequence": "nunique",
           "stop_pair": "nunique",
       }).reset_index()
      )

In [5]:
subset = df2[df2.stop_sequence != df2.stop_pair]

print(subset.shape[0])
print(subset[subset.stop_sequence <subset.stop_pair].shape[0])
print(subset[subset.stop_sequence > subset.stop_pair].shape[0])

1309
1169
140


In [6]:
route_dir_differences = df2[
    (df2.stop_sequence != df2.stop_pair)][
    route_dir_cols
]

stop_differences = pd.merge(
    df,
    route_dir_differences,
    on = route_dir_cols
).groupby(
    stop_cols
).agg({
       "stop_sequence": "nunique",
       "stop_pair": "nunique"
   }).reset_index().query('stop_sequence != stop_pair')[stop_cols]

In [7]:
stop_cols2 = ["stop_id", "stop_sequence", "stop_pair", "stop_pair_name"]

df_differences = pd.merge(
    df,
    stop_differences,
    on = stop_cols
).drop_duplicates(
    subset=route_dir_cols + stop_cols2
)[route_dir_cols + stop_cols2].sort_values(
    ["name", "route_id", "direction_id", "stop_id"]
).reset_index(drop=True)

**Finding: there are about 25% of total stops route-dir-stop_id where too many stop sequences (3 or more) are present.**

This shows how many stop sequence values may be present for a given `stop_id`. If we assume that a stop can be revisited in the same trip, then values of 1 and 2 are ok. Values of 3+...let's look and see what is happening.

In [8]:
(df_differences.groupby(["name", "route_id", "direction_id", "stop_id"])
 .agg({"stop_sequence": "count"})
 .reset_index()
 .stop_sequence
 .value_counts()
)

2     15326
1     14478
3      4588
4      1520
5       498
6       322
12      281
7       279
10      209
8       175
9       172
11      159
14      142
13      137
16       65
15       49
19       47
25       40
23       36
20       28
18       23
29       17
17       12
33        9
21        7
28        5
27        1
26        1
24        1
Name: stop_sequence, dtype: int64

In [9]:
(df_differences.groupby(["name", "route_id", "direction_id", "stop_id"])
 .agg({"stop_sequence": "count"})
 .reset_index()
 .stop_sequence
 .value_counts(normalize=True)
)

2     0.396769
1     0.374816
3     0.118777
4     0.039351
5     0.012893
6     0.008336
12    0.007275
7     0.007223
10    0.005411
8     0.004531
9     0.004453
11    0.004116
14    0.003676
13    0.003547
16    0.001683
15    0.001269
19    0.001217
25    0.001036
23    0.000932
20    0.000725
18    0.000595
29    0.000440
17    0.000311
33    0.000233
21    0.000181
28    0.000129
27    0.000026
26    0.000026
24    0.000026
Name: stop_sequence, dtype: float64

In [10]:
def select_examples(df: pd.DataFrame, dupes: int):
    stop_cols_list = ["name", "route_id", "direction_id", "stop_id"]
    df2 = (df.groupby(stop_cols_list)
         .agg({"stop_sequence": "count"})
         .reset_index()
         .query(f'stop_sequence=={dupes}')
        )[stop_cols_list]
    
    df3 = pd.merge(
        df,
        df2,
        on = stop_cols_list
    ).sort_values(stop_cols_list).reset_index(drop=True)
    
    return df3

In [11]:
select_examples(df_differences, 3)

Unnamed: 0,schedule_gtfs_dataset_key,name,route_id,direction_id,stop_id,stop_sequence,stop_pair,stop_pair_name
0,e681c3a8dafa2c80e5b8e2cdd01f917a,Antelope Valley Transit Authority Schedule,1,0.0,255,420,255__256,10th St. W. & Ave. J__10th St. W. & Ave. J-4
1,e681c3a8dafa2c80e5b8e2cdd01f917a,Antelope Valley Transit Authority Schedule,1,0.0,255,480,255__256,10th St. W. & Ave. J__10th St. W. & Ave. J-4
2,e681c3a8dafa2c80e5b8e2cdd01f917a,Antelope Valley Transit Authority Schedule,1,0.0,255,540,255__256,10th St. W. & Ave. J__10th St. W. & Ave. J-4
3,e681c3a8dafa2c80e5b8e2cdd01f917a,Antelope Valley Transit Authority Schedule,1,1.0,224,600,224__225,Palmdale Blvd. & 20th St. E.__Palmdale Blvd. &...
4,e681c3a8dafa2c80e5b8e2cdd01f917a,Antelope Valley Transit Authority Schedule,1,1.0,224,720,224__225,Palmdale Blvd. & 20th St. E.__Palmdale Blvd. &...
...,...,...,...,...,...,...,...,...
13759,0bcba4ddc5c10546f2e957a74f58b8ac,Yuba-Sutter Schedule,473,1.0,14925,10,14925__14926,Bogue Rd P&R__Walton Ave at Sunsweet
13760,0bcba4ddc5c10546f2e957a74f58b8ac,Yuba-Sutter Schedule,473,1.0,14925,3,14925__12404,Bogue Rd P&R__J & 4th St
13761,0bcba4ddc5c10546f2e957a74f58b8ac,Yuba-Sutter Schedule,6463,1.0,12172,2,12172__2371244,Alturas & Shasta St.__Ash St. & Hwy 99
13762,0bcba4ddc5c10546f2e957a74f58b8ac,Yuba-Sutter Schedule,6463,1.0,12172,9,12172__758939,Alturas & Shasta St.__Gov't Center I St & 9th St


In [12]:
select_examples(df_differences, 4)

Unnamed: 0,schedule_gtfs_dataset_key,name,route_id,direction_id,stop_id,stop_sequence,stop_pair,stop_pair_name
0,e681c3a8dafa2c80e5b8e2cdd01f917a,Antelope Valley Transit Authority Schedule,1,0.0,1009,960,1009__262,Owen Memorial Park__10th St. W. & Ave. L
1,e681c3a8dafa2c80e5b8e2cdd01f917a,Antelope Valley Transit Authority Schedule,1,0.0,1009,1140,1009__262,Owen Memorial Park__10th St. W. & Ave. L
2,e681c3a8dafa2c80e5b8e2cdd01f917a,Antelope Valley Transit Authority Schedule,1,0.0,1009,1020,1009__262,Owen Memorial Park__10th St. W. & Ave. L
3,e681c3a8dafa2c80e5b8e2cdd01f917a,Antelope Valley Transit Authority Schedule,1,0.0,1009,840,1009__262,Owen Memorial Park__10th St. W. & Ave. L
4,e681c3a8dafa2c80e5b8e2cdd01f917a,Antelope Valley Transit Authority Schedule,1,0.0,256,480,256__257,10th St. W. & Ave. J-4__10th St. W. & Ave. J-10
...,...,...,...,...,...,...,...,...
6075,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,4931,0.0,3496149,27,3496149__3492061,Hillcrest Dr & Ventu Park Rd__Hillcrest Dr & L...
6076,17712ec68e3869e3c53525426e38cadd,Yuma Schedule,11,0.0,497,57,497__497,William Brook Ave @ B Street__William Brook Av...
6077,17712ec68e3869e3c53525426e38cadd,Yuma Schedule,11,0.0,497,59,497__,William Brook Ave @ B Street__
6078,17712ec68e3869e3c53525426e38cadd,Yuma Schedule,11,0.0,497,59,497__173,William Brook Ave @ B Street__Juan Sanchez Bou...


In [13]:
select_examples(df_differences, 10)

Unnamed: 0,schedule_gtfs_dataset_key,name,route_id,direction_id,stop_id,stop_sequence,stop_pair,stop_pair_name
0,e681c3a8dafa2c80e5b8e2cdd01f917a,Antelope Valley Transit Authority Schedule,1,0.0,199,2580,199__200,Palmdale Blvd. & 17th St. E.__Palmdale Blvd. &...
1,e681c3a8dafa2c80e5b8e2cdd01f917a,Antelope Valley Transit Authority Schedule,1,0.0,199,2940,199__200,Palmdale Blvd. & 17th St. E.__Palmdale Blvd. &...
2,e681c3a8dafa2c80e5b8e2cdd01f917a,Antelope Valley Transit Authority Schedule,1,0.0,199,2700,199__200,Palmdale Blvd. & 17th St. E.__Palmdale Blvd. &...
3,e681c3a8dafa2c80e5b8e2cdd01f917a,Antelope Valley Transit Authority Schedule,1,0.0,199,3180,199__200,Palmdale Blvd. & 17th St. E.__Palmdale Blvd. &...
4,e681c3a8dafa2c80e5b8e2cdd01f917a,Antelope Valley Transit Authority Schedule,1,0.0,199,3120,199__200,Palmdale Blvd. & 17th St. E.__Palmdale Blvd. &...
...,...,...,...,...,...,...,...,...
2085,40ead758629da2ad8a74dbc687652e5a,Norwalk Avail Schedule,7,1.0,1432,2849,1432__1170,Washington Boulevard and Putnam Street__Lamber...
2086,40ead758629da2ad8a74dbc687652e5a,Norwalk Avail Schedule,7,1.0,1432,2549,1432__1170,Washington Boulevard and Putnam Street__Lamber...
2087,40ead758629da2ad8a74dbc687652e5a,Norwalk Avail Schedule,7,1.0,1432,2279,1432__1170,Washington Boulevard and Putnam Street__Lamber...
2088,40ead758629da2ad8a74dbc687652e5a,Norwalk Avail Schedule,7,1.0,1432,2774,1432__1170,Washington Boulevard and Putnam Street__Lamber...


In [14]:
def get_combos(df: pd.DataFrame, route_dir_cols: list):
    stop_id_seq = df[route_dir_cols + [
        "stop_id", "stop_sequence"]].drop_duplicates().shape[0]
    stop_id_pair = df[route_dir_cols + [
        "stop_id", "stop_pair"]].drop_duplicates().shape[0]
    
    n_stop_ids = df[route_dir_cols + ["stop_id"]].drop_duplicates().shape[0]
    
    print(f"stop_id-stop_sequence combos: {stop_id_seq}")
    print(f"stop_id-stop_pair combos: {stop_id_pair}")
    print(f"n unique stop_ids: {n_stop_ids}")


In [15]:
get_combos(df, route_dir_cols)

stop_id-stop_sequence combos: 194537
stop_id-stop_pair combos: 133261
n unique stop_ids: 130190
