# Between trip variation in stop_ids and stop_sequences for the same shape on rare occasions

When segments are missing, there may be 2 complicating factors we haven't yet considered in the workflow: 
1. stops for a looping route use a start position and `distance` to capture the full length of the segment. This distance is a straight line / crow flies distance, not distance traveled along shape. If it errors, it most likely errors by cutting *too little* of the segment (ex: roads are in an L-shape, but straight line distance is hypotenuse).
   * potential fix: do an extra check of the segment's endpoint to the next stop, and if distance > 0, it needs to add a bit more to the segment.
1. too many stops are used as cut-points for a shape, there are some zero-length segments in between. 
   * in `17_debug_empty_stop_segments`, there's a between-trip variation we're not accounting for. within trip revisiting the same stop_id twice is ok. but between trips, it's possible for stop B to show up as sequence 1 and sequence 2. 
   Trip A: (A1, B2, C3). Trip B: (B1, B2). stop B, in our df assembled for segmenting, shows up as B1 and B2, and now we're cutting an empty length segment in between. 
   * the above example is a simple shift of where the bus began its trip. but, if it's happening in the middle of the trip, these switches could potentially confound how the segment is cut, since we factor in the previous stop's location when we cut the segment.
   * potential fix: figure out a least common denominator approach for cutting segments, maybe by picking the trip with the most stops to stand-in for that shape, and cut segments from that. keeping it to 1 trip means stop sequence increases monotonically. 
   
**Problem**: By taking into account all the trips for a shape, we're convoluting what `stop_sequence` means.

Use the heuristic of picking a trip (sorted alphabetically) for the shape with the most stops. 

**Finding 1**: Between 95-98% of trips that share same shape have exactly same number of stops (looking at Mar-Jul 2023). The remaining are trips with the same shape_id, but have varying number of stops, usually 2 or 3 variants on number of stops. It can go up to 6 variants. Hopefully, hitting the trip with most stops will cut the segments in a least-common-denominator fashion.

**Finding 2**: It is not the case that all the stops of the lesser stop numbers is a subset of the trip with most stops. But of these 2% of trips, now about 1% of them have at least 1 stop that we'd be missing. This is infrequent enough, we should just go with a trip that contains the most stop, and use that trip's stops for segmenting the shape. As long as vehicle positions can be contained (somewhere on that shape), it should fall onto a segment, even if our segment becomes longer.

In [1]:
import dask.dataframe as dd
import numpy as np
import pandas as pd

from dask import delayed, compute

from shared_utils import rt_dates
from segment_speed_utils import helpers, gtfs_schedule_wrangling
from segment_speed_utils.project_vars import SEGMENT_GCS


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas


In [2]:
months = ["mar", "apr", "may", "jun", "jul"]

dates = [
    rt_dates.DATES[f"{m}2023"] for m in months
]
dates

['2023-03-15', '2023-04-12', '2023-05-17', '2023-06-14', '2023-07-12']

In [3]:
def get_stops_per_shape(date: str) -> pd.DataFrame:
    """
    Import stop_times, merge in trips, and count
    number of stops for trips grouped by shape.
    """
    
    stop_times = helpers.import_scheduled_stop_times(
        date,
        columns = ["feed_key", "trip_id", "stop_sequence", "stop_id"]
    )

    stops_per_trip = (stop_times.groupby(["feed_key", "trip_id"], 
                                         observed=True, group_keys=False)
                      .agg({"stop_id": "count"})
                      .reset_index()
                      .rename(columns = {"stop_id": "n_stops"})
    ).compute()
    
    
    trips = helpers.import_scheduled_trips(
        date,
        columns = ["feed_key", "name", "trip_id", "shape_array_key"],
        get_pandas = True
    )
    
    # we exclude Amtrak and Flex in speeds
    trips = gtfs_schedule_wrangling.exclude_scheduled_operators(trips)
    
    df = pd.merge(
        trips,
        stops_per_trip,
        on = ["feed_key", "trip_id"],
        how = "inner"
    )
    
    df2 = (df.groupby("shape_array_key")
           .agg({"n_stops": lambda x: list(set(x))})
           .reset_index()
      )
    
    df2 = df2.assign(
        multiple = df2.apply(lambda x: len(x.n_stops), axis=1),
        service_date = date
    )
    
    return df2

In [4]:
delayed_dfs = [delayed(get_stops_per_shape)(d) for d in dates]
results = [compute(i)[0] for i in delayed_dfs]

df = pd.concat(results, axis=0)

In [5]:
for d in dates:
    print(d)
    subset_df = df[df.service_date == d]
    print(subset_df.multiple.value_counts(normalize=True))
    print(subset_df.multiple.value_counts())

2023-03-15
1    0.963241
2    0.026530
3    0.007991
4    0.001119
5    0.000959
6    0.000160
Name: multiple, dtype: float64
1    6027
2     166
3      50
4       7
5       6
6       1
Name: multiple, dtype: int64
2023-04-12
1    0.966117
2    0.024267
3    0.007479
4    0.001068
5    0.000916
6    0.000153
Name: multiple, dtype: float64
1    6330
2     159
3      49
4       7
5       6
6       1
Name: multiple, dtype: int64
2023-05-17
1    0.989540
2    0.006437
3    0.002575
5    0.000805
4    0.000644
Name: multiple, dtype: float64
1    6149
2      40
3      16
5       5
4       4
Name: multiple, dtype: int64
2023-06-14
1    0.986443
2    0.008694
3    0.003242
5    0.000884
4    0.000589
6    0.000147
Name: multiple, dtype: float64
1    6694
2      59
3      22
5       6
4       4
6       1
Name: multiple, dtype: int64
2023-07-12
1    0.986163
2    0.008630
3    0.003273
5    0.001042
4    0.000744
6    0.000149
Name: multiple, dtype: float64
1    6628
2      58
3      22
5       

In [6]:
# Define the shapes we need to filter for by service_date
shapes_needed_dict =  {
    d: df[(df.service_date == d) & 
          (df.multiple > 1)].shape_array_key.unique().tolist()
    for d in dates
}

In [7]:
def full_stop_info(date: str, shapes_needed: dict) -> pd.DataFrame:
    shapes_list = shapes_needed[date]
    
    trips = helpers.import_scheduled_trips(
        date,
        filters = [[("shape_array_key", "in", shapes_list)]],
        columns = ["feed_key", "name", "trip_id", "shape_array_key"],
        get_pandas = True
    )
    
    operators_list = trips.feed_key.unique().tolist()
    trips_list = trips.trip_id.unique().tolist()

    stop_times = helpers.import_scheduled_stop_times(
        date,
        filters = [[("feed_key", "in", operators_list),
                    ("trip_id", "in", trips_list)]],
        columns = ["feed_key", "trip_id", "stop_sequence", "stop_id"]
    )
    
    st_with_shape = dd.merge(
        stop_times,
        trips,
        on = ["feed_key", "trip_id"],
        how = "inner"
    ).compute()
    
    df = (st_with_shape.groupby(["feed_key", "name",
                                 "shape_array_key", "trip_id"], 
                                observed=True, group_keys=False)
          .agg({"stop_id": lambda x: list(x)})
          .reset_index()
          .rename(columns = {"stop_id": "stops_present"})
         )
    
    df = df.assign(
        num_stops = df.apply(lambda x: len(x.stops_present), axis=1),
        service_date = date
    )
    
    df = df.assign(
        max_stops = df.groupby("shape_array_key").num_stops.transform("max")
    )
    
    
    return df

In [8]:
full_stop_results = [
    delayed(full_stop_info)(d, shapes_needed_dict) 
    for d in dates
]

In [9]:
full_stop_results2 = [compute(i)[0] for i in full_stop_results]

full_df = (pd.concat(full_stop_results2, axis=0)
           .sort_values(["service_date", "shape_array_key", 
                         "num_stops", "trip_id"], 
                        ascending=[True, True, False, True])
           .reset_index(drop=True)
          )

In [10]:
# Let's keep the unique variations of num_stops
full_df2 = full_df.drop_duplicates(
    subset=["service_date", "feed_key", "shape_array_key", 
    "num_stops", "max_stops"]
).reset_index(drop=True)

In [11]:
full_df2 = full_df2.assign(
    list_max_stops = full_df2.apply(
        lambda x: x.stops_present 
        if x.num_stops == x.max_stops else np.nan, axis=1)
)

full_df2 = full_df2.assign(
    list_max_stops = full_df2.groupby(
        ["feed_key", "shape_array_key", "service_date"]
    ).list_max_stops.fillna(method = 'ffill')
)

In [12]:
full_df2 = full_df2.assign(
    missing_stops = full_df2.apply(
        lambda x: 
        list(
            set(x.stops_present).difference(set(x.list_max_stops))
        ), axis=1)
)

In [13]:
full_df2.missing_stops.value_counts(normalize=True)

[]                                                   0.993035
[2436343]                                            0.003482
[80403, 80406, 80407, 80401, 80402, 80404, 80405]    0.001161
[515, 503]                                           0.001161
[510, 535, 513]                                      0.000580
[510, 513, 535]                                      0.000580
Name: missing_stops, dtype: float64

In [14]:
# https://stackoverflow.com/questions/56903912/how-to-check-if-an-element-is-an-empty-list-in-pandas
full_df2[full_df2.missing_stops.str.len() != 0].feed_key.value_counts()

7d12db085da6cd5cff99680c147cba9a    6
49edd787a8f56c4e96b6b3c128e91a6e    2
a05e032fbc7f677fdc9afaa6c1b20f3c    2
b9e5620a5f48b1104b87195858c893b0    2
Name: feed_key, dtype: int64

In [15]:
check_operators = full_df2[full_df2.missing_stops.str.len() != 0
                          ].feed_key.unique().tolist()

In [16]:
full_df2[full_df2.feed_key.isin(check_operators)].service_date.value_counts()

2023-03-15    28
2023-04-12    10
2023-05-17     6
2023-06-14     6
Name: service_date, dtype: int64

In [17]:
check_trips = full_df2[full_df2.feed_key.isin(check_operators)
                      ].trip_id.unique().tolist()

In [18]:
helpers.import_scheduled_trips(
    dates[0],
    filters = [[("feed_key", "in", check_operators)]],
    columns = ["feed_key", "name"],
    get_pandas = True
).drop_duplicates()

Unnamed: 0,feed_key,name
0,7d12db085da6cd5cff99680c147cba9a,Amador Schedule
1,b9e5620a5f48b1104b87195858c893b0,Tuolumne Schedule
2,49edd787a8f56c4e96b6b3c128e91a6e,LA Metro Rail Schedule


In [19]:
trips2 = helpers.import_scheduled_trips(
    dates[0],
    filters = [[("feed_key", "in", check_operators), 
                ("trip_id", "in", check_trips)
               ]],
    columns = ["feed_key", "name",
               "shape_array_key", "trip_id"],
    get_pandas = True
).drop_duplicates()

In [20]:
check_me = helpers.import_scheduled_stop_times(
    dates[0],
    filters = [[("feed_key", "in", check_operators),
        ("trip_id", "in", trips2.trip_id.unique().tolist())]],
    columns = ["feed_key", "trip_id", "stop_id", "stop_sequence"]
).compute()

In [21]:
check_me2 = pd.merge(
    check_me,
    trips2,
    on = ["feed_key", "trip_id"],
    how = "inner"
)

In [22]:
check_me2.sort_values(
    ["shape_array_key", "trip_id", "stop_sequence"]
).groupby(
    ["shape_array_key", "trip_id", "name"]
).agg({
    "stop_id": lambda x: list(x),
    "stop_sequence": lambda x: list(x),
}).reset_index().head()

Unnamed: 0,shape_array_key,trip_id,name,stop_id,stop_sequence
0,111a979d3a4d17bb74e18488c470a544,57706684,LA Metro Rail Schedule,"[80109, 80110, 80111, 80112, 80113, 80114, 801...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]"
1,111a979d3a4d17bb74e18488c470a544,57706689,LA Metro Rail Schedule,"[80101, 80102, 80105, 80106, 80107, 80108, 801...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14..."
2,1b9cfa59b70c557cc833ef560e80b9de,57706683,LA Metro Rail Schedule,"[80122, 80121, 80120, 80119, 80118, 80117, 801...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14..."
3,1b9cfa59b70c557cc833ef560e80b9de,57706686,LA Metro Rail Schedule,"[80118, 80117, 80116, 80115, 80114, 80113, 801...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14..."
4,1b9cfa59b70c557cc833ef560e80b9de,57706703,LA Metro Rail Schedule,"[80108, 80107, 80106, 80105, 80154, 80153, 80101]","[1, 2, 3, 4, 5, 6, 7]"
