# Comprehensive review of stop segments from March - July 2023

### Issues
Changes implemented for better segments:
* pick 1 trip with the most stops to cut the segments for that shape.
* include an extra distance check while subsetting by prior stop and current stop.

#### Multiple stop_sequences for different stops present in a shape

If we look across all trips for a shape, the same stop sequence can be present for different stops.

Ex: stop sequence 2 shows up for stop A and stop B.

**Challenge**: this prevents segments from being cut correctly, because segmenting uses arrays for stop sequences and stop geometry.
**Solution**: pick a trip with the most number of stops present, sort and keep 1 trip. Use this trip and its stops to cut segments. The entire shape is used, but we don't cut too-short segments, and vp have more of a chance of getting joined to longer segments.

#### Gaps remain in between segments

When plotting it on a map, we can sometimes see little gaps between segments.

**Challenge**: `super_project()` uses straight line distance between points, since using `shapely.project` wasn't always correct. straight line distance can still underestimate (hypotenuse vs sum of 2 sides).
**Solution**: include extra check where proposed end point is compared to the original end point, and take the leftover distance. This could cut some segments too long, but that's not as bad of an issue when we could potentially plug in some of the gaps.

The extra check performs marginally better, but some gaps still remain.

#### Difficulty in segmenting loops near origin

If we pull apart a shape geometry's coordinates and project them, we'll see that loopy route's points travel back toward the origin. 

**Challenge**: These segments are few, and implementing a second check for `super_project` where a shape's line geometry coords are monotonically cast, cumulative distance array generated creates more errors than it solves. 
**Solution**: since a second `super_project` improved a couple of segments, but at the expense of creating more issues elsewhere, we'll forgo this option. In a random check of 50-100 shapes, found only a couple of shapes whose segments improved, but many more shapes whose segments now had errors. For the majority, there were no visible differences in the segments.


In [None]:
import os
os.environ['USE_PYGEOS'] = '0'

import dask.dataframe as dd
import dask_geopandas as dg
import geopandas as gpd
import numpy as np
import pandas as pd

from shared_utils import rt_dates
from segment_speed_utils import helpers
from segment_speed_utils.project_vars import SEGMENT_GCS

In [None]:
analysis_date = "2023-05-17"

df = gpd.read_parquet(
    f"{SEGMENT_GCS}stop_segments_{analysis_date}.parquet",
    columns = ["shape_array_key", "stop_sequence", "geometry"],
    filters = [[("shape_array_key", "in", improved_shapes)]]
)#.set_geometry("stop_segment_geometry")

In [None]:
improved_shapes = [
    "0badc8e8e7c3e15eaef3feddd38b5eaf",
    "6bee2519e137efd0d445736b8128f32d",
]

got_worse = [
    "6316ca1a41a3696ea80c09abc40d4df3",
    "21bfcb9dc9f1ab2e1ee152b84ece7667", # mixed
    "3c26deafa5cbf15bb7b613c61581214b", # no change, but we want it to
    "011ac48604c84ff6d314563d8e583c3e",
    "0a6cc7ee3f0709e04e94ec887bf854fe"
]

In [None]:
df[df.shape_array_key.isin(improved_shapes)].explore(
    "stop_sequence", 
    tiles='CartoDB Positron', categorical=True, legend=False
)

In [None]:
months = [
    #"mar", 
    "apr", 
    "may", 
    #"jun", 
    #"jul"
]

dates = [
    rt_dates.DATES[f"{m}2023"] for m in months
]
dates

In [None]:
def import_segments(date: str, **kwargs) -> gpd.GeoDataFrame:
    
    gdf = gpd.read_parquet(
        f"{SEGMENT_GCS}stop_segments_{date}.parquet",
        columns = ["gtfs_dataset_key", "feed_key", 
                   "shape_array_key", "stop_id", "stop_sequence",
                   "loop_or_inlining", "geometry"
                  ],
        **kwargs
        
    )
    gdf = gdf.assign(
        segment_length = gdf.geometry.length,
        service_date = date
    )
    
    return gdf

## Apr vs May 

* Mar / Apr cut segments using 1 `super_project`
* May / Jun / Jul cut segments using 2 `super_project` rounds, by checking the first round of loop/inlining segments and any segments whose representative point doesn't fall on the shape, goes through another attempt at cutting

In [None]:
def get_segment_length(date: str, **kwargs):

    gdf = import_segments(date, **kwargs)

    gdf = gdf.assign(
        sum_segment_length = (gdf.groupby(
            ["shape_array_key", "service_date"])
                              .segment_length
                              .transform("sum")
                             )
    )
    
    shape_keys_present = gdf.shape_array_key.unique().tolist()
    
    shapes = helpers.import_scheduled_shapes(
        date,
        filters = [[("shape_array_key", "in", shape_keys_present)]],
        get_pandas = True, 
    )
    
    shapes = shapes.assign(
        shape_length = shapes.geometry.length
    )

    gdf2 = pd.merge(
        gdf,
        shapes,
        on = "shape_array_key",
        how = "inner",
    )
    
    gdf2 = gdf2.assign(
        difference_meters = (gdf2.sum_segment_length - 
                             gdf2.shape_length).round(3)
    )
    
    return gdf2

In [None]:
shape_filtering = [[("loop_or_inlining", "==", 1), 
                    ("district", "==", 11)]]
apr = get_segment_length(
    dates[0], filters = shape_filtering
)

In [None]:
may = get_segment_length(
    dates[1], filters = shape_filtering
)

In [None]:
apr_shapes = apr.shape_array_key.unique()
may_shapes = may.shape_array_key.unique()

shapes_in_common = np.intersect1d(apr_shapes, may_shapes)

In [None]:
len(shapes_in_common)

In [None]:
apr = import_segments(
    dates[0], 
    filters = [[("shape_array_key", "in", got_worse)]]
)

may = import_segments(
    dates[1],
    filters = [[("shape_array_key", "in", got_worse)]]    
)

In [None]:
MAP_KWARGS = {
    "tiles": "CartoDB Positron",
    "categorical": True,
    "legend": False
}

improved_shapes = [
    "0badc8e8e7c3e15eaef3feddd38b5eaf",
    "6bee2519e137efd0d445736b8128f32d",
]

got_worse = [
    "6316ca1a41a3696ea80c09abc40d4df3",
    "21bfcb9dc9f1ab2e1ee152b84ece7667", # mixed
    "3c26deafa5cbf15bb7b613c61581214b", # no change, but we want it to
    "011ac48604c84ff6d314563d8e583c3e",
    "0a6cc7ee3f0709e04e94ec887bf854fe"
]

In [None]:
apr.explore("stop_sequence", **MAP_KWARGS)

In [None]:
may.explore("stop_sequence", **MAP_KWARGS)

In [None]:
random_idx = [
    #10, 110, 210, 310,
    #410, 510, 610, 710,
    #810, 910, 1010, 1110,
    #1210, 1310,1410, 1510, 
    #1610, 1710, 1810, 1910,
    #2010, 2110,
    #4, 100, 496, 493,
    #483, 124, 312, 298,
    #32, 349, 850, 756, 933, 
    #282, 482, 485, 300, 209, 540, 
    #392, 678, 695, 2109, 1335, 
    #1294, 2102, 2004, 2019
]

In [None]:
sd_key = "a4f6fd5552107e05fe9743ac7cce2c55"
apr_sd = apr[apr.gtfs_dataset_key==sd_key].shape_array_key.unique()
may_sd = may[may.gtfs_dataset_key==sd_key].shape_array_key.unique()

sd_shapes_in_common = np.intersect1d(apr_sd, may_sd)

In [None]:
for i in sd_shapes_in_common:
    #one_shape = shapes_in_common[i]
    one_shape = i
    print(one_shape)
    drop = ["geometry_y"]
    apr_map = (apr[apr.shape_array_key==one_shape]
               .drop(columns = drop)
               .set_geometry("geometry_x")
              ).explore(
        "stop_sequence", **MAP_KWARGS
    )
    
    may_map = (may[may.shape_array_key==one_shape]
               .drop(columns = drop)
               .set_geometry("geometry_x")
              ).explore(
        "stop_sequence", **MAP_KWARGS
    )
    
    display(apr_map)
    display(may_map)