# Average speeds on segments as time-series

## Exploratory: aggregate to route-direction-stop columns 

**Metrics**
* count number of speed values by stop (check for equality against number of service dates) -- how many do we get right vs have dupes
   * if `n_speed_values = n_dates`, we are tracking the same segment over time
   * if `n_speed_values > n_dates`, there are duplicate segments that are getting counted for each date
* how many geometry values are present for a segment?
   * anything > 1 means there's a ~dupe segment, or a segment cut slightly differently -- why does this happen, do the segments differ significantly or not really?

With time-series, we can now track how well `route-dir-stop_pair` is at capturing the same geographic entity over time. Aggregations are done for each day, and we don't really know how well they are performing across time. 

Goal is to say: these are the average speeds on this segment over this period of time.

Route-direction is more stable than shape (definitely trip) grain. With GTFS Digest, we saw that around the 6 month timeframe, even `route_id` starts changing. We do harmonize across time with route names, so let's see how well we're doing now at tracking the same segment over time.

In [1]:
import folium
import geopandas as gpd
import pandas as pd

import segment_time_series_utils as ts_utils

In [3]:
df = ts_utils.route_segment_speeds_ts(
    filters = [[("time_period", "==", "all_day")]],
    columns = ts_utils.route_dir_stop_cols + ["service_date", "p50_mph", "geometry"]
)

df2 = ts_utils.count_time_series_values_by_route_direction_stop(
    df, ts_utils.route_dir_stop_cols)

In [4]:
df.service_date.nunique()

17

In [5]:
df2[df2.n_speed_values == df2.n_dates].shape[0] / df2.shape[0]

0.9545535172478178

In [6]:
df2[df2.n_speed_values > df2.n_dates].shape[0] / df2.shape[0]

0.045446482752182236

* 95% rows are getting it right, 5% has dupes (speed counts are more than the dates, which means there are some duplicated counts)
* about 55% of segment geometries stay stable across 17 months...85% have 1 or 2 geometries over this period.
* geometries may need to get harmonized or not...if they are slightly differing, we may not care
* the different segments getting cut would reflect the fact that shape changes. `stop_segments` already uses 1 shape for each route-direction, so intra-date-variation is already at a minimum....yet over 17 months, we are seeing this aggregation need further reducing of variation.

In [None]:
df2.n_geometry.value_counts(normalize=True)

In [None]:
dupes = df2[df2.n_geometry>=3].reset_index(drop=True)
dupes.shape

In [None]:
def plot_segment_variations(
    df: gpd.GeoDataFrame, 
    dupes_df: pd.DataFrame,
    row_value: int
):
    """
    Make a map of the variations of segments so we can see 
    just how similar or different they look.
    """
    subset_gdf = pd.merge(
        df,
        dupes_df.iloc[row_value : row_value + 1],
        on = ts_utils.route_dir_stop_cols,
        how = "inner"
    )[ts_utils.route_dir_stop_cols + ["geometry"]
    ].drop_duplicates().reset_index(drop=True)
    
    display(subset_gdf)
    
    subset_gdf = subset_gdf.assign(
        obs = subset_gdf.index
    )
    
    sample_colors = [
        "blue", "green", "red", "orange", 
        "purple", "black", "gray"
    ]
    
    m = subset_gdf[subset_gdf.obs==0].explore(
        "obs",
        tiles = "CartoDB Positron",
        color = sample_colors[0], name = "obs 0"
    )
    
    for idx, value in enumerate(subset_gdf[subset_gdf.obs > 0].obs.unique()):
        m = subset_gdf[subset_gdf.obs == value].explore(
            m=m, color = sample_colors[idx], name = f"obs {value}"
        )

    folium.LayerControl().add_to(m)
    
    display(m)

In [None]:
for v in range(0, dupes.shape[0], 150):
    plot_segment_variations(df, dupes, v)