# Route Identification Over Time, Approach 1

Recent observations shows small chages in routes over time. Specifically in the following fields:
* route ID
* route short name
* route long name
* route desc

Need to observe these route changes in order to account for these changes in future analyses.

## Objective
1. Query data from `fct_monthly_routes` to help identify variences in Routes. Query for 2023, a couple of months. 
2. Save data to GCS `gtfs_schedule` bucket
3. Filter down data to `Sacramento Regional Transit`, identify and observe routes for any variences

* https://github.com/cal-itp/data-analyses/issues/924

In [None]:
import geopandas as gpd
import pandas as pd
from shared_utils import rt_dates
from calitp_data_analysis import geography_utils

In [None]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [None]:
gdf = gpd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/gtfs_schedule/route_identification_2023_m04_m09.parquet"
)

In [None]:
gdf.shape

In [None]:
gdf = gdf.drop(columns=["year", "key"])

In [None]:
gdf.sample().drop(columns=["geometry"])

### LA Metro

In [None]:
# gdf.name.unique()

In [None]:
la_metro = gdf.loc[gdf.name == "LA Metro Bus Schedule"].reset_index(drop=True)

In [None]:
la_metro.shape

In [None]:
la_metro.month.unique()

In [None]:
# Compare each month
# https://stackoverflow.com/questions/47769453/pandas-split-dataframe-to-multiple-by-unique-values-rows
la_dfs = dict(tuple(la_metro.groupby("month")))

In [None]:
la_metro_df = la_metro.drop(columns = ['geometry'])

In [None]:
la_metro_df.groupby(['route_id']).agg({'shape_id':'nunique'}).sort_values(['shape_id'], ascending = False).head()

In [None]:
april = la_dfs[4]

In [None]:
april.columns

In [None]:
may = la_dfs[5]

In [None]:
june = la_dfs[6]

In [None]:
pd.merge(
    april,
    may,
    on=["name", "source_record_id", "route_id", "shape_id"],
    how="outer",
    indicator=True,
)[["_merge"]].value_counts()

In [None]:
m1 = (
    pd.merge(
        april,
        may,
        on=["name", "source_record_id", "route_id", "shape_id"],
        how="outer",
        indicator=True,
    )
    .rename(columns={"_merge": "april_v_may"})
    .drop(columns=["month_x", "month_y"])
)

In [None]:
m1.columns

In [None]:
preview = ["source_record_id", "name", "route_id", "shape_id", "april_v_may"]

In [None]:
m1.loc[m1.april_v_may == "left_only"].set_geometry("geometry_x").explore(
    "route_id", style_kwds={"weight": 10}
)

In [None]:
m1.loc[m1.april_v_may == "right_only"].set_geometry("geometry_y").explore(
    "route_id", style_kwds={"weight": 10}
)

In [None]:
m1 = m1.set_geometry('geometry_x').to_crs(geography_utils.CA_StatePlane)

In [None]:
m1['april_geo_len'] = m1.geometry_x.length

In [None]:
m1 = m1.set_geometry('geometry_y').to_crs(geography_utils.CA_StatePlane)

In [None]:
m1['may_len'] = m1.geometry_y.length

In [None]:
m1.loc[m1.april_v_may != "both"][preview]

In [None]:
may.shape, april.shape

In [None]:
april.geometry.geom_type.unique()

In [None]:
may.geometry.geom_type.unique()

In [None]:
group1 = m1.loc[m1.april_v_may != "both"].groupby(['route_id',]).agg({'shape_id':'nunique', 'april_geo_len':'max', 'may_len':'max'}).reset_index()

In [None]:
group1.april_geo_len/group1.may_len * 100