# Route Identification Over Time, Approach 1

Recent observations shows small chages in routes over time. Specifically in the following fields:
* route ID
* route short name
* route long name
* route desc

Need to observe these route changes in order to account for these changes in future analyses.

## Objective
1. Query data from `fct_monthly_routes` to help identify variences in Routes. Query for 2023, a couple of months. 
2. Save data to GCS `gtfs_schedule` bucket
3. Filter down data to `Sacramento Regional Transit`, identify and observe routes for any variences

* https://github.com/cal-itp/data-analyses/issues/924

In [1]:
import geopandas as gpd
import pandas as pd
from calitp_data_analysis import geography_utils
from shared_utils import rt_dates

In [2]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [3]:
gdf = gpd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/gtfs_schedule/route_identification_2023_m04_m09.parquet"
)

In [4]:
gdf.shape

(24325, 8)

In [7]:
gdf = gdf.drop(columns=["year", "key"])

In [8]:
gdf.sample().drop(columns=["geometry"])

Unnamed: 0,source_record_id,name,route_id,shape_id,month
18773,recCNNGH8SHfXBKvv,Bay Area 511 Golden Gate Transit Schedule,580,5800038,8


### LA Metro Example



In [None]:
# gdf.name.unique()

In [9]:
la_metro = gdf.loc[gdf.name == "LA Metro Bus Schedule"].reset_index(drop=True)

In [10]:
la_metro.shape

(802, 6)

In [11]:
la_metro.month.unique()

array([9, 7, 4, 6, 8, 5])

In [12]:
# Compare each month
# https://stackoverflow.com/questions/47769453/pandas-split-dataframe-to-multiple-by-unique-values-rows
la_dfs = dict(tuple(la_metro.groupby("month")))

In [13]:
la_metro_df = la_metro.drop(columns=["geometry"])

In [26]:
la_metro_summary = (
    la_metro_df.groupby(["route_id"])
    .agg({"shape_id": "nunique",
         "month":"nunique"})
    .sort_values(["shape_id"], ascending=False)
    .reset_index()
    .rename(columns = {'shape_id':'total_unique_shapes', 'month':'total_unique_months'})
)

In [27]:
la_metro_summary.shape

(229, 3)

#### Can see that for the same route id there are 3 unique diffrent shape ids associated with it for 4 different months.

In [28]:
la_metro_summary.loc[la_metro_summary.total_unique_shapes > 1].shape

(44, 3)

In [29]:
la_metro_summary.loc[la_metro_summary.total_unique_shapes > 1]

Unnamed: 0,route_id,total_unique_shapes,total_unique_months
0,242-13168,3,4
1,265-13168,2,4
2,81-13168,2,4
3,237-13168,2,4
4,237-13167,2,3
5,154-13168,2,4
6,155-13167,2,3
7,155-13168,2,4
8,577-13167,2,3
9,161-13168,2,4


#### Evaluate each month using merges

In [30]:
april = la_dfs[4]

In [31]:
may = la_dfs[5]

In [32]:
june = la_dfs[6]

In [34]:
m1 = (
    pd.merge(
        april,
        may,
        on=["name", "source_record_id", "route_id", "shape_id"],
        how="outer",
        indicator=True,
    )
    .rename(columns={"_merge": "april_v_may"})
    .drop(columns=["month_x", "month_y"])
)

In [35]:
preview = ["source_record_id", "name", "route_id", "shape_id", "april_v_may"]

* Can see the pattern again , route_id 901-13167 has a different shape_id in April vs May

In [75]:
m1.loc[m1.april_v_may != "both"].shape

(20, 10)

In [76]:
m1.shape

(125, 10)

In [36]:
m1.loc[m1.april_v_may != "both"][preview]

Unnamed: 0,source_record_id,name,route_id,shape_id,april_v_may
0,recX8JOPmBQM9aWLC,LA Metro Bus Schedule,901-13167,9010055_DEC22,left_only
8,recX8JOPmBQM9aWLC,LA Metro Bus Schedule,237-13167,2370031_DEC22,left_only
21,recX8JOPmBQM9aWLC,LA Metro Bus Schedule,125-13167,1250150_DEC22,left_only
38,recX8JOPmBQM9aWLC,LA Metro Bus Schedule,611-13167,6110027_DEC22,left_only
39,recX8JOPmBQM9aWLC,LA Metro Bus Schedule,665-13167,6650030_DEC22,left_only
70,recX8JOPmBQM9aWLC,LA Metro Bus Schedule,910-13167,9100211_DEC22,left_only
85,recX8JOPmBQM9aWLC,LA Metro Bus Schedule,265-13167,2650016_DEC22,left_only
88,recX8JOPmBQM9aWLC,LA Metro Bus Schedule,686-13167,6860004_DEC22,left_only
107,recX8JOPmBQM9aWLC,LA Metro Bus Schedule,110-13167,1100289_DEC22,left_only
114,recX8JOPmBQM9aWLC,LA Metro Bus Schedule,155-13167,1550041_DEC22,left_only


#### Eyeball the maps
* 237-13167

In [37]:
m1.loc[m1.april_v_may == "left_only"].set_geometry("geometry_x").explore(
    "route_id", style_kwds={"weight": 10}
)

In [38]:
m1.loc[m1.april_v_may == "right_only"].set_geometry("geometry_y").explore(
    "route_id", style_kwds={"weight": 10}
)

#### How to check if it's the same route under different names?
* Change the CRS
* Take the length for each month? 

In [58]:
m1 = m1.set_geometry("geometry_x").to_crs(geography_utils.CA_StatePlane)

In [59]:
m1["april_len"] = m1.geometry_x.length

In [60]:
m1 = m1.set_geometry("geometry_y").to_crs(geography_utils.CA_StatePlane)

In [61]:
m1["may_len"] = m1.geometry_y.length

In [62]:
may.shape, april.shape

((115, 6), (115, 6))

In [71]:
group1 = (
    m1.loc[m1.april_v_may != "both"]
    .groupby(
        [
            "route_id",
        ]
    )
    .agg({"april_len": "max", "may_len": "max"})
    .reset_index()
)

In [72]:
group1['length_pct'] = (group1.april_len / group1.may_len * 100).astype('int64')

In [73]:
group1['length'] = (group1.april_len - group1.may_len).astype('int64')

In [74]:
group1

Unnamed: 0,route_id,april_len,may_len,length_pct,length
0,110-13167,97156.96,107558.41,90,-10401
1,125-13167,109516.75,111384.3,98,-1867
2,155-13167,110056.33,109549.91,100,506
3,237-13167,134826.9,131970.98,102,2855
4,265-13167,86727.99,90611.46,95,-3883
5,611-13167,73450.32,80099.68,91,-6649
6,665-13167,28466.68,25277.14,112,3189
7,686-13167,27980.85,24335.95,114,3644
8,901-13167,92919.28,93694.06,99,-774
9,910-13167,140609.0,139713.27,100,895


## To do
* Find other examples that aren't LA Metro in which the route-id and shape-id change
* Compare all the months you've queried
* Check if shape-id changes from one month to another, but route-id doesn't. Check vice versa. 
* What's the threshold for the length to be considered the same route?
* Find steps that are repeated and turn them into functions