# Route Identification Over Time, Approach 1

Recent observations shows small chages in routes over time. Specifically in the following fields:
* route ID
* route short name
* route long name
* route desc

Need to observe these route changes in order to account for these changes in future analyses.

## Objective
1. Query data from `fct_monthly_routes` to help identify variences in Routes. Query for 2023, a couple of months. 
2. Save data to GCS `gtfs_schedule` bucket
3. Filter down data to `Sacramento Regional Transit`, identify and observe routes for any variences

* https://github.com/cal-itp/data-analyses/issues/924

In [7]:
import geopandas as gpd
import pandas as pd
from calitp_data_analysis import geography_utils
from shared_utils import rt_dates
from calitp_data_analysis.tables import tbls
from siuba import *
import datetime

In [2]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

### Tables
* https://dbt-docs.calitp.org/#!/source/source.calitp_warehouse.external_gtfs_schedule.routes

In [8]:
analysis_date = datetime.date(2023, 11, 15)

In [10]:
def external_gtfs(
        date,
    ) -> pd.DataFrame:    
    
        df = (tbls.external_gtfs_schedule.routes()
              >> filter(_.dt == date)
                  >> collect()
             )
        return df

In [11]:
route_info = external_gtfs(analysis_date)

In [13]:
route_info.head()

Unnamed: 0,_line_number,agency_id,route_id,route_short_name,route_long_name,route_desc,route_type,route_url,route_color,route_text_color,route_sort_order,continuous_pickup,continuous_drop_off,network_id,dt,ts,base64_url
0,1,,601,601,ROUTE 601,,3,https://countyconnection.com/routes/601/,7a99ac,ffffff,,,,,2023-11-01,2023-11-01 03:00:27.975483+00:00,aHR0cDovL2NjY3RhLm9yZy9HVEZTL2dvb2dsZV90cmFuc2l0LnppcA==
1,2,,602,602,ROUTE 602,,3,https://countyconnection.com/routes/602/,7a99ac,ffffff,,,,,2023-11-01,2023-11-01 03:00:27.975483+00:00,aHR0cDovL2NjY3RhLm9yZy9HVEZTL2dvb2dsZV90cmFuc2l0LnppcA==
2,3,,603,603,ROUTE 603,,3,https://countyconnection.com/routes/603/,7a99ac,ffffff,,,,,2023-11-01,2023-11-01 03:00:27.975483+00:00,aHR0cDovL2NjY3RhLm9yZy9HVEZTL2dvb2dsZV90cmFuc2l0LnppcA==
3,4,,605,605,ROUTE 605,,3,https://countyconnection.com/routes/605/,7a99ac,ffffff,,,,,2023-11-01,2023-11-01 03:00:27.975483+00:00,aHR0cDovL2NjY3RhLm9yZy9HVEZTL2dvb2dsZV90cmFuc2l0LnppcA==
4,5,,606,606,ROUTE 606,,3,https://countyconnection.com/routes/606/,7a99ac,ffffff,,,,,2023-11-01,2023-11-01 03:00:27.975483+00:00,aHR0cDovL2NjY3RhLm9yZy9HVEZTL2dvb2dsZV90cmFuc2l0LnppcA==


In [None]:
gdf = gpd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/gtfs_schedule/route_identification_2023_m04_m09.parquet"
)

In [None]:
gdf.shape

In [None]:
gdf = gdf.drop(columns=["year", "key"])

In [None]:
gdf.sample().drop(columns=["geometry"])

### LA Metro Example



In [None]:
# gdf.name.unique()

In [None]:
la_metro = gdf.loc[gdf.name == "LA Metro Bus Schedule"].reset_index(drop=True)

In [None]:
la_metro.shape

In [None]:
la_metro.month.unique()

In [None]:
# Compare each month
# https://stackoverflow.com/questions/47769453/pandas-split-dataframe-to-multiple-by-unique-values-rows
la_dfs = dict(tuple(la_metro.groupby("month")))

In [None]:
la_metro_df = la_metro.drop(columns=["geometry"])

In [None]:
la_metro_summary = (
    la_metro_df.groupby(["route_id"])
    .agg({"shape_id": "nunique",
         "month":"nunique"})
    .sort_values(["shape_id"], ascending=False)
    .reset_index()
    .rename(columns = {'shape_id':'total_unique_shapes', 'month':'total_unique_months'})
)

In [None]:
la_metro_summary.shape

#### Can see that for the same route id there are 3 unique diffrent shape ids associated with it for 4 different months.

In [None]:
la_metro_summary.loc[la_metro_summary.total_unique_shapes > 1].shape

In [None]:
la_metro_summary.loc[la_metro_summary.total_unique_shapes > 1]

#### Evaluate each month using merges

In [None]:
april = la_dfs[4]

In [None]:
may = la_dfs[5]

In [None]:
june = la_dfs[6]

In [None]:
m1 = (
    pd.merge(
        april,
        may,
        on=["name", "source_record_id", "route_id", "shape_id"],
        how="outer",
        indicator=True,
    )
    .rename(columns={"_merge": "april_v_may"})
    .drop(columns=["month_x", "month_y"])
)

In [None]:
preview = ["source_record_id", "name", "route_id", "shape_id", "april_v_may"]

* Can see the pattern again , route_id 901-13167 has a different shape_id in April vs May

In [None]:
m1.loc[m1.april_v_may != "both"].shape

In [None]:
m1.shape

In [None]:
m1.loc[m1.april_v_may != "both"][preview]

#### Eyeball the maps
* 237-13167

In [None]:
m1.loc[m1.april_v_may == "left_only"].set_geometry("geometry_x").explore(
    "route_id", style_kwds={"weight": 10}
)

In [None]:
m1.loc[m1.april_v_may == "right_only"].set_geometry("geometry_y").explore(
    "route_id", style_kwds={"weight": 10}
)

#### How to check if it's the same route under different names?
* Change the CRS
* Take the length for each month? 

In [None]:
m1 = m1.set_geometry("geometry_x").to_crs(geography_utils.CA_StatePlane)

In [None]:
m1["april_len"] = m1.geometry_x.length

In [None]:
m1 = m1.set_geometry("geometry_y").to_crs(geography_utils.CA_StatePlane)

In [None]:
m1["may_len"] = m1.geometry_y.length

In [None]:
may.shape, april.shape

In [None]:
group1 = (
    m1.loc[m1.april_v_may != "both"]
    .groupby(
        [
            "route_id",
        ]
    )
    .agg({"april_len": "max", "may_len": "max"})
    .reset_index()
)

In [None]:
group1['length_pct'] = (group1.april_len / group1.may_len * 100).astype('int64')

In [None]:
group1['length'] = (group1.april_len - group1.may_len).astype('int64')

In [None]:
group1

## To do
* Find other examples that aren't LA Metro in which the route-id, description, shape-id, etc changed
    * Can use various combinations of `groupby`
* Compare all the months you've queried 
    * Can use `pd.merge` and turn on the indicator.
    * Can also change the values of the indicator so it isn't just left_only/right_only/both
* What's the threshold for the length to be considered the same route?
    * Can also eyeball routes using `explore()`
* Find steps that are repeated month after month and turn them into functions
* Clean up string columns
    * "Route ABC"  "route abc"  "   Route abc" and "Route-ABC" are all considered different strings.
    * Make sure to do lstrip, rstrip, remove puncutation, and either lowercase/titlecase/uppercase for columns you want to compare, to make sure the same string presented in different formats are as uniform as possible. 