# Are route categories stable quarter to quarter?

If a `route_id` is `parallel` in one quarter, would it change to `on_shn` in another? It should be pretty stable, since how often would a bus route drastically deviate from its original route? 

Freeways don't change quarter to quarter.

Check if there are large shifts in categories from current quarter to prior quarter.

In [1]:
import geopandas as gpd
import pandas as pd

from shared_utils import rt_dates
from update_vars import BUS_SERVICE_GCS


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas as gpd


In [2]:
# Only look at v2 warehouse
dfs = {}
for key, date in rt_dates.PMAC.items():
    if "2023" in date:
        df = gpd.read_parquet(f"{BUS_SERVICE_GCS}routes_categorized_{date}.parquet")
        dfs[key] = df

In [3]:
keep_cols = [
    "feed_key", "name",
    "category", "route_id", 
    "district"
]

df1 = dfs["Q1_2023"][keep_cols]
df2 = dfs["Q2_2023"][keep_cols]

In [4]:
def compare_col(df1, df2, col):
    print(df1[col].value_counts())
    print(df2[col].value_counts())
    print(df1[col].value_counts(normalize=True))
    print(df2[col].value_counts(normalize=True))

In [5]:
compare_col(df1, df2, "category")

intersects_shn    1229
on_shn             548
other              492
Name: category, dtype: int64
intersects_shn    1248
on_shn             566
other              519
Name: category, dtype: int64
intersects_shn    0.541648
on_shn            0.241516
other             0.216836
Name: category, dtype: float64
intersects_shn    0.534934
on_shn            0.242606
other             0.222460
Name: category, dtype: float64


In [6]:
compare_col(df1, df2, "district")

4.0     664
7.0     518
3.0     216
8.0     184
11.0    151
5.0     142
10.0     93
6.0      86
12.0     71
2.0      41
1.0      39
9.0      16
Name: district, dtype: int64
4.0     647
7.0     529
3.0     214
8.0     181
11.0    163
5.0     144
10.0    129
6.0     116
12.0     71
1.0      40
2.0      35
9.0      16
Name: district, dtype: int64
4.0     0.298964
7.0     0.233228
3.0     0.097253
8.0     0.082846
11.0    0.067987
5.0     0.063935
10.0    0.041873
6.0     0.038721
12.0    0.031968
2.0     0.018460
1.0     0.017560
9.0     0.007204
Name: district, dtype: float64
4.0     0.283151
7.0     0.231510
3.0     0.093654
8.0     0.079212
11.0    0.071335
5.0     0.063020
10.0    0.056455
6.0     0.050766
12.0    0.031072
1.0     0.017505
2.0     0.015317
9.0     0.007002
Name: district, dtype: float64


In [7]:
m1 = pd.merge(
    df1, 
    df2,
    on = ["feed_key", "name", "route_id"],
    how = "outer",
    validate = "1:1",
    indicator="compare_categories"
)

In [8]:
m1.compare_categories.value_counts()

right_only    2061
left_only     1997
both           272
Name: compare_categories, dtype: int64

In [9]:
in_both = m1[(m1.compare_categories=="both")]

In [10]:
in_both.shape

(272, 8)

In [11]:
in_both[(in_both.category_x != in_both.category_y)].shape

(0, 8)