# Route Identification Over Time, Approach 1

Recent observations shows small chages in routes over time. Specifically in the following fields:
* route ID
* route short name
* route long name
* route desc

Need to observe these route changes in order to account for these changes in future analyses.

## Objective
1. Query data from `fct_monthly_routes` to help identify variences in Routes. Query for 2023, a couple of months. 
2. Save data to GCS `gtfs_schedule` bucket
3. Filter down data to `Sacramento Regional Transit`, identify and observe routes for any variences


In [1]:
#imports copied from download_vehicle_position.py script

import datetime
import gcsfs
import geopandas as gpd
import pandas as pd
import shapely
import sys

from calitp_data_analysis.tables import tbls
from calitp_data_analysis import utils
from loguru import logger
from siuba import *

from shared_utils import schedule_rt_utils
from shared_utils import geography_utils



# test to query fct_monthly_routes. DO NOT RUN

def get_monthly_routes(
        year: str,
        months: list
    ) -> pd.DataFrame:    
    
        df = (tbls.mart_gtfs.fct_monthly_routes()
              >> filter(_.year == year)
              >> filter(_.month.isin(months))
              >> select(_.key, _.source_record_id,
                        _.name,
                        _.route_id, _.shape_id,
                        _.month,
                        _.year,
                       _.pt_array)
                  >> collect()
             )
        return df

In [None]:
# df_pt = get_monthly_routes(2023, [4, 5, 6, 7, 8, 9])

In [None]:
#from metabase, joined fct_monthly_routes with dim_providers_gtfs_data.saved results to gcs
#test reading in data for  sacramento schedule m4 to m9,

#df = pd.read_csv('gs://calitp-analytics-data/data-analyses/gtfs_schedule/sacrament_schedule_route_id_2023_m4_m9.csv')

In [None]:
#df.shape

## cleaning metabase data of fct_monthly_routes and dim_providers_gtfs_data

In [None]:
#looking at columns name
#df.columns

#function that replaces spaces with _ and make lower case
#effectively making snakecase
def cleaner(df):
    df.columns = df.columns.str.replace(' ','_')
    df.columns = df.columns.str.lower()
    return df

In [None]:
#cleaner(df)

In [None]:
#removing dim providers string
#df.columns = df.columns.str.replace('dim_provider_gtfs_data_→_','')


In [None]:
#df.columns

#need a way to get point array back
#merge df with regular fct_monthly_routes on key

merge = pd.merge(df, df_pt, on=['key', 
                                'name', 
                                'month', 
                                'source_record_id', 
                                'route_id',
                                'year',
                               ])

display(merge.shape)
display(type(merge))
list(merge.columns)


In [None]:
#merge.head(3)

merge2=merge.drop(columns=['distinct_values_of_key',
 'shape_id_y'])
merge2.columns

In [None]:
#makeing merge into a gdf now that i have pt geom back

#sac_m4m9= geography_utils.make_routes_gdf(merge2, 'EPSG:4326')

type(sac_m4m9)
display(sac_m4m9.geometry.name)
display(list(sac_m4m9.columns))

In [None]:
#sac_m4m9.plot()

In [None]:
#uploading as geoparquet to GCS
#utils.geoparquet_gcs_export(sac_m4m9, "gs://calitp-analytics-data/data-analyses/gtfs_schedule/", "sac_route_variance_m4m9_geo")

In [8]:
#attempt at reaching in geoparquet FROM GCS
sac_m4m9 = pd.read_parquet('gs://calitp-analytics-data/data-analyses/gtfs_schedule/sac_route_variance_m4m9_geo.parquet')

In [9]:
# attempt to read in dataframe from approach 2
sac_trip_desc = pd.read_parquet('gs://calitp-analytics-data/data-analyses/gtfs_schedule/sac_trips_route_identification_2023_m04_m09.parquet')

In [10]:
display(type(sac_trip_desc))
display(sac_trip_desc.shape)
list(sac_trip_desc.columns)

pandas.core.frame.DataFrame

(326, 8)

['feed_key',
 'name',
 'schedule_gtfs_dataset_key',
 'route_id',
 'route_short_name',
 'route_long_name',
 'route_desc',
 'month']

In [11]:
display(type(sac_m4m9))
display(sac_m4m9.shape)
list(sac_m4m9.columns)

pandas.core.frame.DataFrame

(1179, 18)

['key',
 'schedule_gtfs_dataset_key',
 'name',
 'month',
 'year',
 'month_last_day',
 'valid_from',
 'valid_to',
 'source_record_id',
 'route_id',
 'shape_id_x',
 'organization_name',
 'service_name',
 'schedule_gtfs_dataset_name',
 'organization_source_record_id',
 'service_source_record_id',
 'schedule_source_record_id',
 'geometry']

## Dataframe Comparison
Comparing the data within the `fct_monthly_routes/dim_providers_gtfs_data` (aka sac_m4m9) and the `helpers.import_scheduled_trips` (aka sac_trip_desc)

In [12]:
#comparing gtfs dataset keys
display(sac_m4m9.schedule_gtfs_dataset_key.value_counts())
display(sac_trip_desc.schedule_gtfs_dataset_key.value_counts())

#observed 1 dataset key in `sac_m4m9`, but 2 dataset keys in `sac_trip_desc`
#`sac_trip_desc` pulls data from `helpers.import_scheduled_trips`


cb3074eb8b423dfc5acfeeb0de95eb82    1179
Name: schedule_gtfs_dataset_key, dtype: int64

43a1e46d592a1ee647bce8422c68460c    260
cb3074eb8b423dfc5acfeeb0de95eb82     66
Name: schedule_gtfs_dataset_key, dtype: int64

In [13]:
#comparing months
display(sac_m4m9.month.value_counts())
display(sac_trip_desc.month.value_counts())
#missing June 2023 from sac_trip_desc??

8    201
9    198
4    195
5    195
6    195
7    195
Name: month, dtype: int64

sep      66
april    65
may      65
july     65
aug      65
Name: month, dtype: int64

In [14]:
sac_trip_desc.pivot_table(index = ['schedule_gtfs_dataset_key','month'],
                      values = ['feed_key'],
                     aggfunc='count'
                             ).reset_index()
#only sept has the matching cb307... dataset key
#April to May, had 65 routes with the 43a1e... dataset key. but 66 routes in Sept.
#What is the difference between these months? What routes are in each of these months

Unnamed: 0,schedule_gtfs_dataset_key,month,feed_key
0,43a1e46d592a1ee647bce8422c68460c,april,65
1,43a1e46d592a1ee647bce8422c68460c,aug,65
2,43a1e46d592a1ee647bce8422c68460c,july,65
3,43a1e46d592a1ee647bce8422c68460c,may,65
4,cb3074eb8b423dfc5acfeeb0de95eb82,sep,66


In [76]:
april = sac_trip_desc[sac_trip_desc.month=='april'].route_id.unique()
sep = sac_trip_desc[sac_trip_desc.month=='sep'].route_id.unique()

In [82]:
month_list = [
    'april',
    'aug',
    'july',
    'may',
    'sep']

In [136]:
#funcion that prints the unique route IDs for each month in the helpers.import_scheduled_trips dataset
#but you have to enter in
def route(month):
    array = sac_trip_desc[sac_trip_desc.month==month].route_id.unique()
    series = pd.Series(array).sort_values()
    return series

In [137]:
#attempt to compare the routes in each month from helpers.import_scheduled_trips dataframe
#which routes exisit / not exist compared to other months. 
april = route('april')
may = route('may')
july = route('july')
aug = route('aug')
sep = route('sep')

In [141]:
april.isin(may)

0     True
1     True
2     True
3     True
4     True
      ... 
54    True
55    True
56    True
57    True
58    True
Length: 65, dtype: bool

In [139]:
may

0     001
1     011
2     013
3     015
4     019
     ... 
57    507
58    519
59    533
60    F10
61    F20
Length: 65, dtype: object

# HALL OF SHAME

---


---


In [None]:
#peaking into df to make sure everything looks good

#shape shows 11,927 rows and 8 columns
#display(df.shape)

#type shows data is in df
#display(type(df))

#columns return all the columns we listed in the function
#display(list(df.columns))

#value_counts confirm df only has rows from 2023 March to May
#display(df.value_counts(subset=['year','month']))

### via Tiffany, make df to GPD
* <b>COMPLETE</b> Query data from warehouse
* then use this snippet from `make_routes_gdf` from `_shared_utils/shared_utils/geography_utils.py`.
    * `ddf["geometry"] = ddf.pt_array.apply(make_linestring)`
<br>  
* <b>COMPLETE</b> then save out as geo parquet to the `gtfs_schedule` folder in GCS (so versioning and history stays) using 
    * `utils.geoparquet_gcs_export(vp_gdf, SEGMENT_GCS, f"vp_{analysis_date}")`

In [None]:
#test of make_routes_gdf. DO NOT RUN

# aprl_sept_2023_routes = geography_utils.make_routes_gdf(df, "EPSG:4326")

In [None]:
#display(type(aprl_sept_2023_routes))
#list(aprl_sept_2023_routes.columns)
#display(aprl_sept_2023_routes.geometry.name)

---

In [None]:
#creating sub-df for 'Sacramento Schedule'. DO NOT RUN

# sac = aprl_sept_2023_routes[aprl_sept_2023_routes['name'] == 'Sacramento Schedule']

In [None]:
# type(sac)

In [None]:
#writing sac filtered gdf to gcs as geoparquet. DO NOT RUN AGAIN

# utils.geoparquet_gcs_export(sac, "gs://calitp-analytics-data/data-analyses/gtfs_schedule/", "sac_route_identification_2023_m04_m09_geo")

In [None]:
#Now we can read in the parquet file from gcs without having to remake everything again.

sac_gdf = gpd.read_parquet('gs://calitp-analytics-data/data-analyses/gtfs_schedule/sac_route_identification_2023_m04_m09_geo.parquet')

In [None]:
type(sac_gdf)

In [None]:
display(sac.shape)
display(sac.head())

In [None]:
sac_gdf.plot()

In [None]:
Attempt at creating sub df of each routes to plot

In [None]:
sac_routes = sac_gdf['route_id'].unique()


In [None]:
#empty dictionary
sac_sub_route_ids = {}

#each element in sac_routes will be called route.
#for each route in sac_routes, query each row related to that route.(where ever you use the variable route, go 1-by-1 the differnet
#then, create a dataframe for each route and place it into the dictionary sub_dataframes
for route in sac_routes:
    sub_df = sac[sac['route_id'] == route]
    sac_sub_route_ids[route] = sub_df

In [None]:
len(sac_sub_route_ids)

In [None]:
#testing dictionary with route 23 and 88
sac_sub_route_ids['023']

In [None]:
#function to plot routes and to check for any variation in route geometry
def route_checker(route_id):
    display(sac_sub_route_ids[route_id].plot())
    display(f'geometry checker...{sac_sub_route_ids[route_id].geometry.value_counts()}')
    

In [None]:
route_checker('088')

At this point determined we need more data that what `fct_monthly_routes` has. See approach 2

## Next Steps

for every `name` and `route_id` in routes, need to see if each row is the same or not. Need to identify any variation in the routes. 



---

Trying to use a loop that will create a df for every route_id in sac_routes. but would need to do this for every `name` in the `fct_monthly_routes` df eventually?


---

In [None]:
#list of unique route names from initial df
route_names = df['name'].unique()


In [None]:
#new loop that creates a dictionary of each unique schedule name with all its routes.
sub_route_name = {}

for name in route_names:
    sub_df2 = df[df['name'] == name]
    sub_route_name[name] = sub_df2
    

In [None]:
#test to see if new dictionary works
sub_route_name['Auburn Schedule']

## Now I have `sac_sub_route_ids` and `sub_route_name` dictionaries

Examples of some noteable routes with slight variations over time.

In [None]:
#General observations for Sacramento Schedule: 
#shape_id changes every month. 
#pt_array changes every month, however, did get a warning upon initial query of data so may need to review query to account for geodata
#Month 4 has the point geom data

display(sac_sub_route_ids['088'])
display(sac_sub_route_ids['023'])
display(sac_sub_route_ids['105'])
display(sac_sub_route_ids['F20'])
display(sac_sub_route_ids['215'])

In [None]:
#test to see other route names
#other route names have more point geometry than Sacramento.
display(sub_route_name['Santa Cruz Schedule'])
display(sub_route_name['Merced Schedule'])
display(sub_route_name['San Diego Schedule'])
display(sub_route_name['Roseville Schedule'])

In [None]:
import importlib

importlib.reload(segment_speed_utils)
from segment_speed_utils.project_vars import SCHED_GCS, analysis_date