# Route Identification Over Time

Recent observations shows small chages in routes over time. Specifically in the following fields:
* route ID
* route short name
* route long name
* route desc

Need to observe these route changes in order to account for these changes in future analyses.

## Objective
1. Query data from `fct_monthly_routes` to help identify variences in Routes. Query for 2023, a couple of months. 
2. Save data to GCS `gtfs_schedule` bucket
3. Filter down data to `Sacramento Regional Transit`, identify and observe routes for any variences


## function from `open_data/download_vehicle_position.py`
    
    import datetime
    import gcsfs
    import geopandas as gpd
    import pandas as pd
    import shapely
    import sys

    from calitp_data_analysis.tables import tbls
    from calitp_data_analysis import utils
    from loguru import logger
    from siuba import *

    from shared_utils import schedule_rt_utils
    
    def download_vehicle_positions(
        date: str,
        operator_names: list
    ) -> pd.DataFrame:    
    
        df = (tbls.mart_gtfs.fct_vehicle_locations()
              >> filter(_.service_date == date)
              >> filter(_.gtfs_dataset_name.isin(operator_names))
              >> select(_.gtfs_dataset_key, _.gtfs_dataset_name,
                        _.schedule_gtfs_dataset_key,
                        _.trip_id, _.trip_instance_key,
                        _.location_timestamp,
                        _.location)
                  >> collect()
             )

In [1]:
#imports

import datetime
import gcsfs
import geopandas as gpd
import pandas as pd
import shapely
import sys

from calitp_data_analysis.tables import tbls
from calitp_data_analysis import utils
from loguru import logger
from siuba import *

from shared_utils import schedule_rt_utils



ModuleNotFoundError: No module named 'shared_utils'

In [2]:
# test to query fct_monthly_routes
def get_monthly_routes(
        year: str,
        months: list
    ) -> pd.DataFrame:    
    
        df = (tbls.mart_gtfs.fct_monthly_routes()
              >> filter(_.year == year)
              >> filter(_.month.isin(months))
              >> select(_.key, _.source_record_id,
                        _.name,
                        _.route_id, _.shape_id,
                        _.month,
                        _.year,
                       _.pt_array)
                  >> collect()
             )
        return df

In [3]:
df = get_monthly_routes(2023, [3, 4, 5])

  sqlalchemy.util.warn(


In [None]:
#testing export to GCS > csuyat_folder
# 'gs://calitp-analytics-data/data-analyses/csuyat_folder/##FILENAME##.parquet'
# df.to_parquet()

#sucsessfully written to GCS

#df.to_parquet('gs://calitp-analytics-data/data-analyses/csuyat_folder/route_identification_2023_m03_m05.parquet')

---

In [4]:
#peaking into df to make sure everything looks good

#shape shows 11,927 rows and 8 columns
display(df.shape)

#type shows data is in df
display(type(df))

#columns return all the columns we listed in the function
display(list(df.columns))

#value_counts confirm df only has rows from 2023 March to May
display(df.value_counts(subset=['year','month']))

(11927, 8)

pandas.core.frame.DataFrame

['key',
 'source_record_id',
 'name',
 'route_id',
 'shape_id',
 'month',
 'year',
 'pt_array']

year  month
2023  5        4180
      3        3899
      4        3848
dtype: int64

In [5]:
df.head()

Unnamed: 0,key,source_record_id,name,route_id,shape_id,month,year,pt_array
0,5203ce136ca42bd0dedcacd1dc9333af,reca8NS1B4WihN0UT,AC Transit Schedule,706,shp-706-04,5,2023,"[POINT(-122.125885 37.696773), POINT(-122.1256..."
1,b8d8702da47d7c09600f595fe7883fde,recS9dL7UTYLgg4r9,Sonoma Schedule,1079,p_8814,4,2023,"[POINT(-122.873369 38.609732), POINT(-122.8718..."
2,a828ca6eb47f20a8d80a1ca8909f0bf2,rec5AIu50d5GUeSFW,Plumas Schedule,126,p_1426806,3,2023,"[POINT(-120.902916 39.93456), POINT(-120.90307..."
3,1ca08f0eee7e2b274208197b8e7d6a8f,recSitwCIHxMr06TX,Imperial Valley Transit Schedule,11,18,4,2023,"[POINT(-115.55911 32.7918), POINT(-115.56138 3..."
4,ccaa77907ea4779e7a7aa54cc9a812cf,recS9dL7UTYLgg4r9,Sonoma Schedule,1026,p_2773,4,2023,"[POINT(-123.011852 38.460851), POINT(-123.0119..."


In [6]:
#creating sub-df for 'Sacramento Schedule'
#195 rows, 8 columns
sac = df[df['name'] == 'Sacramento Schedule']

In [7]:
display(sac.shape)
display(sac.head(3))

(195, 8)

Unnamed: 0,key,source_record_id,name,route_id,shape_id,month,year,pt_array
13,7d5bc4199721d12c326e68b76ed4e40b,recbzZQUIdMmFvm1r,Sacramento Schedule,62,45316,4,2023,"[POINT(-121.53569 38.48545), POINT(-121.535756..."
20,d63a9754f51a8834d558e84020e10f34,recbzZQUIdMmFvm1r,Sacramento Schedule,138,45389,4,2023,"[POINT(-121.753874 38.539257), POINT(-121.7536..."
268,274a0e075046066ef749de889c204785,recbzZQUIdMmFvm1r,Sacramento Schedule,248,45451,5,2023,[]


In [8]:
sac_routes = sac['route_id'].unique()

In [9]:
sac_routes

array(['062', '138', '248', 'F10', '227', '013', '142', '124', '226',
       '033', '001', '067', '088', '211', '113', '093', '011', '023',
       '210', '177', '084', '026', '081', '087', '206', 'F20', '025',
       '056', '175', '102', '213', '015', '019', '30', '533', '075',
       '252', '030', '205', '109', '228', '068', '105', '176', '021',
       '507', '255', '082', '247', '061', '106', '051', '519', '134',
       '214', '086', '103', '161', '246', '072', '215', '129', '038',
       '212', '078'], dtype=object)

## Next Steps

for every `name` and `route_id` in routes, need to see if each row is the same or not. Need to identify any variation in the routes. 



---

Trying to use a loop that will create a df for every route_id in sac_routes. but would need to do this for every `name` in the `fct_monthly_routes` df eventually?


In [None]:
#test loop that will create df for evey unique route on `sac_routes` list

#declare empty list
sub_sacdf = []

for routes in sac_routes:
        sub_df = sac[sac['name'] == sac_routes]
        sub_sacdf.append(sub_df)

In [15]:
#v2 of above

#empty dictionary
sub_dataframes = {}

#each element in sac_routes will be called route.
#for each route in sac_routes, query each row related to that route.
#then, create a dataframe for each route and place it into the dictionary sub_dataframes
for route in sac_routes:
    sub_df = sac[sac['route_id'] == route]
    sub_dataframes[route] = sub_df

In [24]:
#testing dictionary with route 23
sub_dataframes['088']

Unnamed: 0,key,source_record_id,name,route_id,shape_id,month,year,pt_array
1173,fc4aecdbb2e90697aaca21929881da0c,recbzZQUIdMmFvm1r,Sacramento Schedule,88,45350,4,2023,"[POINT(-121.491718 38.579866), POINT(-121.4915..."
4271,79872e3f47d4f8c9579aeb7c141649ad,recbzZQUIdMmFvm1r,Sacramento Schedule,88,44974,3,2023,[]
10792,559586f993036a5d2aba23bbddb5645f,recbzZQUIdMmFvm1r,Sacramento Schedule,88,45350,5,2023,[]


---

Trying to use a loop that will create df for every `name` in routes, then group by `route_id`, then check for matching rows

1. need a list of unique names from initial df
    `df.name.unique()` >> returns array

2. then need to run that list of names through `new_df = df[df['name'] == unique name]` and get new dfs for each unique name
    use for loop
    
3. then run a check for all rows in each df to see if all rows match eachother
    use function.

In [10]:
name_list = df['name'].unique()

In [12]:
name_list2 = pd.DataFrame(name_list, columns=['name'])

In [13]:
sub_dataframes = []

for name in name_list2:
    sub_df = df[df['name'] == name_list2]
    sub_dataframes.append(sub_df)
    

  sub_df = df[df['name'] == name_list2]


In [14]:
sub_dataframes

[       key source_record_id name route_id shape_id  month  year pt_array
 0      NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 1      NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 2      NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 3      NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 4      NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 ...    ...              ...  ...      ...      ...    ...   ...      ...
 11922  NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 11923  NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 11924  NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 11925  NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 11926  NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 
 [11927 rows x 8 columns]]