# Route Identification Over Time

Recent observations shows small chages in routes over time. Specifically in the following fields:
* route ID
* route short name
* route long name
* route desc

Need to observe these route changes in order to account for these changes in future analyses.

## Objective
1. Query data from `fct_monthly_routes` to help identify variences in Routes. Query for 2023, a couple of months. 
2. Save data to GCS `gtfs_schedule` bucket
3. Filter down data to `Sacramento Regional Transit`, identify and observe routes for any variences


## function from `open_data/download_vehicle_position.py`
    
    import datetime
    import gcsfs
    import geopandas as gpd
    import pandas as pd
    import shapely
    import sys

    from calitp_data_analysis.tables import tbls
    from calitp_data_analysis import utils
    from loguru import logger
    from siuba import *

    from shared_utils import schedule_rt_utils
    
    def download_vehicle_positions(
        date: str,
        operator_names: list
    ) -> pd.DataFrame:    
    
        df = (tbls.mart_gtfs.fct_vehicle_locations()
              >> filter(_.service_date == date)
              >> filter(_.gtfs_dataset_name.isin(operator_names))
              >> select(_.gtfs_dataset_key, _.gtfs_dataset_name,
                        _.schedule_gtfs_dataset_key,
                        _.trip_id, _.trip_instance_key,
                        _.location_timestamp,
                        _.location)
                  >> collect()
             )

In [1]:
#imports

import datetime
import gcsfs
import geopandas as gpd
import pandas as pd
import shapely
import sys

from calitp_data_analysis.tables import tbls
from calitp_data_analysis import utils
from loguru import logger
from siuba import *

from shared_utils import schedule_rt_utils



In [2]:
# test to query fct_monthly_routes
def get_monthly_routes(
        year: str,
        months: list
    ) -> pd.DataFrame:    
    
        df = (tbls.mart_gtfs.fct_monthly_routes()
              >> filter(_.year == year)
              >> filter(_.month.isin(months))
              >> select(_.key, _.source_record_id,
                        _.name,
                        _.route_id, _.shape_id,
                        _.month,
                        _.year,
                       _.pt_array)
                  >> collect()
             )
        return df

In [3]:
df = get_monthly_routes(2023, [3, 4, 5])

  sqlalchemy.util.warn(


In [None]:
#testing export to GCS > csuyat_folder
# 'gs://calitp-analytics-data/data-analyses/csuyat_folder/##FILENAME##.parquet'
# df.to_parquet()

#sucsessfully written to GCS

#df.to_parquet('gs://calitp-analytics-data/data-analyses/csuyat_folder/route_identification_2023_m03_m05.parquet')

---

In [4]:
#peaking into df to make sure everything looks good

#shape shows 11,927 rows and 8 columns
display(df.shape)

#type shows data is in df
display(type(df))

#columns return all the columns we listed in the function
display(list(df.columns))

#value_counts confirm df only has rows from 2023 March to May
display(df.value_counts(subset=['year','month']))

(11927, 8)

pandas.core.frame.DataFrame

['key',
 'source_record_id',
 'name',
 'route_id',
 'shape_id',
 'month',
 'year',
 'pt_array']

year  month
2023  5        4180
      3        3899
      4        3848
dtype: int64

In [58]:
df.head()

Unnamed: 0,key,source_record_id,name,route_id,shape_id,month,year,pt_array
0,8f81340f4949669bf89a962038249112,rec5HrdtVmO2dPK0h,Palos Verdes PTA Schedule,840,p_2203,3,2023,"[POINT(-118.4073 33.74868), POINT(-118.40898 3..."
1,0ed4e44064781ebb11168abfbf93b09d,rec9AyXUSMUHFnLsH,Bay Area 511 Regional Schedule,SF:35,SF:3501,5,2023,"[POINT(-122.435254 37.762362), POINT(-122.4351..."
2,3987c4e3d673a91b2d6a1a060215444b,reciKWkJ953NSPTtj,G Trans Schedule,19607,p_1277378,5,2023,"[POINT(-118.287054 33.869095), POINT(-118.2877..."
3,04f100967e2ef9ebbe637ec63ebaf410,rec9AyXUSMUHFnLsH,Bay Area 511 Regional Schedule,AC:W,AC:shp-W-06,5,2023,"[POINT(-122.396553 37.789249), POINT(-122.3966..."
4,6dd26cfe3807522aff9d39a3880a7c05,recvAztAtQDpjBkL2,San Joaquin Flex,19696,,4,2023,[]


---

## Next Steps

for every `name` and `route_id` in routes, need to see if each row is the same or not. Need to identify any variation in the routes. 



---

Trying to use a loop that will create a df for every route_id in sac_routes. but would need to do this for every `name` in the `fct_monthly_routes` df eventually?


In [5]:
#creating sub-df for 'Sacramento Schedule'
#195 rows, 8 columns
sac = df[df['name'] == 'Sacramento Schedule']

In [62]:
display(sac.shape)
display(sac.head(3))

(195, 8)

Unnamed: 0,key,source_record_id,name,route_id,shape_id,month,year,pt_array
144,5ee089aa97106d962f283dd7d77fa8eb,recbzZQUIdMmFvm1r,Sacramento Schedule,13,45278,4,2023,"[POINT(-121.38163 38.6067), POINT(-121.38188 3..."
178,f2b22790026a410c8fff0a9213058ec8,recbzZQUIdMmFvm1r,Sacramento Schedule,142,45391,4,2023,"[POINT(-121.488841 38.57662), POINT(-121.49016..."
180,f6e440870fca27e0a16e5ce174b6cbe3,recbzZQUIdMmFvm1r,Sacramento Schedule,227,45445,5,2023,[]


In [67]:
sac_routes = sac['route_id'].unique()

In [68]:
sac_routes

array(['013', '142', '227', '026', '061', '113', '134', '103', '533',
       '252', '138', '086', '051', '023', '078', '129', '001', '011',
       '210', '019', '033', '519', '248', '084', '213', '067', '205',
       '068', '056', '109', '246', '176', '030', '124', '161', 'F20',
       '106', '247', '087', '025', '214', '021', '507', '072', '228',
       'F10', '255', '226', '062', '211', '30', '175', '093', '206',
       '015', '075', '088', '105', '081', '212', '102', '082', '177',
       '038', '215'], dtype=object)

In [69]:
#test for loop that will create df for evey unique route on `sac_routes` list

sub_sacdf = []

for sac_routes in sac_routes:
        sub_df = sac[sac['name'] == sac_routes]
        sub_sacdf.append(sub_df)

---

Trying to use a loop that will create df for every `name` in routes, then group by `route_id`, then check for matching rows

1. need a list of unique names from initial df
    `df.name.unique()` >> returns array

2. then need to run that list of names through `new_df = df[df['name'] == unique name]` and get new dfs for each unique name
    use for loop
    
3. then run a check for all rows in each df to see if all rows match eachother
    use function.

In [52]:
name_list = df['name'].unique()

In [54]:
name_list2 = pd.DataFrame(name_list, columns=['name'])

In [56]:
sub_dataframes = []

for name in name_list2:
    sub_df = df[df['name'] == name_list2]
    sub_dataframes.append(sub_df)
    

  sub_df = df[df['name'] == name_list2]


In [57]:
sub_dataframes

[       key source_record_id name route_id shape_id  month  year pt_array
 0      NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 1      NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 2      NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 3      NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 4      NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 ...    ...              ...  ...      ...      ...    ...   ...      ...
 11922  NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 11923  NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 11924  NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 11925  NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 11926  NaN              NaN  NaN      NaN      NaN    NaN   NaN      NaN
 
 [11927 rows x 8 columns]]