# Confirming the operators under `El Dorado County Transportation Commission` and `Placer County Transportation Planning Agency`

El Dorado and Placer County inquired us asking where is the "published data" Caltrans is promised to deliver to RTPAs, via sb125. as of 10/17, the monthly ridership site did not have a separate tab for these RTPAs. Further investigation shows that though these RTPAs did not have separate tabs, some (if not all) the associated transit operators were showing under the SACOG tab. (City of Placer, El Dorado County Transit Authority)

Before splitting out the these operatos and RTPAs into separate tabs, need to confirm if any other transit operators are under these RTPAs, then update the rtpa ntd id crosswalk file.  

Per Tiffany
>This crosswalk you use is `ntd_id to RTPA`. You can combine that with several warehouse tables:
>
>- `dim_organizations` and `dim_gtfs_datasets` and `dim_provider_gtfs_data`
>
>- since the above is a fairly complicated traversal, in `shared_utils`, there's a function that wraps this and gets you from a starting poitn of  a `schedule_gtfs_dataset_key` (operator) to `ntd_id`. It is used and created here.
>
>- Operators (based on GTFS schedule) are subject to a date. If you know which date you want, you can use the crosswalk created and saved >out here in `gtfs_analytics_data.yml` find the GCS path. This will get you the operator's `schedule_gtfs_dataset_key` + `ntd_id` + other ntd columns and you can connect that to your `ntd_id - RTPA` crosswalk."

In [1]:
import pandas as pd
import shutil
import sys
import os
import gcsfs
from calitp_data_analysis.tables import tbls
from siuba import _, collect, count, filter, show_query
from shared_utils.rt_dates import MONTH_DICT
from segment_speed_utils import helpers

from update_vars import NTD_MODES, NTD_TOS, YEAR, MONTH

In [2]:
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

In [3]:
from shared_utils.schedule_rt_utils import sample_gtfs_dataset_key_to_organization_crosswalk

---

## testing the `create_gtfs_dataset_key_to_organization_crosswalk` function
uses the `sample_schedule_feed_key_to_organization_crosswalk` function. this should give us gtfs `schedule feed`, `org name` and `ntd ID`



In [4]:
def create_gtfs_dataset_key_to_organization_crosswalk(
    analysis_date: str
) -> pd.DataFrame:
    """
    For every operator that appears in schedule data, 
    create a crosswalk that links to organization_source_record_id.
    For all our downstream outputs, at various aggregations,
    we need to attach these over and over again.
    """
    df = helpers.import_scheduled_trips(
        analysis_date,
        columns = ["gtfs_dataset_key", "name"],
        get_pandas = True
    ).rename(columns = {"schedule_gtfs_dataset_key": "gtfs_dataset_key"})
    # rename columns because we must use simply gtfs_dataset_key in schedule_rt_utils function
    
    # Get base64_url, organization_source_record_id and organization_name
    crosswalk = sample_gtfs_dataset_key_to_organization_crosswalk(
        df,
        analysis_date,
        quartet_data = "schedule",
        dim_gtfs_dataset_cols = ["key", "source_record_id", "base64_url"],
        dim_organization_cols = ["source_record_id", "name", 
                                 "itp_id", "caltrans_district",
                                  "ntd_id_2022"]
    )

    df_with_org = pd.merge(
        df.rename(columns = {"gtfs_dataset_key": "schedule_gtfs_dataset_key"}),
        crosswalk,
        on = "schedule_gtfs_dataset_key",
        how = "inner"
    )
    
    return df_with_org

In [5]:
# can get dates from GCS `calitp-analytics-data/data-analyses/gtfs_schedule/crosswalk`
gtfs_to_org = create_gtfs_dataset_key_to_organization_crosswalk(
            "2024-08-14"
        )

In [6]:
gtfs_to_org.shape

(206, 10)

In [7]:
gtfs_to_org.head()

Unnamed: 0,schedule_gtfs_dataset_key,name,schedule_source_record_id,base64_url,organization_source_record_id,organization_name,itp_id,caltrans_district_x,ntd_id_2022,caltrans_district_y
0,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,reckQmUdXUzHFmlVf,City of Ojai,231.0,07 - Los Angeles,91058,07 - Los Angeles / Ventura
1,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,rec7EN71rsZxDFxZd,Ventura County Transportation Commission,380.0,07 - Los Angeles,90164,07 - Los Angeles / Ventura
2,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,recS7GnKTcQVX20HE,Gold Coast Transit District,123.0,07 - Los Angeles,90035,07 - Los Angeles / Ventura
3,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,rec1ErIn9gG1Isk5W,City of Simi Valley,308.0,07 - Los Angeles,90050,07 - Los Angeles / Ventura
4,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,recojKzQsBzE1hjVu,City of Moorpark,210.0,07 - Los Angeles,90227,07 - Los Angeles / Ventura


---

##  If you know which date you want, you can use the crosswalk created and saved out here in `gtfs_analytics_data.yml` find the GCS path.

should also give us `gtfs schedule feed`, `org name` and `ntd id`

In [8]:
aug_crosswalk = pd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/gtfs_schedule/crosswalk/gtfs_key_organization_2024-08-14.parquet"
)

In [9]:
display(
    aug_crosswalk.shape,
    list(aug_crosswalk.columns.sort_values())

)

(206, 30)

['base64_url',
 'caltrans_district',
 'counties_served',
 'density',
 'funding_sources',
 'hq_city',
 'hq_county',
 'is_public_entity',
 'is_publicly_operating',
 'name',
 'number_of_counties_with_service',
 'number_of_state_counties',
 'on_demand_vehicles_at_max_service',
 'organization_name',
 'organization_source_record_id',
 'organization_type',
 'population',
 'primary_uza_code',
 'primary_uza_name',
 'reporter_type',
 'schedule_gtfs_dataset_key',
 'schedule_source_record_id',
 'service_area_pop',
 'service_area_sq_miles',
 'state_admin_funds_expended',
 'subrecipient_type',
 'vehicles_at_max_service',
 'voms_do',
 'voms_pt',
 'year']

In [10]:
aug_crosswalk.head()

Unnamed: 0,schedule_gtfs_dataset_key,name,schedule_source_record_id,base64_url,organization_source_record_id,organization_name,caltrans_district,counties_served,hq_city,hq_county,is_public_entity,is_publicly_operating,funding_sources,on_demand_vehicles_at_max_service,vehicles_at_max_service,number_of_state_counties,primary_uza_name,density,number_of_counties_with_service,state_admin_funds_expended,service_area_sq_miles,population,service_area_pop,subrecipient_type,primary_uza_code,reporter_type,organization_type,voms_pt,voms_do,year
0,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,reckQmUdXUzHFmlVf,City of Ojai,07 - Los Angeles,,Ojai,,,,,2.0,2,,,,,,,,,Rural General Public Transit,,Rural Reporter,"City, County or Local Government Unit or Depar...",,2.0,2022
1,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,rec7EN71rsZxDFxZd,Ventura County Transportation Commission,07 - Los Angeles,Santa Clara;Ventura,Camarillo,Ventura,True,True,5307;5311;5339,,45,,"Oxnard--San Buenaventura (Ventura), CA",4910.0,,,28.0,376117.0,209877.0,,,Full Reporter,Independent Public Agency or Authority of Tran...,45.0,,2022
2,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,recS7GnKTcQVX20HE,Gold Coast Transit District,07 - Los Angeles,Ventura,Oxnard,Ventura,True,True,5307;5310;5339,49.0,72,,"Oxnard--San Buenaventura (Ventura), CA",4910.0,,,84.0,376117.0,374827.0,,,Full Reporter,Independent Public Agency or Authority of Tran...,22.0,49.0,2022
3,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,rec1ErIn9gG1Isk5W,City of Simi Valley,07 - Los Angeles,Ventura,Simi Valley,Ventura,True,True,5307,15.0,15,,"Simi Valley, CA",4027.0,,,50.0,127364.0,126356.0,,,Reduced Reporter,"City, County or Local Government Unit or Depar...",,17.0,2022
4,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,recojKzQsBzE1hjVu,City of Moorpark,07 - Los Angeles,Ventura,Moorpark,Ventura,True,True,5307,,2,,"Thousand Oaks, CA",2668.0,,,12.0,213986.0,35975.0,,,Reduced Reporter,"City, County or Local Government Unit or Depar...",2.0,,2022


In [11]:
aug_crosswalk["name"].value_counts()

VCTC GMV Schedule                                       7
San Diego Schedule                                      4
Humboldt Schedule                                       3
North County Schedule                                   2
TART, North Lake Tahoe Schedule                         2
Tehama Schedule                                         2
Foothill Schedule                                       2
Redding Schedule                                        2
Bay Area 511 SolTrans Schedule                          2
Bay Area 511 Sonoma County Transit Schedule             2
Bay Area 511 Muni Schedule                              2
Bay Area 511 Santa Clara Transit Schedule               2
Bay Area 511 Commute.org Schedule                       2
Flixbus Schedule                                        2
UCSC Schedule                                           2
Metrolink Schedule                                      2
Roseville Transit GMV Schedule                          1
eTrans Schedul

---

Compare the dataframes in both methods to compare the values. they may be the same things but the initiald DF has ntd id

In [12]:
display(
    gtfs_to_org.columns,
    aug_crosswalk.columns
)

Index(['schedule_gtfs_dataset_key', 'name', 'schedule_source_record_id',
       'base64_url', 'organization_source_record_id', 'organization_name',
       'itp_id', 'caltrans_district_x', 'ntd_id_2022', 'caltrans_district_y'],
      dtype='object')

Index(['schedule_gtfs_dataset_key', 'name', 'schedule_source_record_id',
       'base64_url', 'organization_source_record_id', 'organization_name',
       'caltrans_district', 'counties_served', 'hq_city', 'hq_county',
       'is_public_entity', 'is_publicly_operating', 'funding_sources',
       'on_demand_vehicles_at_max_service', 'vehicles_at_max_service',
       'number_of_state_counties', 'primary_uza_name', 'density',
       'number_of_counties_with_service', 'state_admin_funds_expended',
       'service_area_sq_miles', 'population', 'service_area_pop',
       'subrecipient_type', 'primary_uza_code', 'reporter_type',
       'organization_type', 'voms_pt', 'voms_do', 'year'],
      dtype='object')

In [13]:
col = [
    "schedule_gtfs_dataset_key",
    "name",
    "schedule_source_record_id",
    "organization_source_record_id",
    #"itp_id"
]

for i in col:
    print(
        gtfs_to_org[i].unique() == aug_crosswalk[i].unique()
    )

# looks good! move forward with gtfs_to_org

[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  T

---

- analyze the gtfs_to_org df
- what agencies are in in

In [14]:
list(gtfs_to_org.columns)

['schedule_gtfs_dataset_key',
 'name',
 'schedule_source_record_id',
 'base64_url',
 'organization_source_record_id',
 'organization_name',
 'itp_id',
 'caltrans_district_x',
 'ntd_id_2022',
 'caltrans_district_y']

In [15]:
gtfs_to_org.describe(include=object)

Unnamed: 0,schedule_gtfs_dataset_key,name,schedule_source_record_id,base64_url,organization_source_record_id,organization_name,caltrans_district_x,ntd_id_2022,caltrans_district_y
count,206,206,206,206,206,206,204,168,206
unique,182,182,182,182,189,189,12,153,12
top,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,recg58MziBRsVfavn,Yosemite National Park,07 - Los Angeles,90030,07 - Los Angeles / Ventura
freq,7,7,7,7,2,2,59,2,60


In [16]:
gtfs_to_org[gtfs_to_org["organization_name"].str.contains("Roseville")]

Unnamed: 0,schedule_gtfs_dataset_key,name,schedule_source_record_id,base64_url,organization_source_record_id,organization_name,itp_id,caltrans_district_x,ntd_id_2022,caltrans_district_y
22,bef2e7553d6d7fb3789f3b081e66365a,Roseville Transit GMV Schedule,recuSnC10vPfhAAs9,aHR0cHM6Ly9yb3NldmlsbGVidXN0cmFja2VyLmNvbS9ndGZz,recUdTq5QiUjJRiAe,City of Roseville,271.0,03 - Marysville,90168,03 - Marysville
114,13ff8c918cc62f49169d93a04864d8e7,Roseville Schedule,rec90jC43naXJz9lr,aHR0cHM6Ly9pcG9ydGFsLnNhY3J0LmNvbS9HVEZTL1Jvc2...,recUdTq5QiUjJRiAe,City of Roseville,271.0,03 - Marysville,90168,03 - Marysville


In [17]:
# see which scheduel feeds these operators are in
city=[
    "Roseville",
    "Auburn",
    "Placer",
    "El Dorado",
    "Shingle Springs",
    "Placerville",
    "Cameron Park",
    "Pollock Pines",
    "Colfax",
    "Lincoln",
    "Rocklin",
    "alpine"
]

city_2 = "|".join(city)

display(
    gtfs_to_org[gtfs_to_org["organization_name"].str.contains(city_2, case=False)].sort_values(by="organization_name"),

    aug_crosswalk[
        aug_crosswalk["counties_served"].str.contains(city_2, case=False, na=False)].sort_values(by="counties_served"),
    
    aug_crosswalk[
        aug_crosswalk["organization_name"].str.contains("auburn", case=False, na=False)].sort_values(by="organization_name"),
    aug_crosswalk[
        aug_crosswalk["counties_served"].str.contains("none", case=False, na=False)].sort_values(by="counties_served")
)

Unnamed: 0,schedule_gtfs_dataset_key,name,schedule_source_record_id,base64_url,organization_source_record_id,organization_name,itp_id,caltrans_district_x,ntd_id_2022,caltrans_district_y
46,020467a276c12a9fe4b0a2332e393f2c,Auburn Schedule,recPN6fJseWncWhDZ,aHR0cHM6Ly93d3cuYXVidXJuLmNhLmdvdi9Eb2N1bWVudE...,recbW86Xrtuw8PhiU,City of Auburn,23.0,03 - Marysville,91032,03 - Marysville
22,bef2e7553d6d7fb3789f3b081e66365a,Roseville Transit GMV Schedule,recuSnC10vPfhAAs9,aHR0cHM6Ly9yb3NldmlsbGVidXN0cmFja2VyLmNvbS9ndGZz,recUdTq5QiUjJRiAe,City of Roseville,271.0,03 - Marysville,90168,03 - Marysville
114,13ff8c918cc62f49169d93a04864d8e7,Roseville Schedule,rec90jC43naXJz9lr,aHR0cHM6Ly9pcG9ydGFsLnNhY3J0LmNvbS9HVEZTL1Jvc2...,recUdTq5QiUjJRiAe,City of Roseville,271.0,03 - Marysville,90168,03 - Marysville
186,7228eba069f2a0fad0ed8552410a544d,El Dorado Schedule,recX4LuyZsMBVngah,aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...,recEDVdKwYUSkGBRd,El Dorado County Transit Authority,101.0,03 - Marysville,90229,03 - Marysville
175,8de1f1a3b9ae172c6b8255b1c82c340f,Placer Schedule,reclUxmuws84qZ0n7,aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...,recDD2rnkl2m7IV8u,Placer County,251.0,03 - Marysville,90196,03 - Marysville


Unnamed: 0,schedule_gtfs_dataset_key,name,schedule_source_record_id,base64_url,organization_source_record_id,organization_name,caltrans_district,counties_served,hq_city,hq_county,is_public_entity,is_publicly_operating,funding_sources,on_demand_vehicles_at_max_service,vehicles_at_max_service,number_of_state_counties,primary_uza_name,density,number_of_counties_with_service,state_admin_funds_expended,service_area_sq_miles,population,service_area_pop,subrecipient_type,primary_uza_code,reporter_type,organization_type,voms_pt,voms_do,year
186,7228eba069f2a0fad0ed8552410a544d,El Dorado Schedule,recX4LuyZsMBVngah,aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...,recEDVdKwYUSkGBRd,El Dorado County Transit Authority,03 - Marysville,El Dorado,Diamond Springs,El Dorado,True,True,5307;5310;5311;5339,17.0,17,,"Sacramento, CA",4163.0,,,1719,1946618.0,148614,,,Reduced Reporter,Independent Public Agency or Authority of Tran...,,16.0,2022
22,bef2e7553d6d7fb3789f3b081e66365a,Roseville Transit GMV Schedule,recuSnC10vPfhAAs9,aHR0cHM6Ly9yb3NldmlsbGVidXN0cmFja2VyLmNvbS9ndGZz,recUdTq5QiUjJRiAe,City of Roseville,03 - Marysville,Placer,Roseville,Placer,True,True,5307,,24,,"Sacramento, CA",4163.0,,,43,1946618.0,153300,,,Reduced Reporter,"City, County or Local Government Unit or Depar...",24.0,,2022
114,13ff8c918cc62f49169d93a04864d8e7,Roseville Schedule,rec90jC43naXJz9lr,aHR0cHM6Ly9pcG9ydGFsLnNhY3J0LmNvbS9HVEZTL1Jvc2...,recUdTq5QiUjJRiAe,City of Roseville,03 - Marysville,Placer,Roseville,Placer,True,True,5307,,24,,"Sacramento, CA",4163.0,,,43,1946618.0,153300,,,Reduced Reporter,"City, County or Local Government Unit or Depar...",24.0,,2022
175,8de1f1a3b9ae172c6b8255b1c82c340f,Placer Schedule,reclUxmuws84qZ0n7,aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...,recDD2rnkl2m7IV8u,Placer County,03 - Marysville,Placer,Auburn,Placer,True,True,5307;5311;5339,21.0,47,,"Sacramento, CA",4163.0,,,169,1946618.0,392258,,,Full Reporter,"City, County or Local Government Unit or Depar...",28.0,17.0,2022


Unnamed: 0,schedule_gtfs_dataset_key,name,schedule_source_record_id,base64_url,organization_source_record_id,organization_name,caltrans_district,counties_served,hq_city,hq_county,is_public_entity,is_publicly_operating,funding_sources,on_demand_vehicles_at_max_service,vehicles_at_max_service,number_of_state_counties,primary_uza_name,density,number_of_counties_with_service,state_admin_funds_expended,service_area_sq_miles,population,service_area_pop,subrecipient_type,primary_uza_code,reporter_type,organization_type,voms_pt,voms_do,year
46,020467a276c12a9fe4b0a2332e393f2c,Auburn Schedule,recPN6fJseWncWhDZ,aHR0cHM6Ly93d3cuYXVidXJuLmNhLmdvdi9Eb2N1bWVudE...,recbW86Xrtuw8PhiU,City of Auburn,03 - Marysville,,Auburn,,,,,6,6,,,,,,,,,Rural General Public Transit,,Rural Reporter,"City, County or Local Government Unit or Depar...",,6,2022


Unnamed: 0,schedule_gtfs_dataset_key,name,schedule_source_record_id,base64_url,organization_source_record_id,organization_name,caltrans_district,counties_served,hq_city,hq_county,is_public_entity,is_publicly_operating,funding_sources,on_demand_vehicles_at_max_service,vehicles_at_max_service,number_of_state_counties,primary_uza_name,density,number_of_counties_with_service,state_admin_funds_expended,service_area_sq_miles,population,service_area_pop,subrecipient_type,primary_uza_code,reporter_type,organization_type,voms_pt,voms_do,year


In [18]:
# see if there are any other operators in these schedule feeds
# looks like this is the only one
sched_list = [
    "Roseville Transit GMV Schedule",
    "Roseville Schedule",
    "Placer Schedule",
    "El Dorado Schedule",
    "Auburn Schedule",
]
sched_list_2 = "|".join(sched_list)

gtfs_to_org[gtfs_to_org["name"].str.contains(sched_list_2, case=False)].sort_values(by="name")

Unnamed: 0,schedule_gtfs_dataset_key,name,schedule_source_record_id,base64_url,organization_source_record_id,organization_name,itp_id,caltrans_district_x,ntd_id_2022,caltrans_district_y
46,020467a276c12a9fe4b0a2332e393f2c,Auburn Schedule,recPN6fJseWncWhDZ,aHR0cHM6Ly93d3cuYXVidXJuLmNhLmdvdi9Eb2N1bWVudE...,recbW86Xrtuw8PhiU,City of Auburn,23.0,03 - Marysville,91032,03 - Marysville
186,7228eba069f2a0fad0ed8552410a544d,El Dorado Schedule,recX4LuyZsMBVngah,aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...,recEDVdKwYUSkGBRd,El Dorado County Transit Authority,101.0,03 - Marysville,90229,03 - Marysville
175,8de1f1a3b9ae172c6b8255b1c82c340f,Placer Schedule,reclUxmuws84qZ0n7,aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...,recDD2rnkl2m7IV8u,Placer County,251.0,03 - Marysville,90196,03 - Marysville
114,13ff8c918cc62f49169d93a04864d8e7,Roseville Schedule,rec90jC43naXJz9lr,aHR0cHM6Ly9pcG9ydGFsLnNhY3J0LmNvbS9HVEZTL1Jvc2...,recUdTq5QiUjJRiAe,City of Roseville,271.0,03 - Marysville,90168,03 - Marysville
22,bef2e7553d6d7fb3789f3b081e66365a,Roseville Transit GMV Schedule,recuSnC10vPfhAAs9,aHR0cHM6Ly9yb3NldmlsbGVidXN0cmFja2VyLmNvbS9ndGZz,recUdTq5QiUjJRiAe,City of Roseville,271.0,03 - Marysville,90168,03 - Marysville


In [19]:
# sanity check 
# example of schedle feed key, with multiple operators 
display(
    gtfs_to_org[gtfs_to_org["organization_name"].str.contains("Ojai")],
    gtfs_to_org[gtfs_to_org["name"].str.contains("VCTC GMV Schedule")]
)

Unnamed: 0,schedule_gtfs_dataset_key,name,schedule_source_record_id,base64_url,organization_source_record_id,organization_name,itp_id,caltrans_district_x,ntd_id_2022,caltrans_district_y
0,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,reckQmUdXUzHFmlVf,City of Ojai,231.0,07 - Los Angeles,91058,07 - Los Angeles / Ventura


Unnamed: 0,schedule_gtfs_dataset_key,name,schedule_source_record_id,base64_url,organization_source_record_id,organization_name,itp_id,caltrans_district_x,ntd_id_2022,caltrans_district_y
0,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,reckQmUdXUzHFmlVf,City of Ojai,231.0,07 - Los Angeles,91058,07 - Los Angeles / Ventura
1,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,rec7EN71rsZxDFxZd,Ventura County Transportation Commission,380.0,07 - Los Angeles,90164,07 - Los Angeles / Ventura
2,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,recS7GnKTcQVX20HE,Gold Coast Transit District,123.0,07 - Los Angeles,90035,07 - Los Angeles / Ventura
3,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,rec1ErIn9gG1Isk5W,City of Simi Valley,308.0,07 - Los Angeles,90050,07 - Los Angeles / Ventura
4,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,recojKzQsBzE1hjVu,City of Moorpark,210.0,07 - Los Angeles,90227,07 - Los Angeles / Ventura
5,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,recPJULRJk1Yn824N,City of Thousand Oaks,337.0,07 - Los Angeles,90165,07 - Los Angeles / Ventura
6,1770249a5a2e770ca90628434d4934b1,VCTC GMV Schedule,recrAG7e0oOiR6FiP,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,recD4Vzt0EDC3VY7I,City of Camarillo,54.0,07 - Los Angeles,90163,07 - Los Angeles / Ventura


## Read in ntd_id_rtpa_crosswalk

In [20]:
crosswalk = pd.read_csv(
        f"gs://calitp-analytics-data/data-analyses/ntd/ntd_id_rtpa_crosswalk.csv", 
        dtype = {"NTD ID": "str"}
    #have to rename NTD ID col to match the dim table
    ).rename(columns={"NTD ID": "ntd_id"})

In [21]:
display(
    crosswalk["RTPA"].nunique(),
    crosswalk["RTPA_open_data"].nunique()
)

26

20

In [22]:
## do the operators found in the previous, exist in the crosswalk?

In [23]:
crosswalk[crosswalk["Agency"].str.contains(city_2)]

Unnamed: 0,ntd_id,Legacy NTD ID,Agency,UZA Name,RTPA_open_data,RTPA
60,90229,9229.0,El Dorado County Transit Authority,"Sacramento, CA",Sacramento Area Council of Governments,El Dorado County Transportation Commission
61,90168,9168.0,City of Roseville,"Sacramento, CA",Sacramento Area Council of Governments,Placer County Transportation Planning Agency
62,90196,9196.0,County of Placer,"Sacramento, CA",Sacramento Area Council of Governments,Placer County Transportation Planning Agency
70,91032,91032.0,City of Auburn,"Sacramento, CA",Placer County Transportation Planning Agency,Placer County Transportation Planning Agency
121,90168,,City of Roseville,"Sacramento, CA",Placer County Transportation Planning Agency,Placer County Transportation Planning Agency


--- 

## Read in ntd ridership data

In [26]:
# query the warehouse
full_upt = (
    tbls.mart_ntd.dim_monthly_ridership_with_adjustments() 
    >> filter(#_.uza_name.str.contains(" ,CA"),
              #_.period_year_month.str.contains("2024-")
             )
    >> collect()
).rename(columns = {"mode_type_of_service_status": "Status"}
       )

In [28]:
full_upt.columns

Index(['key', 'ntd_id', 'legacy_ntd_id', 'agency', 'reporter_type',
       'period_year_month', 'period_year', 'period_month', 'primary_uza_name',
       'primary_uza_code', '_3_mode', 'mode', 'mode_name', 'service_type',
       'Status', 'tos', 'upt', 'vrm', 'vrh', 'voms', '_dt', 'execution_ts'],
      dtype='object')

In [29]:
ca = full_upt[full_upt["primary_uza_name"].str.contains(", CA") & full_upt["period_year_month"].str.contains("2024-")]

## do any of the operators found in the previous, exist in the ntd data?

In [30]:
ca[ca["agency"].str.contains(city_2)]["agency"].value_counts()

County of Placer                      88
El Dorado County Transit Authority    33
City of Roseville                     22
Name: agency, dtype: int64

In [31]:
## do all the operators in the crosswalk exist in the ntd data?

crosswalk_agency = crosswalk["Agency"].unique().tolist()
crosswalk_agency_2 = "|".join(crosswalk_agency)

display(
    len(ca[ca["agency"].str.contains(crosswalk_agency_2)]["agency"].unique()),
    len(crosswalk_agency)
)

#yes they do 

118

121

---

# Conclusion

**Update the following to `ntd_id_rtpa` crosswalk**

County of placer
- RTPA: Placer County Transportation Planning Agency
    
City of Roseville
- RTPA: Placer County Transportation Planning Agency

El Dorado County Transit Authority
- RTPA: El Dorado County Transportation Commission

**Add to the ntd_id_rtpa crosswalk**

City of Auburn
- RTPA: Placer County Transportation Planning Agency
- NTD ID: 91032
- UZA Name: Sacramento, CA

In [None]:
display(
    crosswalk[crosswalk["Agency"].str.contains("Roseville")],
    crosswalk[crosswalk["Agency"].str.contains("El Dorado")]
)

In [None]:
crosswalk.loc[crosswalk["Agency"].str.contains("Placer"), "RTPA"] = "Placer County Transportation Planning Agency"

In [None]:
crosswalk.loc[crosswalk["Agency"].str.contains("Roseville"), "RTPA"] = "Placer County Transportation Planning Agency"
crosswalk.loc[crosswalk["Agency"].str.contains("El Dorado"), "RTPA"] = "El Dorado County Transportation Commission"

In [None]:
display(
    crosswalk[crosswalk["Agency"].str.contains("Placer")],
    crosswalk[crosswalk["Agency"].str.contains("Roseville")],
    crosswalk[crosswalk["Agency"].str.contains("El Dorado")]
)

In [None]:
columns = crosswalk.columns

auburn = pd.DataFrame([["91032","91032","City of Auburn","Sacramento,CA","Placer County Transportation Planning Agency", "Placer County Transportation Planning Agency"]], columns=columns)

crosswalk_2 = pd.concat([crosswalk, auburn], ignore_index=True)

display(
    crosswalk.shape,
    crosswalk_2.shape
)

In [None]:
crosswalk_2["ntd_id"] = crosswalk_2["ntd_id"].astype("str")

In [None]:
crosswalk_2.info()

In [None]:
# save updated crosswalk back to GCS
#crosswalk_2.to_csv(f"gs://calitp-analytics-data/data-analyses/ntd/ntd_id_rtpa_crosswalk.csv", index=False)


In [None]:
#crosswalk_2.to_parquet(f"gs://calitp-analytics-data/data-analyses/ntd/ntd_id_rtpa_crosswalk.parquet")

---

In [None]:
# addressing merge issue from monthly_ridership_by_rtpa.py script

full_upt = (tbls.mart_ntd.dim_monthly_ntd_ridership_with_adjustments() >> collect()).rename(columns = {"mode_type_of_service_status": "Status"})
    
full_upt = full_upt[full_upt.agency.notna()].reset_index(drop=True)
    
#full_upt.to_parquet(
#        f"{GCS_FILE_PATH}ntd_monthly_ridership_{year}_{month}.parquet"
#    )
    
ca = full_upt[(full_upt["uza_name"].str.contains(", CA")) & 
            (full_upt.agency.notna())].reset_index(drop=True)
    
crosswalk = pd.read_csv(
        f"gs://calitp-analytics-data/data-analyses/ntd/ntd_id_rtpa_crosswalk.csv", 
        dtype = {"ntd_id": "str"}
    #have to rename NTD ID col to match the dim table
    )#.rename(columns={"NTD ID": "ntd_id"})
    
df = pd.merge(
        ca,
        # Merging on too many columns can create problems 
        # because csvs and dtypes aren't stable / consistent 
        # for NTD ID, Legacy NTD ID, and UZA
        crosswalk[["ntd_id", "RTPA"]],
        on = "ntd_id",
        how = "left",
        indicator = True
    )
    
print(df._merge.value_counts())


In [None]:
df[df["RTPA"].str.contains("El Dor")]