In [1]:
import pandas as pd
from calitp_data_analysis.tables import tbls
from siuba import _, collect,distinct, filter, select

pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

# QC `ntd_id_rtpa_crosswalk` against `dim_organizations`

**6/2/2025**
`dim_organizations.rtpa_name` data was updated in Airtable based on the findings in 5/9/2025. Tested integrating `dim_orgs` into existing analyses. The test on `Annual NTD Ridership Report` and `New Transit Performance Metrics` both resulted in leaving the `Los Angeles County Public Works` ntd_ids not having RTPAs. Determined that a manually adding a RTPA value to these agencies will suffice.

For the `Monthly NTD Ridership Report`, some ntd_ids did not have RTPAs. All of these agencies has a `last_report_year` value of pre-2018. Determined that these agencies can be excluded from the report as they do not meet the 2018 minimum year requirement. A helper function was developed to help filter for `last_report_year`.


--

As of 4/21/2025, `dim_orgs` now has an RTPA column. RTPA values were set from a previous `rtpa/mpo` column. The `ntd_id_rtpa_crosswalk`, derived from `mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_upt`,
was created by pulling all the agnecies in ntd and assigning a rtpa based on their city location.

## Tasks
- see if the ntd_id in the xwalk appear in dim_orgs and vise-versa.
    - are all ntd_id from xwalk in dim_orgs? NO
    - are all ntd_id from dim_orgs in xwalk? NO!
    
- do the ntd_id/rtpa pairs from xwalk match to dim_orgs
    - SCAG acounts for like 6 counties in Socal

- refactor analyses that use the old xwalk
    - remove old xwalk
    - replace with rtpa data from dim_orgs
    - make manual adjustments to SCAG agengies >> separate out to the socal CTCs (orange, imperial, san bernardino etc etc)


---
Findings
- there are some ntd_id that are unique to dim_orgs that are not in my xwalk
- vise-versa, there are some ntd_id that are unique to xwalk but are not in dim_orgs
- `Agency Status`: Designates agencies as active or inactive based on whether or not they submitted a report in the most recent report year.					


work-plan
- fill in the missing RTPA columns in dim_orgs via airtable
- what is the process of adding new orgs to dim_orgs? sounds like a lot of work (filling in all the columns)
- find out what agencies are unique to to crosswalk, that do not appear in dim_orgs. Assign ntd_ids and RTPA values to agencies we can fix.


## 5/9/2025: Observed Scenarios
Initial ntd_id to rtpa crosswalk was derived from `mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_upt`. We were later informed `dim_organizations` now contains rtpa data. However when compared using ntd_id, 36 agencies were unique to mart_ntd... and were not part of dim_organizations. agency name and ntd_id variables were analyed further

Options for each variable
- name
    - match
    - semi-match
    - different
    - none
- dim_org.ntd_id
    - match
    - different
    - none

Of the 12 possible combinations, 6 combinations were observed. The following scenarios were identified comparing the differences between the agency name and ntd_id in mart_ntd vs dim_orgs , and the resulting action or not.

CSV of all 38 agencies are in GCS: `gs://calitp-analytics-data/data-analyses/ntd/mart_ntd_orgs_compared_dim_orgs.csv`

In [2]:
# the final csv comparing the unique mart_ntd agencies to dim_org.
# found via mergeing the ntd id crosswalk to dim_orgs, then selecting the left-only unmerged rows to investigate futher.
mart_ntd_vs_dim_orgs = pd.read_csv("gs://calitp-analytics-data/data-analyses/ntd/mart_ntd_orgs_compared_dim_orgs.csv")

In [3]:
display(
    mart_ntd_vs_dim_orgs["ntd_id_2022 updated in airtable"].value_counts(dropna=False)
)

NaN     25
True    13
Name: ntd_id_2022 updated in airtable, dtype: int64

### 1. **name: match | dim_org.ntd_ID: none**
- Scenario where the mart_ntd agency name and the dim_org agency name are the same, or very obviously the same, but there was no ntd_ind in dim_org.
- example
    - "City of Benicia (Benicia Breeze) 90174" vs "City of Benicia n/a"
    - "City of Merced Transit System 90143" vs "City of Merced n/a "
- Solution: ntd_id from mart_ntd copied to ntd_id_2022 in airtable

In [4]:
mart_ntd_vs_dim_orgs[mart_ntd_vs_dim_orgs["Scenario"]=="name: match | dim_org.ntd_ID: none"]

Unnamed: 0,mart_ntd_funding_and_expenses...ntd_id,mart_ntd_funding_and_expenses...agency_name,last_report_year,city,agency_status,Scenario,dim_org.name equivilent,dim_org.ntd_id equivilent,result,ntd_id_2022 updated in airtable,Notes
2,90054,"San Diego Trolley, Inc. (MTS)",2006,San Diego,Inactive,name: match | dim_org.ntd_ID: none,"San Diego Trolley, Inc.",,copied ntd_id to dim_org.ntd_id_2022,True,
6,90143,City of Merced Transit System,1993,Merced,Inactive,name: match | dim_org.ntd_ID: none,City of Merced,,copied ntd_id to dim_org.ntd_id_2022,True,
12,90174,City of Benicia (Benicia Breeze),2011,Benicia,Inactive,name: match | dim_org.ntd_ID: none,City of Benicia,,copied ntd_id to dim_org.ntd_id_2022,True,
17,90187,"San Gabriel Transit, Inc. (SGT)",2000,Rosemead,Inactive,name: match | dim_org.ntd_ID: none,San Gabriel Transit Inc.,,copied ntd_id to dim_org.ntd_id_2022,True,
25,90231,City of Irvine (COI),2016,Irvine,Inactive,name: match | dim_org.ntd_ID: none,City of Irvine,,copied ntd_id to dim_org.ntd_id_2022,True,
36,90311,Stanislaus Council of Governments (StanCOG) - Mobility Programs,2023,Modesto,Active,name: match | dim_org.ntd_ID: none,Stanislaus Council of Governments,,copied ntd_id to dim_org.ntd_id_2022,True,


### 2. **name: match | dim_org.ntd_ID: different**
- When the mart_ntd and dim_org agency name are the same, but the dim_org ntd_id is different
- example
    - "Los Angeles County Metropolitan Transportation Authority (LACMTA) 90021" vs "Los Angeles County Metropolitan Transportation Authority 90154"
    - "Paratransit, Inc. CTSA 90224" vs "Paratransit Inc. 90223"
- no changes were made due to existing ntd_id. 
- guessing this will still cause some unmerged rows between mart_ntd and dim_orgs since there will be different ntd_id
- can the mart_ntd org names be added as an alias to in airtable?
 

In [5]:
mart_ntd_vs_dim_orgs[mart_ntd_vs_dim_orgs["Scenario"]=="name: match | dim_org.ntd_ID: different"]

Unnamed: 0,mart_ntd_funding_and_expenses...ntd_id,mart_ntd_funding_and_expenses...agency_name,last_report_year,city,agency_status,Scenario,dim_org.name equivilent,dim_org.ntd_id equivilent,result,ntd_id_2022 updated in airtable,Notes
0,90021,Los Angeles County Metropolitan Transportation Authority (LACMTA),1993,Los Angeles,Inactive,name: match | dim_org.ntd_ID: different,Los Angeles County Metropolitan Transportation Authority,90154,"no changes, dim_org.ntd_id_2022 already establised",,put as alias on 90271?
24,90224,"Paratransit, Inc. CTSA",2013,Sacramento,Inactive,name: match | dim_org.ntd_ID: different,Paratransit Inc.,90223,"no changes, dim_org.ntd_id_2022 already establised",,
26,90269,"Los Angeles County - Department of Public Works, Transit Operations, Athens MB",2023,Alhambra,Active,name: match | dim_org.ntd_ID: different,"Los Angeles County (alias, LA DPW)?",90271,no changes. Add as alias?,,
27,90270,"Los Angeles County (LACDPW) - Department of Public Works, Transit Operations ? Avocado Heights",2023,Alhambra,Active,name: match | dim_org.ntd_ID: different,"Los Angeles County (alias, LA DPW)?",90271,no changes. Add as alias?,,
28,90272,"Los Angeles County (LACDPW) - Department of Public Works, Transit Operations ? East Valinda",2023,Alhambra,Active,name: match | dim_org.ntd_ID: different,"Los Angeles County (alias, LA DPW)?",90271,no changes. Add as alias?,,
29,90273,"Los Angeles County (LACDPW) - Department of Public Works, Transit Operations - Florence Firestone",2023,Alhambra,Active,name: match | dim_org.ntd_ID: different,"Los Angeles County (alias, LA DPW)?",90271,no changes. Add as alias?,,
30,90274,"Los Angeles County - Department of Public Works, Transit Operations, King Medical Center MB",2023,Alhambra,Active,name: match | dim_org.ntd_ID: different,"Los Angeles County (alias, LA DPW)?",90271,no changes. Add as alias?,,
31,90275,"Los Angeles County - Department of Public Works, Transit Operations, Lennox MB",2023,Alhambra,Active,name: match | dim_org.ntd_ID: different,"Los Angeles County (alias, LA DPW)?",90271,no changes. Add as alias?,,
32,90276,"Los Angeles County (LACDPW) - Department of Public Works, Transit Operations - South Whittier",2023,Alhambra,Active,name: match | dim_org.ntd_ID: different,"Los Angeles County (alias, LA DPW)?",90271,no changes. Add as alias?,,
33,90277,"Los Angeles County - Department of Public Works, Transit Operations, Whittier Et Al DR",2023,Alhambra,Active,name: match | dim_org.ntd_ID: different,"Los Angeles County (alias, LA DPW)?",90271,no changes. Add as alias?,,


### 3. **name: semi-match | dim_org.ntd_ID: none**
- When the mart_ntd and dim_org names are similar (but not exact), and there is no ntd_id in dim_org.
- example
    - "National City Transit (NCT) 90189" vs "City of National City n/a". 
    - "City of Vallejo Transportation Program (Vallejo Transit, Baylink) 90028" vs "City of Vallejo n/a". 
- Solution: ntd_id from mart_ntd copied to ntd_id_2022 in airtable
- felt pretty confident these were all the same agency, just different service/brand names

In [6]:
mart_ntd_vs_dim_orgs[mart_ntd_vs_dim_orgs["Scenario"]=="name: semi-match | dim_org.ntd_ID: none"]

Unnamed: 0,mart_ntd_funding_and_expenses...ntd_id,mart_ntd_funding_and_expenses...agency_name,last_report_year,city,agency_status,Scenario,dim_org.name equivilent,dim_org.ntd_id equivilent,result,ntd_id_2022 updated in airtable,Notes
1,90028,"City of Vallejo Transportation Program (Vallejo Transit, Baylink)",2012,Vallejo,Inactive,name: semi-match | dim_org.ntd_ID: none,City of Vallejo,recUwbn2HNnI2rpxY,copied ntd_id to dim_org.ntd_id_2022,True,"SolTrans services Vallejo now, need to talk this one out"
7,90150,City of Alameda Ferry Services,2010,Alameda,Inactive,name: semi-match | dim_org.ntd_ID: none,City of Alameda,,copied ntd_id to dim_org.ntd_id_2022,True,
9,90160,Outreach & Escort dba OUTREACH,2001,San Jose,Inactive,name: semi-match | dim_org.ntd_ID: none,OUTREACH,,copied ntd_id to dim_org.ntd_id_2022,True,probably wrapped into BCAG
18,90188,County of San Diego Transit System (CTS),2003,San Diego,Inactive,name: semi-match | dim_org.ntd_ID: none,San Diego County,,copied ntd_id to dim_org.ntd_id_2022,True,
19,90189,National City Transit (NCT),2006,National City,Inactive,name: semi-match | dim_org.ntd_ID: none,City of National City,rec1S4pn6HGRfSekT,copied ntd_id to dim_org.ntd_id_2022,True,probably is City of Natnioal City. National city wiki said It was absored into MTS


### 4. **name: none | dim_org.ntd_ID: none**
- Where the mart_ntd name does not have a similar, or any equivilent name, in dim_org. thus, no dim_org ntd_id either.
- example
    - "ATC / Vancom 90170" vs "no name, no id"
    - "Laidlaw Transit Services 90178" vs "no name, no id"
    - "DAVE Transportation Services, Inc. 90158" vs"no name, no id"
- should these agencies be included in airtable?

In [7]:
mart_ntd_vs_dim_orgs[mart_ntd_vs_dim_orgs["Scenario"]=="name: none | dim_org.ntd_ID: none"]

Unnamed: 0,mart_ntd_funding_and_expenses...ntd_id,mart_ntd_funding_and_expenses...agency_name,last_report_year,city,agency_status,Scenario,dim_org.name equivilent,dim_org.ntd_id equivilent,result,ntd_id_2022 updated in airtable,Notes
8,90158,"DAVE Transportation Services, Inc.",2001,Sherman Oaks,Inactive,name: none | dim_org.ntd_ID: none,none,,no changes. Add org to airtable?,,
11,90170,ATC / Vancom,2006,Oakland,Inactive,name: none | dim_org.ntd_ID: none,none,,no changes. Add org to airtable?,,
13,90178,Laidlaw Transit Services,2001,El Monte,Inactive,name: none | dim_org.ntd_ID: none,none,,no changes. Add org to airtable?,,out of business
14,90179,Ryder/ATE,2001,West Covina,Inactive,name: none | dim_org.ntd_ID: none,none,,no changes. Add org to airtable?,,
20,90190,"Laidlaw Transit Services, Inc.",2001,San Jose,Inactive,name: none | dim_org.ntd_ID: none,none,,no changes. Add org to airtable?,,out of business
37,99280,Reservation Transportation Authority,2015,none,,name: none | dim_org.ntd_ID: none,,,no changes. Add org to airtable?,,


### 5. **name: different | dim_org.ntd_ID: different**
- When the mart_ntd and dim_org name have different, but slightly similar, names. with different ntd_id.
- example
    - "LACMTA - Small Operators (LACMTA) 90166" vs "90154 Los Angeles County Metropolitan Transportation Authority 90154"
    - "MTS Contract Services (MCS) 90185" vs probably, San Diego Metropolitan Transit System 90026
- no changes were made. less confident these names match
- need guidence

In [8]:
mart_ntd_vs_dim_orgs[mart_ntd_vs_dim_orgs["Scenario"]=="name: different | dim_org.ntd_ID: different"]

Unnamed: 0,mart_ntd_funding_and_expenses...ntd_id,mart_ntd_funding_and_expenses...agency_name,last_report_year,city,agency_status,Scenario,dim_org.name equivilent,dim_org.ntd_id equivilent,result,ntd_id_2022 updated in airtable,Notes
10,90166,LACMTA - Small Operators (LACMTA),2015,Los Angeles,Inactive,name: different | dim_org.ntd_ID: different,Los Angeles County Metropolitan Transportation Authority,90154,no changes. Add as alias?,,put as alias on 90154?
15,90185,MTS Contract Services (MCS),2006,San Diego,Inactive,name: different | dim_org.ntd_ID: different,"probably, San Diego Metropolitan Transit System",90026,no changes. Add as alias?,,


### 6. **name: different | dim_org.ntd_ID: none**
- when mart_ntd and dim_org names have different, but slightly similar, names. but no ntd_id in dim_orgs
- example
    - "Chico Area Transit System City of Chico (CATS) 90128" vs "City of Chico n/a"
    - "Paso Robles Transit Services (PE) 90195" vs "City of Paso Robles n/a"
- some names were more confident than others.
- some ntd_ids were copied over. some were left alone.

In [9]:
mart_ntd_vs_dim_orgs[mart_ntd_vs_dim_orgs["Scenario"]=="name: different | dim_org.ntd_ID: none"]

Unnamed: 0,mart_ntd_funding_and_expenses...ntd_id,mart_ntd_funding_and_expenses...agency_name,last_report_year,city,agency_status,Scenario,dim_org.name equivilent,dim_org.ntd_id equivilent,result,ntd_id_2022 updated in airtable,Notes
3,90055,Monterey County RIDES,1996,Salinas,Inactive,name: different | dim_org.ntd_ID: none,Monterey County?,recPCdSwksmlKiK4k,no changes. Should ntd_id be copied over?,,"There is Monterey County, but also ADA Paratransit RIDES in MST"
4,90077,Los Angeles County Transportation Commission / MTA,1993,Los Angeles,Inactive,name: different | dim_org.ntd_ID: none,Los Angeles County? (not CTC),,no changes. Should ntd_id be copied over?,,
5,90127,Chico Area Transit System City of Chico (CATS),2005,Chico,Inactive,name: different | dim_org.ntd_ID: none,City of Chico,recNjaGALypWtjGar,copied ntd_id to dim_org.ntd_id_2022,True,
16,90186,San Francisco Paratransit (ATC),2007,San Francisco,Inactive,name: different | dim_org.ntd_ID: none,"probably City and County of San Francisco, via the sfmta webite",,no changes. Should ntd_id be copied over?,,https://www.sfmta.com/getting-around/accessibility/paratransit
21,90193,Chula Vista Transit (CVT),2015,Chula Vista,Inactive,name: different | dim_org.ntd_ID: none,City of Chula Vista,recOT4QO6t6mRhUEu,copied ntd_id to dim_org.ntd_id_2022,True,? absorbed to MTS ?
22,90195,Paso Robles Transit Services (PE),2014,Paso Robles,Inactive,name: different | dim_org.ntd_ID: none,City of Paso Robles,recTsBhbc04OTbbe3,no changes. Should ntd_id be copied over?,,"probably is City of Paso Robles, there is a ""Paso Express"" picture on the SLORTA website for Paso Robles Routes A & B services"
23,90212,Imperial Valley Transit (IVT),2010,El Centro,Inactive,name: different | dim_org.ntd_ID: none,Imperial County Transportation Commission,rec38PbjPbEy2Tvdu,no changes. Should ntd_id be copied over?,,"IVT is listed as a service under ICTC, but IVT has its own website https://www.ivtransit.com/"


## Full Table

In [10]:
mart_ntd_vs_dim_orgs.sort_values(by="Scenario")

Unnamed: 0,mart_ntd_funding_and_expenses...ntd_id,mart_ntd_funding_and_expenses...agency_name,last_report_year,city,agency_status,Scenario,dim_org.name equivilent,dim_org.ntd_id equivilent,result,ntd_id_2022 updated in airtable,Notes
10,90166,LACMTA - Small Operators (LACMTA),2015,Los Angeles,Inactive,name: different | dim_org.ntd_ID: different,Los Angeles County Metropolitan Transportation Authority,90154,no changes. Add as alias?,,put as alias on 90154?
15,90185,MTS Contract Services (MCS),2006,San Diego,Inactive,name: different | dim_org.ntd_ID: different,"probably, San Diego Metropolitan Transit System",90026,no changes. Add as alias?,,
23,90212,Imperial Valley Transit (IVT),2010,El Centro,Inactive,name: different | dim_org.ntd_ID: none,Imperial County Transportation Commission,rec38PbjPbEy2Tvdu,no changes. Should ntd_id be copied over?,,"IVT is listed as a service under ICTC, but IVT has its own website https://www.ivtransit.com/"
3,90055,Monterey County RIDES,1996,Salinas,Inactive,name: different | dim_org.ntd_ID: none,Monterey County?,recPCdSwksmlKiK4k,no changes. Should ntd_id be copied over?,,"There is Monterey County, but also ADA Paratransit RIDES in MST"
4,90077,Los Angeles County Transportation Commission / MTA,1993,Los Angeles,Inactive,name: different | dim_org.ntd_ID: none,Los Angeles County? (not CTC),,no changes. Should ntd_id be copied over?,,
5,90127,Chico Area Transit System City of Chico (CATS),2005,Chico,Inactive,name: different | dim_org.ntd_ID: none,City of Chico,recNjaGALypWtjGar,copied ntd_id to dim_org.ntd_id_2022,True,
22,90195,Paso Robles Transit Services (PE),2014,Paso Robles,Inactive,name: different | dim_org.ntd_ID: none,City of Paso Robles,recTsBhbc04OTbbe3,no changes. Should ntd_id be copied over?,,"probably is City of Paso Robles, there is a ""Paso Express"" picture on the SLORTA website for Paso Robles Routes A & B services"
21,90193,Chula Vista Transit (CVT),2015,Chula Vista,Inactive,name: different | dim_org.ntd_ID: none,City of Chula Vista,recOT4QO6t6mRhUEu,copied ntd_id to dim_org.ntd_id_2022,True,? absorbed to MTS ?
16,90186,San Francisco Paratransit (ATC),2007,San Francisco,Inactive,name: different | dim_org.ntd_ID: none,"probably City and County of San Francisco, via the sfmta webite",,no changes. Should ntd_id be copied over?,,https://www.sfmta.com/getting-around/accessibility/paratransit
0,90021,Los Angeles County Metropolitan Transportation Authority (LACMTA),1993,Los Angeles,Inactive,name: match | dim_org.ntd_ID: different,Los Angeles County Metropolitan Transportation Authority,90154,"no changes, dim_org.ntd_id_2022 already establised",,put as alias on 90271?


### What are the agencies with last report year >2018 that still need to be resolved?

In [11]:
mart_ntd_vs_dim_orgs[
    (mart_ntd_vs_dim_orgs["last_report_year"]>2018) &
    (mart_ntd_vs_dim_orgs["ntd_id_2022 updated in airtable"].isna())
][[
    "mart_ntd_funding_and_expenses...ntd_id",
    "mart_ntd_funding_and_expenses...agency_name",
]]

# just LA County

Unnamed: 0,mart_ntd_funding_and_expenses...ntd_id,mart_ntd_funding_and_expenses...agency_name
26,90269,"Los Angeles County - Department of Public Works, Transit Operations, Athens MB"
27,90270,"Los Angeles County (LACDPW) - Department of Public Works, Transit Operations ? Avocado Heights"
28,90272,"Los Angeles County (LACDPW) - Department of Public Works, Transit Operations ? East Valinda"
29,90273,"Los Angeles County (LACDPW) - Department of Public Works, Transit Operations - Florence Firestone"
30,90274,"Los Angeles County - Department of Public Works, Transit Operations, King Medical Center MB"
31,90275,"Los Angeles County - Department of Public Works, Transit Operations, Lennox MB"
32,90276,"Los Angeles County (LACDPW) - Department of Public Works, Transit Operations - South Whittier"
33,90277,"Los Angeles County - Department of Public Works, Transit Operations, Whittier Et Al DR"
34,90278,"Los Angeles County - Department of Public Works, Transit Operations, Willowbrook MB"
35,90279,"Los Angeles County - Department of Public Works, Transit Operations, Willowbrook et. al. DR"


Ran these ntd_id through `mart_ntd.dim_annual_agency_information` to see if they exist, results returned `Los Angeles County` for agency_name.

Meaning 2 things:
1. these ntd_id do not exist in dim_orgs
2. these ntd_ids do exist in dim_annual_agency_information, and looks like they already fixed themselves?

![image.png](attachment:41cb6948-a6ca-458a-9aa7-6961d01411b7.png)



## In summary

is everything accounted for now?

If we ensure future NTD analyses are from 2018 to present, then the ntd_id/RTPA data in dim_orgs should be good to use. 

With the exception of the multiple `Los Angeles County` ntd_id. However, I believe this can be easily accounted for.


## Next Steps
Test replacing the initial ntd_id to RTPA xwalk with the updated dim_organizations.

Affected analyses:
- Annual NTD Ridership Report
- New Transit Performance Metrics
- Monthly NTD Ridership Report

In each analyses, find the cell that merges data. then replace the xwalk with dim_orgs.

In [35]:
dim_orgs =(
    tbls.mart_transit_database.dim_organizations()
    >> filter(
        _._is_current == True,
        _.ntd_id_2022.notna()
    )
    >> select(
        _.name,
        _.ntd_id_2022,
        _.rtpa_name,
        _.mpo_name
    )
    >> collect()
)


In [36]:
dim_orgs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 245 entries, 0 to 244
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         245 non-null    object
 1   ntd_id_2022  245 non-null    object
 2   rtpa_name    244 non-null    object
 3   mpo_name     161 non-null    object
dtypes: object(4)
memory usage: 7.8+ KB


In [37]:
year_list=[
    "2018",
    "2019",
    "2020",
    "2021",
    "2022",
    "2023",
    "2024",
    "2025"
]

## Test: Annual NTD Ridership Report, data prep

In [38]:
ntd_service = (
    tbls.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_upt()
    >> filter(_.state.str.contains("CA") | 
              _.state.str.contains("NV"), # to get lake Tahoe Transportation back
              _.year.isin(year_list),
              _.last_report_year>2018, # NEW FILTER
              _.city != None,
              _.primary_uza_name.str.contains(", CA") | 
              _.primary_uza_name.str.contains("CA-NV") |
              _.primary_uza_name.str.contains("California Non-UZA") | 
              _.primary_uza_name.str.contains("El Paso, TX--NM") # something about Paso 
             )
    >> select(
        'agency_name',
        'agency_status',
        'city',
        'legacy_ntd_id',
        "last_report_year", # NEW COLUMN
        'mode',
        'ntd_id',
        'reporter_type',
        'reporting_module',
        'service',
        'state',
        'uace_code',
        'primary_uza_name',
        'uza_population',
        'year',
        'upt',
    )
    >> collect())

In [39]:
    
    ntd_service = ntd_service.groupby(
        [
            "agency_name",
            'agency_status',
            "city",
            "state",
            "ntd_id",
            'primary_uza_name',
            "reporter_type",
            "mode",
            "service",
            "last_report_year", #NEW COLUMNS
            "year"
        ]
    ).agg({
        "upt":"sum"
    }).sort_values(by="ntd_id").reset_index()

    
#    print("read in new `ntd_id_to_rtpa_all_reporter_types` crosswalk") 
    
#     ntd_to_rtpa_crosswalk = pd.read_parquet(f"{GCS_FILE_PATH}ntd_id_rtpa_crosswalk_all_reporter_types.parquet")
        
#     print("merge ntd data to crosswalk")
    
    ntd_data_by_rtpa = ntd_service.merge(
    #ntd_to_rtpa_crosswalk,
    dim_orgs,
    how="left",
    left_on=[
        "ntd_id",
        #"agency", "reporter_type", "city" # sometime agency name, reporter type and city name change or are inconsistent, causing possible fanout
    ],
    right_on="ntd_id_2022",
    indicator=True
    ).rename(
    columns={
        "actual_vehicles_passenger_car_revenue_hours":"vrh",
        "actual_vehicles_passenger_car_revenue_miles":"vrm",
        "unlinked_passenger_trips_upt":"upt",
        'agency_name_x':"agency_name", 
        'agency_status_x':"agency_status", 
        'city_x':"city", 
        'state_x':"state",
        'reporter_type_x':"reporter_type",
        "agency_name_y":"xwalk_agency_name",
        'reporter_type_y':"xwalk_reporter_type",
        'agency_status_y':"xwalk_agency_status",
        'city_y':"xwalk_city",
        'state_y':"xwalk_state",
    }
    )
    
    ntd_data_by_rtpa.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3762 entries, 0 to 3761
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   agency_name       3762 non-null   object  
 1   agency_status     3762 non-null   object  
 2   city              3762 non-null   object  
 3   state             3762 non-null   object  
 4   ntd_id            3762 non-null   object  
 5   primary_uza_name  3762 non-null   object  
 6   reporter_type     3762 non-null   object  
 7   mode              3762 non-null   object  
 8   service           3762 non-null   object  
 9   last_report_year  3762 non-null   int64   
 10  year              3762 non-null   object  
 11  upt               3762 non-null   float64 
 12  name              3702 non-null   object  
 13  ntd_id_2022       3702 non-null   object  
 14  rtpa_name         3702 non-null   object  
 15  mpo_name          2874 non-null   object  
 16  _merge            3762 n

In [40]:
ntd_data_by_rtpa["_merge"].value_counts()

both          3702
left_only       60
right_only       0
Name: _merge, dtype: int64

In [18]:
list(ntd_data_by_rtpa[ntd_data_by_rtpa["_merge"]=="left_only"]["agency_name"].unique())

['Los Angeles County - Department of Public Works, Transit Operations, Athens MB',
 'Los Angeles County (LACDPW) - Department of Public Works, Transit Operations – Avocado Heights',
 'Los Angeles County (LACDPW) - Department of Public Works, Transit Operations – East Valinda',
 'Los Angeles County (LACDPW) - Department of Public Works, Transit Operations - Florence Firestone',
 'Los Angeles County - Department of Public Works, Transit Operations, King Medical Center MB',
 'Los Angeles County - Department of Public Works, Transit Operations, Lennox MB',
 'Los Angeles County (LACDPW) - Department of Public Works, Transit Operations - South Whittier',
 'Los Angeles County - Department of Public Works, Transit Operations, Whittier Et Al DR',
 'Los Angeles County - Department of Public Works, Transit Operations, Willowbrook MB',
 'Los Angeles County - Department of Public Works, Transit Operations, Willowbrook et. al. DR']

Can just manually add the RTPA name to these, since they all will be LA Metro for the report 

## Test: New Transit Performance Metrics

In [19]:
col_list=['agency_name',
          'agency_status',
          'city','ntd_id',
          'reporter_type',
          'reporting_module',
          'state',
          'mode',
          'service',
          'primary_uza_name',
          'year',]
    
    # get opex data
op_total = (
        tbls.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_opexp_total()
        >> select(
            _.agency_name,
            _.agency_status,
            _.city,
            _.mode,
            _.service,
            _.ntd_id,
            _.reporter_type,
            _.reporting_module,
            _.state,
            _.primary_uza_name,
            _.year,
            _.opexp_total,
        )
        >> filter(
            _.state == "CA",
            _.primary_uza_name.str.contains(", CA"),
            _.year.isin(year_list),
            _.opexp_total.notna(),
        )
        >> collect()
    )
    
    #get mode data
mode_upt = (
        tbls.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_upt()
        >> select(
            _.agency_name,
            _.agency_status,
            _.city,
            _.mode,
            _.service,
            _.ntd_id,
            _.reporter_type,
            _.reporting_module,
            _.state,
            _.primary_uza_name,
            _.year,
            _.upt,
        )
        >> filter(_.state == "CA",
                  _.primary_uza_name.str.contains(", CA"),
                  _.year.isin(year_list),
                  _.upt.notna()
                 )
        >> collect()
    )
    
    #get vrh data
mode_vrh = (
        tbls.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_vrh()
        >> select(
            _.agency_name,
            _.agency_status,
            _.city,
            _.mode,
            _.service,
            _.ntd_id,
            _.reporter_type,
            _.reporting_module,
            _.state,
            _.primary_uza_name,
            _.year,
            _.vrh,
        )
        >> filter(_.state == "CA",
                  _.primary_uza_name.str.contains(", CA"),
                  _.year.isin(year_list),
                  _.vrh.notna()
                 )
        >> collect()
    )
    
    # get vrm data
mode_vrm = (
        tbls.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_vrm()
        >> select(
            _.agency_name,
            _.agency_status,
            _.city,
            _.mode,
            _.service,
            _.ntd_id,
            _.reporter_type,
            _.reporting_module,
            _.state,
            _.primary_uza_name,
            _.year,
            _.vrm,
        )
        >> filter(_.state == "CA",
                  _.primary_uza_name.str.contains(", CA"),
                  _.year.isin(year_list),
                  _.vrm.notna()
                 )
        >> collect()
    )
    
    # merge upt to vrh
merge_upt_vrh = mode_upt.merge(
        mode_vrh,
        on= col_list,
        how="left",
        indicator=True,
    )
    
    #then merge vrm to previous
merge_upt_vrh_vrm = merge_upt_vrh.drop(columns="_merge").merge(
        mode_vrm,
        on = col_list,
        how = "left",
        indicator=True
    )
    
    # merge in opex total
merge_opex_upt_vrm_vrh = merge_upt_vrh_vrm.drop(columns="_merge").merge(
        op_total,
        on= col_list,
        how="left",
        indicator=True
    )
merge_opex_upt_vrm_vrh["opexp_total"] = merge_opex_upt_vrm_vrh["opexp_total"].fillna(0)
 

In [20]:
merge_opex_upt_vrm_vrh.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2091 entries, 0 to 2090
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   agency_name       2091 non-null   object  
 1   agency_status     2091 non-null   object  
 2   city              2091 non-null   object  
 3   mode              2091 non-null   object  
 4   service           2091 non-null   object  
 5   ntd_id            2091 non-null   object  
 6   reporter_type     2091 non-null   object  
 7   reporting_module  2091 non-null   object  
 8   state             2091 non-null   object  
 9   primary_uza_name  2091 non-null   object  
 10  year              2091 non-null   object  
 11  upt               2091 non-null   int64   
 12  vrh               2091 non-null   int64   
 13  vrm               2091 non-null   int64   
 14  opexp_total       2091 non-null   float64 
 15  _merge            2091 non-null   category
dtypes: category(1), float64(

In [21]:
dim_orgs.columns

Index(['name', 'ntd_id_2022', 'rtpa_name'], dtype='object')

In [22]:
   
    # read in ntd id-to-RTPA xwalk
#     xwalk_path = "ntd_id_rtpa_crosswalk_all_reporter_types.parquet"

#     rtpa_ntd_xwalk = pd.read_parquet(f"{GCS_FILE_PATH}{xwalk_path}")
    
    # merge in xwalk
merge_metrics_rtpa = merge_opex_upt_vrm_vrh.drop(columns="_merge").merge(
        #rtpa_ntd_xwalk,
        dim_orgs,
        # on=[
        #     "ntd_id",
        #     "city",
        #     "state",
        #     "agency_name",
        #     "reporter_type",
        #     "agency_status"
        # ],
    left_on="ntd_id",
    right_on="ntd_id_2022",
        how="left",
        indicator=True
    )
    
    merge_metrics_rtpa["opexp_total"] = merge_metrics_rtpa["opexp_total"].astype("int64")
    #merge_metrics_rtpa["mode"] = merge_metrics_rtpa["mode"].map(NTD_MODES)
    #merge_metrics_rtpa["service"] = merge_metrics_rtpa["service"].map(NTD_TOS)
merge_metrics_rtpa.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2091 entries, 0 to 2090
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   agency_name       2091 non-null   object  
 1   agency_status     2091 non-null   object  
 2   city              2091 non-null   object  
 3   mode              2091 non-null   object  
 4   service           2091 non-null   object  
 5   ntd_id            2091 non-null   object  
 6   reporter_type     2091 non-null   object  
 7   reporting_module  2091 non-null   object  
 8   state             2091 non-null   object  
 9   primary_uza_name  2091 non-null   object  
 10  year              2091 non-null   object  
 11  upt               2091 non-null   int64   
 12  vrh               2091 non-null   int64   
 13  vrm               2091 non-null   int64   
 14  opexp_total       2091 non-null   int64   
 15  name              2031 non-null   object  
 16  ntd_id_2022       2031 n

In [23]:
merge_metrics_rtpa[merge_metrics_rtpa["_merge"]=="left_only"]["agency_name"].unique().tolist()

# same results as previous, i can handle these exceptions pretty easily (manually asigning LA Metro to RTPA for analyses)

['Los Angeles County (LACDPW) - Department of Public Works, Transit Operations - Florence Firestone',
 'Los Angeles County (LACDPW) - Department of Public Works, Transit Operations - South Whittier',
 'Los Angeles County (LACDPW) - Department of Public Works, Transit Operations – Avocado Heights',
 'Los Angeles County (LACDPW) - Department of Public Works, Transit Operations – East Valinda',
 'Los Angeles County - Department of Public Works, Transit Operations, Athens MB',
 'Los Angeles County - Department of Public Works, Transit Operations, King Medical Center MB',
 'Los Angeles County - Department of Public Works, Transit Operations, Lennox MB',
 'Los Angeles County - Department of Public Works, Transit Operations, Whittier Et Al DR',
 'Los Angeles County - Department of Public Works, Transit Operations, Willowbrook MB',
 'Los Angeles County - Department of Public Works, Transit Operations, Willowbrook et. al. DR']

## Test: Monthly NTD Ridership Report

In [41]:
full_upt = (
    tbls.mart_ntd.dim_monthly_ridership_with_adjustments() 
    >> filter(
        _.period_year.isin(year_list) # NEW FILTER
        #_.mode_type_of_service_status == "Active", # NEW FILTER, but its possible for modes to go from active to inactive during the analyses year
    )
    >> collect()).rename(
        columns = {
            "mode_type_of_service_status": "Status",
            "primary_uza_name":"uza_name"
        })
    
full_upt = full_upt[full_upt.agency.notna()].reset_index(drop=True)
    
    # full_upt.to_parquet(
    #     f"{GCS_FILE_PATH}ntd_monthly_ridership_{year}_{month}.parquet"
    # )
    
ca = full_upt[(full_upt["uza_name"].str.contains(", CA")) & 
            (full_upt.agency.notna())].reset_index(drop=True)
    
    # crosswalk = pd.read_csv(
    #     f"gs://calitp-analytics-data/data-analyses/ntd/ntd_id_rtpa_crosswalk.csv", 
    #     dtype = {"ntd_id": "str"}
    # #have to rename NTD ID col to match the dim table
    # )#.rename(columns={"NTD ID": "ntd_id"})

In [42]:
ca.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30537 entries, 0 to 30536
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   key                30537 non-null  object             
 1   ntd_id             30537 non-null  object             
 2   legacy_ntd_id      28710 non-null  object             
 3   agency             30537 non-null  object             
 4   reporter_type      30537 non-null  object             
 5   period_year_month  30537 non-null  object             
 6   period_year        30537 non-null  object             
 7   period_month       30537 non-null  object             
 8   uza_name           30537 non-null  object             
 9   primary_uza_code   30537 non-null  object             
 10  _3_mode            30537 non-null  object             
 11  mode               30537 non-null  object             
 12  mode_name          30537 non-null  object     

In [43]:
def check_last_report_year(df:pd.DataFrame) -> pd.DataFrame:
    """
    Input the monthly ridership data. Will be merged to `mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_upt`
    to attach `last_report_year` column. 
    
    df will be filtered to exclude ntd_ids with `last_report_year` BEFORE 2018.
    """
    
    last_report_year = (
        tbls.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_upt()
        >> filter(_.state == "CA",
                  _.primary_uza_name.str.contains(", CA"),
                  _.year.isin(year_list),
                  _.upt.notna()
                 )
        >> distinct(
                _.agency_name,
                #_.agency_status,
                _.ntd_id,
                _.last_report_year,
                 #_.state,
                 #_.primary_uza_name,
                 #_.year,
                 #_.upt,
            )
        >> collect()
    )
    
    merge = df.merge(
        last_report_year,
        on = "ntd_id",
        how = "left",
    )
    merge_2 = merge[merge["last_report_year"]>=2018]
    
    return merge_2
    

In [44]:
ca_2 = check_last_report_year(ca)


In [45]:
ca_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27753 entries, 0 to 30536
Data columns (total 24 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   key                27753 non-null  object             
 1   ntd_id             27753 non-null  object             
 2   legacy_ntd_id      26100 non-null  object             
 3   agency             27753 non-null  object             
 4   reporter_type      27753 non-null  object             
 5   period_year_month  27753 non-null  object             
 6   period_year        27753 non-null  object             
 7   period_month       27753 non-null  object             
 8   uza_name           27753 non-null  object             
 9   primary_uza_code   27753 non-null  object             
 10  _3_mode            27753 non-null  object             
 11  mode               27753 non-null  object             
 12  mode_name          27753 non-null  object     

In [46]:
df = ca_2.merge(
        # Merging on too many columns can create problems 
        # because csvs and dtypes aren't stable / consistent 
        # for NTD ID, Legacy NTD ID, and UZA
        dim_orgs,
        left_on = "ntd_id",
        right_on = "ntd_id_2022",
        how = "left",
        indicator = True
    )
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27753 entries, 0 to 27752
Data columns (total 29 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   key                27753 non-null  object             
 1   ntd_id             27753 non-null  object             
 2   legacy_ntd_id      26100 non-null  object             
 3   agency             27753 non-null  object             
 4   reporter_type      27753 non-null  object             
 5   period_year_month  27753 non-null  object             
 6   period_year        27753 non-null  object             
 7   period_month       27753 non-null  object             
 8   uza_name           27753 non-null  object             
 9   primary_uza_code   27753 non-null  object             
 10  _3_mode            27753 non-null  object             
 11  mode               27753 non-null  object             
 12  mode_name          27753 non-null  object     

In [47]:
df["_merge"].value_counts()

# worked! only shows left-only info

both          27753
left_only         0
right_only        0
Name: _merge, dtype: int64

In [31]:
left_overs = df[df["_merge"]=="left_only"][["ntd_id","agency","ntd_id_2022","rtpa_name"]].drop_duplicates()
left_overs

Unnamed: 0,ntd_id,agency,ntd_id_2022,rtpa_name


In [32]:
### what is the last report year for these left over agencies
left_overs["ntd_id"].unique().tolist()

[]

NTD time series

![image.png](attachment:e646f6ba-9782-46b1-8312-d12d17b30270.png)

monthly ntd ridership report page for MTC

![image.png](attachment:096a5110-9454-47b6-936c-9cd03dabc1b7.png)

annual ntd ridership report for MTC

![image.png](attachment:17adf8e7-ff57-4ebe-9abb-7efbe8969550.png)

**MOVE FORWARD WITH EXCLUDING THESE AGENCIES FROM ANY REPORT, THEY DO NOT MEET THE 2018 MIN YEAR**

---
# Explore work

In [None]:
xwalk = pd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/ntd/ntd_id_rtpa_crosswalk_all_reporter_types.parquet"
)

In [None]:
# crosswalk initially dirived from this mart_ntd_funding_and_expense table
mart_ntd = (
    tbls.mart_ntd_funding_and_expenses.fct_service_data_and_operating_expenses_time_series_by_mode_upt()
    >> filter(
        _.state =="CA",
        _.ntd_id.notna()
    )
    >> distinct(
        _.agency_name,
        _.ntd_id,
        _.last_report_year,
        _.agency_status,
        _.city 
    )
    
    >> collect()
)

mart_ntd.info()

In [None]:
dim_orgs = (
    tbls.mart_transit_database.dim_organizations()
    >> filter(
        _._is_current == True,
        # _.public_currently_operating == True,
        _.ntd_id_2022 != "",
    )
    >> select(
        _.name,
        _.ntd_id_2022,
        _.rtpa_name,
        _.key,
        # _.alias # alias is a list that is messing up merges down the line
    )
    >> collect()
)

dim_orgs.info()

In [None]:
currently_operating = (
    tbls.mart_transit_database.dim_organizations()
    >> filter(
        _._is_current == True, _.public_currently_operating == True, _.ntd_id_2022 != ""
    )
    >> select(
        _.name,
        _.ntd_id_2022,
        _.rtpa_name,
        _.key,
        # _.alias
    )
    >> collect()
)

currently_operating.info()

In [None]:
currently_operating[currently_operating["ntd_id_2022"].isna()]

In [None]:
# need to get county info?
county_bridge = (
    tbls.mart_transit_database.bridge_organizations_x_headquarters_county_geography()
    >> filter(
        _._is_current == True,
    )
    >> select(
        _.organization_key,
        _.organization_name,
        _.county_geography_name,
        _.county_geography_key,
    )
    >> collect()
)

county_bridge.info()

In [None]:
dim_org_county = dim_orgs.merge(
    county_bridge, how="left", left_on="key", right_on="organization_key"
)

display(
    dim_org_county.info(),
    # dim_org_county.head()
)

In [None]:
dim_org_county_2 = currently_operating.merge(
    county_bridge, how="left", left_on="key", right_on="organization_key"
)

display(
    dim_org_county_2.info(),
    # dim_org_county.head()
)

## What is the differences in NTD id between dim_org_county and dim_org_county_2?


In [None]:
id_check = dim_org_county.merge(dim_org_county_2, how="outer", indicator=True)

In [None]:
display(
    id_check["_merge"].value_counts(),
)

### Decision

It is possible for agencies to switch between currently operating and not operating. Therefore, RTPA values should be assigned to agencies regardless of operating status.

## What are the unique pairs of RTPA names and geography names?

In [None]:
rtpa_county = (
    dim_org_county[dim_org_county["rtpa_name"].notna()][
        ["county_geography_name", "rtpa_name"]
    ]
    .drop_duplicates()
    .sort_values(by="rtpa_name")
)
rtpa_county[rtpa_county["county_geography_name"].str.contains("Plumas")]

## Create dictionary of counties names : RTPA name

In [None]:
county_rtpa_dict = rtpa_county.set_index("county_geography_name")["rtpa_name"].to_dict()
county_rtpa_dict.update(
    {
        "Plumas": "Plumas County Transportation Commission",
        "Sierra": "Sierra County Transportation Commission",
        # "Imperial":"Imperial County Transportation Commission", NO! Imperial is under SCAG
        # "Los Angeles":"Los Angeles County Metropolitan Transportation Authority", NO! LA is SCAG
    }
)
county_rtpa_dict["Imperial"]

## What are the Agencies with missing RTPA names, but have a county name?

In [None]:
dim_org_county[
    (dim_org_county["rtpa_name"].isna())
    & (dim_org_county["county_geography_name"].isna())
].drop_duplicates()

# added rtpa names in airtable

In [None]:
no_rtpa = dim_org_county[
    (dim_org_county["rtpa_name"].isna())
    & (dim_org_county["county_geography_name"].notna())
].drop_duplicates()

len(no_rtpa)

## !!THESE ORGS NEED RTPAs IN dim_orgs!!

In [None]:
no_rtpa["update_dim_org_rtpa_to"] = no_rtpa["county_geography_name"].map(
    county_rtpa_dict
)

In [None]:
display(
    no_rtpa.info(),
    no_rtpa[
        [
            "name",
            "ntd_id_2022",
            "rtpa_name",
            # "alias",
            "county_geography_name",
            "update_dim_org_rtpa_to",
        ]
    ].sort_values(by="county_geography_name"),
)

4/28/2025 - added rtpa to those agencies. 
4/28/2025 - changes took affect.


## What are the unique RTPA in dim_orgs
- are any of the SOCAL CTCs in there?


There are some County Transportation Commission in the RTPA list
- Ventura County Transportation Commission

Could not find
- Los Angeles County Metropolitan Transportation Authority
- San Bernardino Associated Governments
- Riverside County Transportation Commission
- Imperial County Transportation Commission
- Orange County Transportation Authority

In [None]:
just_rtpa_name = (
    dim_orgs[dim_orgs["rtpa_name"].notna()]["rtpa_name"]
    .drop_duplicates()
    .reset_index(drop=True)
)

just_rtpa_name.info()

In [None]:
check_ctc = [
    "Los Angeles County Metropolitan Transportation Authority",
    "San Bernardino Associated Governments",
    "Riverside County Transportation Commission",
    "Imperial County Transportation Commission",
    "Orange County Transportation Authority",
    "Ventura County Transportation Commission",
]
just_rtpa_name[just_rtpa_name.isin(check_ctc)]

In [None]:
# fuzzy string search

ctc_substring = [
    "Los Angeles",
    "Bernardino ",
    "Riverside",
    "Imperial",
    "Orange",
    "Ventura",
]

for i in ctc_substring:
    print(just_rtpa_name[just_rtpa_name.str.contains(i)])

In [None]:
# check ctc list against agency names col
dim_org_county[dim_org_county["organization_name"].isin(check_ctc)]

In [None]:
# somehow San Bernardino was missing from the previous cell
dim_org_county[dim_org_county["name"].str.contains("San Bernardino")]

## Socal CTC dictionary
to be used in other notebooks to convert socal counties from SCAG to the county CTC


In [None]:
dim_org_county[
    (dim_org_county["name"].isin(check_ctc))
    | (dim_org_county["name"].str.contains("San Bernardino"))
][["county_geography_name", "name"]]

In [None]:
# save this to a module for later imports
scag_to_ctc = {
    "Riverside": "Riverside County Transportation Commission",
    "Imperial": "Imperial County Transportation Commission",
    "Los Angeles": "Los Angeles County Metropolitan Transportation Authority",
    "Orange": "Orange County Transportation Authority",
    "San Bernardino": "San Bernardino County Transportation Authority",
    "Ventura": "Ventura County Transportation Commission",
}

scag_to_ctc

## Compare ntd ID from dim_org_county to ntd id in xwalk

In [None]:
xwalk_compare = xwalk.merge(
    dim_org_county,
    how="outer",
    left_on="ntd_id",
    right_on="ntd_id_2022",
    indicator=True,
)
xwalk_compare["_merge"].value_counts()

In [None]:
mart_ntd_compare = mart_ntd.merge(
    dim_org_county,
    how="outer",
    left_on="ntd_id",
    right_on="ntd_id_2022",
    indicator=True,
)
mart_ntd_compare["_merge"].value_counts()

these agencies do no appear in dim_orgs

they only appear in the xwalk (ntd id were pulled from an ntd table)

In [None]:
# unique agencies in the xwalk
xwalk_compare[xwalk_compare["_merge"] == "left_only"]

left_only_ntd_id = (
    xwalk_compare[xwalk_compare["_merge"] == "left_only"]["ntd_id"].unique().tolist()
)
display(len(left_only_ntd_id), type(left_only_ntd_id))

In [None]:
# unique agencies in dim_orgs
xwalk_compare[xwalk_compare["_merge"] == "right_only"]

right_only_ntd_id = (
    xwalk_compare[xwalk_compare["_merge"] == "right_only"]["ntd_id_2022"]
    .unique()
    .tolist()
)
display(len(right_only_ntd_id), type(right_only_ntd_id))

### checking if unique xwalk ntd_id appear anywhere in unfiltered dim_orgs

In [None]:
left_only_dim_orgs = (
    tbls.mart_transit_database.dim_organizations()
    >> filter(
        # _.ntd_id_2022.isin(left_only_ntd_id), # ntd_ids from xwalk, returned 0 matches
        _.ntd_id.isin(left_only_ntd_id)  # returned 0 matches
    )
    >> distinct(
        _.name,
        _.ntd_id,
        _.ntd_id_2022,
        _.rtpa_name,
        _.key,
        _._is_current,
        _.public_currently_operating,
    )
    >> collect()
)

len(left_only_dim_orgs)

# so the ntd_ids from the xwalk (initialy pulled from mart_ntd) DO NOT APPEAR anywhere in dim_orgs
# what can we conclude.

In [None]:
right_only_dim_orgs = (
    tbls.mart_transit_database.dim_organizations()
    >> filter(
        _.ntd_id_2022.isin(
            right_only_ntd_id
        ),  # ntd_id from dim_org. returned 31 matches
        # _.ntd_id.isin(right_only_ntd_id) # returned 43 matches
    )
    >> distinct(
        _.name,
        _.ntd_id,
        _.ntd_id_2022,
        _.rtpa_name,
        _.key,
        _._is_current,
        _.public_currently_operating,
    )
    >> collect()
)

len(right_only_dim_orgs)

In [None]:
### further investigate the unique xwalk ntd_id (these ids come from mart_ntd.)
# are there any similar ntd names in dim orgs?

xwalk_compare[xwalk_compare["_merge"] == "left_only"].sort_values(by="agency_name").reset_index(drop=True).value_counts("agency_name")

In [None]:
### further investigate the unique xwalk ntd_id (these ids come from mart_ntd.)
# are there any similar ntd names in dim orgs?

mart_ntd_compare[mart_ntd_compare["_merge"] == "left_only"].sort_values(by="agency_name").reset_index(drop=True).value_counts("agency_name")

In [None]:
mart_ntd_compare[mart_ntd_compare["_merge"] == "left_only"].sort_values(by="agency_name")

A lot of these agency names includ acronyms in their names. or are sub-departments of the parent agency (public works).



In [None]:
check_names = [
    "ATC",
    "Vancom",
    "Chico",
    "Chula Vista",
    "Alameda Ferry",
    "Benicia",
    "Irvine",
    "Merced",
    "Vallejo",
    "San Diego",
    "DAVE",
    "Imperial",
    "LACMTA",
    "Laidwal",
    "Department of Public Works",
    "MTA",
    "MTS",
    "Monterey",
    "National City",
    "Outreach",
    "Paratransit",
    "Paso Robles",
    "Ryder",
    "San Diego Trolley",
    "San Gabriel",
    "Stanislaus",
]

In [None]:
all_dim_orgs = (
    tbls.mart_transit_database.dim_organizations()
    >> filter(
        # _._is_current == True,
        # _.name.LIKE("%San%")
    )
    # >> distinct(
    #   _.name,
    #  _.ntd_id,
    # _.ntd_id_2022,
    # _.rtpa_name,
    # _.key,
    # _._is_current,
    # _.public_currently_operating
    # )
    >> collect()
)
all_dim_orgs.info()

In [None]:
for i in check_names:
    display(
        f"""Results for agency_name containing: {i}""",
        all_dim_orgs[all_dim_orgs["name"].str.contains(i, na=False)]["name"]
        .unique()
        .tolist(),
        # all_dim_orgs[all_dim_orgs["name"].str.contains(i, na=False)]#.value_counts("name")
    )