# QC `ntd_id_rtpa_crosswalk` against `dim_organizations`

As of 4/21/2025, `dim_orgs` now has an RTPA column. RTPA values were set from a previous `rtpa/mpo` column. The `ntd_id_rtpa_crosswalk`
was created by pulling all the agnecies in ntd and assigning a rtpa based on their city location.

## Tasks
- see if the ntd_id in the xwalk appear in dim_orgs and vise-versa.
    - are all ntd_id from xwalk in dim_orgs? NO
    - are all ntd_id from dim_orgs in xwalk? NO!
    
- do the ntd_id/rtpa pairs from xwalk match to dim_orgs
    - SCAG acounts for like 6 counties in Socal

- refactor analyses that use the old xwalk
    - remove old xwalk
    - replace with rtpa data from dim_orgs
    - make manual adjustments to SCAG agengies >> separate out to the socal CTCs (orange, imperial, san bernardino etc etc)


---
Findings
- there are some ntd_id that are unique to dim_orgs that are not in my xwalk
- vise-versa, there are some ntd_id that are unique to xwalk but are not in dim_orgs
- 

work-plan
- fill in the missing RTPA columns in dim_orgs
- what is the process of adding new orgs to dim_orgs? sounds like a lot of work (filling in all the columns)

In [1]:
import pandas as pd
from calitp_data_analysis.tables import tbls
from siuba import _, collect, count, filter, show_query, select

pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", None)

In [2]:
xwalk = pd.read_parquet("gs://calitp-analytics-data/data-analyses/ntd/ntd_id_rtpa_crosswalk_all_reporter_types.parquet")

In [3]:
dim_orgs = (
    tbls.mart_transit_database.dim_organizations()
    >> filter(
        _._is_current == True,
        #_.public_currently_operating == True,
        _.ntd_id_2022 != ""
    )
    >> select(
        _.name,
        _.ntd_id_2022,
        _.rtpa_name,
        _.key
    )
    >> collect()
)

dim_orgs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232 entries, 0 to 231
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         232 non-null    object
 1   ntd_id_2022  232 non-null    object
 2   rtpa_name    203 non-null    object
 3   key          232 non-null    object
dtypes: object(4)
memory usage: 7.4+ KB


In [4]:
currently_operating = (
    tbls.mart_transit_database.dim_organizations()
    >> filter(
        _._is_current == True,
        _.public_currently_operating == True,
        _.ntd_id_2022 != ""
    )
    >> select(
        _.name,
        _.ntd_id_2022,
        _.rtpa_name,
        _.key
    )
    >> collect()
)

currently_operating.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         180 non-null    object
 1   ntd_id_2022  180 non-null    object
 2   rtpa_name    180 non-null    object
 3   key          180 non-null    object
dtypes: object(4)
memory usage: 5.8+ KB


In [5]:
currently_operating[currently_operating["ntd_id_2022"].isna()]

Unnamed: 0,name,ntd_id_2022,rtpa_name,key


In [6]:
# need to get county info? 
county_bridge = (
    tbls.mart_transit_database.bridge_organizations_x_headquarters_county_geography()
    >> filter(
        _._is_current == True,

    )
    >> select(
        _.organization_key,
        _.organization_name,
        _.county_geography_name,
        _.county_geography_key
    )
    >> collect()
)

county_bridge.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1294 entries, 0 to 1293
Data columns (total 4 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   organization_key       1294 non-null   object
 1   organization_name      1294 non-null   object
 2   county_geography_name  1294 non-null   object
 3   county_geography_key   1294 non-null   object
dtypes: object(4)
memory usage: 40.6+ KB


In [7]:
dim_org_county = dim_orgs.merge(
    county_bridge,
    how="left",
    left_on="key",
    right_on="organization_key"
)

display(
    dim_org_county.info(),
    #dim_org_county.head()
)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 232 entries, 0 to 231
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   name                   232 non-null    object
 1   ntd_id_2022            232 non-null    object
 2   rtpa_name              203 non-null    object
 3   key                    232 non-null    object
 4   organization_key       232 non-null    object
 5   organization_name      232 non-null    object
 6   county_geography_name  232 non-null    object
 7   county_geography_key   232 non-null    object
dtypes: object(8)
memory usage: 16.3+ KB


None

In [8]:
dim_org_county_2 = currently_operating.merge(
    county_bridge,
    how="left",
    left_on="key",
    right_on="organization_key"
)

display(
    dim_org_county_2.info(),
    #dim_org_county.head()
)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 180 entries, 0 to 179
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   name                   180 non-null    object
 1   ntd_id_2022            180 non-null    object
 2   rtpa_name              180 non-null    object
 3   key                    180 non-null    object
 4   organization_key       180 non-null    object
 5   organization_name      180 non-null    object
 6   county_geography_name  180 non-null    object
 7   county_geography_key   180 non-null    object
dtypes: object(8)
memory usage: 12.7+ KB


None

## What is the differences in NTD id between dim_org_county and dim_org_county_2?


In [9]:
id_check = dim_org_county.merge(
    dim_org_county_2, 
    how="outer", 
    indicator=True
)

In [10]:
display(
    id_check["_merge"].value_counts(),
)

both          180
left_only      52
right_only      0
Name: _merge, dtype: int64

### Decision

It is possible for agencies to switch between currently operating and not operating. Therefore, RTPA values should be assigned to agencies regardless of operating status.

# What are the unique pairs of RTPA names and geography names?

In [31]:
rtpa_county = dim_org_county[dim_org_county["rtpa_name"].notna()][["county_geography_name","rtpa_name"]].drop_duplicates().sort_values(by="rtpa_name")
rtpa_county[rtpa_county["county_geography_name"].str.contains("Plumas")]

Unnamed: 0,county_geography_name,rtpa_name


## Create dictionary of counties names : RTPA name

In [32]:
county_rtpa_dict = rtpa_county.set_index("county_geography_name")["rtpa_name"].to_dict()
county_rtpa_dict.update(
    {
        'Plumas':'Plumas County Transportation Commission',
        'Sierra':'Sierra County Transportation Commission'
    }
)
county_rtpa_dict

{'Alpine': 'Alpine County Local Transportation Commission',
 'Amador': 'Amador County Transportation Commission',
 'Butte': 'Butte County Association of Governments',
 'Calaveras': 'Calaveras Council of Governments',
 'Colusa': 'Colusa County Transportation Commission',
 'San Benito': 'Council of San Benito County Governments',
 'Del Norte': 'Del Norte Local Transportation Commission',
 'El Dorado': 'Tahoe Regional Planning Agency',
 'Fresno': 'Fresno Council of Governments',
 'Glenn': 'Glenn County Transportation Commission',
 'Humboldt': 'Humboldt County Association of Governments',
 'Inyo': 'Inyo County Local Transportation Commission',
 'Kern': 'Kern Council of Governments',
 'Kings': 'Kings County Association of Governments',
 'Lake': 'Lake County/City Area Planning Council',
 'Lassen': 'Lassen County Transportation Commission',
 'Madera': 'Madera County Transportation Commission',
 'Mariposa': 'Mariposa County Local Transportation Commission',
 'Mendocino': 'Mendocino Council of 

## What are the Agencies with missing RTPA names, but have a county name?

In [14]:
dim_org_county[(dim_org_county["rtpa_name"].isna()) & (dim_org_county["county_geography_name"].isna())].drop_duplicates()

Unnamed: 0,name,ntd_id_2022,rtpa_name,key,organization_key,organization_name,county_geography_name,county_geography_key


In [13]:
no_rtpa = dim_org_county[
    (dim_org_county["rtpa_name"].isna()) & 
    (dim_org_county["county_geography_name"].notna())
].drop_duplicates()

len(no_rtpa)

29

## THESE ORGS NEED RTPAs IN dim_orgs!

In [33]:
no_rtpa["new_rtpa"] = no_rtpa["county_geography_name"].map(county_rtpa_dict)

In [35]:
display(
    no_rtpa.info(),
    no_rtpa[[
        "name",
        "ntd_id_2022",
        "rtpa_name",
        "county_geography_name",
        "new_rtpa"
    ]]
)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 29 entries, 2 to 73
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   name                   29 non-null     object
 1   ntd_id_2022            29 non-null     object
 2   rtpa_name              0 non-null      object
 3   key                    29 non-null     object
 4   organization_key       29 non-null     object
 5   organization_name      29 non-null     object
 6   county_geography_name  29 non-null     object
 7   county_geography_key   29 non-null     object
 8   new_rtpa               29 non-null     object
dtypes: object(9)
memory usage: 2.3+ KB


None

Unnamed: 0,name,ntd_id_2022,rtpa_name,county_geography_name,new_rtpa
2,Bishop Paiute Tribe,99268,,Inyo,Inyo County Local Transportation Commission
3,Calaveras County,91063,,Calaveras,Calaveras Council of Governments
4,California Department of Transportation,9R02,,Sacramento,Sacramento Area Council of Governments
5,California Vanpool Authority,90230,,Kings,Kings County Association of Governments
13,City of Claremont,90296,,Los Angeles,Southern California Association of Governments
16,City of Davis,90167,,Yolo,Sacramento Area Council of Governments
19,City of Folsom,90220,,Sacramento,Sacramento Area Council of Governments
21,City of La Habra Heights,99445,,Los Angeles,Southern California Association of Governments
23,City of Lakewood,90301,,Los Angeles,Southern California Association of Governments
24,City of Lincoln,90235,,Placer,Placer County Transportation Planning Agency


## What are the unique RTPA in dim_orgs
- are any of the SOCAL CTCs in there?


There are some County Transportation Commission in the RTPA list
- Ventura County Transportation Commission

Could not find
- Los Angeles County Metropolitan Transportation Authority
- San Bernardino Associated Governments
- Riverside County Transportation Commission
- Imperial County Transportation Commission
- Orange County Transportation Authority

In [15]:
just_rtpa_name = dim_orgs[dim_orgs["rtpa_name"].notna()]["rtpa_name"].drop_duplicates().reset_index(drop=True)

just_rtpa_name.info()

<class 'pandas.core.series.Series'>
RangeIndex: 40 entries, 0 to 39
Series name: rtpa_name
Non-Null Count  Dtype 
--------------  ----- 
40 non-null     object
dtypes: object(1)
memory usage: 452.0+ bytes


In [16]:
check_ctc = [
    "Los Angeles County Metropolitan Transportation Authority",
    "San Bernardino Associated Governments",
    "Riverside County Transportation Commission",
    "Imperial County Transportation Commission",
    "Orange County Transportation Authority",
    "Ventura County Transportation Commission",
]
just_rtpa_name[just_rtpa_name.isin(check_ctc)]

18    Ventura County Transportation Commission
Name: rtpa_name, dtype: object

In [17]:
# fuzzy string search

ctc_substring=[
    "Los Angeles",
    "San Bernardino",
    "Riverside",
    "Imperial",
    "Orange",
]

for i in ctc_substring:
    print(just_rtpa_name[just_rtpa_name.str.contains(i)])

Series([], Name: rtpa_name, dtype: object)
Series([], Name: rtpa_name, dtype: object)
Series([], Name: rtpa_name, dtype: object)
Series([], Name: rtpa_name, dtype: object)
Series([], Name: rtpa_name, dtype: object)


In [18]:
# check ctc list against agency names col
dim_org_county[dim_org_county["name"].isin(check_ctc)]

# all 5 CTCs appear as orgs in the dim_org list

Unnamed: 0,name,ntd_id_2022,rtpa_name,key,organization_key,organization_name,county_geography_name,county_geography_key
56,Riverside County Transportation Commission,90218,,8d13f55a09eeb4e37328ee8481d332d4,8d13f55a09eeb4e37328ee8481d332d4,Riverside County Transportation Commission,Riverside,17a6841cb057ea751e22785fb9229596
168,Imperial County Transportation Commission,90226,Southern California Association of Governments,621fbb855822846119c106fd5bfa56b3,621fbb855822846119c106fd5bfa56b3,Imperial County Transportation Commission,Imperial,56a0304a6b0ccdc0dd0dbfae798a45e2
176,Los Angeles County Metropolitan Transportation Authority,90154,Southern California Association of Governments,9e96bde610e80d71f500eea119c4723c,9e96bde610e80d71f500eea119c4723c,Los Angeles County Metropolitan Transportation Authority,Los Angeles,8a8da539caf4f046025b97a5b4b9564b
187,Orange County Transportation Authority,90036,Southern California Association of Governments,47552ed0c038e35ee6f21ec8eb2cb5d8,47552ed0c038e35ee6f21ec8eb2cb5d8,Orange County Transportation Authority,Orange,aef4652aba294c8fd5ee922f9741ff6a
225,Ventura County Transportation Commission,90164,Ventura County Transportation Commission,4f7fa398d9c1c8c75310e13df4818015,4f7fa398d9c1c8c75310e13df4818015,Ventura County Transportation Commission,Ventura,ac21d3cfb432219540f51c7658df90e9


## Compare ntd ID from dim_org_county to ntd id in xwalk

In [22]:
xwalk_compare = xwalk.merge(
    dim_org_county,
    how="outer",
    left_on="ntd_id",
    right_on="ntd_id_2022",
    indicator=True
)
xwalk_compare["_merge"].value_counts()

both          217
left_only      37
right_only     15
Name: _merge, dtype: int64

these agencies do no appear in dim_orgs

they only appear in the xwalk (ntd id were pulled from an ntd table)

In [23]:
xwalk_compare[xwalk_compare["_merge"]=="left_only"]

Unnamed: 0,ntd_id,agency_name,reporter_type,agency_status,city,state,RTPA,name,ntd_id_2022,rtpa_name,key,organization_key,organization_name,county_geography_name,county_geography_key,_merge
15,90021,Los Angeles County Metropolitan Transportation Authority (LACMTA),Full Reporter,Inactive,Los Angeles,CA,Southern California Association of Governments,,,,,,,,,left_only
21,90028,"City of Vallejo Transportation Program (Vallejo Transit, Baylink)",Full Reporter,Inactive,Vallejo,CA,Metropolitan Transportation Commission,,,,,,,,,left_only
34,90054,"San Diego Trolley, Inc. (MTS)",Full Reporter,Inactive,San Diego,CA,San Diego Association of Governments,,,,,,,,,left_only
35,90055,Monterey County RIDES,Full Reporter,Inactive,Salinas,CA,Transportation Agency for Monterey County,,,,,,,,,left_only
38,90077,Los Angeles County Transportation Commission / MTA,Full Reporter,Inactive,Los Angeles,CA,Southern California Association of Governments,,,,,,,,,left_only
53,90127,Chico Area Transit System City of Chico (CATS),Full Reporter,Inactive,Chico,CA,Butte County Association of Governments,,,,,,,,,left_only
56,90143,City of Merced Transit System,Full Reporter,Inactive,Merced,CA,Merced County Association of Governments,,,,,,,,,left_only
62,90150,City of Alameda Ferry Services,Full Reporter,Inactive,Alameda,CA,Metropolitan Transportation Commission,,,,,,,,,left_only
68,90158,"DAVE Transportation Services, Inc.",Full Reporter,Inactive,Sherman Oaks,CA,Southern California Association of Governments,,,,,,,,,left_only
70,90160,Outreach & Escort dba OUTREACH,Full Reporter,Inactive,San Jose,CA,Metropolitan Transportation Commission,,,,,,,,,left_only


In [24]:
xwalk_compare[xwalk_compare["_merge"]=="right_only"]

Unnamed: 0,ntd_id,agency_name,reporter_type,agency_status,city,state,RTPA,name,ntd_id_2022,rtpa_name,key,organization_key,organization_name,county_geography_name,county_geography_key,_merge
254,,,,,,,,Bishop Paiute Tribe,99268,,7416dba335568df67c5bcb5444fef5b7,7416dba335568df67c5bcb5444fef5b7,Bishop Paiute Tribe,Inyo,2043f1e3cb85e2f1651696047250ef0b,right_only
255,,,,,,,,California Department of Transportation,9R02,,a1e59256e9f14aed58b1bde2bd7fdc09,a1e59256e9f14aed58b1bde2bd7fdc09,California Department of Transportation,Sacramento,bd7bba0b0cb7727b2d5d6509e14104ae,right_only
256,,,,,,,,City of Hawaiian Gardens,99450,Southern California Association of Governments,42089bcb6206e2a1f9d089fa27ee97c4,42089bcb6206e2a1f9d089fa27ee97c4,City of Hawaiian Gardens,Los Angeles,8a8da539caf4f046025b97a5b4b9564b,right_only
257,,,,,,,,City of La Habra Heights,99445,,e806f59dc0303e3c345e799932fcdc68,e806f59dc0303e3c345e799932fcdc68,City of La Habra Heights,Los Angeles,8a8da539caf4f046025b97a5b4b9564b,right_only
258,,,,,,,,City of Palmdale,99448,,bef4aec3f78557a4487f1b968813d679,bef4aec3f78557a4487f1b968813d679,City of Palmdale,Los Angeles,8a8da539caf4f046025b97a5b4b9564b,right_only
259,,,,,,,,City of South El Monte,99443,Southern California Association of Governments,b3ca49e9e653678dd7ae04c6fc594df9,b3ca49e9e653678dd7ae04c6fc594df9,City of South El Monte,Los Angeles,8a8da539caf4f046025b97a5b4b9564b,right_only
260,,,,,,,,Hollywood Burbank Airport,99444,,6b172451cb8d65c8aeb4c20481e7c8f5,6b172451cb8d65c8aeb4c20481e7c8f5,Hollywood Burbank Airport,Los Angeles,8a8da539caf4f046025b97a5b4b9564b,right_only
261,,,,,,,,Plumas County Transportation Commission,2-005,,126e922ab5a5720d43535ad44d5f52f7,126e922ab5a5720d43535ad44d5f52f7,Plumas County Transportation Commission,Plumas,f6a66352382d6cf7e1e4487ebe978aa5,right_only
262,,,,,,,,"Quechan Tribe of the Fort Yuma Indian Reservation, California & Arizona",99310,,5ae1452aa49ea8fb0ed6f10be9ea7b11,5ae1452aa49ea8fb0ed6f10be9ea7b11,"Quechan Tribe of the Fort Yuma Indian Reservation, California & Arizona",Imperial,56a0304a6b0ccdc0dd0dbfae798a45e2,right_only
263,,,,,,,,Sacramento Area Council of Governments,90308,,730bbe38e9025b8b1caf479a33ebd8aa,730bbe38e9025b8b1caf479a33ebd8aa,Sacramento Area Council of Governments,Sacramento,bd7bba0b0cb7727b2d5d6509e14104ae,right_only
