# QC `ntd_id_rtpa_crosswalk` against `dim_organizations`

As of 4/21/2025, `dim_orgs` now has an RTPA column. RTPA values were set from a previous `rtpa/mpo` column. The `ntd_id_rtpa_crosswalk`
was created by pulling all the agnecies in ntd and assigning a rtpa based on their city location.

## Tasks
- see if the ntd_id in the xwalk appear in dim_orgs and vise-versa.
    - are all ntd_id from xwalk in dim_orgs?
    - are there any ntd_ids from dim_orgs not in xwalk?
    
- do the ntd_id/rtpa pairs from xwalk match to dim_orgs
    - SCAG acounts for like 6 counties in Socal

- refactor analyses that use the old xwalk
    - remove old xwalk
    - replace with rtpa data from dim_orgs
    - make manual adjustments to SCAG agengies >> separate out to the socal CTCs (orange, imperial, san bernardino etc etc)
    

In [1]:
import pandas as pd
from calitp_data_analysis.tables import tbls
from siuba import _, collect, count, filter, show_query, select

pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", None)

In [2]:
xwalk = pd.read_parquet("gs://calitp-analytics-data/data-analyses/ntd/ntd_id_rtpa_crosswalk_all_reporter_types.parquet")

In [7]:
dim_orgs = (
    tbls.mart_transit_database.dim_organizations()
    >> filter(
        _._is_current == True,
        #_.public_currently_operating == True,
        _.ntd_id_2022 != ""
    )
    >> select(
        _.name,
        _.ntd_id_2022,
        _.rtpa_name,
        _.key
    )
    >> collect()
)

dim_orgs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232 entries, 0 to 231
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         232 non-null    object
 1   ntd_id_2022  232 non-null    object
 2   rtpa_name    203 non-null    object
 3   key          232 non-null    object
dtypes: object(4)
memory usage: 7.4+ KB


In [8]:
currently_operating = (
    tbls.mart_transit_database.dim_organizations()
    >> filter(
        _._is_current == True,
        _.public_currently_operating == True,
        _.ntd_id_2022 != ""
    )
    >> select(
        _.name,
        _.ntd_id_2022,
        _.rtpa_name,
        _.key
    )
    >> collect()
)

currently_operating.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         180 non-null    object
 1   ntd_id_2022  180 non-null    object
 2   rtpa_name    180 non-null    object
 3   key          180 non-null    object
dtypes: object(4)
memory usage: 5.8+ KB


In [15]:
currently_operating[currently_operating["ntd_id_2022"].isna()]

Unnamed: 0,name,ntd_id_2022,rtpa_name,key


In [5]:
# need to get county info? 
county_bridge = (
    tbls.mart_transit_database.bridge_organizations_x_headquarters_county_geography()
    >> filter(
        _._is_current == True,

    )
    >> select(
        _.organization_key,
        _.organization_name,
        _.county_geography_name,
        _.county_geography_key
    )
    >> collect()
)

county_bridge.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1294 entries, 0 to 1293
Data columns (total 4 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   organization_key       1294 non-null   object
 1   organization_name      1294 non-null   object
 2   county_geography_name  1294 non-null   object
 3   county_geography_key   1294 non-null   object
dtypes: object(4)
memory usage: 40.6+ KB


In [16]:
dim_org_county = dim_orgs.merge(
    county_bridge,
    how="left",
    left_on="key",
    right_on="organization_key"
)

display(
    dim_org_county.info(),
    #dim_org_county.head()
)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 232 entries, 0 to 231
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   name                   232 non-null    object
 1   ntd_id_2022            232 non-null    object
 2   rtpa_name              203 non-null    object
 3   key                    232 non-null    object
 4   organization_key       232 non-null    object
 5   organization_name      232 non-null    object
 6   county_geography_name  232 non-null    object
 7   county_geography_key   232 non-null    object
dtypes: object(8)
memory usage: 16.3+ KB


None

In [18]:
dim_org_county_2 = currently_operating.merge(
    county_bridge,
    how="left",
    left_on="key",
    right_on="organization_key"
)

display(
    dim_org_county_2.info(),
    #dim_org_county.head()
)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 180 entries, 0 to 179
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   name                   180 non-null    object
 1   ntd_id_2022            180 non-null    object
 2   rtpa_name              180 non-null    object
 3   key                    180 non-null    object
 4   organization_key       180 non-null    object
 5   organization_name      180 non-null    object
 6   county_geography_name  180 non-null    object
 7   county_geography_key   180 non-null    object
dtypes: object(8)
memory usage: 12.7+ KB


None

## What is the differences in NTD id between dim_org_county and dim_org_county_2?


In [35]:
id_check = dim_org_county.merge(
    dim_org_county_2, 
    how="outer", 
    indicator=True
)

In [39]:
display(
    id_check["_merge"].value_counts(),
)

both          180
left_only      52
right_only      0
Name: _merge, dtype: int64

### Decision

It is possible for agencies to switch between currently operating and not operating. Therefore, RTPA values should be assigned to agencies regardless of operating status.

# What are the unique pairs of RTPA names and geography names?

In [75]:
rtpa_county = dim_org_county[dim_org_county["rtpa_name"].notna()][["county_geography_name","rtpa_name"]].drop_duplicates().sort_values(by="rtpa_name")

rtpa_county

Unnamed: 0,county_geography_name,rtpa_name
1,Alpine,Alpine County Local Transportation Commission
69,Amador,Amador County Transportation Commission
74,Butte,Butte County Association of Governments
75,Calaveras,Calaveras Council of Governments
157,Colusa,Colusa County Transportation Commission
197,San Benito,Council of San Benito County Governments
67,Del Norte,Del Norte Local Transportation Commission
160,El Dorado,El Dorado County Transportation Commission
108,Fresno,Fresno Council of Governments
163,Glenn,Glenn County Transportation Commission


## Create dictionary of counties names : RTPA name

In [76]:
county_rtpa_dict = rtpa_county.set_index("county_geography_name")["rtpa_name"].to_dict()

county_rtpa_dict

{'Alpine': 'Alpine County Local Transportation Commission',
 'Amador': 'Amador County Transportation Commission',
 'Butte': 'Butte County Association of Governments',
 'Calaveras': 'Calaveras Council of Governments',
 'Colusa': 'Colusa County Transportation Commission',
 'San Benito': 'Council of San Benito County Governments',
 'Del Norte': 'Del Norte Local Transportation Commission',
 'El Dorado': 'Tahoe Regional Planning Agency',
 'Fresno': 'Fresno Council of Governments',
 'Glenn': 'Glenn County Transportation Commission',
 'Humboldt': 'Humboldt County Association of Governments',
 'Inyo': 'Inyo County Local Transportation Commission',
 'Kern': 'Kern Council of Governments',
 'Kings': 'Kings County Association of Governments',
 'Lake': 'Lake County/City Area Planning Council',
 'Lassen': 'Lassen County Transportation Commission',
 'Madera': 'Madera County Transportation Commission',
 'Mariposa': 'Mariposa County Local Transportation Commission',
 'Mendocino': 'Mendocino Council of 

## What are the Agencies with missing RTPA names, but have a county name?

In [46]:
dim_org_county[(dim_org_county["rtpa_name"].isna()) & (dim_org_county["county_geography_name"].notna())].drop_duplicates()

Unnamed: 0,name,ntd_id_2022,rtpa_name,key,organization_key,organization_name,county_geography_name,county_geography_key
2,Bishop Paiute Tribe,99268,,7416dba335568df67c5bcb5444fef5b7,7416dba335568df67c5bcb5444fef5b7,Bishop Paiute Tribe,Inyo,2043f1e3cb85e2f1651696047250ef0b
3,Calaveras County,91063,,4a749b39ccec696bd5a7283359febc12,4a749b39ccec696bd5a7283359febc12,Calaveras County,Calaveras,3562e9723b46a1db706bd516af79e143
4,California Department of Transportation,9R02,,a1e59256e9f14aed58b1bde2bd7fdc09,a1e59256e9f14aed58b1bde2bd7fdc09,California Department of Transportation,Sacramento,bd7bba0b0cb7727b2d5d6509e14104ae
5,California Vanpool Authority,90230,,b8bf7f2b3f96f422a859a3afd4156a07,b8bf7f2b3f96f422a859a3afd4156a07,California Vanpool Authority,Kings,9d62e13427ed184bf8a59bea53949a0d
13,City of Claremont,90296,,3ceb6b2eba6b0c7960e7fffebaa5d940,3ceb6b2eba6b0c7960e7fffebaa5d940,City of Claremont,Los Angeles,8a8da539caf4f046025b97a5b4b9564b
16,City of Davis,90167,,80d9c11c21af7e411acc64549c17bc1b,80d9c11c21af7e411acc64549c17bc1b,City of Davis,Yolo,e7df60ba9736e8b7132a24b46f05a8ce
19,City of Folsom,90220,,4fe0165cac31b8facf93383f576acb29,4fe0165cac31b8facf93383f576acb29,City of Folsom,Sacramento,bd7bba0b0cb7727b2d5d6509e14104ae
21,City of La Habra Heights,99445,,e806f59dc0303e3c345e799932fcdc68,e806f59dc0303e3c345e799932fcdc68,City of La Habra Heights,Los Angeles,8a8da539caf4f046025b97a5b4b9564b
23,City of Lakewood,90301,,a48b1992ed0b2b37884ee95b6cc7a0c4,a48b1992ed0b2b37884ee95b6cc7a0c4,City of Lakewood,Los Angeles,8a8da539caf4f046025b97a5b4b9564b
24,City of Lincoln,90235,,4843615c5fdce38e0f6c6de27a43312d,4843615c5fdce38e0f6c6de27a43312d,City of Lincoln,Placer,a51c620df563e3afebb26c95ee66944a


## What are the unique RTPA in dim_orgs
- are any of the SOCAL CTCs in there?


There are some County Transportation Commission in the RTPA list
- Ventura County Transportation Commission

Could not find
- Los Angeles County Metropolitan Transportation Authority
- San Bernardino Associated Governments
- Riverside County Transportation Commission
- Imperial County Transportation Commission
- Orange County Transportation Authority

In [41]:
just_rtpa_name = dim_orgs[dim_orgs["rtpa_name"].notna()]["rtpa_name"].drop_duplicates().reset_index(drop=True)

just_rtpa_name.info()

<class 'pandas.core.series.Series'>
RangeIndex: 40 entries, 0 to 39
Series name: rtpa_name
Non-Null Count  Dtype 
--------------  ----- 
40 non-null     object
dtypes: object(1)
memory usage: 452.0+ bytes


In [42]:
check_ctc = [
    "Los Angeles County Metropolitan Transportation Authority",
    "San Bernardino Associated Governments",
    "Riverside County Transportation Commission",
    "Imperial County Transportation Commission",
    "Orange County Transportation Authority",
    "Ventura County Transportation Commission",
]
just_rtpa_name[just_rtpa_name.isin(check_ctc)]

18    Ventura County Transportation Commission
Name: rtpa_name, dtype: object

In [43]:
# fuzzy string search

ctc_substring=[
    "Los Angeles",
    "San Bernardino",
    "Riverside",
    "Imperial",
    "Orange",
]

for i in ctc_substring:
    print(just_rtpa_name[just_rtpa_name.str.contains(i)])

Series([], Name: rtpa_name, dtype: object)
Series([], Name: rtpa_name, dtype: object)
Series([], Name: rtpa_name, dtype: object)
Series([], Name: rtpa_name, dtype: object)
Series([], Name: rtpa_name, dtype: object)


In [44]:
# check ctc list against agency names col
dim_org_county[dim_org_county["name"].isin(check_ctc)]

Unnamed: 0,name,ntd_id_2022,rtpa_name,key,organization_key,organization_name,county_geography_name,county_geography_key
56,Riverside County Transportation Commission,90218,,8d13f55a09eeb4e37328ee8481d332d4,8d13f55a09eeb4e37328ee8481d332d4,Riverside County Transportation Commission,Riverside,17a6841cb057ea751e22785fb9229596
168,Imperial County Transportation Commission,90226,Southern California Association of Governments,621fbb855822846119c106fd5bfa56b3,621fbb855822846119c106fd5bfa56b3,Imperial County Transportation Commission,Imperial,56a0304a6b0ccdc0dd0dbfae798a45e2
176,Los Angeles County Metropolitan Transportation Authority,90154,Southern California Association of Governments,9e96bde610e80d71f500eea119c4723c,9e96bde610e80d71f500eea119c4723c,Los Angeles County Metropolitan Transportation Authority,Los Angeles,8a8da539caf4f046025b97a5b4b9564b
187,Orange County Transportation Authority,90036,Southern California Association of Governments,47552ed0c038e35ee6f21ec8eb2cb5d8,47552ed0c038e35ee6f21ec8eb2cb5d8,Orange County Transportation Authority,Orange,aef4652aba294c8fd5ee922f9741ff6a
225,Ventura County Transportation Commission,90164,Ventura County Transportation Commission,4f7fa398d9c1c8c75310e13df4818015,4f7fa398d9c1c8c75310e13df4818015,Ventura County Transportation Commission,Ventura,ac21d3cfb432219540f51c7658df90e9
