# QC `ntd_id_rtpa_crosswalk` against `dim_organizations`

As of 4/21/2025, `dim_orgs` now has an RTPA column. RTPA values were set from a previous `rtpa/mpo` column. The `ntd_id_rtpa_crosswalk`
was created by pulling all the agnecies in ntd and assigning a rtpa based on their city location.

## Tasks
- see if the ntd_id in the xwalk appear in dim_orgs and vise-versa.
    - are all ntd_id from xwalk in dim_orgs?
    - are there any ntd_ids from dim_orgs not in xwalk?
    
- do the ntd_id/rtpa pairs from xwalk match to dim_orgs
    - SCAG acounts for like 6 counties in Socal

- refactor analyses that use the old xwalk
    - remove old xwalk
    - replace with rtpa data from dim_orgs
    - make manual adjustments to SCAG agengies >> separate out to the socal CTCs (orange, imperial, san bernardino etc etc)
    

In [1]:
import pandas as pd
from calitp_data_analysis.tables import tbls
from siuba import _, collect, count, filter, show_query, select

In [2]:
xwalk = pd.read_parquet("gs://calitp-analytics-data/data-analyses/ntd/ntd_id_rtpa_crosswalk_all_reporter_types.parquet")

In [16]:
dim_orgs = (
    tbls.mart_transit_database.dim_organizations()
    >> filter(
        _._is_current == True,
        #_.public_currently_operating == True,
        _.ntd_id_2022 != ""
    )
    >> select(
        _.name,
        _.ntd_id_2022,
        _.rtpa_name,
        _.key
    )
    >> collect()
)

dim_orgs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232 entries, 0 to 231
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         232 non-null    object
 1   ntd_id_2022  232 non-null    object
 2   rtpa_name    203 non-null    object
 3   key          232 non-null    object
dtypes: object(4)
memory usage: 7.4+ KB


In [17]:
dim_orgs[dim_orgs["rtpa_name"].isna()]

Unnamed: 0,name,ntd_id_2022,rtpa_name,key
2,Bishop Paiute Tribe,99268,,7416dba335568df67c5bcb5444fef5b7
3,Calaveras County,91063,,4a749b39ccec696bd5a7283359febc12
4,California Department of Transportation,9R02,,a1e59256e9f14aed58b1bde2bd7fdc09
5,California Vanpool Authority,90230,,b8bf7f2b3f96f422a859a3afd4156a07
13,City of Claremont,90296,,3ceb6b2eba6b0c7960e7fffebaa5d940
16,City of Davis,90167,,80d9c11c21af7e411acc64549c17bc1b
19,City of Folsom,90220,,4fe0165cac31b8facf93383f576acb29
21,City of La Habra Heights,99445,,e806f59dc0303e3c345e799932fcdc68
23,City of Lakewood,90301,,a48b1992ed0b2b37884ee95b6cc7a0c4
24,City of Lincoln,90235,,4843615c5fdce38e0f6c6de27a43312d


In [19]:
# need to get county info? 
county_bridge = (
    tbls.mart_transit_database.bridge_organizations_x_headquarters_county_geography()
    >> filter(
        _._is_current == True,
        #_.public_currently_operating == True,
        #_.ntd_id_2022 != ""
    )
    >> select(
        _.organization_key,
        _.organization_name,
        _.county_geography_name,
        _.county_geography_key
    )
    >> collect()
)

county_bridge.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1294 entries, 0 to 1293
Data columns (total 4 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   organization_key       1294 non-null   object
 1   organization_name      1294 non-null   object
 2   county_geography_name  1294 non-null   object
 3   county_geography_key   1294 non-null   object
dtypes: object(4)
memory usage: 40.6+ KB


In [21]:
dim_org_county = dim_orgs.merge(
    county_bridge,
    how="left",
    left_on="key",
    right_on="organization_key"
)

display(
    dim_org_county.info(),
    dim_org_county.head()
)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 232 entries, 0 to 231
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   name                   232 non-null    object
 1   ntd_id_2022            232 non-null    object
 2   rtpa_name              203 non-null    object
 3   key                    232 non-null    object
 4   organization_key       232 non-null    object
 5   organization_name      232 non-null    object
 6   county_geography_name  232 non-null    object
 7   county_geography_key   232 non-null    object
dtypes: object(8)
memory usage: 16.3+ KB


None

Unnamed: 0,name,ntd_id_2022,rtpa_name,key,organization_key,organization_name,county_geography_name,county_geography_key
0,Access Services,90157,Southern California Association of Governments,d84a961daa618c733f9d9c3bd49c322f,d84a961daa618c733f9d9c3bd49c322f,Access Services,Los Angeles,8a8da539caf4f046025b97a5b4b9564b
1,Alpine County,91116,Alpine County Local Transportation Commission,9b5971d16d58e4fcafa694ee7fa33b12,9b5971d16d58e4fcafa694ee7fa33b12,Alpine County,Alpine,d313dacae39cc4867f7d187e092f4f33
2,Bishop Paiute Tribe,99268,,7416dba335568df67c5bcb5444fef5b7,7416dba335568df67c5bcb5444fef5b7,Bishop Paiute Tribe,Inyo,2043f1e3cb85e2f1651696047250ef0b
3,Calaveras County,91063,,4a749b39ccec696bd5a7283359febc12,4a749b39ccec696bd5a7283359febc12,Calaveras County,Calaveras,3562e9723b46a1db706bd516af79e143
4,California Department of Transportation,9R02,,a1e59256e9f14aed58b1bde2bd7fdc09,a1e59256e9f14aed58b1bde2bd7fdc09,California Department of Transportation,Sacramento,bd7bba0b0cb7727b2d5d6509e14104ae


# What are the unique pairs of RTPA names and geography names?

In [32]:
dim_org_county[["rtpa_name","county_geography_name"]].drop_duplicates().reset_index(drop=True)

Unnamed: 0,rtpa_name,county_geography_name
0,Southern California Association of Governments,Los Angeles
1,Alpine County Local Transportation Commission,Alpine
2,,Inyo
3,,Calaveras
4,,Sacramento
...,...,...
71,Merced County Association of Governments,Merced
72,Trinity County Transportation Commission,Trinity
73,Tuolumne County Transportation Council,Tuolumne
74,Sacramento Area Council of Governments,Yolo


# What are the unique RTPA in dim_orgs
- are any of the SOCAL CTCs in there?
- ventura, LA Metro, San Bernardino, Riverside, Imperial

In [37]:
dim_orgs["rtpa_name"].sort_values().unique().tolist()

['Alpine County Local Transportation Commission',
 'Amador County Transportation Commission',
 'Butte County Association of Governments',
 'Calaveras Council of Governments',
 'Colusa County Transportation Commission',
 'Council of San Benito County Governments',
 'Del Norte Local Transportation Commission',
 'El Dorado County Transportation Commission',
 'Fresno Council of Governments',
 'Glenn County Transportation Commission',
 'Humboldt County Association of Governments',
 'Inyo County Local Transportation Commission',
 'Kern Council of Governments',
 'Kings County Association of Governments',
 'Lake County/City Area Planning Council',
 'Lassen County Transportation Commission',
 'Madera County Transportation Commission',
 'Mariposa County Local Transportation Commission',
 'Mendocino Council of Governments',
 'Merced County Association of Governments',
 'Metropolitan Transportation Commission',
 'Modoc County Transportation Commission',
 'Nevada County Transportation Commission',


In [40]:
dim_orgs[dim_orgs["rtpa_name"].str.contains("Ventura")]

ValueError: Cannot mask with non-boolean array containing NA / NaN values