# QC `ntd_id_rtpa_crosswalk` against `dim_organizations`

As of 4/21/2025, `dim_orgs` now has an RTPA column. RTPA values were set from a previous `rtpa/mpo` column. The `ntd_id_rtpa_crosswalk`
was created by pulling all the agnecies in ntd and assigning a rtpa based on their city location.

## Tasks
- see if the ntd_id in the xwalk appear in dim_orgs and vise-versa.
    - are all ntd_id from xwalk in dim_orgs?
    - are there any ntd_ids from dim_orgs not in xwalk?
    
- do the ntd_id/rtpa pairs from xwalk match to dim_orgs
    - SCAG acounts for like 6 counties in Socal

- refactor analyses that use the old xwalk
    - remove old xwalk
    - replace with rtpa data from dim_orgs
    - make manual adjustments to SCAG agengies >> separate out to the socal CTCs (orange, imperial, san bernardino etc etc)
    

In [44]:
import pandas as pd
from calitp_data_analysis.tables import tbls
from siuba import _, collect, count, filter, show_query, select

pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", None)

In [3]:
xwalk = pd.read_parquet("gs://calitp-analytics-data/data-analyses/ntd/ntd_id_rtpa_crosswalk_all_reporter_types.parquet")

In [4]:
dim_orgs = (
    tbls.mart_transit_database.dim_organizations()
    >> filter(
        _._is_current == True,
        #_.public_currently_operating == True,
        _.ntd_id_2022 != ""
    )
    >> select(
        _.name,
        _.ntd_id_2022,
        _.rtpa_name,
        _.key
    )
    >> collect()
)

dim_orgs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232 entries, 0 to 231
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         232 non-null    object
 1   ntd_id_2022  232 non-null    object
 2   rtpa_name    203 non-null    object
 3   key          232 non-null    object
dtypes: object(4)
memory usage: 7.4+ KB


In [11]:
# need to get county info? 
county_bridge = (
    tbls.mart_transit_database.bridge_organizations_x_headquarters_county_geography()
    >> filter(
        _._is_current == True,
        #_.public_currently_operating == True,
        #_.ntd_id_2022 != ""
    )
    >> select(
        _.organization_key,
        _.organization_name,
        _.county_geography_name,
        _.county_geography_key
    )
    >> collect()
)

county_bridge.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1294 entries, 0 to 1293
Data columns (total 4 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   organization_key       1294 non-null   object
 1   organization_name      1294 non-null   object
 2   county_geography_name  1294 non-null   object
 3   county_geography_key   1294 non-null   object
dtypes: object(4)
memory usage: 40.6+ KB


In [39]:
dim_org_county = dim_orgs.merge(
    county_bridge,
    how="left",
    left_on="key",
    right_on="organization_key"
)

display(
    dim_org_county.info(),
    #dim_org_county.head()
)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 232 entries, 0 to 231
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   name                   232 non-null    object
 1   ntd_id_2022            232 non-null    object
 2   rtpa_name              203 non-null    object
 3   key                    232 non-null    object
 4   organization_key       232 non-null    object
 5   organization_name      232 non-null    object
 6   county_geography_name  232 non-null    object
 7   county_geography_key   232 non-null    object
dtypes: object(8)
memory usage: 16.3+ KB


None

# What are the unique pairs of RTPA names and geography names?

In [45]:
dim_org_county.groupby("rtpa_name").agg({
    "county_geography_name":"unique"
})

Unnamed: 0_level_0,county_geography_name
rtpa_name,Unnamed: 1_level_1
Alpine County Local Transportation Commission,[Alpine]
Amador County Transportation Commission,[Amador]
Butte County Association of Governments,[Butte]
Calaveras Council of Governments,[Calaveras]
Colusa County Transportation Commission,[Colusa]
Council of San Benito County Governments,[San Benito]
Del Norte Local Transportation Commission,[Del Norte]
El Dorado County Transportation Commission,[El Dorado]
Fresno Council of Governments,[Fresno]
Glenn County Transportation Commission,[Glenn]


## What are the Agencies with missing RTPA names, but have a county name?

In [16]:
dim_org_county[dim_org_county["rtpa_name"].isna()][["name","ntd_id_2022","rtpa_name","county_geography_name"]].sort_values(by="county_geography_name")

Unnamed: 0,name,ntd_id_2022,rtpa_name,county_geography_name
3,Calaveras County,91063,,Calaveras
44,Elk Valley Rancheria,99452,,Del Norte
73,Blue Lake Rancheria,99292,,Humboldt
55,Quechan Tribe of the Fort Yuma Indian Reservat...,99310,,Imperial
2,Bishop Paiute Tribe,99268,,Inyo
38,City of Tehachapi,91074,,Kern
5,California Vanpool Authority,90230,,Kings
46,Hollywood Burbank Airport,99444,,Los Angeles
21,City of La Habra Heights,99445,,Los Angeles
23,City of Lakewood,90301,,Los Angeles


## What are the unique RTPA in dim_orgs
- are any of the SOCAL CTCs in there?


There are some County Transportation Commission in the RTPA list
- Ventura County Transportation Commission

Could not find
- Los Angeles County Metropolitan Transportation Authority
- San Bernardino Associated Governments
- Riverside County Transportation Commission
- Imperial County Transportation Commission
- Orange County Transportation Authority

In [29]:
just_rtpa_name = dim_orgs[dim_orgs["rtpa_name"].notna()]["rtpa_name"].drop_duplicates().reset_index(drop=True)

just_rtpa_name.info()

<class 'pandas.core.series.Series'>
RangeIndex: 40 entries, 0 to 39
Series name: rtpa_name
Non-Null Count  Dtype 
--------------  ----- 
40 non-null     object
dtypes: object(1)
memory usage: 452.0+ bytes


In [38]:
check_ctc = [
    "Los Angeles County Metropolitan Transportation Authority",
    "San Bernardino Associated Governments",
    "Riverside County Transportation Commission",
    "Imperial County Transportation Commission",
    "Orange County Transportation Authority",
    "Ventura County Transportation Commission",
]
just_rtpa_name[just_rtpa_name.isin(check_ctc)]

18    Ventura County Transportation Commission
Name: rtpa_name, dtype: object