# Run Functions to Add Information to Projects

To run the data through the script, all you need to do is update `my_file` path to the most recent export from FMIS and QMRS uploaded to GCS, then run the function in the section `Export Data` with your dataframe and the current date. Then your aggregated data will be ready in GCS. 

In [1]:
import _script_utils
import _data_utils
import pandas as pd
from calitp_data_analysis.sql import to_snakecase
from siuba import *

In [2]:
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", None)

## Read in Data and function development / Test Function

For the following function:
* update the file path for `my_file` to the most recent file name of the FMIS & QMRS export
* the second kwargs is the unique recipient identifier, in this case it should stay the same with subsequent exports
* the third kwargs is the aggregation level you want for the data. Unless otherwise specified, it should be `agg` which is one row per project

In [3]:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/dla/dla-iija"

In [4]:
my_file = "FMIS_IIJA_20250709.xlsx"

### Check data

In [5]:
check_data = to_snakecase(pd.read_excel(f"{GCS_FILE_PATH}/{my_file}"))

In [6]:
check_data.head(1)

Unnamed: 0,fmis_transaction_date,program_code,program_code_description,project_number,recipient_project_number,project_title,county_code,congressional_district,project_status_description,project_description,improvement_type,improvement_type_description,total_cost_amount,obligations_amount,summary_recipient
0,2022-03-15,Y230,STBG-URBANIZED >200K IIJA,6084275,0422000280L,"FREMONT, RICHMOND, AND MARIN AND SONOMA COUNTIES, ALONG THE SMART CORRIDOR. BIKE SHARE CAPITAL PROGRAM (TC)",13,Cong Dist 15,Active,"FREMONT, RICHMOND, AND MARIN AND SONOMA COUNTIES, ALONG THE SMART CORRIDOR. BIKE SHARE CAPITAL PROGRAM (TC)",44,Other,700000.0,700000.0,L6084MTC


In [7]:
check_data[["summary_recipient"]].sample(3)

Unnamed: 0,summary_recipient
2507,L5379MTC
2060,L5957SANDAG
5009,S MTC


In [8]:
check_data.project_number.nunique()

2310

In [9]:
check_data.columns

Index(['fmis_transaction_date', 'program_code', 'program_code_description',
       'project_number', 'recipient_project_number', 'project_title',
       'county_code', 'congressional_district', 'project_status_description',
       'project_description', 'improvement_type',
       'improvement_type_description', 'total_cost_amount',
       'obligations_amount', 'summary_recipient'],
      dtype='object')

In [10]:
import intake

In [11]:
previous_df = to_snakecase(pd.read_excel(
    "gs://calitp-analytics-data/data-analyses/dla/dla-iija/2b Output.xlsx"
))

In [12]:
set(list(previous_df.columns))-set(list(check_data.columns))

{'_3c_agency_name',
 '_3c_county',
 '_3c_iija_obligated',
 '_3c_phase',
 '_3c_status',
 '_3c_total_cost',
 'action_type',
 'district',
 'fads_location',
 'iija_codes',
 'lp2000_location',
 'rk_locode',
 'state_local',
 'summary_recipient_defined_text_field_1_value'}

In [13]:
catalog = intake.open_catalog(
    "../../_shared_utils/shared_utils/shared_data_catalog.yml"
)

In [14]:
counties = to_snakecase((catalog.ca_counties.read()))[["name", "cnty_fips"]]

In [15]:
counties.cnty_fips   = counties.cnty_fips.astype(int)

In [16]:
check_data.county_code.unique()

array([ 13,  83,  67, 999, 103,  23,  81,  73,  15,   1,  37,  79,  19,
        77,  59,  71,  11,  85,  65, 113,  61, 107,  89,  75, 115,  45,
         7,  87,  99,  97,  93,  29,  21,  41,  95,  57,  17,  33, 101,
        53,  39, 105,  35,  43,  47,  25,   3,  31,   5, 111,  69,   9,
        55,  27,  63,  51, 109,  91,  49])

In [17]:
counties.sort_values(by = ["name"])

Unnamed: 0,name,cnty_fips
0,Alameda County,1
1,Alpine County,3
2,Amador County,5
3,Butte County,7
4,Calaveras County,9
5,Colusa County,11
6,Contra Costa County,13
7,Del Norte County,15
8,El Dorado County,17
9,Fresno County,19


In [18]:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/dla/dla-iija"

In [19]:
county_codes = to_snakecase(
    (pd.read_excel(f"{GCS_FILE_PATH}/CountyNameToCodeLookUp.xlsx"))
)

In [20]:
county_codes.head(2)

Unnamed: 0,county_name,rca_county_code
0,Alameda County,ALA
1,Alpine County,ALP


In [21]:
county_mapping = pd.merge(county_codes, counties, left_on ="county_name", right_on = "name",how = "inner")
county_mapping = county_mapping.drop(columns = ["name"])
county_mapping = county_mapping.rename(columns = {"county_name":"COUNTY", "rca_county_code":"COUNTY_ABBREV"})

In [22]:
county_mapping.sample()

Unnamed: 0,COUNTY,COUNTY_ABBREV,cnty_fips
21,Mariposa County,MPA,43


In [23]:
check_data = check_data.rename(columns = {"summary_recipient": "summary_recipient_defined_text_field_1_value"})

In [24]:
check_data2 = _data_utils.add_new_codes(check_data)

In [25]:
check_data2.columns

Index(['fmis_transaction_date', 'program_code', 'program_code_description',
       'project_number', 'recipient_project_number', 'project_title',
       'county_code', 'congressional_district', 'project_status_description',
       'project_description', 'improvement_type',
       'improvement_type_description', 'total_cost_amount',
       'obligations_amount', 'summary_recipient_defined_text_field_1_value',
       'iija_program_code', 'funding_type_code'],
      dtype='object')

In [26]:
check_data3 = _script_utils.identify_agency(check_data2, "summary_recipient_defined_text_field_1_value")

In [27]:
check_data3.columns

Index(['fmis_transaction_date', 'program_code', 'program_code_description',
       'project_number', 'recipient_project_number', 'project_title',
       'county_code', 'congressional_district', 'project_status_description',
       'project_description', 'improvement_type',
       'improvement_type_description', 'total_cost_amount',
       'obligations_amount', 'summary_recipient_defined_text_field_1_value',
       'iija_program_code', 'funding_type_code', 'implementing_agency_locode',
       'implementing_agency', 'district', 'county_name', 'rtpa_name',
       'mpo_name'],
      dtype='object')

In [28]:
check_data3 = _script_utils.format_congressional_district(check_data3, "congressional_district")

In [29]:
check_data3.columns

Index(['fmis_transaction_date', 'program_code', 'program_code_description',
       'project_number', 'recipient_project_number', 'project_title',
       'county_code', 'congressional_district', 'project_status_description',
       'project_description', 'improvement_type',
       'improvement_type_description', 'total_cost_amount',
       'obligations_amount', 'summary_recipient_defined_text_field_1_value',
       'iija_program_code', 'funding_type_code', 'implementing_agency_locode',
       'implementing_agency', 'district', 'county_name', 'rtpa_name',
       'mpo_name'],
      dtype='object')

In [30]:
check_data3 = _script_utils.change_district_format(check_data3, "district")

In [31]:
check_data.columns

Index(['fmis_transaction_date', 'program_code', 'program_code_description',
       'project_number', 'recipient_project_number', 'project_title',
       'county_code', 'congressional_district', 'project_status_description',
       'project_description', 'improvement_type',
       'improvement_type_description', 'total_cost_amount',
       'obligations_amount', 'summary_recipient_defined_text_field_1_value'],
      dtype='object')

In [32]:
check_data3.columns

Index(['fmis_transaction_date', 'program_code', 'program_code_description',
       'project_number', 'recipient_project_number', 'project_title',
       'county_code', 'congressional_district', 'project_status_description',
       'project_description', 'improvement_type',
       'improvement_type_description', 'total_cost_amount',
       'obligations_amount', 'summary_recipient_defined_text_field_1_value',
       'iija_program_code', 'funding_type_code', 'implementing_agency_locode',
       'implementing_agency', 'district', 'county_name', 'rtpa_name',
       'mpo_name'],
      dtype='object')

In [33]:
check_data4 = _script_utils.add_county_abbrev(check_data3, "county_name")

In [34]:
check_data4.columns

Index(['fmis_transaction_date', 'program_code', 'program_code_description',
       'project_number', 'recipient_project_number', 'project_title',
       'county_code', 'congressional_district', 'project_status_description',
       'project_description', 'improvement_type',
       'improvement_type_description', 'total_cost_amount',
       'obligations_amount', 'summary_recipient_defined_text_field_1_value',
       'iija_program_code', 'funding_type_code', 'implementing_agency_locode',
       'implementing_agency', 'district', 'county_name', 'rtpa_name',
       'mpo_name', 'county_name_abbrev'],
      dtype='object')

In [35]:
check_data4[['county_name','county_code', "county_name_abbrev"]].sample(10)

Unnamed: 0,county_name,county_code,county_name_abbrev
2637,San Mateo County,81,SM
2907,Marin County,41,MRN
4566,Los Angeles County,37,LA
4191,Siskiyou County,93,SIS
1146,Calaveras County,9,CAL
1272,Imperial County,73,IMP
1103,Sacramento County,67,SAC
4289,Statewide,999,
3569,San Bernardino County,71,SBD
1328,Fresno County,19,FRE


### Run Script

In [36]:
df2 = _script_utils.run_script(
    my_file, "summary_recipient_defined_text_field_1_value", "agg"
)

Index(['fmis_transaction_date', 'project_number', 'implementing_agency',
       'summary_recipient_defined_text_field_1_value', 'funding_type_code',
       'program_code', 'program_code_description', 'recipient_project_number',
       'improvement_type', 'improvement_type_description',
       'program_code_description_for_description', 'project_title',
       'obligations_amount', 'total_cost_amount', 'congressional_district',
       'district', 'county_code', 'county_name', 'county_name_abbrev',
       'county_name_title', 'implementing_agency_locode', 'rtpa_name',
       'mpo_name'],
      dtype='object')

  df['implementing_agency_locode'] = df['implementing_agency_locode'].str.replace('.0', '')


True

In [37]:
df2.columns

Index(['fmis_transaction_date', 'project_number', 'implementing_agency',
       'summary_recipient_defined_text_field_1_value', 'funding_type_code',
       'program_code', 'program_code_description', 'recipient_project_number',
       'improvement_type', 'improvement_type_description',
       'old_project_title_desc', 'obligations_amount', 'total_cost_amount',
       'congressional_district', 'district', 'county_code', 'county_name',
       'county_name_abbrev', 'implementing_agency_locode', 'rtpa_name',
       'mpo_name', 'new_project_title', 'new_description_col'],
      dtype='object')

### Testing the data

In [38]:
assert len(df2) == check_data.project_number.nunique()

In [41]:
check_data.columns

Index(['fmis_transaction_date', 'program_code', 'program_code_description',
       'project_number', 'recipient_project_number', 'project_title',
       'county_code', 'congressional_district', 'project_status_description',
       'project_description', 'improvement_type',
       'improvement_type_description', 'total_cost_amount',
       'obligations_amount', 'summary_recipient_defined_text_field_1_value'],
      dtype='object')

In [42]:
df2 >> filter(_.project_number == "5004049")

Unnamed: 0,fmis_transaction_date,project_number,implementing_agency,summary_recipient_defined_text_field_1_value,funding_type_code,program_code,program_code_description,recipient_project_number,improvement_type,improvement_type_description,old_project_title_desc,obligations_amount,total_cost_amount,congressional_district,district,county_code,county_name,county_name_abbrev,implementing_agency_locode,rtpa_name,mpo_name,new_project_title,new_description_col
785,2024-04-15,5004049,San Diego,L5004SANDAG,IIJA-F,Y001|Y110|Y908|Y909,National Highway Performance Program (NHPP)|Bridge Formula Program|Bridge Replacement and Rehabilitation Program,11955780L,10|17,Bridge Replacement - Added Capacity|Construction Engineering,"WEST MISSION BAY DRIVE OVER THE SAN DIEGO RIVER BRIDGE REPLACEMENT, BR. NO. 57C-0023",80036838,90928327,|52|,|11|,73,San Diego County,|SD|,4,San Diego Association of Governments,San Diego Association Of Governments,Replace Bridge in San Diego,"Replace Bridge in San Diego, part of the National Highway Performance Program (NHPP), and the Bridge Formula Program, and the Bridge Replacement and Rehabilitation Program. (Federal Project ID: 5004049)."


## Export Data

In [None]:
### rename the file for export to GCS
### use date to rename

In [43]:
 _script_utils.export_to_gcs(df2, "07102025_agg")