# Run Functions to Add Information to Projects

To run the data through the script, all you need to do is update `my_file` path to the most recent export from FMIS and QMRS uploaded to GCS, then run the function in the section `Export Data` with your dataframe and the current date. Then your aggregated data will be ready in GCS. 

In [1]:
import _data_utils
import _script_utils
import pandas as pd
from calitp_data_analysis.sql import to_snakecase

In [2]:
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", None)

In [3]:
locodes = to_snakecase(
    pd.read_excel(
        f"gs://calitp-analytics-data/data-analyses/dla/e-76Obligated/locodes_updated7122021.xlsx"
    )
)

## Read in Data and function development / Test Function

For the following function:
* update the file path for `my_file` to the most recent file name of the FMIS & QMRS export
* the second kwargs is the unique recipient identifier, in this case it should stay the same with subsequent exports
* the third kwargs is the aggregation level you want for the data. Unless otherwise specified, it should be `agg` which is one row per project

In [4]:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/dla/dla-iija"

In [5]:
# update this path to the latest IIJA data
my_file = "IIJA_ToDDS_20260113.xlsx"

### Check data
* July 2025 Notes
    * `summary_recipient_defined_text_field_1_value` changed to `summary_recipient` in `script_utils.run_script_original` and `script_utils.run_script_2025.`
    * `rk_locode` is missing so I used `run_script_original` instead.
    * Updated `_script_utils.add_county_abbrev()` because the values in the counties geojson in `shared_data_catalog.yml` changed. 

In [6]:
check_data = to_snakecase(pd.read_excel(f"{GCS_FILE_PATH}/{my_file}"))

In [8]:
check_data.columns

Index(['fmis_transaction_date', 'program_code', 'program_code_description',
       'project_number', 'recipient_project_number', '_10_id', 'project_title',
       'county_code', 'congressional_district', 'project_status_description',
       'project_description', 'improvement_type',
       'improvement_type_description', 'total_cost_amount',
       'obligations_amount', 'summary_recipient_defined_text_field_1_value'],
      dtype='object')

In [7]:
check_data.head(1)

Unnamed: 0,fmis_transaction_date,program_code,program_code_description,project_number,recipient_project_number,_10_id,project_title,county_code,congressional_district,project_status_description,project_description,improvement_type,improvement_type_description,total_cost_amount,obligations_amount,summary_recipient_defined_text_field_1_value
0,2023-05-09,Y240,SURFAC TRNSP BLK GRTS-FLX IIJA,Q101398,0100000193S,100000193,IN DEL NORTE COUNTY NEAR CRESCENT CITY FROM 0.3 MILE SOUTH OF SMITH RIVER BRIDGE TO 0.4 MILE NORTH OF SMITH RIVER BRIDGE REPLACE BRIDGE,15,Cong Dist 2,Active,ON STATE ROUTE: 101. IN DEL NORTE COUNTY NEAR CRESCENT CITY FROM 0.3 MILE SOUTH OF SMITH RIVER BRIDGE TO 0.4 MILE NORTH OF SMITH RIVER BRIDGE REPLACE BRIDGE,11,Bridge Replacement - No Added Capacity,86547400.0,75963900.0,S NON-MPO


### Run Script
* Choose between `run_script_original` or `run_script_2025` depending on the dataframe you receive.

In [9]:
df = _script_utils.run_script_original(
    file_name = my_file, 
    recipient_column = "summary_recipient_defined_text_field_1_value", 
    df_agg_level = "agg"
)

Index(['fmis_transaction_date', 'project_number', 'implementing_agency',
       'summary_recipient_defined_text_field_1_value', 'funding_type_code',
       'program_code', 'program_code_description', 'recipient_project_number',
       'improvement_type', 'improvement_type_description',
       'program_code_description_for_description', 'project_title',
       'obligations_amount', 'total_cost_amount', 'congressional_district',
       'district', 'county_code', 'county_name', 'county_name_abbrev',
       'county_name_title', 'implementing_agency_locode', 'rtpa_name',
       'mpo_name'],
      dtype='object')

  df['implementing_agency_locode'] = df['implementing_agency_locode'].str.replace('.0', '')


True

### Testing the data

In [11]:
len(df) == check_data.project_number.nunique()

True

In [12]:
check_data.columns

Index(['fmis_transaction_date', 'program_code', 'program_code_description',
       'project_number', 'recipient_project_number', '_10_id', 'project_title',
       'county_code', 'congressional_district', 'project_status_description',
       'project_description', 'improvement_type',
       'improvement_type_description', 'total_cost_amount',
       'obligations_amount', 'summary_recipient_defined_text_field_1_value'],
      dtype='object')

In [14]:
df.columns

Index(['fmis_transaction_date', 'project_number', 'implementing_agency',
       'summary_recipient_defined_text_field_1_value', 'funding_type_code',
       'program_code', 'program_code_description', 'recipient_project_number',
       'improvement_type', 'improvement_type_description',
       'old_project_title_desc', 'obligations_amount', 'total_cost_amount',
       'congressional_district', 'district', 'county_code', 'county_name',
       'county_name_abbrev', 'implementing_agency_locode', 'rtpa_name',
       'mpo_name', 'new_project_title', 'new_description_col'],
      dtype='object')

## Export Data

In [None]:
### rename the file for export to GCS
### use date to rename

In [16]:
_script_utils.export_to_gcs(df, "01142026_agg")

## Comparing latest cleaned data with older version

In [17]:
jan_2026 = "gs://calitp-analytics-data/data-analyses/dla/dla-iija/FMIS_Projects_Universe_IIJA_Reporting_01142026_agg.csv"

In [19]:
jan_df = pd.read_csv(jan_2026)

In [25]:
len(jan_df)

3197

In [23]:
jan_df.head(1)

Unnamed: 0.1,Unnamed: 0,Fmis Transaction Date,Project Number,Implementing Agency,Summary Recipient Defined Text Field 1 Value,Funding Type Code,Program Code,Program Code Description,Recipient Project Number,Improvement Type,Improvement Type Description,Old Project Title Desc,Obligations Amount,Total Cost Amount,Congressional District,District,County Code,County Name,County Name Abbrev,Implementing Agency Locode,Rtpa Name,Mpo Name,New Project Title,New Description Col
0,0,2022-01-20,31RA002,California,S AMBAG,IIJA-A,ER01,Emergency Supplement Funding,0518000118S,16|43,Right of Way|Utilities,MONTEREY COUNTY NEAR BIG SUR 2.3 MILES NORTH OF CASTRO CANYON BRIDGE TO 0.8 MILE SOUTH OF BIG SUR RIVER BRIDGE. EMERGENCY PROJECT - PERMANENT RESTORA,2983400,3370100,|20|,|05|,53,Monterey County,|MON|,,,,Right of Way Project in Monterey County,"Right of Way Project in Monterey County, part of the Emergency Supplement Funding. (Federal Project ID: 31RA002)."


In [18]:
oct_2025 = "gs://calitp-analytics-data/data-analyses/dla/dla-iija/FMIS_Projects_Universe_IIJA_Reporting_10202025_agg.csv"

In [20]:
oct_df = pd.read_csv(oct_2025)

In [26]:
len(oct_df)

3194

In [24]:
oct_df.head(1)

Unnamed: 0.1,Unnamed: 0,Fmis Transaction Date,Project Number,Implementing Agency,Summary Recipient Defined Text Field 1 Value,Funding Type Code,Program Code,Program Code Description,Recipient Project Number,Improvement Type,Improvement Type Description,Old Project Title Desc,Obligations Amount,Total Cost Amount,Congressional District,District,County Code,County Name,County Name Abbrev,Implementing Agency Locode,Rtpa Name,Mpo Name,New Project Title,New Description Col
0,0,2022-01-20,31RA002,California,S AMBAG,IIJA-A,ER01,Emergency Supplement Funding,0518000118S,16|43,Right of Way|Utilities,MONTEREY COUNTY NEAR BIG SUR 2.3 MILES NORTH OF CASTRO CANYON BRIDGE TO 0.8 MILE SOUTH OF BIG SUR RIVER BRIDGE. EMERGENCY PROJECT - PERMANENT RESTORA,2983400,3370100,|20|,|05|,53,Monterey County,|MON|,,,,Right of Way Project in Monterey County,"Right of Way Project in Monterey County, part of the Emergency Supplement Funding. (Federal Project ID: 31RA002)."


## Removing S***ba
### `data_utils`

In [None]:
def update_program_code_list2():
    updated_codes = to_snakecase(
        pd.read_excel(
            f"{GCS_FILE_PATH}/program_codes/FY21-22ProgramCodesAsOf5-25-2022.v2_expanded090823.xlsx"
        )
    )[["iija_program_code", "new_description"]]
    original_codes = to_snakecase(
        pd.read_excel(
            f"{GCS_FILE_PATH}/program_codes/Copy of lst_IIJA_Code_20230908.xlsx"
        )
    )[["iija_program_code", "description", "program_name"]]

    program_codes = pd.merge(
        updated_codes,
        original_codes,
        on="iija_program_code",
        how="outer",
        indicator=True,
    )
    program_codes["new_description"] = program_codes["new_description"].str.strip()

    program_codes.new_description.fillna(program_codes["description"], inplace=True)

    program_codes = program_codes.drop(columns={"description", "_merge"})

    def add_program_to_row(row):
        if "Program" not in row["program_name"]:
            return row["program_name"] + " Program"
        else:
            return row["program_name"]

    program_codes["program_name"] = program_codes.apply(add_program_to_row, axis=1)

    return program_codes

### `script_utils`

In [None]:
def county_district_crosswalk() -> pd.DataFrame:
    """
    Aggregate locodes dataset to find which
    districts a county lies in.
    """
    # Load locodes
    locodes_df = _script_utils.load_locodes()

    # Load counties
    county_base = _script_utils.load_county()

    county_district = locodes_df[["district", "county_name"]].drop_duplicates()

    county_info = pd.merge(
        county_base,
        county_district,
        how="left",
        left_on="county_description",
        right_on="county_name",
    ).drop(columns=["county_name"])
    return county_info

In [None]:
test1 = county_district_crosswalk()

In [None]:
test1.head()