# Run Functions to Add Information to Projects

To run the data through the script, all you need to do is update `my_file` path to the most recent export from FMIS and QMRS uploaded to GCS, then run the function in the section `Export Data` with your dataframe and the current date. Then your aggregated data will be ready in GCS. 

In [1]:
import pandas as pd
from siuba import *

import _script_utils

from calitp_data_analysis.sql import to_snakecase


In [2]:
pd.set_option("display.max_columns", 100)
pd.set_option('display.max_colwidth', None)

## Read in Data and function development / Test Function

For the following function:
* update the file path for `my_file` to the most recent file name of the FMIS & QMRS export
* the second kwargs is the unique recipient identifier, in this case it should stay the same with subsequent exports
* the third kwargs is the aggregation level you want for the data. Unless otherwise specified, it should be `agg` which is one row per project

In [3]:
GCS_FILE_PATH  = 'gs://calitp-analytics-data/data-analyses/dla/dla-iija'

In [4]:
my_file = "2b Output.xlsx"

### Check data

In [5]:
check_data = to_snakecase(pd.read_excel(f"{GCS_FILE_PATH}/{my_file}"))

In [6]:
check_data.head(1)

Unnamed: 0,state_local,fmis_transaction_date,program_code,program_code_description,project_number,recipient_project_number,project_title,county_code,congressional_district,project_status_description,project_description,improvement_type,improvement_type_description,total_cost_amount,obligations_amount,summary_recipient_defined_text_field_1_value,rk_locode,action_type,_3c_phase,_3c_status,iija_codes,_3c_iija_obligated,_3c_total_cost,_3c_agency_name,_3c_county,district,lp2000_location,fads_location
0,S,2022-01-20,ER01,EMERGENCY REL 2022 SUPPLEMENT,31RA002,518000118,MONTEREY COUNTY NEAR BIG SUR 2.3 MILES NORTH OF CASTRO CANYON BRIDGE TO 0.8 MILE SOUTH OF BIG SUR RIVER BRIDGE. EMERGENCY PROJECT - PERMANENT RESTORA,53,Cong Dist 20,Active,MONTEREY COUNTY NEAR BIG SUR 2.3 MILES NORTH OF CASTRO CANYON BRIDGE TO 0.8 MILE SOUTH OF BIG SUR RIVER BRIDGE. EMERGENCY PROJECT - PERMANENT RESTORATION. COMPLETE COASTAL DEVELOPMENT PERMIT REQUIREMENTS AT PFEIFFER CANYON BRIDGE.,16,Right of Way,600000.0,531100.0,S AMBAG,,,,,,,,,,,,


In [7]:
check_data.project_number.nunique()

2601

In [8]:
check_data.columns

Index(['state_local', 'fmis_transaction_date', 'program_code',
       'program_code_description', 'project_number',
       'recipient_project_number', 'project_title', 'county_code',
       'congressional_district', 'project_status_description',
       'project_description', 'improvement_type',
       'improvement_type_description', 'total_cost_amount',
       'obligations_amount', 'summary_recipient_defined_text_field_1_value',
       'rk_locode', 'action_type', '_3c_phase', '_3c_status', 'iija_codes',
       '_3c_iija_obligated', '_3c_total_cost', '_3c_agency_name', '_3c_county',
       'district', 'lp2000_location', 'fads_location'],
      dtype='object')

### Run Script

In [9]:
df = _script_utils.run_script(my_file, 'summary_recipient_defined_text_field_1_value', 'agg')

Index(['fmis_transaction_date', 'project_number', 'implementing_agency',
       'summary_recipient_defined_text_field_1_value', 'funding_type_code',
       'program_code', 'program_code_description', 'recipient_project_number',
       'improvement_type', 'improvement_type_description',
       'program_code_description_for_description', 'project_title',
       'obligations_amount', 'total_cost_amount', 'congressional_district',
       'district', 'county_code', 'county_name', 'county_name_abbrev',
       'county_name_title', 'implementing_agency_locode', 'rtpa_name',
       'mpo_name'],
      dtype='object')

  df['implementing_agency_locode'] = df['implementing_agency_locode'].str.replace('.0', '')


True

In [10]:
df.county_code.describe()

count     2601
unique     100
top         37
freq       198
Name: county_code, dtype: object

In [11]:
df2 = _script_utils.run_script2(my_file, 'summary_recipient_defined_text_field_1_value', 'agg')

'Rows with locodes filled'

both          3290
left_only        4
right_only       0
Name: _merge, dtype: int64

'Do the # of rows match?'

True

Index(['fmis_transaction_date', 'project_number', 'implementing_agency',
       'summary_recipient_defined_text_field_1_value', 'funding_type_code',
       'program_code', 'program_code_description', 'recipient_project_number',
       'improvement_type', 'improvement_type_description',
       'program_code_description_for_description', 'project_title',
       'obligations_amount', 'total_cost_amount', 'congressional_district',
       'district', 'county_code', 'county_name', 'county_name_abbrev',
       'county_name_title', 'implementing_agency_locode', 'rtpa_name',
       'mpo_name'],
      dtype='object')

  df['implementing_agency_locode'] = df['implementing_agency_locode'].str.replace('.0', '')


True

In [12]:
df2.columns

Index(['fmis_transaction_date', 'project_number', 'implementing_agency',
       'summary_recipient_defined_text_field_1_value', 'funding_type_code',
       'program_code', 'program_code_description', 'recipient_project_number',
       'improvement_type', 'improvement_type_description',
       'old_project_title_desc', 'obligations_amount', 'total_cost_amount',
       'congressional_district', 'district', 'county_code', 'county_name',
       'county_name_abbrev', 'implementing_agency_locode', 'rtpa_name',
       'mpo_name', 'new_project_title', 'new_description_col'],
      dtype='object')

### Testing the data

In [None]:
## when grouping by funding program (pne project can have multiple rows), len is 1612 for 2023 version of data
## asserting the length of the df is the same as number of projects
assert len(df) == check_data.project_number.nunique()

In [None]:
assert len(df2) == check_data.project_number.nunique()

In [None]:
len(df2), len(df)

In [None]:
## check one project with multiple funding codes
df>>filter(_.project_number=='5004049')

In [None]:
df2>>filter(_.project_number=='5004049')

In [None]:
df.implementing_agency.value_counts().head()

In [None]:
df2.implementing_agency.value_counts().head()

In [None]:
df2.columns

## Export Data

In [None]:
### rename the file for export to GCS
### use date to rename

In [None]:
# _script_utils.export_to_gcs(df, "01302025_agg")

In [None]:
_script_utils.export_to_gcs(df2, "04242025_agg")