# Run Functions to Add Information to Projects

To run the data through the script, all you need to do is update `my_file` path to the most recent export from FMIS and QMRS uploaded to GCS, then run the function in the section `Export Data` with your dataframe and the current date. Then your aggregated data will be ready in GCS. 

In [1]:
import pandas as pd
from siuba import *

import _script_utils

from calitp_data_analysis.sql import to_snakecase


In [2]:
pd.set_option("display.max_columns", 100)
pd.set_option('display.max_colwidth', None)

## Read in Data and function development / Test Function

For the following function:
* update the file path for `my_file` to the most recent file name of the FMIS & QMRS export
* the second kwargs is the unique recipient identifier, in this case it should stay the same with subsequent exports
* the third kwargs is the aggregation level you want for the data. Unless otherwise specified, it should be `agg` which is one row per project

In [3]:
GCS_FILE_PATH  = 'gs://calitp-analytics-data/data-analyses/dla/dla-iija'

In [4]:
my_file = "Copy of 2b to DLA Output.xlsx"

In [5]:
og_file = "IIJA Project List 01_2025.xlsx"



### Check data

In [6]:
check_data = to_snakecase(pd.read_excel(f"{GCS_FILE_PATH}/{my_file}"))

In [7]:
og_file = to_snakecase(pd.read_excel(f"{GCS_FILE_PATH}/{og_file}"))

In [8]:
set(list(check_data.columns)) - set(list(og_file.columns))

{'state_local'}

In [9]:
set(list(og_file.columns)) - set(list(check_data.columns))

{'comp', 'efis_id', 'pid_check1', 'pid_check2', 'pid_district', 'rk_locode'}

In [16]:
og_file.columns

Index(['fmis_transaction_date', 'program_code', 'program_code_description',
       'pid_district', 'project_number', 'recipient_project_number',
       'pid_check1', 'efis_id', 'pid_check2', 'project_title', 'rk_locode',
       'county_code', 'congressional_district', 'project_status_description',
       'project_description', 'improvement_type',
       'improvement_type_description', 'total_cost_amount',
       'obligations_amount', 'summary_recipient_defined_text_field_1_value',
       'comp'],
      dtype='object')

In [15]:
check_data.columns

Index(['state_local', 'fmis_transaction_date', 'program_code',
       'program_code_description', 'project_number',
       'recipient_project_number', 'project_title', 'county_code',
       'congressional_district', 'project_status_description',
       'project_description', 'improvement_type',
       'improvement_type_description', 'total_cost_amount',
       'obligations_amount', 'summary_recipient_defined_text_field_1_value'],
      dtype='object')

In [11]:
check_data.project_number.nunique()

2601

### Run Script

In [12]:
df = _script_utils.run_script(my_file, 'summary_recipient_defined_text_field_1_value', 'agg')

  df['implementing_agency_locode'] = df['implementing_agency_locode'].str.replace('.0', '')


True

In [13]:
df.county_code.describe()

count     2601
unique     100
top         37
freq       198
Name: county_code, dtype: object

In [14]:
df2 = _script_utils.run_script2(my_file, 'summary_recipient_defined_text_field_1_value', 'agg')

AttributeError: 'DataFrame' object has no attribute 'rk_locode'

### Testing the data

In [None]:
## when grouping by funding program (pne project can have multiple rows), len is 1612 for 2023 version of data
## asserting the length of the df is the same as number of projects
assert len(df) == check_data.project_number.nunique()

In [None]:
assert len(df2) == check_data.project_number.nunique()

In [None]:
len(df2), len(df)

In [None]:
## check one project with multiple funding codes
df>>filter(_.project_number=='5004049')

In [None]:
df2>>filter(_.project_number=='5004049')

In [None]:
df.implementing_agency.value_counts().head()

In [None]:
df2.implementing_agency.value_counts().head()

## Export Data

In [None]:
### rename the file for export to GCS
### use date to rename

In [None]:
# _script_utils.export_to_gcs(df, "01302025_agg")

In [None]:
_script_utils.export_to_gcs(df2, "02102025_agg")