# Exploring and Cleaning Data

In the following code I will be performing both exploratory analysis and cleaning the data provided within three excel spreadsheets. These spreadsheets contain budget information for the City of Missoula and aligining Program Inventory information surrounding different programs that are funded by the City of Missoula. 

The data provided is messy and comes from both the financial software that is used, but also from a different software that collects survey responses. The data in the excel files is messy and unclean. This process will load the files into the environment, perform data cleaning functions to prepare the data for manipulation in Power Bi. In Power BI, I will create a dashboard that provides information about all of the different programs currently funded and provide granular breakdowns for financial and other important information therein. 

Once the data is cleaned I will conduct some exploratory analysis on the newly cleaned data to see how it performs. 

In [3]:
### Importing Libraries
# Standard Libraries
import os

# Text Cleaning
import re

# Data Handling
import pandas as pd
import numpy as np

# Excel Handling
import openpyxl

# Additional Cleaning Utilities; each labeled below
from tqdm import tqdm  # Adds progress bars to loops
import chardet  # Detects encoding issues
import janitor # Chaining and Cleaing Helpers

# Visualization
import missingno as msno
import matplotlib.pyplot as plt

In [None]:
## define the file paths 
data_path = "data/"

## Load Files; only the first sheet in expenditure and program inventory, all of the sheets in revenue expense. 
revenue_expense = pd.read_excel(
    os.path.join(data_path, "FY24_Revenue_Expense_Data.xlsx"), 
    sheet_name=None, engine="openpyxl")
# expenditure_status = pd.read_excel(
#     os.path.join(data_path, "FY24_Expenditure_Status.xlsx"), 
#     sheet_name=0, skiprows=6, engine="openpyxl")


program_inventory = pd.read_excel(
    os.path.join(data_path, "Program_Inventory_Internal_Data_Collection.xlsx"), 
    sheet_name=0, engine="openpyxl")

In [24]:
expenditure_status = pd.read_excel(
    os.path.join(data_path, "FY24_Expenditure_Status.xlsx"), 
    sheet_name=0,engine="openpyxl")

In [25]:
expenditure_status

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,,,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,
4,1000,GENERAL FUND,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14133,Total DEPRECIATION,0,,,,,,,,,,,,,,,
14134,,,,,,,,,,,,,,,,,
14135,Total GENERAL FIXED ASSETS ACCOUNT GROUP,0,,,,,,,,,,,,,,,
14136,Grand Total,338487828.71,,,,,,,,,,,,,,,


In [27]:
### working code with prof
# Condition 1: "Unnamed: 0" should not contain only text (no numbers).
# This will be True when "Unnamed: 0" contains any number
condition_1 = expenditure_status["Unnamed: 0"].str.contains(r'\d', na=False)
 
# Condition 2: "Unnamed: 1" should not be NaN (has value)
condition_2 = expenditure_status["Unnamed: 1"].notna()
 
# Condition 3: "Unnamed: 0" should not be a number while "Unnamed: 1" should not contain text
condition_3 = ~((expenditure_status["Unnamed: 0"].str.isnumeric()) & expenditure_status["Unnamed: 1"].str.isalpha())
 
# Condition 4: "Unnamed: 0" and "Unnamed: 1" should not both be text
condition_4 = ~((expenditure_status["Unnamed: 0"].str.isalpha()) & expenditure_status["Unnamed: 1"].str.isalpha())
 
# Applying all conditions to subset the dataframe
filtered_df = expenditure_status[condition_1 & condition_2 & condition_3 & condition_4]
 
filtered_df = filtered_df.iloc[:, :2]

# Display the resulting DataFrame
print(filtered_df)

                                              Unnamed: 0  \
4                                                   1000   
5                                                    210   
6                                \n_x000D_Account Number   
9           1000.210.410100.110.000   SALARIES AND WAGES   
10      1000.210.410100.140.000   EMPLOYER CONTRIBUTIONS   
...                                                  ...   
14111                                               9000   
14113                            \n_x000D_Account Number   
14116             9000.900.510000.830.000   DEPRECIATION   
14120   9000.900.510000.900.000   ASSET DISPOSAL EXPENSE   
14127  9000.900.510360.810.000   LOSS ON SALE OF FIXE...   

                               Unnamed: 1  
4                            GENERAL FUND  
5                            CITY COUNCIL  
6                  Adjusted Appropriation  
9                                  214440  
10                                 199915  
...            

In [5]:
## Display initial previews
print("Revenue Expense Loaded:", revenue_expense.keys())
print("\n Expenditure Status Preview:")
print(expenditure_status.head())

print("\n Program Inventory Preview:")
print(program_inventory.head())

Revenue Expense Loaded: dict_keys(['Rev', 'Exp', 'Status'])

 Expenditure Status Preview:
                                                 210            CITY COUNCIL  \
0                            \n_x000D_Account Number  Adjusted Appropriation   
1  1000.210.410100                LEGISLATION SER...                     NaN   
2       1000.210.410100.100        PERSONAL SERVICES                     NaN   
3       1000.210.410100.110.000   SALARIES AND WAGES                  214440   
4   1000.210.410100.140.000   EMPLOYER CONTRIBUTIONS                  199915   

   Unnamed: 2  Unnamed: 3  Unnamed: 4  Unnamed: 5  Unnamed: 6  Unnamed: 7  \
0         NaN         NaN         NaN         NaN         NaN         NaN   
1         NaN         NaN         NaN         NaN         NaN         NaN   
2         NaN         NaN         NaN         NaN         NaN         NaN   
3         NaN         NaN         NaN         NaN         NaN         NaN   
4         NaN         NaN         NaN       

In [6]:
## Define Cleaning Functions
def drop_unnamed_columns(df):
    """
    Drop columns with headers like 'Unnamed: XX'.
    These are usually trailing blanks from Excel exports.
    """
    return df.loc[:, ~df.columns.str.contains("^Unnamed", na=False)]

def clean_numeric_column(column, zfill=None):
    """
    Remove trailing '.0' from numeric-like strings and optionally pad.
    """
    cleaned = column.astype(str).str.replace(r"\.0$", "", regex=True)
    return cleaned.str.zfill(zfill) if zfill else cleaned

def clean_identifiers(df):
    """
    Standardize identifiers: Fund #, Dept #, Activity Code.
    Pad and clean these for consistency across datasets.
    """
    if "Fund #" in df.columns:
        df["Fund #"] = clean_numeric_column(df["Fund #"])
    if "Dept #" in df.columns:
        df["Dept #"] = clean_numeric_column(df["Dept #"], zfill=3)
    if "Activity Code" in df.columns:
        df["Activity Code"] = clean_numeric_column(df["Activity Code"], zfill=6)
    return df

def expand_multicolumn_headers(df, column_expansion_map):
    """
    Expand wide compound headers into real sub-columns.
    Assumes compound headers are followed by unnamed columns in Excel.
    """
    new_columns = []
    col_iter = iter(df.columns)

    while True:
        try:
            col = next(col_iter)
            if col in column_expansion_map:
                expanded = column_expansion_map[col]
                new_columns.extend(expanded)
                for _ in range(len(expanded) - 1):
                    next(col_iter)
            else:
                new_columns.append(col)
        except StopIteration:
            break

    df.columns = new_columns
    return df

def apply_department_and_fund_mappings(df, dept_map, fund_map):
    """
    Apply lookup maps for human-readable department and fund names.
    Fallback to 'REDACTED' where no match is found.
    """
    df["Department"] = df["Dept #"].map(dept_map).fillna("REDACTED")
    if "Fund" in df.columns:
        df["Fund Name"] = df["Fund"].map(lambda f: fund_map.get(f, "REDACTED"))
    return df

def clean_program_inventory(df, column_expansion_map, dept_map, fund_map):
    """
    Full cleaning pipeline for the Program Inventory dataset.
    Applies identifier cleanup, expands compound headers, drops empty cols,
    and maps department/fund names.
    """
    df = df.rename(columns={"Org": "Dept #"})
    df = clean_identifiers(df)
    df = expand_multicolumn_headers(df, column_expansion_map)
    df = drop_unnamed_columns(df)
    df = apply_department_and_fund_mappings(df, dept_map, fund_map)
    return df

def remove_leading_trailing_whitespace(df):
    """
    Trim whitespace from all string-like columns only.
    Safely handles object columns that are not strings.
    """
    for col in df.select_dtypes(include="object").columns:
        df[col] = df[col].astype(str).str.strip()
    return df

def standardize_case(df, cols):
    """
    Convert selected columns to title case.
    Useful for labels like 'department' and 'fund name'.
    """
    for col in cols:
        if col in df.columns:
            df[col] = df[col].astype(str).str.title()
    return df

def strip_whitespace_and_standardize(df):
    """
    Combo function to:
    - Remove leading/trailing whitespace from text
    - Title-case key label fields
    """
    df = remove_leading_trailing_whitespace(df)
    df = standardize_case(df, cols=["Department", "Department (H6)", "Fund Name"])
    return df

In [7]:
## When using the comments in line, it allows you to easily reference them later by using help(function)
## help(fix_multiline_headers)

In [8]:
### Potential Cleaning Functions Bank, not currently used but might add to polish up or might remove all together. 

# def validate_numeric_ranges(df, columns, min_val=0):
#     """
#     Remove rows with negative values in numeric columns.
#     Helps with cleaning FTE, cost, budget fields before visualization.
#     """
#     for col in columns:
#         if col in df.columns:
#             df = df[df[col] >= min_val]
#     return df

# def remove_duplicates(df, subset=None):
#     """
#     Drop duplicates from a DataFrame.
#     'subset' can be used to define which columns to check.
#     """
#     return df.drop_duplicates(subset=subset)

In [9]:
## Data Cleaning Process for Expenditure Status
# Remove Unnamed columns and fix initial column name
df_expenditure_status = drop_unnamed_columns(expenditure_status).copy()
df_expenditure_status.columns.values[0] = "Account Number"

# Split account structure
split_cols = ["Fund #", "Dept #", "Activity Code", "Object Code", "Sub-object Code"]
split_data = df_expenditure_status["Account Number"].astype(str).str.split(".", expand=True, n=4)
split_data.columns = split_cols

# Merge into main DataFrame
df_expenditure_status = pd.concat([df_expenditure_status, split_data], axis=1)

# Convert identifiers to proper numeric types
for col in split_cols:
    df_expenditure_status[col] = pd.to_numeric(df_expenditure_status[col], errors="coerce").astype("Int64")

# Fill missing fund and department info
df_expenditure_status[["Fund #", "Dept #"]] = df_expenditure_status[["Fund #", "Dept #"]].ffill()

# Drop rows missing activity code
df_expenditure_status = df_expenditure_status.dropna(subset=["Activity Code"]).reset_index(drop=True)

In [None]:
# Extract unique Dept and Fund codes
dept_mapping_df = df_expenditure_status[["Dept #"]].drop_duplicates().dropna()
fund_mapping_df = df_expenditure_status[["Fund #"]].drop_duplicates().dropna()

# Manually assigned readable department names
dept_name_lookup = {
    210: "City Council", 220: "Mayor", 230: "Finance", 240: "Human Resources",
    250: "Legal", 260: "Police", 270: "Fire", 280: "Public Works",
    290: "Parks & Recreation", 300: "Planning & Development", 310: "Library",
    320: "IT Services", 330: "Community Development", 340: "Housing Services"
}

# Build mapping dictionaries
dept_map = dept_mapping_df["Dept #"].astype(int).map(dept_name_lookup).to_dict()
fund_map = fund_mapping_df["Fund #"].dropna().astype(str).to_dict()

# Map department names to expenditure
df_expenditure_status["Department"] = df_expenditure_status["Dept #"].map(dept_name_lookup).fillna("REDACTED")

#### add rows to extract and remove to sub df for depts, etc. look into this & reorder cleaning tasks

In [11]:
# Preview column names to see where compound headers are followed by Unnamed cols
print("\n🧪 Raw Program Inventory Columns:")
for i, col in enumerate(program_inventory.columns):
    print(f"{i}: {col}")


🧪 Raw Program Inventory Columns:
0: Fund
1: Org
2: Activity
3: Program Title (H8)
4: Requested Title Change (I9)
5: Department (H6)
6: FTEs (H36)
7: Personnel (G27)
8: O&M (G28)
9: Debt (G29)
10: Grant (G30)
11: Transfers (G31)
12: Capital (G32)
13: Total Expenditures (G33)
14: Cost Recovery (E58, P24)
15: Unnamed: 15
16: Description (E12)
17: Additional Activities (E20)
18: Mandate (E41, H41, E43)
19: Unnamed: 19
20: Unnamed: 20
21: Service Level (E47, H47, E49)
22: Unnamed: 22
23: Unnamed: 23
24: Reliance & Interdependencies (E53, E55)
25: Unnamed: 25
26: Strategic Goal (E64, E66, E68, E74, E80)
27: Unnamed: 27
28: Unnamed: 28
29: Unnamed: 29
30: Unnamed: 30
31: Trend (Demand) (E87, E89)
32: Unnamed: 32
33: Risk (E93, E95)
34: Unnamed: 34


In [12]:
## Define Column Expansion Map for Program Inventory
### Currently using placeholders to test for correct code, will update once final is ready for correct named columns.
column_expansion_map = {
    "Cost Recovery (E58, P24)": ["cost_recovery_e58", "cost_recovery_p24"],
    "Mandate (E41, H41, E43)": ["mandate_e41", "mandate_h41", "mandate_e43"],
    "Service Level (E47, H47, E49)": ["service_level_e47", "service_level_h47", "service_level_e49"],
    "Reliance & Interdependencies (E53, E55)": ["reliance_e53", "reliance_e55"],
    "Strategic Goal (E64, E66, E68, E74, E80)": [
        "strategic_goal_e64", "strategic_goal_e66", "strategic_goal_e68",
        "strategic_goal_e74", "strategic_goal_e80"
    ],
    "Trend (Demand) (E87, E89)": ["trend_demand_e87", "trend_demand_e89"],
    "Risk (E93, E95)": ["risk_e93", "risk_e95"]
}

In [13]:
## Follow through with program inventory clean
df_program_inventory = clean_program_inventory(
    program_inventory,
    column_expansion_map,
    dept_map,
    fund_map
)

In [14]:
# Final polish before export
df_expenditure_status = strip_whitespace_and_standardize(df_expenditure_status)
df_program_inventory = strip_whitespace_and_standardize(df_program_inventory)

In [15]:
## Preview cleaned column structure
print("\n🧪 Cleaned Program Inventory Columns:")
for i, col in enumerate(df_program_inventory.columns):
    print(f"{i}: {col}")


🧪 Cleaned Program Inventory Columns:
0: Fund
1: Dept #
2: Activity
3: Program Title (H8)
4: Requested Title Change (I9)
5: Department (H6)
6: FTEs (H36)
7: Personnel (G27)
8: O&M (G28)
9: Debt (G29)
10: Grant (G30)
11: Transfers (G31)
12: Capital (G32)
13: Total Expenditures (G33)
14: cost_recovery_e58
15: cost_recovery_p24
16: Description (E12)
17: Additional Activities (E20)
18: mandate_e41
19: mandate_h41
20: mandate_e43
21: service_level_e47
22: service_level_h47
23: service_level_e49
24: reliance_e53
25: reliance_e55
26: strategic_goal_e64
27: strategic_goal_e66
28: strategic_goal_e68
29: strategic_goal_e74
30: strategic_goal_e80
31: trend_demand_e87
32: trend_demand_e89
33: risk_e93
34: risk_e95
35: Department
36: Fund Name


In [16]:
# Display the final cleaned datasets
print("\n✅ Final Cleaned Expenditure Status (Sample):")
print(df_expenditure_status.head(5))

print("\n✅ Final Cleaned Program Inventory (Sample):")
print(df_program_inventory.head(5))


✅ Final Cleaned Expenditure Status (Sample):
                                      Account Number CITY COUNCIL  Fund #  \
0       1000.210.410100.100        PERSONAL SERVICES          nan    1000   
1       1000.210.410100.110.000   SALARIES AND WAGES       214440    1000   
2   1000.210.410100.140.000   EMPLOYER CONTRIBUTIONS       199915    1000   
3  1000.210.410100.141.000   STATE RETIREMENT CON...          217    1000   
4                1000.210.410100.200        SUPPLIES          nan    1000   

   Dept #  Activity Code  Object Code  Sub-object Code    Department  
0     210         410100         <NA>             <NA>  City Council  
1     210         410100          110             <NA>  City Council  
2     210         410100          140             <NA>  City Council  
3     210         410100          141             <NA>  City Council  
4     210         410100         <NA>             <NA>  City Council  

✅ Final Cleaned Program Inventory (Sample):
   Fund Dept #  Acti

In [17]:
### Coming back to this
## Drop leftover Excel headers and rename suffix-heavy columns

# Drop possible header artifact in expenditure data
# df_expenditure_status = df_expenditure_status.drop(columns=["city_council"], errors="ignore")

# Optionally rename long-form suffix columns for clarity in Power BI visuals
# df_program_inventory = df_program_inventory.rename(columns={
#     "program_title_h8_": "program_title",
#     "requested_title_change_i9_": "requested_title_change",
#     "department_h6_": "department_label"
# })

In [18]:
## Normalize all column headers using pyjanitor
df_expenditure_status = df_expenditure_status.clean_names()
df_program_inventory = df_program_inventory.clean_names()

for sheet_name, df in revenue_expense.items():
    revenue_expense[sheet_name] = df.clean_names()

In [19]:
## Preview normalized column names
print("\n📌 Expenditure Status Columns:")
print(df_expenditure_status.columns.tolist())

print("\n📌 Program Inventory Columns:")
print(df_program_inventory.columns.tolist())


📌 Expenditure Status Columns:
['account_number', 'city_council', 'fund_#', 'dept_#', 'activity_code', 'object_code', 'sub_object_code', 'department']

📌 Program Inventory Columns:
['fund', 'dept_#', 'activity', 'program_title_h8_', 'requested_title_change_i9_', 'department_h6_', 'ftes_h36_', 'personnel_g27_', 'o&m_g28_', 'debt_g29_', 'grant_g30_', 'transfers_g31_', 'capital_g32_', 'total_expenditures_g33_', 'cost_recovery_e58', 'cost_recovery_p24', 'description_e12_', 'additional_activities_e20_', 'mandate_e41', 'mandate_h41', 'mandate_e43', 'service_level_e47', 'service_level_h47', 'service_level_e49', 'reliance_e53', 'reliance_e55', 'strategic_goal_e64', 'strategic_goal_e66', 'strategic_goal_e68', 'strategic_goal_e74', 'strategic_goal_e80', 'trend_demand_e87', 'trend_demand_e89', 'risk_e93', 'risk_e95', 'department', 'fund_name']


In [20]:
### Export cleaned files 
### Not ready to do this just yet. 

# Export to CSV
df_expenditure_status.to_csv("cleaned_expenditure_status.csv", index=False)
df_program_inventory.to_csv("cleaned_program_inventory.csv", index=False)

# Export to Excel (with sheet name)
with pd.ExcelWriter("cleaned_missoula_budget_data.xlsx", engine="openpyxl") as writer:
    df_expenditure_status.to_excel(writer, sheet_name="Expenditure Status", index=False)
    df_program_inventory.to_excel(writer, sheet_name="Program Inventory", index=False)

for sheet_name, df in revenue_expense.items():
    df.to_excel(writer, sheet_name=sheet_name, index=False)

In [28]:
filtered_df.to_csv("testing_expend_status.csv", index=False)