# Cleaning Data

In this notebook, I clean two datasets provided by the City of Missoula:
- FY24_Expenditure_Status.xlsx
- Program_Inventory_Internal_Data_Collection.xlsx

The original data is exported from Tyler Edens (financial software) and Workiva (survey data), and is messy and inconsistent. This process removes artifacts, standardizes structure, and prepares the data for Power BI.

Once the data is cleaned I will conduct some exploratory analysis on the fresh data to see both how the cleaning process performed but also to showcase lurking information within the data that is easily digestable at a glance. 

## Step 1: Importing The Libraries
The cell below loads all the necessary Python libraries for data handling and cleaning. 

In [56]:
# Standard Libraries
import os

# Text Cleaning
import re

# Data Handling
import pandas as pd
import numpy as np

# Excel Handling
import openpyxl

# Additional Cleaning Utilities; each labeled below
from tqdm import tqdm  # Adds progress bars to loops
import chardet  # Detects encoding issues
import janitor # Chaining and Cleaing Helpers

# Visualization
import missingno as msno
import matplotlib.pyplot as plt

## Step 2: Define Cleaning Functions
Reducing repetition and provide greater clarity, we will create and define a set of helper functions. 

In [57]:
# Defined Cleaning Functions
def drop_unnamed_columns(df):
    """
    Drops columns with headers like 'Unnamed: X'.
    These are usually trailing blank columns from Excel exports.
    """
    return df.loc[:, ~df.columns.str.contains("^Unnamed", na=False)]


def clean_numeric_column(column, zfill=None):
    """
    Remove trailing '.0' from numeric-like strings.
    Optionally left-pad with zeros if a zfill length is provided.
    """
    cleaned = column.astype(str).str.replace(r"\.0$", "", regex=True)
    return cleaned.str.zfill(zfill) if zfill else cleaned


def clean_identifiers(df):
    """
    Standardize key identifiers like 'Fund #', 'Dept #' and 'Activity Code'.
    Applies formatting and zero-padding where appropriate.
    """
    if "fund_no" in df.columns:
        df["fund_no"] = clean_numeric_column(df["fund_no"], zfill=4)
    if "dept_no" in df.columns:
        df["dept_no"] = clean_numeric_column(df["dept_no"], zfill=3)
    if "activity" in df.columns:
        df["activity"] = clean_numeric_column(df["activity"], zfill=6)
    return df


def expand_multicolumn_headers(df, column_expansion_map):
    """
    Expands wide compound headers into multiple real columns.
    For example, "Mandate (E41, H41, E43)" → [mandate_e41, mandate_h41, mandate_e43].
    Assumes the next N columns are 'Unnamed'.
    """
    new_columns = []
    col_iter = iter(df.columns)
    while True:
        try:
            col = next(col_iter)
            if col in column_expansion_map:
                expanded = column_expansion_map[col]
                new_columns.extend(expanded)
                for _ in range(len(expanded) - 1):
                    next(col_iter)
            else:
                new_columns.append(col)
        except StopIteration:
            break
    df.columns = new_columns
    return df


def apply_department_and_fund_mappings(df, dept_map, fund_map):
    """
    Applies department and fund mapping dictionaries.
    Any unmapped codes will be labeled 'Unmapped' for clarity.
    """
    df["department"] = df["dept_no"].map(dept_map).fillna("unmapped")
    if "fund_no" in df.columns:
        df["fund_name"] = df["fund_no"].map(lambda f: fund_map.get(f, "unmapped"))
    return df


def clean_program_inventory(df, column_expansion_map, dept_map, fund_map):
    """
    Full cleaning pipeline for the Program Inventory dataset.
    Renames key columns, applies identifier formatting,
    expands compound headers, drops filler columns,
    and applies department/fund mapping.
    """
    df = df.rename(columns={"Org": "dept_no"})
    df = clean_identifiers(df)
    df = expand_multicolumn_headers(df, column_expansion_map)
    df = drop_unnamed_columns(df)
    df = apply_department_and_fund_mappings(df, dept_map, fund_map)
    return df


def remove_leading_trailing_whitespace(df):
    """
    Trim whitespace from all string-like columns only.
    Safely handles object columns that are not strings.
    """
    for col in df.select_dtypes(include="object").columns:
        df[col] = df[col].astype(str).str.strip()
    return df

def standardize_case(df, cols):
    """
    Convert selected columns to title case.
    Useful for labels like 'department' and 'fund name'.
    """
    for col in cols:
        if col in df.columns:
            df[col] = df[col].astype(str).str.title()
    return df

def strip_whitespace_and_standardize(df):
    """
    Combo function to:
    - Remove leading/trailing whitespace from text
    - Title-case key label fields
    """
    df = remove_leading_trailing_whitespace(df)
    df = standardize_case(df, cols=["department", "department_h6", "fund_name"])
    return df


def remove_trailing_underscores(df):
    """ 
    Removes trailing underscores in program inventory
    """
    df.columns = df.columns.str.rstrip("_")
    return df

In [58]:
## When using the comments in line, it allows you to easily reference them later by using help(function)
## help(fix_multiline_headers)

## Step 3: Loading The Files
The code below will load all the necessary Excel files that will be cleaned in this project.

In [59]:
# define file path
data_path = "data/"

# Load Files; only the first sheet in expenditure and program inventory, all of the sheets in revenue expense. 
expenditure_status = pd.read_excel(
    os.path.join(data_path, "FY24_Expenditure_Status.xlsx"),
    sheet_name=0,engine="openpyxl")

program_inventory = pd.read_excel(
    os.path.join(data_path, "Program_Inventory_Internal_Data_Collection.xlsx"), 
    sheet_name=0, engine="openpyxl")

In [60]:
# Preview Expenditure Status
expenditure_status

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,,,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,
4,1000,GENERAL FUND,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14133,Total DEPRECIATION,0,,,,,,,,,,,,,,,
14134,,,,,,,,,,,,,,,,,
14135,Total GENERAL FIXED ASSETS ACCOUNT GROUP,0,,,,,,,,,,,,,,,
14136,Grand Total,338487828.71,,,,,,,,,,,,,,,


## Step 4: Filter Raw Expenditure Status Data
This code removes blank rows, summary totals, and non-transactional records. 

In [61]:
## Logical Conditions
# Condition 1: Keep rows where "Unnamed: 0" has at least one digit
condition_1 = expenditure_status["Unnamed: 0"].astype(str).str.contains(r"\d", na=False)

# Condition 2: "Unnamed: 1" must not be empty, null
condition_2 = expenditure_status["Unnamed: 1"].notna()

# Condition 3: Remove rows where 'Unnamed: 0' is numeric and 'Unnamed: 1' is alphabetic
condition_3 = ~(
    expenditure_status["Unnamed: 0"].astype(str).str.isnumeric() &
    expenditure_status["Unnamed: 1"].astype(str).str.isalpha()
)

# Condition 4: Remove rows where both 'Unnamed: 0' and 'Unnamed: 1' are only alphabetic
condition_4 = ~(
    expenditure_status["Unnamed: 0"].astype(str).str.isalpha() &
    expenditure_status["Unnamed: 1"].astype(str).str.isalpha()
)

# Condition 5: Keep only rows where 'Unnamed: 1' contains a number (ensures a dollar amount)
condition_5 = expenditure_status["Unnamed: 1"].astype(str).str.contains(r"\d", na=False)

# Condition 6: Exclude rows where 'Unnamed: 0' contains the word "Total"
condition_6 = ~expenditure_status["Unnamed: 0"].astype(str).str.contains("Total", case=False, na=False)

# Combine all logic conditions into a single filter
all_conditions = condition_1 & condition_2 & condition_3 & condition_4 & condition_5 & condition_6

# Subset the filtered data 
expenditure_filtered_df = expenditure_status[all_conditions].copy()

# Apply to raw dataframe and retain relevant columns
expenditure_filtered_df = expenditure_status[all_conditions].iloc[:, :2].copy()
expenditure_filtered_df.columns = ["account_number", "adjusted_appropriation"]

# Checkpoint: Preview the cleaned result
expenditure_filtered_df.head()

Unnamed: 0,account_number,adjusted_appropriation
9,1000.210.410100.110.000 SALARIES AND WAGES,214440
10,1000.210.410100.140.000 EMPLOYER CONTRIBUTIONS,199915
11,1000.210.410100.141.000 STATE RETIREMENT CON...,217
15,1000.210.410100.210.000 OFFICE SUPPLIES,200
16,1000.210.410100.230.000 REPAIR/MAINTENANCE,750


In [62]:
print("📌 Expenditure Data Shape:", expenditure_filtered_df.shape)

📌 Expenditure Data Shape: (2351, 2)


## Step 5: Restructuring Expenditure Data
This code will complete a further deep dive on cleaning the expenditure status data. 

In [63]:
# Define column names for account number breakdown
split_cols = ["fund_no", "dept_no", "activity_code", "object_code", "sub_object_code"]

# Extract components using regex (some rows may lack the sub_object_code)
split_data = expenditure_filtered_df["account_number"].astype(str).str.extract(
    r"^(\d{4})\.(\d{3})\.(\d{6})\.(\d{3})(?:\.(\d{1,3}))?"
)
split_data.columns = split_cols

# Merge split components back into the main DataFrame
expend_structured = pd.concat([expenditure_filtered_df, split_data], axis=1)

# Extract the account description (text after the numeric code)
expend_structured["account_description"] = (
    expenditure_filtered_df["account_number"]
    .astype(str)
    .str.extract(r"(?:\d{3,}(?:\.\d{3,}){1,4})\s+(.*)", expand=False)
    .str.strip()
)

# Convert each segment to string and strip whitespace
for col in split_cols:
    expend_structured[col] = expend_structured[col].astype("string").str.strip()

# Special: Left-pad sub_object_code to always be 3 digits (keep nulls as <NA>)
expend_structured["sub_object_code"] = (
    expend_structured["sub_object_code"]
    .where(expend_structured["sub_object_code"].notna())  # preserve missing
    .str.zfill(3)
)

# Forward fill fund and department codes
expend_structured[["fund_no", "dept_no"]] = expend_structured[["fund_no", "dept_no"]].ffill()

# Drop any rows without an activity code — these are incomplete
expend_structured = expend_structured.dropna(subset=["activity_code"]).reset_index(drop=True)

# Remove original raw account_number column
expend_structured = expend_structured.drop(columns=["account_number"])

# Final cleanup: apply text stripping and title-casing
df_expenditure_status = expend_structured.copy()
df_expenditure_status = strip_whitespace_and_standardize(df_expenditure_status)

# Remove Transfers Object Code 820 from data to ensure budget accuracy. 
df_expenditure_status = df_expenditure_status[df_expenditure_status["object_code"] != "820"]

# Preview cleaned and structured expenditure data
print("\n✅ Structured Expenditure Data Preview:")
display(df_expenditure_status.head(10))


✅ Structured Expenditure Data Preview:


Unnamed: 0,adjusted_appropriation,fund_no,dept_no,activity_code,object_code,sub_object_code,account_description
0,214440,1000,210,410100,110,0,SALARIES AND WAGES
1,199915,1000,210,410100,140,0,EMPLOYER CONTRIBUTIONS
2,217,1000,210,410100,141,0,STATE RETIREMENT CONTRIBUTIONS
3,200,1000,210,410100,210,0,OFFICE SUPPLIES
4,750,1000,210,410100,230,0,REPAIR/MAINTENANCE
5,350,1000,210,410100,240,0,OTHER SUPPLIES
6,20,1000,210,410100,310,0,COMMUNICATIONS
7,250,1000,210,410100,320,0,PRINTING & DUPLICATING
8,2955,1000,210,410100,330,0,PUBLICITY SUBSCRIPTIONS & DUES
9,7150,1000,210,410100,350,0,PROFESSIONAL SERVICES


In [64]:
# Update df_expenditure_status for further use
df_expenditure_status = strip_whitespace_and_standardize(df_expenditure_status)

In [65]:
# Preview column names to see where compound headers are followed by Unnamed cols
print("\nRaw Program Inventory Columns:")
for i, col in enumerate(program_inventory.columns):
    print(f"{i}: {col}")


Raw Program Inventory Columns:
0: Fund
1: Org
2: Activity
3: Program Title (H8)
4: Requested Title Change (I9)
5: Department (H6)
6: FTEs (H36)
7: Personnel (G27)
8: O&M (G28)
9: Debt (G29)
10: Grant (G30)
11: Transfers (G31)
12: Capital (G32)
13: Total Expenditures (G33)
14: Cost Recovery (E58, P24)
15: Unnamed: 15
16: Description (E12)
17: Additional Activities (E20)
18: Mandate (E41, H41, E43)
19: Unnamed: 19
20: Unnamed: 20
21: Service Level (E47, H47, E49)
22: Unnamed: 22
23: Unnamed: 23
24: Reliance & Interdependencies (E53, E55)
25: Unnamed: 25
26: Strategic Goal (E64, E66, E68, E74, E80)
27: Unnamed: 27
28: Unnamed: 28
29: Unnamed: 29
30: Unnamed: 30
31: Trend (Demand) (E87, E89)
32: Unnamed: 32
33: Risk (E93, E95)
34: Unnamed: 34


## Step 6: Expand Headers in Program Inventory

In [66]:
column_expansion_map = {
    "Cost Recovery (E58, P24)": ["cost_recovery_e58_YN", "cost_recovery_p24_percent"],
    "Mandate (E41, H41, E43)": ["mandate_e41_YN", "mandate_h41_entity", "mandate_e43_descript"],
    "Service Level (E47, H47, E49)": ["service_level_e47_YN", "service_level_h47_entity", "service_level_e49_descript"],
    "Reliance & Interdependencies (E53, E55)": ["reliance_e53_level", "reliance_e55_high_descript"],
    "Strategic Goal (E64, E66, E68, E74, E80)": [
        "strategic_goal_e64_YN", "strategic_goal_e66_name", "strategic_goal_e68_action_descript",
        "strategic_goal_e74_additional_actions", "strategic_goal_e80_2nd_additional_actions"
    ],
    "Trend (Demand) (E87, E89)": ["trend_demand_e87_level", "trend_demand_e89_descript"],
    "Risk (E93, E95)": ["risk_e93_type", "risk_e95_descript"]
}

## Step 7: Clean Program Inventory

In [67]:
# Cleaning pipeline for program inventory
def clean_program_inventory(df, column_expansion_map):
    df = df.rename(columns={"Org": "dept_no"})
    df = clean_identifiers(df)
    df = expand_multicolumn_headers(df, column_expansion_map)
    df = drop_unnamed_columns(df)
    return df

# Apply cleaning pipeline
df_program_inventory = clean_program_inventory(program_inventory, column_expansion_map)

# Fill blank text fields only (leave numeric NaNs intact)
for col in df_program_inventory.select_dtypes(include="object").columns:
    df_program_inventory[col] = df_program_inventory[col].fillna("blank")

# Strip and standardize casing for labels
df_program_inventory = strip_whitespace_and_standardize(df_program_inventory)

# ✅ Preview cleaned Program Inventory
print("\n✅ Cleaned Program Inventory Preview:")
display(df_program_inventory.head())


✅ Cleaned Program Inventory Preview:


Unnamed: 0,Fund,dept_no,Activity,Program Title (H8),Requested Title Change (I9),Department (H6),FTEs (H36),Personnel (G27),O&M (G28),Debt (G29),...,reliance_e55_high_descript,strategic_goal_e64_YN,strategic_goal_e66_name,strategic_goal_e68_action_descript,strategic_goal_e74_additional_actions,strategic_goal_e80_2nd_additional_actions,trend_demand_e87_level,trend_demand_e89_descript,risk_e93_type,risk_e95_descript
0,1000,210,410100,LEGISLATION SERVICES,0,Council,12.0,414572.0,23605.0,0,...,Internal Organizational (City) Dependence,No,0,0,0,0,Demand for service is increasing,Increase in council time for committees and in...,Low/No Risk,0
1,1000,210,419000,GENERAL GOV'T ONETIME EXPENDITURES,One Time Expense - Neighborhood Traffic Manage...,Council,0.0,0.0,20000.0,0,...,0,No,0,0,0,0,Demand for service is stable,0,Low/No Risk,0
2,1000,220,410210,ADMINISTRATION,Mayor's Office,Administration,3.0,459928.0,69072.0,0,...,Internal Organizational (City) Dependence,No,0,0,0,0,Demand for service is increasing,As the City grows and issues become more compl...,Low/No Risk,Program risk is low
3,1000,220,410250,ADMINISTRATION,Communications Office,Administration,1.0,148046.0,24289.0,0,...,Internal Organizational (City) Dependence,Yes,Organizational Excellence and Resilience,"Drive consistency, effectiveness and excellenc...",Overhaul and implement social media policies a...,0,Demand for service is increasing,"Demand for accurate, timely, appropriate commu...",Low/No Risk,Program risk is considered low
4,1000,220,419000,GENERAL GOV'T ONETIME EXPENDITURES,0,Administration,0.0,0.0,38105.0,0,...,0,No,0,0,0,0,Demand for service is stable,0,Low/No Risk,0


## Step 8: Normalize Column Headers for All Files

In [68]:
df_expenditure_status = df_expenditure_status.clean_names()
df_program_inventory = df_program_inventory.clean_names()
df_expenditure_status = remove_trailing_underscores(df_expenditure_status)
df_program_inventory = remove_trailing_underscores(df_program_inventory)

## Step 9: Validations

In [69]:
print("🔍 Nulls in Expenditure Status:\n", df_expenditure_status.isnull().sum())
print("\n🔍 Nulls in Program Inventory:\n", df_program_inventory.isnull().sum())

required_keys = ["fund_no", "dept_no", "activity_code"]
missing_keys = df_expenditure_status[df_expenditure_status[required_keys].isnull().any(axis=1)]
dupes = df_expenditure_status[df_expenditure_status.duplicated(subset="account_description", keep=False)]

print(f"\n🔎 Missing key identifiers: {missing_keys.shape[0]} rows")

🔍 Nulls in Expenditure Status:
 adjusted_appropriation    0
fund_no                   0
dept_no                   0
activity_code             0
object_code               0
sub_object_code           0
account_description       0
dtype: int64

🔍 Nulls in Program Inventory:
 fund                                         0
dept_no                                      0
activity                                     0
program_title_h8                             0
requested_title_change_i9                    0
department_h6                                0
ftes_h36                                     0
personnel_g27                                0
o&m_g28                                      0
debt_g29                                     0
grant_g30                                    0
transfers_g31                                0
capital_g32                                  0
total_expenditures_g33                       0
cost_recovery_e58_yn                         0
cost_recovery_p24_perc

## Step 10: Department Mapping

In [70]:
# Create a clean department map from program inventory (no duplicates, no nulls)
dept_map_df = df_program_inventory[["dept_no", "department_h6"]].drop_duplicates().dropna()
dept_map = dept_map_df.set_index("dept_no")["department_h6"].to_dict()

# Make a safe copy of the expenditure DataFrame to avoid SettingWithCopyWarning
df_expenditure_status = df_expenditure_status.copy()

# Apply the mapping using dept_no → department_h6
df_expenditure_status["department"] = (
    df_expenditure_status["dept_no"]
    .astype(str)
    .map(dept_map)
    .fillna("Unmapped")  # Ensures any missing mappings are labeled clearly
)

## Step 11: Export Cleaned Files

In [71]:
os.makedirs("cleaned_outputs", exist_ok=True)

df_expenditure_status.to_csv("cleaned_outputs/cleaned_expenditure_status.csv", index=False)
df_program_inventory.to_csv("cleaned_outputs/cleaned_program_inventory.csv", index=False)

print("✅ Cleaned datasets exported to 'cleaned_outputs/'")

✅ Cleaned datasets exported to 'cleaned_outputs/'
