### Cleaning Data

In this notebook, I clean two datasets provided by the City of Missoula:
- FY24_Expenditure_Status.xlsx
- Program_Inventory_Internal_Data_Collection.xlsx

The original data is exported from Tyler Edens (financial software) and Workiva (survey data), and is messy and inconsistent. This process removes artifacts, standardizes structure, and prepares the data for Power BI.

Once the data is cleaned I will conduct some exploratory analysis on the fresh data to see both how the cleaning process performed but also to showcase lurking information within the data that is easily digestable at a glance. 

## Step 1: Importing The Libraries
The cell below loads all the necessary Python libraries for data handling and cleaning. 

In [1]:
# Standard Libraries
import os

# Text Cleaning
import re

# Data Handling
import pandas as pd
import numpy as np

# Excel Handling
import openpyxl

# Additional Cleaning Utilities; each labeled below
from tqdm import tqdm  # Adds progress bars to loops
import chardet  # Detects encoding issues
import janitor # Chaining and Cleaing Helpers

# Visualization
import missingno as msno
import matplotlib.pyplot as plt

### Step 2: Loading The Files
The code below will load all the necessary Excel files that will be cleaned in this project.

In [2]:
# define file path
data_path = "data/"

# Load Files; only the first sheet in expenditure and program inventory, all of the sheets in revenue expense. 
revenue_expense = pd.read_excel(
    os.path.join(data_path, "FY24_Revenue_Expense_Data.xlsx"), 
    sheet_name=None, engine="openpyxl")

expenditure_status = pd.read_excel(
    os.path.join(data_path, "FY24_Expenditure_Status.xlsx"),
    sheet_name=0,engine="openpyxl")

program_inventory = pd.read_excel(
    os.path.join(data_path, "Program_Inventory_Internal_Data_Collection.xlsx"), 
    sheet_name=0, engine="openpyxl")

In [3]:
# Preview Expenditure Status
expenditure_status

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,,,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,
4,1000,GENERAL FUND,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14133,Total DEPRECIATION,0,,,,,,,,,,,,,,,
14134,,,,,,,,,,,,,,,,,
14135,Total GENERAL FIXED ASSETS ACCOUNT GROUP,0,,,,,,,,,,,,,,,
14136,Grand Total,338487828.71,,,,,,,,,,,,,,,


### Step 3: Filter Raw Expenditure Status Data
This code removes blank rows, summary totals, and non-transactional records. 

In [4]:
## Logical Conditions
# Condition 1: Keep rows where "Unnamed: 0" has at least one digit
condition_1 = expenditure_status["Unnamed: 0"].astype(str).str.contains(r"\d", na=False)

# Condition 2: "Unnamed: 1" must not be empty, null
condition_2 = expenditure_status["Unnamed: 1"].notna()

# Condition 3: Remove rows where 'Unnamed: 0' is numeric and 'Unnamed: 1' is alphabetic
condition_3 = ~(
    expenditure_status["Unnamed: 0"].astype(str).str.isnumeric() &
    expenditure_status["Unnamed: 1"].astype(str).str.isalpha()
)

# Condition 4: Remove rows where both 'Unnamed: 0' and 'Unnamed: 1' are only alphabetic
condition_4 = ~(
    expenditure_status["Unnamed: 0"].astype(str).str.isalpha() &
    expenditure_status["Unnamed: 1"].astype(str).str.isalpha()
)

# Condition 5: Exclude rows where 'Unnamed: 0' contains the word "Total"
condition_5 = ~expenditure_status["Unnamed: 0"].astype(str).str.contains("Total", case=False, na=False)

# Condition 6: Keep only rows where 'Unnamed: 1' contains a number (ensures a dollar amount)
condition_6 = expenditure_status["Unnamed: 1"].astype(str).str.contains(r"\d", na=False)

# Combine all logic conditions into a single filter
all_conditions = condition_1 & condition_2 & condition_3 & condition_4 & condition_5 & condition_6

# Subset the filtered data 
expenditure_filtered_df = expenditure_status[all_conditions].copy()

# Only keep relevant first two columns
expenditure_filtered_df = expenditure_filtered_df.iloc[:, :2]

# Rename columns for clarity
expenditure_filtered_df.columns = ["account_number", "adjusted_appropriation"]

# Checkpoint: Preview the cleaned result
expenditure_filtered_df.head()

Unnamed: 0,account_number,adjusted_appropriation
9,1000.210.410100.110.000 SALARIES AND WAGES,214440
10,1000.210.410100.140.000 EMPLOYER CONTRIBUTIONS,199915
11,1000.210.410100.141.000 STATE RETIREMENT CON...,217
15,1000.210.410100.210.000 OFFICE SUPPLIES,200
16,1000.210.410100.230.000 REPAIR/MAINTENANCE,750


### Step 4: Preview Raw Files After Filter Applied

In [5]:
print("Revenue Expense Loaded:", revenue_expense.keys())
print("\nFiltered Expenditure Status Preview:")
print(expenditure_filtered_df.head())

print("\nProgram Inventory Preview:")
print(program_inventory.head())

Revenue Expense Loaded: dict_keys(['Rev', 'Exp', 'Status'])

Filtered Expenditure Status Preview:
                                       account_number adjusted_appropriation
9        1000.210.410100.110.000   SALARIES AND WAGES                 214440
10   1000.210.410100.140.000   EMPLOYER CONTRIBUTIONS                 199915
11  1000.210.410100.141.000   STATE RETIREMENT CON...                    217
15          1000.210.410100.210.000   OFFICE SUPPLIES                    200
16       1000.210.410100.230.000   REPAIR/MAINTENANCE                    750

Program Inventory Preview:
   Fund  Org  Activity                  Program Title (H8)  \
0  1000  210    410100                LEGISLATION SERVICES   
1  1000  210    419000  GENERAL GOV'T ONETIME EXPENDITURES   
2  1000  220    410210                      ADMINISTRATION   
3  1000  220    410250                      ADMINISTRATION   
4  1000  220    419000  GENERAL GOV'T ONETIME EXPENDITURES   

                         Requested Titl

### Step 5: Define Cleaning Functions
Reducing repetition and provide greater clarity, we will create and define a set of helper functions. 

In [6]:
# Defined Cleaning Functions
def drop_unnamed_columns(df):
    """
    Drops columns with headers like 'Unnamed: X'.
    These are usually trailing blank columns from Excel exports.
    """
    return df.loc[:, ~df.columns.str.contains("^Unnamed", na=False)]


def clean_numeric_column(column, zfill=None):
    """
    Remove trailing '.0' from numeric-like strings.
    Optionally left-pad with zeros if a zfill length is provided.
    """
    cleaned = column.astype(str).str.replace(r"\.0$", "", regex=True)
    return cleaned.str.zfill(zfill) if zfill else cleaned


def clean_identifiers(df):
    """
    Standardize key identifiers like 'Fund #', 'Dept #' and 'Activity Code'.
    Applies formatting and zero-padding where appropriate.
    """
    if "fund_no" in df.columns:
        df["fund_no"] = clean_numeric_column(df["fund_no"])
    if "dept_no" in df.columns:
        df["dept_no"] = clean_numeric_column(df["dept_no"], zfill=3)
    if "activity_code" in df.columns:
        df["activity_code"] = clean_numeric_column(df["activity_code"], zfill=6)
    return df


def expand_multicolumn_headers(df, column_expansion_map):
    """
    Expands wide compound headers into multiple real columns.
    For example, "Mandate (E41, H41, E43)" → [mandate_e41, mandate_h41, mandate_e43].
    Assumes the next N columns are 'Unnamed'.
    """
    new_columns = []
    col_iter = iter(df.columns)

    while True:
        try:
            col = next(col_iter)
            if col in column_expansion_map:
                expanded = column_expansion_map[col]
                new_columns.extend(expanded)
                for _ in range(len(expanded) - 1):
                    next(col_iter)
            else:
                new_columns.append(col)
        except StopIteration:
            break

    df.columns = new_columns
    return df


def apply_department_and_fund_mappings(df, dept_map, fund_map):
    """
    Applies department and fund mapping dictionaries.
    Any unmapped codes will be labeled 'Unmapped' for clarity.
    """
    df["department"] = df["dept_no"].map(dept_map).fillna("Unmapped")
    if "fund" in df.columns:
        df["fund_name"] = df["fund"].map(lambda f: fund_map.get(f, "Unmapped"))
    return df


def clean_program_inventory(df, column_expansion_map, dept_map, fund_map):
    """
    Full cleaning pipeline for the Program Inventory dataset.
    Renames key columns, applies identifier formatting,
    expands compound headers, drops filler columns,
    and applies department/fund mapping.
    """
    df = df.rename(columns={"Org": "dept_no"})
    df = clean_identifiers(df)
    df = expand_multicolumn_headers(df, column_expansion_map)
    df = drop_unnamed_columns(df)
    df = apply_department_and_fund_mappings(df, dept_map, fund_map)
    return df


def remove_leading_trailing_whitespace(df):
    """
    Trim whitespace from all string-like columns only.
    Safely handles object columns that are not strings.
    """
    for col in df.select_dtypes(include="object").columns:
        df[col] = df[col].astype(str).str.strip()
    return df

def standardize_case(df, cols):
    """
    Convert selected columns to title case.
    Useful for labels like 'department' and 'fund name'.
    """
    for col in cols:
        if col in df.columns:
            df[col] = df[col].astype(str).str.title()
    return df

def strip_whitespace_and_standardize(df):
    """
    Combo function to:
    - Remove leading/trailing whitespace from text
    - Title-case key label fields
    """
    df = remove_leading_trailing_whitespace(df)
    df = standardize_case(df, cols=["department", "department (h6)", "fund_name"])
    return df

In [None]:
## When using the comments in line, it allows you to easily reference them later by using help(function)
## help(fix_multiline_headers)

### Step 6: Restructuring Expenditure Data
This code will complete a further deep dive on cleaning the expenditure status data. 

In [7]:
df_expenditure_status = expenditure_filtered_df.copy()
df_expenditure_status.columns = ["account_number", "adjusted_appropriation"]

# Split structured account number
split_cols = ["fund_no", "dept_no", "activity_code", "object_code", "sub_object_code"]
split_data = df_expenditure_status["account_number"].astype(str).str.split(".", expand=True, n=4)
split_data.columns = split_cols

# Merge splits into main frame
df_expenditure_status = pd.concat([df_expenditure_status, split_data], axis=1)

# Convert to Int64
for col in split_cols:
    df_expenditure_status[col] = pd.to_numeric(df_expenditure_status[col], errors="coerce").astype("Int64")

# Forward-fill fund/dept
df_expenditure_status[["fund_no", "dept_no"]] = df_expenditure_status[["fund_no", "dept_no"]].ffill()

# Remove rows with missing activity_code
df_expenditure_status = df_expenditure_status.dropna(subset=["activity_code"]).reset_index(drop=True)

# Extract account description from string
df_expenditure_status["account_description"] = df_expenditure_status["account_number"].astype(str)\
    .str.extract(r"(?:\d{3,}\.\d{3,}(?:\.\d{3,}){1,3}\s+)(.*)", expand=False)\
    .str.strip()

# Preview
print("\n✅ Structured Expenditure Data Preview:")
display(df_expenditure_status.head())


✅ Structured Expenditure Data Preview:


Unnamed: 0,account_number,adjusted_appropriation,fund_no,dept_no,activity_code,object_code,sub_object_code,account_description
0,1000.210.410100.110.000 SALARIES AND WAGES,214440,1000,210,410100,110,,SALARIES AND WAGES
1,1000.210.410100.140.000 EMPLOYER CONTRIBUTIONS,199915,1000,210,410100,140,,EMPLOYER CONTRIBUTIONS
2,1000.210.410100.141.000 STATE RETIREMENT CON...,217,1000,210,410100,141,,STATE RETIREMENT CONTRIBUTIONS
3,1000.210.410100.210.000 OFFICE SUPPLIES,200,1000,210,410100,210,,OFFICE SUPPLIES
4,1000.210.410100.230.000 REPAIR/MAINTENANCE,750,1000,210,410100,230,,REPAIR/MAINTENANCE


In [8]:
# Preview column names to see where compound headers are followed by Unnamed cols
print("\nRaw Program Inventory Columns:")
for i, col in enumerate(program_inventory.columns):
    print(f"{i}: {col}")


Raw Program Inventory Columns:
0: Fund
1: Org
2: Activity
3: Program Title (H8)
4: Requested Title Change (I9)
5: Department (H6)
6: FTEs (H36)
7: Personnel (G27)
8: O&M (G28)
9: Debt (G29)
10: Grant (G30)
11: Transfers (G31)
12: Capital (G32)
13: Total Expenditures (G33)
14: Cost Recovery (E58, P24)
15: Unnamed: 15
16: Description (E12)
17: Additional Activities (E20)
18: Mandate (E41, H41, E43)
19: Unnamed: 19
20: Unnamed: 20
21: Service Level (E47, H47, E49)
22: Unnamed: 22
23: Unnamed: 23
24: Reliance & Interdependencies (E53, E55)
25: Unnamed: 25
26: Strategic Goal (E64, E66, E68, E74, E80)
27: Unnamed: 27
28: Unnamed: 28
29: Unnamed: 29
30: Unnamed: 30
31: Trend (Demand) (E87, E89)
32: Unnamed: 32
33: Risk (E93, E95)
34: Unnamed: 34


### Step 7: Generate Department and Fund Mappings

In [10]:
# Extract unique Dept and Fund codes from cleaned expenditure data
dept_mapping_df = df_expenditure_status[["dept_no"]].drop_duplicates().dropna()
fund_mapping_df = df_expenditure_status[["fund_no"]].drop_duplicates().dropna()

# Manually create department name lookup dictionary
dept_name_lookup = {
    210: "City Council", 220: "Mayor", 230: "Finance", 240: "Human Resources",
    250: "Legal", 260: "Police", 270: "Fire", 280: "Public Works",
    290: "Parks & Recreation", 300: "Planning & Development", 310: "Library",
    320: "IT Services", 330: "Community Development", 340: "Housing Services"
}

# Build final dictionaries from available codes; these map numeric fund/department values to names
dept_map = dept_mapping_df["dept_no"].astype(int).map(dept_name_lookup).to_dict()
fund_map = fund_mapping_df["fund_no"].dropna().astype(str).to_dict()

# Apply human-readable department names to main DataFrame
df_expenditure_status["department"] = df_expenditure_status["dept_no"].map(dept_map).fillna("Unmapped")

### Step 8: Whitespace and Case Standardization

In [13]:
df_expenditure_status = strip_whitespace_and_standardize(df_expenditure_status)
df_expenditure_status = df_expenditure_status.clean_names()

# Clean revenue_expense sheets
for sheet_name, df in revenue_expense.items():
    revenue_expense[sheet_name] = df.clean_names()

print("\n✅ Final Cleaned Expenditure Status Preview:")
display(df_expenditure_status.head())


✅ Final Cleaned Expenditure Status Preview:


Unnamed: 0,account_number,adjusted_appropriation,fund_no,dept_no,activity_code,object_code,sub_object_code,account_description,department
0,1000.210.410100.110.000 SALARIES AND WAGES,214440,1000,210,410100,110,,SALARIES AND WAGES,Unmapped
1,1000.210.410100.140.000 EMPLOYER CONTRIBUTIONS,199915,1000,210,410100,140,,EMPLOYER CONTRIBUTIONS,Unmapped
2,1000.210.410100.141.000 STATE RETIREMENT CON...,217,1000,210,410100,141,,STATE RETIREMENT CONTRIBUTIONS,Unmapped
3,1000.210.410100.210.000 OFFICE SUPPLIES,200,1000,210,410100,210,,OFFICE SUPPLIES,Unmapped
4,1000.210.410100.230.000 REPAIR/MAINTENANCE,750,1000,210,410100,230,,REPAIR/MAINTENANCE,Unmapped


### Step 9: Clean Program Inventory

In [14]:
# Define Multi Header expansion mapping for program inventory
column_expansion_map = {
    "Cost Recovery (E58, P24)": ["cost_recovery_e58", "cost_recovery_p24"],
    "Mandate (E41, H41, E43)": ["mandate_e41", "mandate_h41", "mandate_e43"],
    "Service Level (E47, H47, E49)": ["service_level_e47", "service_level_h47", "service_level_e49"],
    "Reliance & Interdependencies (E53, E55)": ["reliance_e53", "reliance_e55"],
    "Strategic Goal (E64, E66, E68, E74, E80)": [
        "strategic_goal_e64", "strategic_goal_e66", "strategic_goal_e68",
        "strategic_goal_e74", "strategic_goal_e80"
    ],
    "Trend (Demand) (E87, E89)": ["trend_demand_e87", "trend_demand_e89"],
    "Risk (E93, E95)": ["risk_e93", "risk_e95"]
}

# Map readable department and fund names
df_program_inventory = clean_program_inventory(
    program_inventory,
    column_expansion_map,
    dept_map,
    fund_map
)

# Final standardization for labels and whitespace
df_program_inventory = strip_whitespace_and_standardize(df_program_inventory)
df_program_inventory = df_program_inventory.clean_names()

# Preview final column structure
print("\n✅ Cleaned Program Inventory Columns:")
for i, col in enumerate(df_program_inventory.columns):
    print(f"{i}: {col}")


✅ Cleaned Program Inventory Columns:
0: fund
1: dept_no
2: activity
3: program_title_h8_
4: requested_title_change_i9_
5: department_h6_
6: ftes_h36_
7: personnel_g27_
8: o&m_g28_
9: debt_g29_
10: grant_g30_
11: transfers_g31_
12: capital_g32_
13: total_expenditures_g33_
14: cost_recovery_e58
15: cost_recovery_p24
16: description_e12_
17: additional_activities_e20_
18: mandate_e41
19: mandate_h41
20: mandate_e43
21: service_level_e47
22: service_level_h47
23: service_level_e49
24: reliance_e53
25: reliance_e55
26: strategic_goal_e64
27: strategic_goal_e66
28: strategic_goal_e68
29: strategic_goal_e74
30: strategic_goal_e80
31: trend_demand_e87
32: trend_demand_e89
33: risk_e93
34: risk_e95
35: department


### Step 10: Validations

In [15]:
print("\n🔎 Nulls in Expenditure Status:")
print(df_expenditure_status.isnull().sum())

print("\n🔎 Nulls in Program Inventory:")
print(df_program_inventory.isnull().sum())

print("\n📌 Data Types – Expenditure Status:")
print(df_expenditure_status.dtypes)

print("\n📌 Data Types – Program Inventory:")
print(df_program_inventory.dtypes)

# Key identifier check
missing_keys = df_expenditure_status[
    df_expenditure_status[["fund_no", "dept_no", "activity_code"]].isnull().any(axis=1)
]
print(f"\n❗Missing key budget identifiers: {missing_keys.shape[0]} rows")

# Duplicate account numbers
duplicates = df_expenditure_status[df_expenditure_status.duplicated(subset="account_number", keep=False)]
print(f"\n❗Duplicate account numbers: {duplicates.shape[0]}")


🔎 Nulls in Expenditure Status:
account_number               0
adjusted_appropriation       0
fund_no                      0
dept_no                      0
activity_code                0
object_code                  0
sub_object_code           2258
account_description          0
department                   0
dtype: int64

🔎 Nulls in Program Inventory:
fund                          0
dept_no                       0
activity                      0
program_title_h8_             0
requested_title_change_i9_    0
department_h6_                0
ftes_h36_                     0
personnel_g27_                0
o&m_g28_                      0
debt_g29_                     0
grant_g30_                    0
transfers_g31_                0
capital_g32_                  0
total_expenditures_g33_       0
cost_recovery_e58             0
cost_recovery_p24             0
description_e12_              0
additional_activities_e20_    0
mandate_e41                   0
mandate_h41                   0
manda

### Step 11: Export Cleaned Files

In [16]:
# Export each cleaned dataset as its own CSV
df_expenditure_status.to_csv("cleaned_expenditure_status.csv", index=False)
df_program_inventory.to_csv("cleaned_program_inventory.csv", index=False)

print("✅ All cleaned files saved individually.")

✅ All cleaned files saved individually.
