# Exploring and Cleaning Data

In the following code I will be performing both exploratory analysis and cleaning the data provided within three excel spreadsheets. These spreadsheets contain budget information for the City of Missoula and aligining Program Inventory information surrounding different programs that are funded by the City of Missoula. 

The data provided is messy and comes from both the financial software that is used, but also from a different software that collects survey responses. The data in the excel files is messy and unclean. This process will load the files into the environment, perform data cleaning functions to prepare the data for manipulation in Power Bi. In Power BI, I will create a dashboard that provides information about all of the different programs currently funded and provide granular breakdowns for financial and other important information therein. 

Once the data is cleaned I will conduct some exploratory analysis on the newly cleaned data to see how it performs. 

In [17]:
### Importing Libraries
# Standard Libraries
import os

# Text Cleaning
import re

# Data Handling
import pandas as pd
import numpy as np

# Excel Handling
import openpyxl

# Additional Cleaning Utilities; each labeled below
import chardet  # Detects encoding issues
import janitor  # Simplifies df cleaning
from tqdm import tqdm  # Adds progress bars to loops

# Visualization
import missingno as msno
import matplotlib.pyplot as plt

In [18]:
## define the file paths 
data_path = "data/"

## Load Files; only the first sheet in expenditure and program inventory, all of the sheets in revenue expense. 
revenue_expense = pd.read_excel(
    os.path.join(data_path, "FY24_Revenue_Expense_Data.xlsx"), sheet_name=None, engine="openpyxl")
expenditure_status = pd.read_excel(
    os.path.join(data_path, "FY24_Expenditure_Status.xlsx"), sheet_name=0, skiprows=6, engine="openpyxl")
program_inventory = pd.read_excel(
    os.path.join(data_path, "Program_Inventory_Internal_Data_Collection.xlsx"), sheet_name=0, engine="openpyxl")

In [19]:
## Display initial previews
print("Revenue Expense Loaded:", revenue_expense.keys())
print("\n Expenditure Status Preview:")
print(expenditure_status.head())

print("\n Program Inventory Preview:")
print(program_inventory.head())

Revenue Expense Loaded: dict_keys(['Rev', 'Exp', 'Status'])

 Expenditure Status Preview:
                                                 210            CITY COUNCIL  \
0                            \n_x000D_Account Number  Adjusted Appropriation   
1  1000.210.410100                LEGISLATION SER...                     NaN   
2       1000.210.410100.100        PERSONAL SERVICES                     NaN   
3       1000.210.410100.110.000   SALARIES AND WAGES                  214440   
4   1000.210.410100.140.000   EMPLOYER CONTRIBUTIONS                  199915   

   Unnamed: 2  Unnamed: 3  Unnamed: 4  Unnamed: 5  Unnamed: 6  Unnamed: 7  \
0         NaN         NaN         NaN         NaN         NaN         NaN   
1         NaN         NaN         NaN         NaN         NaN         NaN   
2         NaN         NaN         NaN         NaN         NaN         NaN   
3         NaN         NaN         NaN         NaN         NaN         NaN   
4         NaN         NaN         NaN       

In [20]:
# Print actual column names in expenditure_status
print("\n Available columns in expenditure_status:")
print(list(expenditure_status.columns))


 Available columns in expenditure_status:
['210', 'CITY COUNCIL', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15', 'Unnamed: 16']


In [21]:
## Define Cleaning Functions

def drop_unnamed_columns(df):
    """
    Removes fully empty 'Unnamed' columns from a DataFrame.
    
    Parameters:
        df (pd.DataFrame): The input DataFrame.

    Returns:
        pd.DataFrame: Cleaned DataFrame without unnecessary columns.
    """
    return df.loc[:, ~df.columns.str.contains("^Unnamed", na=False)]


def clean_numeric_column(column):
    """
    Cleans a numeric column by removing unnecessary '.0' from values.
    
    Parameters:
        column (pd.Series): The numeric column to be cleaned.
        
    Returns:
        pd.Series: Cleaned column with standardized formatting.
    """
    return column.astype(str).str.replace(r"\.0$", "", regex=True)


def clean_identifiers(df):
    """
    Cleans and standardizes financial identifiers: Fund #, Activity Code, Dept #.
    - Splits at '.' to remove decimals.
    - Zero-pads Activity Code (6 digits) and Dept # (3 digits).
    
    Parameters:
        df (pd.DataFrame): DataFrame containing financial identifiers.
        
    Returns:
        pd.DataFrame: Updated DataFrame with standardized formatting.
    """
    if "Fund #" in df.columns:
        df["Fund #"] = df["Fund #"].astype(str).str.split(".").str[0]

    if "Activity Code" in df.columns:
        df["Activity Code"] = df["Activity Code"].astype(str).str.split(".").str[0].str.zfill(6)

    if "Dept #" in df.columns:
        df["Dept #"] = df["Dept #"].astype(str).str.split(".").str[0].str.zfill(3)

    return df


def rename_multiline_headers(df):
    """
    Renames unnamed columns in Program Inventory to follow a structured approach.
    - Expands column names where appropriate.

    Parameters:
        df (pd.DataFrame): The input DataFrame.

    Returns:
        pd.DataFrame: DataFrame with renamed columns.
    """
    column_mappings = {
        "Cost Recovery": ["Cost Recovery E58", "Cost Recovery P24"],
        "Mandate": ["Mandate E41", "Mandate H41", "Mandate E43"],
        "Service Level": ["Service Level E47", "Service Level H47", "Service Level E49"],
        "Reliance & Interdependencies": ["Reliance E53", "Reliance E55"],
        "Strategic Goal": ["Strategic Goal E64", "Strategic Goal E66", "Strategic Goal E68", "Strategic Goal E74", "Strategic Goal E80"],
        "Trend (Demand)": ["Trend Demand E87", "Trend Demand E89"],
        "Risk": ["Risk E93", "Risk E95"]
    }

    new_columns = []
    idx = 0

    for col in df.columns:
        if col in column_mappings:
            new_columns.extend(column_mappings[col])
            idx += len(column_mappings[col]) - 1  # Adjust for multiple columns
        else:
            new_columns.append(col)

    df.columns = new_columns[: len(df.columns)]  # Ensure the number of columns matches

    return df

In [24]:
## Data Cleaning Process for Expenditure Status
# Remove unnamed columns
df_expenditure_status = drop_unnamed_columns(expenditure_status)

# Splitting Account Number into structured columns
split_cols = ["Fund #", "Dept #", "Activity Code", "Object Code", "Sub-object Code"]
df_expenditure_status[split_cols] = df_expenditure_status.iloc[:, 0].astype(str).str.split(".", expand=True, n=4)

# Convert numeric fields
for col in split_cols:
    df_expenditure_status[col] = pd.to_numeric(df_expenditure_status[col], errors="coerce").astype("Int64")

# Forward-fill Fund # and Dept #
df_expenditure_status[["Fund #", "Dept #"]] = df_expenditure_status[["Fund #", "Dept #"]].ffill()

# Drop missing Activity Code rows
df_expenditure_status = df_expenditure_status.dropna(subset=["Activity Code"])

# Reset index
df_expenditure_status = df_expenditure_status.reset_index(drop=True)

# Display preview
print("\n Cleaned Expenditure Status:")
print(df_expenditure_status.head(10))


 Cleaned Expenditure Status:
                                                 210 CITY COUNCIL  Fund #  \
0       1000.210.410100.100        PERSONAL SERVICES          NaN    1000   
1       1000.210.410100.110.000   SALARIES AND WAGES       214440    1000   
2   1000.210.410100.140.000   EMPLOYER CONTRIBUTIONS       199915    1000   
3  1000.210.410100.141.000   STATE RETIREMENT CON...          217    1000   
4                1000.210.410100.200        SUPPLIES          NaN    1000   
5          1000.210.410100.210.000   OFFICE SUPPLIES          200    1000   
6       1000.210.410100.230.000   REPAIR/MAINTENANCE          750    1000   
7           1000.210.410100.240.000   OTHER SUPPLIES          350    1000   
8      1000.210.410100.300        PURCHASED SERVICES          NaN    1000   
9           1000.210.410100.310.000   COMMUNICATIONS           20    1000   

   Dept #  Activity Code  Object Code  Sub-object Code  
0     210         410100         <NA>             <NA>  
1     21

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_expenditure_status[split_cols] = df_expenditure_status.iloc[:, 0].astype(str).str.split(".", expand=True, n=4)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_expenditure_status[split_cols] = df_expenditure_status.iloc[:, 0].astype(str).str.split(".", expand=True, n=4)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-

In [25]:
## Define Mapping
# Extract mappings
fund_mapping = df_expenditure_status[["Fund #", "Dept #"]].drop_duplicates().dropna()
dept_mapping = df_expenditure_status[["Dept #", "Department"]].drop_duplicates().dropna()

# Convert to dictionary format
fund_mapping_dict = dict(zip(fund_mapping["Fund #"], fund_mapping["Dept #"]))
dept_mapping_dict = dict(zip(dept_mapping["Dept #"], dept_mapping["Department"]))

# Apply mapping
df_expenditure_status["Department"] = df_expenditure_status["Dept #"].map(dept_mapping_dict).fillna("REDACTED")

# Checkpoint
print("\nFinal Mapped Expenditure Status:")
print(df_expenditure_status.head(10))

KeyError: "['Department'] not in index"

In [27]:
df_expenditure_status

Unnamed: 0,210,CITY COUNCIL,Fund #,Dept #,Activity Code,Object Code,Sub-object Code
0,1000.210.410100.100 PERSONAL SERVICES,,1000,210,410100,,
1,1000.210.410100.110.000 SALARIES AND WAGES,214440,1000,210,410100,110,
2,1000.210.410100.140.000 EMPLOYER CONTRIBUTIONS,199915,1000,210,410100,140,
3,1000.210.410100.141.000 STATE RETIREMENT CON...,217,1000,210,410100,141,
4,1000.210.410100.200 SUPPLIES,,1000,210,410100,,
...,...,...,...,...,...,...,...
4470,9000.900.510000.830.000 DEPRECIATION,0,9000,900,510000,830,
4471,9000.900.510000.900 CAPITAL OUTLAY,,9000,900,510000,,
4472,9000.900.510000.900.000 ASSET DISPOSAL EXPENSE,0,9000,900,510000,900,
4473,9000.900.510360.800 OTHER OBJECTS,,9000,900,510360,,


In [None]:
## Clean and Process Program Inventory

# Rename "Org" to "Dept #" for consistency
df_program_inventory = df_program_inventory.rename(columns={"Org": "Dept #"})

# Apply department mapping
df_program_inventory["Department"] = df_program_inventory["Dept #"].map(dept_mapping)

# Assign Fund Name using mapping
df_program_inventory["Fund Name"] = df_program_inventory["Fund"].map(fund_mapping)

# Rename multi-line headers
df_program_inventory = rename_multiline_headers(df_program_inventory)

# Final cleanup
df_program_inventory = drop_unnamed_columns(df_program_inventory)

In [None]:
# Display the final cleaned datasets
import ace_tools as tools  # Ensure ace_tools is installed for visualization
tools.display_dataframe_to_user(name="Updated Expenditure Status", dataframe=df_expenditure_status)
tools.display_dataframe_to_user(name="Updated Program Inventory", dataframe=df_program_inventory)

# Preview data for verification
print("\nFinal Cleaned Expenditure Status Preview:")
print(df_expenditure_status.head())

print("\nFinal Cleaned Program Inventory Preview:")
print(df_program_inventory.head())

In [None]:
## When using the comments in line, it allows you to easily reference them later by using help(function)
## help(fix_multiline_headers)