# Exploring and Cleaning Data

In the following code I will be performing both exploratory analysis and cleaning the data provided within three excel spreadsheets. These spreadsheets contain budget information for the City of Missoula and aligining Program Inventory information surrounding different programs that are funded by the City of Missoula. 

The data provided is messy and comes from both the financial software that is used, but also from a different software that collects survey responses. The data in the excel files is messy and unclean. This process will load the files into the environment, perform data cleaning functions to prepare the data for manipulation in Power Bi. In Power BI, I will create a dashboard that provides information about all of the different programs currently funded and provide granular breakdowns for financial and other important information therein. 

Once the data is cleaned I will conduct some exploratory analysis on the newly cleaned data to see how it performs. 

In [1]:
### Importing Libraries
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt

In [2]:
## Load Files; only the first sheet in expenditure and program inventory, all of the sheets in revenue expense. 
revenue_expense = pd.read_excel("data/FY24_Revenue_Expense_Data.xlsx", sheet_name=None)
expenditure_status = pd.read_excel("data/FY24_Expenditure_Status.xlsx", sheet_name=0, skiprows=5)
program_inventory = pd.read_excel("data/Program_Inventory_Internal_Data_Collection.xlsx", sheet_name=0)

In [3]:
## Previews load 
print("Revenue Expense Loaded:", revenue_expense.keys())
print("Expenditure Status Preview:")
print(expenditure_status.head())
print("Program Inventory Preview:")
print(program_inventory.head())

Revenue Expense Loaded: dict_keys(['Rev', 'Exp', 'Status'])
Expenditure Status Preview:
                                                1000            GENERAL FUND  \
0                                                210            CITY COUNCIL   
1                            \n_x000D_Account Number  Adjusted Appropriation   
2  1000.210.410100                LEGISLATION SER...                     NaN   
3       1000.210.410100.100        PERSONAL SERVICES                     NaN   
4       1000.210.410100.110.000   SALARIES AND WAGES                  214440   

   Unnamed: 2  Unnamed: 3  Unnamed: 4  Unnamed: 5  Unnamed: 6  Unnamed: 7  \
0         NaN         NaN         NaN         NaN         NaN         NaN   
1         NaN         NaN         NaN         NaN         NaN         NaN   
2         NaN         NaN         NaN         NaN         NaN         NaN   
3         NaN         NaN         NaN         NaN         NaN         NaN   
4         NaN         NaN         NaN         

In [4]:
## Defining Cleaning Data Functions
def drop_unnamed_columns(df):
    """Removes fully empty 'Unnamed' columns from a DataFrame."""
    return df.loc[:, ~df.columns.str.contains("^Unnamed")]

def clean_numeric_column(column):
    """Removes trailing .0 from numeric values for better formatting."""
    return column.astype(str).str.replace(r'\.0$', '', regex=True)

def clean_identifiers(df):
    """Cleans and standardizes Fund #, Activity Code, and Dept # formatting."""
    if "Fund #" in df.columns:
        df["Fund #"] = df["Fund #"].astype(str).str.split('.').str[0]
    if "Activity code" in df.columns:
        df["Activity code"] = df["Activity code"].astype(str).str.split('.').str[0].str.zfill(6)
    if "Dept #" in df.columns:
        df["Dept #"] = df["Dept #"].astype(str).str.split('.').str[0].str.zfill(3)
    return df

def clean_dataframe(df):
    """Removes empty rows and columns, trims column names, and applies identifier cleaning."""
    df = df.dropna(axis=1, how='all')
    df.columns = df.columns.str.strip()
    df = df.dropna(how='all')
    return clean_identifiers(df)

In [5]:
## Data Cleaning Process
df_expenditure_status = drop_unnamed_columns(expenditure_status)
df_program_inventory = drop_unnamed_columns(program_inventory)

In [6]:
# Ensure the "Account Number" column is properly formatted before splitting, then splitting account number into different pieces
df_expenditure_status = df_expenditure_status[df_expenditure_status.iloc[:, 0].notna()]

split_columns = ['Fund #', 'Dept #', 'Activity Code', 'Object Code', 'Sub-object Code']
df_expenditure_status[split_columns] = df_expenditure_status.iloc[:, 0].str.split('.', expand=True, n=4)

In [7]:
# Convert newly created columns to numeric where applicable
for col in split_columns:
    df_expenditure_status[col] = pd.to_numeric(df_expenditure_status[col], errors="coerce")

In [8]:
# Forward-fill missing Fund # and Dept # values correctly
df_expenditure_status[['Fund #', 'Dept #']] = df_expenditure_status[['Fund #', 'Dept #']].ffill()

In [9]:
# Drop any remaining rows where Activity Code is missing, ensuring no metadata rows remain
df_expenditure_status = df_expenditure_status.dropna(subset=['Activity Code'])

In [10]:
# Drop the original "Account Number" column as we now have separate columns
df_expenditure_status = df_expenditure_status.drop(columns=[df_expenditure_status.columns[0]])

In [11]:
# Reorder columns for better readability
column_order = split_columns + list(df_expenditure_status.columns[len(split_columns):])
df_expenditure_status = df_expenditure_status[column_order]

In [12]:
# Drop remaining unnamed columns that might have persisted
df_expenditure_status = drop_unnamed_columns(df_expenditure_status)

In [13]:
#### Stuck on Sub-Object code. 

In [14]:
# Fix multi-line headers in Program Inventory by extracting meaningful column names
def fix_multiline_headers(columns):
    new_headers = []
    for col in columns:
        if "(" in col and ")" in col:
            base_name = col.split("(")[0].strip()  # Extract base name before "("
            sub_names = col[col.find("(") + 1:col.find(")")].split(",")  # Extract sub-column names
            cleaned_sub_names = [base_name + " " + sub.strip() for sub in sub_names]
            
            # If the number of sub-columns matches the number of actual columns, replace them
            if len(cleaned_sub_names) == 1:
                new_headers.append(cleaned_sub_names[0])
            else:
                new_headers.append(base_name)  # Keep base name if mismatch occurs
        else:
            new_headers.append(col)
    
    # Ensure the length of new headers matches the number of columns
    if len(new_headers) != len(columns):
        print(f"⚠️ Warning: Column count mismatch! Expected {len(columns)}, but got {len(new_headers)}. Using original names.")
        return list(columns)  # Return the original names if there's a mismatch
    
    return new_headers

df_program_inventory.columns = fix_multiline_headers(df_program_inventory.columns)

In [15]:
# Remove special characters from column names
df_program_inventory.columns = df_program_inventory.columns.str.replace(r'\n|\r|_x000D_', '', regex=True)

In [16]:
# Preview cleaned data before further processing
print("Cleaned Revenue Expense Sheets:", revenue_expense.keys())
print("Final Cleaned Expenditure Status Preview:")
print(df_expenditure_status.head())
print("Final Cleaned Program Inventory Preview:")
print(df_program_inventory.head())

Cleaned Revenue Expense Sheets: dict_keys(['Rev', 'Exp', 'Status'])
Final Cleaned Expenditure Status Preview:
   Fund #  Dept #  Activity Code  Object Code  Sub-object Code  \
3  1000.0   210.0       410100.0          NaN              NaN   
4  1000.0   210.0       410100.0        110.0              NaN   
5  1000.0   210.0       410100.0        140.0              NaN   
6  1000.0   210.0       410100.0        141.0              NaN   
9  1000.0   210.0       410100.0          NaN              NaN   

   Sub-object Code  
3              NaN  
4              NaN  
5              NaN  
6              NaN  
9              NaN  
Final Cleaned Program Inventory Preview:
   Fund  Org  Activity                    Program Title H8  \
0  1000  210    410100                LEGISLATION SERVICES   
1  1000  210    419000  GENERAL GOV'T ONETIME EXPENDITURES   
2  1000  220    410210                      ADMINISTRATION   
3  1000  220    410250                      ADMINISTRATION   
4  1000  220    