<a href="https://colab.research.google.com/github/Wanhar-Aziz/Personal-Finance-ML-Project/blob/main/ML_Proposal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Personal Finance Machine Learning (UPEI-CS-4120) Project Proposal**

---



## **Installing Requirements**

---



In [None]:
!pip install kagglehub pandas numpy



## **Download the Personal Finance Dataset from Kaggle**

---

In [None]:
import kagglehub

# Download latest version of dataset
path = kagglehub.dataset_download("miadul/personal-finance-ml-dataset")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'personal-finance-ml-dataset' dataset.
Path to dataset files: /kaggle/input/personal-finance-ml-dataset


## **Load the Dataset into Pandas**
____

The Personal Finance ML Dataset was downloaded directly from Kaggle using the kagglehub library, which provides seamless access to the latest dataset version. After confirming the dataset path, the CSV file (synthetic_personal_finance_dataset.csv) was identified and loaded into a Pandas DataFrame. The dataset contains 32,424 rows and 20 columns, with the first few records confirming the presence of demographic, financial, and loan-related attributes needed for analysis.

In [None]:
import os
import pandas as pd

# Look for a CSV file inside the downloaded folder
csv_files = [f for f in os.listdir(path) if f.endswith(".csv")]
print("CSV files found:", csv_files)

# Pick the first CSV (adjust if there are multiple)
csv_path = os.path.join(path, csv_files[0])
df = pd.read_csv(csv_path)

# Preview
print(df.shape)
df.head()

CSV files found: ['synthetic_personal_finance_dataset.csv']
(32424, 20)


Unnamed: 0,user_id,age,gender,education_level,employment_status,job_title,monthly_income_usd,monthly_expenses_usd,savings_usd,has_loan,loan_type,loan_amount_usd,loan_term_months,monthly_emi_usd,loan_interest_rate_pct,debt_to_income_ratio,credit_score,savings_to_income_ratio,region,record_date
0,U00001,56,Female,High School,Self-employed,Salesperson,3531.69,1182.59,367655.03,No,,0.0,0,0.0,0.0,0.0,430,8.68,Other,2024-01-09
1,U00002,19,Female,PhD,Employed,Salesperson,3531.73,2367.99,260869.1,Yes,Education,146323.34,36,4953.5,13.33,1.4,543,6.16,North America,2022-02-13
2,U00003,20,Female,Master,Employed,Teacher,2799.49,1003.91,230921.21,No,,0.0,0,0.0,0.0,0.0,754,6.87,Africa,2022-05-12
3,U00004,25,Male,PhD,Employed,Manager,5894.88,4440.12,304815.51,Yes,Business,93242.37,24,4926.57,23.93,0.84,461,4.31,Europe,2023-10-02
4,U00005,53,Female,PhD,Employed,Student,5128.93,4137.61,461509.48,No,,0.0,0,0.0,0.0,0.0,516,7.5,Africa,2021-08-07


## **Expolorartory Data Analysis**
____

We began by examining the dataset’s structure, confirming it contains 32,424 rows and 20 columns. A missing data check revealed that nearly all features are complete, with the exception of loan_type, which has approximately 59.92% missing values. Finally, we assessed the target variable, has_loan, and found that about 59.92% of individuals do not have a loan, while 40.08% do. This indicates a moderate class imbalance that should be considered in later modeling steps

In [None]:
# Basic info
rows, cols = df.shape
print(f"Rows: {rows}, Columns: {cols}")

# Missing data %
missing_data = (df.isnull().sum() / len(df) * 100).round(2)
print("\nMissing Data (%):\n", missing_data)

# Class distribution for has_loan
class_distribution = (df['has_loan'].value_counts(normalize=True) * 100).round(2)
print("\nClass Distribution (has_loan):\n", class_distribution)

Rows: 32424, Columns: 20

Missing Data (%):
 user_id                     0.00
age                         0.00
gender                      0.00
education_level             0.00
employment_status           0.00
job_title                   0.00
monthly_income_usd          0.00
monthly_expenses_usd        0.00
savings_usd                 0.00
has_loan                    0.00
loan_type                  59.92
loan_amount_usd             0.00
loan_term_months            0.00
monthly_emi_usd             0.00
loan_interest_rate_pct      0.00
debt_to_income_ratio        0.00
credit_score                0.00
savings_to_income_ratio     0.00
region                      0.00
record_date                 0.00
dtype: float64

Class Distribution (has_loan):
 has_loan
No     59.92
Yes    40.08
Name: proportion, dtype: float64


## **1. Dataset Snapshot**
____

In [None]:
# Feature descriptions
DESCRIPTIONS = {
    "age": "Age of the individual (years)",
    "gender": "Gender identity",
    "marital_status": "Marital status",
    "dependents": "Number of dependents",
    "education": "Highest education level",
    "employment": "Employment status",
    "annual_income": "Annual income",
    "monthly_income": "Monthly income",
    "monthly_expenses": "Monthly household expenses",
    "credit_card_debt": "Total credit card debt",
    "mortgage_debt": "Mortgage debt amount",
    "other_debt": "Other outstanding debt",
    "loan_type": "Type of loan (if any)",
    "loan_amount": "Loan amount",
    "has_loan": "Target: 1 = has loan, 0 = no loan"
}

# Build the table
table = pd.DataFrame({
    "Feature": df.columns,
    "Description": [DESCRIPTIONS.get(c.lower(), c.replace("_"," ").title()) for c in df.columns],
    "Missing (%)": [f"{missing_data[c]:.2f}" for c in df.columns]
})

# Add rows and columns info at the top
summary = pd.DataFrame({
    "Feature": ["Rows (samples)", "Columns (features + target)"],
    "Description": [rows, cols],
    "Missing (%)": ["—", "—"]
})
table1 = pd.concat([summary, table], ignore_index=True)

# Show and save
display(table1)
table1.to_csv("Table_1_Dataset_Snapshot.csv", index=False)

Unnamed: 0,Feature,Description,Missing (%)
0,Rows (samples),32424,—
1,Columns (features + target),20,—
2,user_id,User Id,0.00
3,age,Age of the individual (years),0.00
4,gender,Gender identity,0.00
5,education_level,Education Level,0.00
6,employment_status,Employment Status,0.00
7,job_title,Job Title,0.00
8,monthly_income_usd,Monthly Income Usd,0.00
9,monthly_expenses_usd,Monthly Expenses Usd,0.00


## **2. Dataset Description**
___

The dataset used in this study is the *Personal Finance ML Dataset*, publicly available on Kaggle (https://www.kaggle.com/datasets/miadul/personal-finance-ml-dataset).
It is a **synthetic but realistic dataset** created to simulate financial behavior for research and machine learning purposes.
The dataset consists of **32424 individual records** and **20 attributes**, which include demographic features
(e.g., age, gender, marital status, dependents, education, employment), financial indicators
(e.g., annual income, monthly expenses, debts), and loan-related details (loan type, loan amount).
The target variable, **has_loan**, is binary (0 = no loan, 1 = has loan).

Because the dataset is synthetic, it does not contain personally identifiable information and avoids privacy concerns.
It is provided under Kaggle’s open-access data-sharing framework, making it freely available for academic,
educational, and prototyping uses. The dataset is well-suited for exploratory data analysis, classification,
regression, and clustering tasks in the domain of personal finance and credit risk.

## **Citation**
________
miadul, “Personal Finance ML Dataset,” Kaggle. [Online]. Available: https://www.kaggle.com/datasets/miadul/personal-finance-ml-dataset/data.