# IncomeInsight: Global Higher-Education Cost Analytics & Planning
## Phase 2: Model Development

**Project:** EduSpend - Global Higher-Education Cost Analytics & Planning  
**Author:** yan-cotta  
**Date:** June 7, 2025  
**Phase:** 2 - Model Development  

### Project Overview
This notebook builds on our EDA findings to develop a predictive model for Total Cost of Attendance (TCA). We'll create a regression model that can predict education costs based on various factors like location, degree level, and living cost indices.

### Notebook Goals
1. Prepare data and engineer relevant features
2. Develop a baseline regression model
3. Evaluate model performance and identify key predictive features
4. Refine the model for improved prediction accuracy

In [2]:
# Import data manipulation libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import pickle

# Import modeling libraries
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import xgboost as xgb

# Import MLflow for experiment tracking
import mlflow.sklearn
from mlflow.tracking import MlflowClient

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Set plotting style
plt.style.use('default')
sns.set_palette("viridis")

### MLflow Setup

Configure MLflow for experiment tracking to monitor model training process, log parameters, metrics, and artifacts

In [None]:
# Set MLflow tracking URI (local for now)
mlflow.set_tracking_uri("file:./mlruns")
mlflow.set_experiment("IncomeInsight_>50K_Prediction")

print("MLflow Configuration:")
print(f"✓ Tracking URI: {mlflow.get_tracking_uri()}")
print(f"✓ Experiment: {mlflow.get_experiment_by_name('IncomeInsight_>50K_Prediction')}")

In [None]:
df = pd.read_csv('../data/adult_cleaned.csv')

# Apply binary mapping
df = map_binary_columns(df)

# Separate target
X = df.drop(columns='income')
y = df['income']

# Build preprocessor
preprocessor = build_preprocessor()

# Use it in a full modeling pipeline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

model_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])

In [None]:
df = pd.read_csv('../data/adult_cleaned.csv')

In [None]:
# Fit and transform
df_transformed = pipeline.fit_transform(df)

In [None]:

# Final pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor)
])



dump(pipeline, '../models/pipeline.pkl')