# 🎓 MSc Thesis: Project Walkthrough

This notebook demonstrates the core components of the thesis project interactively. 
You can run each cell to see the code in action!

## 📂 Steps Covered:
1. **Setup**: verifying imports.
2. **Data Generation**: Creating 'Oracle' data and injecting missingness.
3. **Imputation**: Fixing the missing values using Baseline and MICE methods.

In [None]:
# 1. Setup & Imports
import sys
import os
import numpy as np
import pandas as pd

# Ensure we can import from src
sys.path.append(os.path.abspath('..'))

from src.data.data_generator import DataGenerator
from src.utils import load_config
from src.imputation.baseline_imputer import BaselineImputer
from src.imputation.mice_imputer import MiceImputer

print("Libraries loaded successfully! ✅")

## 2. Data Simulation
We use the `DataGenerator` class to create synthetic credit scoring data. 
We then apply a **Missingness Mechanism** (e.g., MNAR - Rejection) to simulate rejected applicants.

In [None]:
# Load Configuration
config = load_config("../configs/experiment_config.yaml")

# Initialize Generator
gen = DataGenerator(config)

# 1. Generate Oracle Data (Ground Truth)
data = gen.generate_oracle_data()
print(f"Generated {len(data['y'])} applicants with {data['X'].shape[1]} features.")

# 2. Apply Missingness (Simulate Rejection)
data_miss = gen.introduce_missingness(data)

# Calculate Stats
n_missing = np.isnan(data_miss['y_observed']).sum()
print(f"\nMissing Labels (Rejected): {n_missing} ({n_missing/len(data['y']):.1%})")

## 3. Imputation (Fixing the Data)
Now we try to recover the missing values using different strategies.
*   **Baseline**: Mode Imputation (Most frequent value).
*   **MICE**: Multiple Imputation by Chained Equations (Uses correlations).

In [None]:
y_obs = data_miss['y_observed']
X = data_miss['X']

# --- Baseline Imputation ---
print("Applying Baseline Imputation...")
base_imputer = BaselineImputer(strategy='mode')
# Simple fillna logic for y
y_base = pd.Series(y_obs).fillna(pd.Series(y_obs).mode()[0]).values
print(f"Baseline - Remaining NaNs: {np.isnan(y_base).sum()}")

# --- MICE Imputation ---
print("\nApplying MICE Imputation...")
# We combine X and y for MICE
data_matrix = np.column_stack((X, y_obs))
mice_imputer = MiceImputer(max_iter=5)
data_mice = mice_imputer.fit(data_matrix).transform(data_matrix)
y_mice = data_mice[:, -1] # Last column is y

print(f"MICE - Remaining NaNs: {np.isnan(y_mice).sum()}")