<a href="https://colab.research.google.com/github/dornercr/INFO371/blob/main/Week%203%20Credit%20Risk%20Modeling%20%26%20Distance-Based%20Classification%20Design.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INFO 371: Data Mining Applications
___

### Week 3: Credit Risk Modeling & Distance-Based Classification Design

### Charles Dorner, EdD (Candidate)
### College of Computing and Informatics, Drexel University

In [None]:
#Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    precision_recall_fscore_support,
)


In [None]:
#Simulate Dataset
X, y = make_classification(
    n_samples=200,
    n_features=4,
    n_informative=3,
    n_redundant=0,
    n_classes=2,
    weights=[0.7, 0.3],
    class_sep=1.5,
    random_state=0
)

df = pd.DataFrame(X, columns=[
    'credit_score', 'income_level', 'debt_to_income', 'loan_amount'
])
df['default'] = y  # 0 = repaid, 1 = defaulted

In [None]:
#Train-test split
X = df.drop('default', axis=1)
y = df['default']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [None]:
#Apply two types of feature scaling

scaler_std = StandardScaler()
X_train_std = scaler_std.fit_transform(X_train)
X_test_std = scaler_std.transform(X_test)

scaler_mm = MinMaxScaler()
X_train_mm = scaler_mm.fit_transform(X_train)
X_test_mm = scaler_mm.transform(X_test)


In [None]:
#train KNN Model
knn = KNeighborsClassifier(n_neighbors=5)
model = OneVsRestClassifier(knn)
model.fit(X_train_std, y_train)


In [None]:
#evaluate on test set
y_pred = model.predict(X_test_std)
print("Test Accuracy:", model.score(X_test_std, y_test))


Test Accuracy: 0.95


In [None]:
print("F1 Macro:", f1_score(y_test, y_pred, average='macro'))
print("F1 Micro:", f1_score(y_test, y_pred, average='micro'))
print("F1 Weighted:", f1_score(y_test, y_pred, average='weighted'))


F1 Macro: 0.9373040752351097
F1 Micro: 0.95
F1 Weighted: 0.94858934169279


In [None]:
precision, recall, f1, support = precision_recall_fscore_support(y_test, y_pred)


In [None]:
#scikit-learn estimate design
print("Model parameters:", model.estimators_[0].get_params())
print("Training accuracy:", model.estimators_[0].score(X_train_std, y_train))


Model parameters: {'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}
Training accuracy: 0.95625


In [None]:
#compare raw, min max, standard scaler models
knn_raw = KNeighborsClassifier(n_neighbors=5)
knn_raw.fit(X_train, y_train)
print("Unscaled Accuracy:", knn_raw.score(X_test, y_test))

knn_mm = KNeighborsClassifier(n_neighbors=5)
knn_mm.fit(X_train_mm, y_train)
print("Min-Max Scaled Accuracy:", knn_mm.score(X_test_mm, y_test))

knn_std = KNeighborsClassifier(n_neighbors=5)
knn_std.fit(X_train_std, y_train)
print("Standard Scaled Accuracy:", knn_std.score(X_test_std, y_test))


Unscaled Accuracy: 0.95
Min-Max Scaled Accuracy: 0.95
Standard Scaled Accuracy: 0.95


# 🎓 When to Use `StandardScaler` vs `MinMaxScaler` (and Whether to Scale at All)

---

## ✅ 1. Based on the Model Type

| Model                         | Requires Scaling? | Recommended Scaler     |
|------------------------------|--------------------|-------------------------|
| **K-Nearest Neighbors (KNN)**| ✅ Yes             | `StandardScaler` or `MinMaxScaler` |
| **Support Vector Machine (SVM)** | ✅ Yes         | `StandardScaler`        |
| **Logistic Regression**       | ✅ Yes (for convergence) | `StandardScaler` |
| **Linear Regression**         | ✅ Yes (for interpretability) | `StandardScaler` |
| **Decision Trees**            | ❌ No              | N/A                     |
| **Random Forests**            | ❌ No              | N/A                     |
| **XGBoost**                   | ❌ No              | N/A                     |
| **K-Means Clustering**        | ✅ Yes             | `StandardScaler`        |
| **Neural Networks**           | ✅ Yes             | `MinMaxScaler` or `StandardScaler` |

---

## 🔍 2. Based on Feature Range / Units

### Use `MinMaxScaler` when:
- Features are in **different units or magnitudes**  
  _e.g., income in thousands vs interest rate in decimals_
- You want values scaled to `[0, 1]` (good for neural nets)
- You need to preserve the original **distribution shape**
- You're preparing **image or pixel data** (e.g., scale to `[0, 255]` or `[0, 1]`)

### Use `StandardScaler` when:
- Features are **approximately normally distributed**
- You want features to have **zero mean and unit variance**
- You're using **distance-based models** (KNN, SVM, PCA)
- You're applying **PCA** or **logistic regression**

---

## 📏 3. Based on Evaluation Metric Sensitivity

- **Accuracy**: may be affected if your model is sensitive to feature magnitude
- **F1 Score**: especially important when classes are imbalanced
- **Convergence Speed**: for gradient-based models (NN, logistic regression)

Scaling helps avoid misleading distances or weight dominance across features.

---

## 📊 4. Based on Data Distribution

- Use `StandardScaler` if data is bell-shaped (Gaussian)
- Use `MinMaxScaler` if data is bounded or needs to stay between 0 and 1
- Try both and compare performance using validation accuracy, F1, and recall

---

## ✅ Summary

| Situation                        | Recommended Scaler     |
|----------------------------------|-------------------------|
| Features on vastly different scales | `MinMaxScaler` or `StandardScaler` |
| Model depends on distance (KNN, SVM) | `StandardScaler`        |
| Feature distributions are bounded and non-normal | `MinMaxScaler`        |
| Neural networks / image data      | `MinMaxScaler`         |
| You're unsure                    | Try both and compare   |
