# KNN for Credit Decisions and Profit Maximization

## 0. Executive Summary (objective + deliverables)
**Objective.** Train and validate a supervised KNN classifier to estimate applicants' default probability and make approve/decline decisions that **maximize expected profit** under a business cost/benefit matrix.  
**Deliverables.** 1) KNN model encapsulated in a reproducible pipeline, 2) operating threshold optimized for utility, 3) holdout test evaluation with technical metrics and **business utility**, 4) deployment and monitoring guidelines.  
**Success criterion.** Beat baselines (approve all / approve none) in net utility; keep risk metrics aligned with policy (e.g., minimum TPR on a priority segment).  
**Key assumptions.** Use case: credit decision for existing customers (historical information available) with target "default next period." Costs and benefits are provided by the business (or scenario-based here).  
**Limitations.** KNN is sensitive to scaling, dimensionality, and inference latency; mitigated via preprocessing, k selection, and production controls.

## 1. Context and Sources
### 1.1. Problem statement
At application time, the bank wants a system that **predicts default risk** and decides approve/decline to maximize financial utility. The decision depends on: estimated risk, operating threshold, and the cost/benefit matrix.

### 1.2. Dataset and data dictionary
"Default of Credit Card Clients" (~30k rows). Demographics, credit limit, recent payment history, billed amounts, and previous payments. Binary label: `default_payment_next_month` (1=default, 0=no default).  
Variable groups:  
- Demographics/profile: `LIMIT_BAL`, `SEX`, `EDUCATION`, `MARRIAGE`, `AGE`.  
- Payment history: `PAY_0` ... `PAY_6` (recent monthly status).  
- Billed amounts: `BILL_AMT1` ... `BILL_AMT6`.  
- Paid amounts: `PAY_AMT1` ... `PAY_AMT6`.  
- Target: `default_payment_next_month`.

### 1.3. Link to the course
Apply Similarity and Neighbors (KNN) concepts: effect of scaling, distance metric choice, k selection, probability calibration, and decision thresholds.

In [None]:
import warnings
from pathlib import Path

from IPython.display import display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.base import clone
from sklearn.calibration import CalibrationDisplay
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import (
    average_precision_score,
    classification_report,
    confusion_matrix,
    make_scorer,
    precision_recall_curve,
    roc_auc_score,
    roc_curve,
)
from sklearn.model_selection import (
    GridSearchCV,
    StratifiedKFold,
    cross_val_predict,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier, NearestNeighbors
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

plt.style.use("seaborn-v0_8")
pd.set_option("display.float_format", lambda v: f"{v:,.2f}")
warnings.filterwarnings("ignore", category=FutureWarning)

PROJECT_ROOT = Path.cwd()
DATA_DIR = PROJECT_ROOT / "Code" / "Final Project" / "Loan Defaults"
DATA_FILE = DATA_DIR / "default of credit card clients.xls"
TARGET = "default_payment_next_month"
ID_COLS = ["ID"]
SEED = 42
rng = np.random.default_rng(SEED)

raw_df = (
    pd.read_excel(DATA_FILE, header=1)
    .rename(columns={"default payment next month": TARGET})
)
raw_df[TARGET] = raw_df[TARGET].astype(int)

print(f"Rows: {raw_df.shape[0]:,} | Columns: {raw_df.shape[1]}")
raw_df.head()

## 2. Business Metric Design
### 2.1. Cost/benefit matrix (TP, FP, TN, FN)
Definitions:  
- **TP**: approve a non-defaulter → benefit = expected margin (interest − cost of funds − provisions) minus operating costs.  
- **FP**: approve a defaulter → cost = expected loss (principal × LGD − recoveries) + expenses.  
- **TN**: reject a defaulter → benefit/cost ≈ 0, or benefit from avoided losses.  
- **FN**: reject a good customer → opportunity cost (foregone margin).

### 2.2. Expected value function and optimal decision threshold
For estimated default probability $\\hat{p}$, expected **utility** of approval is:
$$
U(\\text{approve}) = (1-\\hat{p}) \\cdot B_{\\text{TP}} - \\hat{p} \\cdot C_{\\text{FP}}
$$
and of **reject**:
$$
U(\\text{reject}) = (1-\\hat{p}) \\cdot (-C_{\\text{FN}}) + \\hat{p} \\cdot B_{\\text{TN}}
$$
Define the **optimal threshold** $\\tau^*$ such that approve if $U(\\text{approve}) \\ge U(\\text{reject})$. This threshold is not necessarily 0.5 and depends on the cost/benefit matrix.

### 2.3. Reference baselines
- **Approve all**: utility if nobody is rejected.  
- **Reject all**: utility if nobody is approved.  
The model must **outperform** both in utility.

In [None]:
COST_MATRIX = {
    "tp_benefit": 1600,
    "fp_cost": 6000,
    "fn_cost": 1200,
    "tn_benefit": 200,
}
THRESHOLD_GRID = np.linspace(0.05, 0.95, 181)

def confusion_counts(y_true, y_pred):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    return {
        "tp": int(np.sum((y_true == 1) & (y_pred == 1))),
        "fp": int(np.sum((y_true == 0) & (y_pred == 1))),
        "tn": int(np.sum((y_true == 0) & (y_pred == 0))),
        "fn": int(np.sum((y_true == 1) & (y_pred == 0))),
    }

def cost_sensitive_utility(counts, matrix=COST_MATRIX, normalize=False):
    utility = (
        counts["tp"] * matrix["tp_benefit"]
        + counts["tn"] * matrix["tn_benefit"]
        - counts["fp"] * matrix["fp_cost"]
        - counts["fn"] * matrix["fn_cost"]
    )
    if normalize:
        total = sum(counts.values())
        return utility / total
    return utility

def evaluate_predictions(y_true, y_prob, threshold, matrix=COST_MATRIX):
    preds = (y_prob >= threshold).astype(int)
    counts = confusion_counts(y_true, preds)
    utility = cost_sensitive_utility(counts, matrix=matrix, normalize=False)
    return {
        "threshold": threshold,
        "utility": utility,
        "normalized_utility": utility / len(y_true),
        **counts,
    }

def search_best_threshold(y_true, y_prob, matrix=COST_MATRIX, grid=THRESHOLD_GRID):
    rows = [evaluate_predictions(y_true, y_prob, tau, matrix) for tau in grid]
    results = pd.DataFrame(rows)
    best_row = results.loc[results["utility"].idxmax()].to_dict()
    return results, best_row

def normalized_best_utility(y_true, y_prob):
    _, best_row = search_best_threshold(y_true, y_prob)
    return best_row["normalized_utility"]

utility_scorer = make_scorer(normalized_best_utility, needs_proba=True, greater_is_better=True)

print("Cost/Benefit matrix:", COST_MATRIX)

## 3. Data Understanding (targeted EDA)
### 3.1. Target distribution
Default is typically imbalanced. Document prevalence and implications for metrics (accuracy can mislead; prefer ROC/PR and utility).

### 3.2. Variable types
Identify: continuous numeric (limits, amounts), ordinal (payment status), encoded categorical (sex, education, marital status). Justify treatment given the distance metric.

### 3.3. Missing values and outliers
State policies: minimal imputation if needed; winsorization or robust scaling to reduce outlier impact on distances.

### 3.4. Correlations and scales
KNN requires **scaling** so large-range variables don't dominate distance. Document StandardScaler/RobustScaler choice and rationale.

In [None]:
target_counts = (
    raw_df[TARGET]
    .value_counts()
    .rename_axis("default")
    .reset_index(name="count")
    .assign(pct=lambda df: df["count"] / df["count"].sum())
)

missing = raw_df.isna().sum()
missing = missing[missing > 0].sort_values(ascending=False)

pay_cols = [c for c in raw_df.columns if c.startswith("PAY_")]
corr_cols = pay_cols + ["LIMIT_BAL", TARGET]
corr_matrix = raw_df[corr_cols].corr(method="spearman")

display(target_counts)
if not missing.empty:
    display(missing.to_frame("missing_values"))

display(raw_df.describe().T[["mean", "std", "min", "max"]].round(2).head(10))

fig, axes = plt.subplots(1, 3, figsize=(18, 4))
sns.barplot(data=target_counts, x="default", y="pct", ax=axes[0])
axes[0].set_title("Target distribution")
axes[0].set_ylabel("Prevalence")

sns.boxplot(data=raw_df, y="LIMIT_BAL", ax=axes[1])
axes[1].set_title("LIMIT_BAL spread")

sns.heatmap(corr_matrix, cmap="coolwarm", center=0, ax=axes[2])
axes[2].set_title("Spearman correlation snapshot")
plt.tight_layout()

## 4. Avoid Leakage and Define the Feature Set
### 4.1. Inclusion/exclusion criteria
Use only variables **available at decision time**. Exclude any that anticipate the outcome beyond the defined horizon.

### 4.2. Transformations/encoding
- Categoricals: consistent encoding for KNN (one-hot if needed).  
- Ordinal variables: preserve order when meaningful (payment status).  
- Amounts: consider log or robust transforms for heavy tails.

### 4.3. Candidate feature list
Demographics/profile + recent payment history + billed and paid amounts prior to the decision cutoff. Document exclusions due to leakage or low signal.

In [None]:
categorical_features = ["SEX", "EDUCATION", "MARRIAGE"]
numeric_features = [
    col for col in raw_df.columns
    if col not in ID_COLS + [TARGET] + categorical_features
]

feature_overview = pd.DataFrame({
    "feature": numeric_features + categorical_features,
    "type": ["numeric"] * len(numeric_features) + ["categorical"] * len(categorical_features),
})

display(feature_overview.head(12))
print(f"Total candidate features: {len(feature_overview)}")

## 5. Data Split and Validation Protocol
### 5.1. Train/validation/test
Stratified splitting to preserve target prevalence. Hold out test for final evaluation.

### 5.2. Validation metrics
Report ROC-AUC and PR-AUC for general performance; **optimize and report expected utility** per the business matrix.

### 5.3. Repeats/seed for stability
Use multiple seeds or stratified CV; report mean and dispersion for robustness.

In [None]:
X = raw_df.drop(columns=ID_COLS + [TARGET])
y = raw_df[TARGET]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=SEED,
)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)

print(f"Train size: {X_train.shape[0]:,} | Test size: {X_test.shape[0]:,}")
print(f"Train default rate: {y_train.mean():.3f} | Test default rate: {y_test.mean():.3f}")

## 6. Preprocessing and Pipeline
### 6.1. Scaling
Include scaling inside the pipeline to avoid information leakage across splits.

### 6.2. Encoding
Apply categorical encoding inside the pipeline, consistent across train and test.

### 6.3. Pipeline structure
[coherent preprocessing] -> [scaling] -> [KNNClassifier]. Document decisions.

In [None]:
numeric_pipeline = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)

categorical_pipeline = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_features),
        ("cat", categorical_pipeline, categorical_features),
    ]
)

preprocessor

## 7. KNN Modeling (supervised classification)
### 7.1. Hyperparameters to tune
- n_neighbors (k): controls boundary smoothness and variance.  
- weights: uniform vs distance.  
- p: 1 (Manhattan) vs 2 (Euclidean).  
- metric, leaf_size as needed.

### 7.2. Hyperparameter search
Grid/Random search with stratified CV. Primary optimization: **utility**; secondary metrics: ROC-AUC/PR-AUC.

### 7.3. Performance vs k curve
Document bias-variance trade-off and how utility changes with k.

In [None]:
knn_pipeline = Pipeline(
    steps=[
        ("preprocess", preprocessor),
        ("knn", KNeighborsClassifier()),
    ]
)

param_grid = {
    "knn__n_neighbors": [11, 21, 31, 41],
    "knn__weights": ["uniform", "distance"],
    "knn__p": [1, 2],
}

grid = GridSearchCV(
    estimator=knn_pipeline,
    param_grid=param_grid,
    scoring={"utility": utility_scorer, "roc_auc": "roc_auc", "pr_auc": "average_precision"},
    refit="utility",
    cv=cv,
    n_jobs=-1,
    verbose=2,
)

grid.fit(X_train, y_train)

cv_results = (
    pd.DataFrame(grid.cv_results_)
    .sort_values(by="mean_test_utility", ascending=False)
    .loc[:, [
        "param_knn__n_neighbors",
        "param_knn__weights",
        "param_knn__p",
        "mean_test_utility",
        "mean_test_roc_auc",
        "mean_test_pr_auc",
    ]]
)

display(cv_results.head(8))
print("Best params:", grid.best_params_)
print(f"Best normalized utility (cv): {grid.best_score_:.4f}")
best_model = grid.best_estimator_

## 8. Probabilities, Calibration, and Utility-Based Threshold
### 8.1. Probability prediction and calibration
KNN probabilities come from the positive fraction among neighbors. Assess **calibration** and, if necessary, apply calibration (Platt/Isotonic) on a validation set.

### 8.2. Utility vs threshold curve
Construct the utility curve by varying the decision threshold. Explain how the maximum is selected.

### 8.3. Operating threshold selection
Choose $\\tau^*$ that maximizes utility. If business imposes constraints (e.g., minimum TPR), select the best threshold that satisfies them.

In [None]:
oof_probs = cross_val_predict(
    best_model,
    X_train,
    y_train,
    cv=cv,
    method="predict_proba",
    n_jobs=-1,
)[:, 1]

threshold_curve, best_threshold_row = search_best_threshold(y_train, oof_probs)
best_threshold = float(best_threshold_row["threshold"])

print(f"Best threshold (utility-optimized): {best_threshold:.3f}")
print(f"Normalized utility at tau*: {best_threshold_row['normalized_utility']:.4f}")

fig, axes = plt.subplots(1, 3, figsize=(18, 4))
axes[0].plot(threshold_curve["threshold"], threshold_curve["normalized_utility"], label="Utility/customer")
axes[0].axvline(best_threshold, color="red", linestyle="--", label=f"tau*={best_threshold:.2f}")
axes[0].set_xlabel("Threshold")
axes[0].set_ylabel("Normalized utility")
axes[0].legend()

fpr, tpr, _ = roc_curve(y_train, oof_probs)
axes[1].plot(fpr, tpr, label=f"ROC AUC={roc_auc_score(y_train, oof_probs):.3f}")
axes[1].plot([0, 1], [0, 1], color="grey", linestyle="--")
axes[1].set_xlabel("False positive rate")
axes[1].set_ylabel("True positive rate")
axes[1].set_title("Training ROC (OOF)")
axes[1].legend()

CalibrationDisplay.from_predictions(
    y_train,
    oof_probs,
    n_bins=10,
    strategy="quantile",
    ax=axes[2],
)
axes[2].set_title("Calibration (OOF)")
plt.tight_layout()

## 9. Final Evaluation and Baselines
### 9.1. Primary metrics
ROC-AUC, PR-AUC, confusion matrix on test, and **total utility** vs baselines.

### 9.2. Utility by segments
Analyze utility by relevant subgroups (limits, tenure, internal score) to confirm consistency of benefit.

### 9.3. Stability
Variation across folds/seeds. Note if the model is sensitive to small perturbations.

In [None]:
test_probs = best_model.predict_proba(X_test)[:, 1]
test_preds = (test_probs >= best_threshold).astype(int)
test_counts = confusion_counts(y_test, test_preds)
test_utility = cost_sensitive_utility(test_counts, matrix=COST_MATRIX, normalize=False)

roc_auc = roc_auc_score(y_test, test_probs)
pr_auc = average_precision_score(y_test, test_probs)

print(f"Test ROC-AUC: {roc_auc:.3f} | PR-AUC: {pr_auc:.3f}")
print(f"Test utility (total): {test_utility:,.0f} | per customer: {test_utility / len(y_test):.2f}")
print(classification_report(y_test, test_preds, digits=3))

cm = pd.DataFrame(
    confusion_matrix(y_test, test_preds),
    index=pd.Index(["Actual 0", "Actual 1"], name="Actual"),
    columns=pd.Index(["Pred 0", "Pred 1"], name="Predicted"),
)
display(cm)

baseline_preds = {
    "approve_all": np.ones_like(y_test),
    "reject_all": np.zeros_like(y_test),
    "threshold_0.50": (test_probs >= 0.5).astype(int),
}
baseline_rows = []
for name, preds in baseline_preds.items():
    counts = confusion_counts(y_test, preds)
    util = cost_sensitive_utility(counts, matrix=COST_MATRIX, normalize=True)
    baseline_rows.append({
        "strategy": name,
        "normalized_utility": util,
        "tp": counts["tp"],
        "fp": counts["fp"],
        "tn": counts["tn"],
        "fn": counts["fn"],
    })
baseline_df = pd.DataFrame(baseline_rows).sort_values("normalized_utility", ascending=False)
display(baseline_df)

test_results = X_test.copy()
test_results["y_true"] = y_test.values
test_results["prob_default"] = test_probs
test_results["y_pred"] = test_preds
test_results["limit_segment"] = pd.qcut(test_results["LIMIT_BAL"], q=4, duplicates="drop").astype(str)

segment_summary = []
for seg, df_seg in test_results.groupby("limit_segment"):
    counts = confusion_counts(df_seg["y_true"], df_seg["y_pred"])
    seg_util = cost_sensitive_utility(counts, normalize=True)
    segment_summary.append({
        "limit_segment": seg,
        "customers": len(df_seg),
        "default_rate": df_seg["y_true"].mean(),
        "utility_per_customer": seg_util,
    })
segment_summary = pd.DataFrame(segment_summary).sort_values("utility_per_customer", ascending=False)
display(segment_summary)

## 10. Error and Sensitivity Analysis
### 10.1. Sensitivity to k, metric, and scaling
Document how results change with k, p, and scaling type.

### 10.2. Dimensionality and noise
Explain the effect of irrelevant variables on KNN and why filtering or weighting helps.

### 10.3. Business-critical errors
Identify the most financially costly FP/FN; propose complementary rules (e.g., exposure caps by segment).

In [None]:
cv_summary = (
    pd.DataFrame(grid.cv_results_)
    .groupby("param_knn__n_neighbors")[["mean_test_utility", "mean_test_roc_auc", "mean_test_pr_auc"]]
    .agg(['mean', 'std'])
)
cv_summary.columns = ['_'.join(col).strip() for col in cv_summary.columns]
cv_summary = cv_summary.reset_index().rename(columns={"param_knn__n_neighbors": "k"})

display(cv_summary)

plt.figure(figsize=(8, 4))
plt.plot(cv_summary["k"], cv_summary["mean_test_utility_mean"], marker="o")
plt.fill_between(
    cv_summary["k"],
    cv_summary["mean_test_utility_mean"] - cv_summary["mean_test_utility_std"],
    cv_summary["mean_test_utility_mean"] + cv_summary["mean_test_utility_std"],
    color="C0",
    alpha=0.2,
)
plt.title("Utility sensitivity vs k")
plt.xlabel("k (neighbors)")
plt.ylabel("Normalized utility")
plt.show()

## 11. Local Interpretability for KNN Decisions
### 11.1. Nearest neighbors
For one applicant, show key attributes and distances of the k neighbors. Explain the local reason for approval or rejection.

### 11.2. Traceability
Keep a record of consulted neighbors and key variables for auditability.

### 11.3. Limitations
KNN lacks global coefficients; explanations are **local** and neighborhood-dependent.

In [None]:
sample_idx = test_results["prob_default"].idxmax()
sample_x = X_test.loc[[sample_idx]]
sample_prob = test_results.loc[sample_idx, "prob_default"]
sample_true = test_results.loc[sample_idx, "y_true"]

preprocessor = best_model.named_steps["preprocess"]
knn_estimator = best_model.named_steps["knn"]
X_train_transformed = preprocessor.transform(X_train)
sample_transformed = preprocessor.transform(sample_x)

nn = NearestNeighbors(
    n_neighbors=min(5, knn_estimator.n_neighbors),
    metric=knn_estimator.metric,
    p=knn_estimator.p,
)
nn.fit(X_train_transformed)
distances, indices = nn.kneighbors(sample_transformed)

neighbor_records = (
    X_train.iloc[indices[0]]
    .assign(
        distance=distances[0],
        actual=y_train.iloc[indices[0]].values,
    )
)

print(f"Sample applicant idx {sample_idx} | true={sample_true} | p(default)={sample_prob:.3f}")
display(sample_x.assign(prob_default=sample_prob, actual=sample_true))
display(neighbor_records)

## 12. Risks, Fairness, and Compliance
### 12.1. Subgroup checks
Compare TPR/FPR/PPV across protected vs non-protected subgroups. Flag substantive differences.

### 12.2. Ethical and regulatory implications
Avoid direct sensitive attributes and obvious proxies. Document data governance and decision governance.

### 12.3. Monitoring plan
Drift alerts (data and performance); scheduled retraining; decision and human-override logs.

In [None]:
def group_report(df, group_col):
    rows = []
    for value, subset in df.groupby(group_col):
        counts = confusion_counts(subset["y_true"], subset["y_pred"])
        tpr = counts["tp"] / max(counts["tp"] + counts["fn"], 1)
        fpr = counts["fp"] / max(counts["fp"] + counts["tn"], 1)
        ppv = counts["tp"] / max(counts["tp"] + counts["fp"], 1)
        util = cost_sensitive_utility(counts, normalize=True)
        rows.append({
            group_col: value,
            "customers": len(subset),
            "default_rate": subset["y_true"].mean(),
            "tpr": tpr,
            "fpr": fpr,
            "ppv": ppv,
            "utility_per_customer": util,
        })
    return pd.DataFrame(rows).sort_values("utility_per_customer", ascending=False)

print("Fairness/segment checks")
sex_report = group_report(test_results, "SEX")
education_report = group_report(test_results, "EDUCATION")
display(sex_report)
display(education_report)

## 13. Deployment Plan (MVP)
### 13.1. Export
Serialize the full pipeline, including preprocessing and scaling.

### 13.2. Operational requirements
Acceptable inference latency with indexes/precomputed neighborhoods if needed. Fallback policies if the service fails.

### 13.3. Updates and retraining
Retraining cadence and triggers for drift or utility degradation.

## 14. Conclusions and Next Steps
### 14.1. Key findings
With proper scaling and a utility-optimized threshold, KNN can improve utility vs simple rules and baselines.

### 14.2. Improvement roadmap
- Features: more recent behavioral variables, robust aggregations.  
- Model: benchmark against more scalable methods (regularized logistic, trees, gradient boosting).  
- Business: refine costs/benefits with actual LGD/recovery data.

## Appendix A. Data Dictionary (operational summary)
- LIMIT_BAL: assigned credit limit.  
- SEX, EDUCATION, MARRIAGE, AGE: demographic characteristics.  
- PAY_0 ... PAY_6: monthly payment status (ordinal, indicates delays).  
- BILL_AMT1 ... BILL_AMT6: monthly billed amounts.  
- PAY_AMT1 ... PAY_AMT6: monthly paid amounts.  
- default_payment_next_month: binary target (1=default).  
Note: use only information available at decision time; align feature time windows with the target horizon to prevent leakage.

## Appendix B. Reproducibility Checklist
- Fixed and recorded random seeds.  
- Documented stratified splits.  
- Pipeline with preprocessing inside CV.  
- Library and dataset versions.  
- Final hyperparameters and operating threshold $\\tau^*$ saved.  
- Scripts/notebook with run instructions and artifact signatures.

## Appendix C. Business Scenario Definitions (if no official inputs)
To run threshold optimization without official costs:  
- Conservative scenario: FP very costly (high LGD), FN moderate (opportunity cost).  
- Balanced scenario: FP and FN similar magnitude; maximize global utility.  
- Growth-aggressive scenario: FN costly (missed growth), FP moderate; control losses via exposure caps.  
Report results per scenario and select the one meeting risk constraints.