# KNN for Credit Decisions and Profit Maximization

## 0. Executive Summary (objective + deliverables)
**Objective.** Train and validate a supervised KNN classifier to estimate applicants' default probability and make approve/decline decisions that **maximize expected profit** under a business cost/benefit matrix.  
**Deliverables.** 1) KNN model encapsulated in a reproducible pipeline, 2) operating threshold optimized for utility, 3) holdout test evaluation with technical metrics and **business utility**, 4) deployment and monitoring guidelines.  
**Success criterion.** Beat baselines (approve all / approve none) in net utility; keep risk metrics aligned with policy (e.g., minimum TPR on a priority segment).  
**Key assumptions.** Use case: credit decision for existing customers (historical information available) with target "default next period." Costs and benefits are provided by the business (or scenario-based here).  
**Limitations.** KNN is sensitive to scaling, dimensionality, and inference latency; mitigated via preprocessing, k selection, and production controls.

## 1. Context and Sources
### 1.1. Problem statement
At application time, the bank wants a system that **predicts default risk** and decides approve/decline to maximize financial utility. The decision depends on: estimated risk, operating threshold, and the cost/benefit matrix.

### 1.2. Dataset and data dictionary
"Default of Credit Card Clients" (~30k rows). Demographics, credit limit, recent payment history, billed amounts, and previous payments. Binary label: `default_payment_next_month` (1=default, 0=no default).  
Variable groups:  
- Demographics/profile: `LIMIT_BAL`, `SEX`, `EDUCATION`, `MARRIAGE`, `AGE`.  
- Payment history: `PAY_0` ... `PAY_6` (recent monthly status).  
- Billed amounts: `BILL_AMT1` ... `BILL_AMT6`.  
- Paid amounts: `PAY_AMT1` ... `PAY_AMT6`.  
- Target: `default_payment_next_month`.

### 1.3. Link to the course
Apply Similarity and Neighbors (KNN) concepts: effect of scaling, distance metric choice, k selection, probability calibration, and decision thresholds.

## 2. Business Metric Design
### 2.1. Cost/benefit matrix (TP, FP, TN, FN)
Definitions:  
- **TP**: approve a non-defaulter → benefit = expected margin (interest − cost of funds − provisions) minus operating costs.  
- **FP**: approve a defaulter → cost = expected loss (principal × LGD − recoveries) + expenses.  
- **TN**: reject a defaulter → benefit/cost ≈ 0, or benefit from avoided losses.  
- **FN**: reject a good customer → opportunity cost (foregone margin).

### 2.2. Expected value function and optimal decision threshold
For estimated default probability $\\hat{p}$, expected **utility** of approval is:
$$
U(\\text{approve}) = (1-\\hat{p}) \\cdot B_{\\text{TP}} - \\hat{p} \\cdot C_{\\text{FP}}
$$
and of **reject**:
$$
U(\\text{reject}) = (1-\\hat{p}) \\cdot (-C_{\\text{FN}}) + \\hat{p} \\cdot B_{\\text{TN}}
$$
Define the **optimal threshold** $\\tau^*$ such that approve if $U(\\text{approve}) \\ge U(\\text{reject})$. This threshold is not necessarily 0.5 and depends on the cost/benefit matrix.

### 2.3. Reference baselines
- **Approve all**: utility if nobody is rejected.  
- **Reject all**: utility if nobody is approved.  
The model must **outperform** both in utility.

## 3. Data Understanding (targeted EDA)
### 3.1. Target distribution
Default is typically imbalanced. Document prevalence and implications for metrics (accuracy can mislead; prefer ROC/PR and utility).

### 3.2. Variable types
Identify: continuous numeric (limits, amounts), ordinal (payment status), encoded categorical (sex, education, marital status). Justify treatment given the distance metric.

### 3.3. Missing values and outliers
State policies: minimal imputation if needed; winsorization or robust scaling to reduce outlier impact on distances.

### 3.4. Correlations and scales
KNN requires **scaling** so large-range variables don't dominate distance. Document StandardScaler/RobustScaler choice and rationale.

## 4. Avoid Leakage and Define the Feature Set
### 4.1. Inclusion/exclusion criteria
Use only variables **available at decision time**. Exclude any that anticipate the outcome beyond the defined horizon.

### 4.2. Transformations/encoding
- Categoricals: consistent encoding for KNN (one-hot if needed).  
- Ordinal variables: preserve order when meaningful (payment status).  
- Amounts: consider log or robust transforms for heavy tails.

### 4.3. Candidate feature list
Demographics/profile + recent payment history + billed and paid amounts prior to the decision cutoff. Document exclusions due to leakage or low signal.

## 5. Data Split and Validation Protocol
### 5.1. Train/validation/test
Stratified splitting to preserve target prevalence. Hold out test for final evaluation.

### 5.2. Validation metrics
Report ROC-AUC and PR-AUC for general performance; **optimize and report expected utility** per the business matrix.

### 5.3. Repeats/seed for stability
Use multiple seeds or stratified CV; report mean and dispersion for robustness.

## 6. Preprocessing and Pipeline
### 6.1. Scaling
Include scaling inside the pipeline to avoid information leakage across splits.

### 6.2. Encoding
Apply categorical encoding inside the pipeline, consistent across train and test.

### 6.3. Pipeline structure
[coherent preprocessing] -> [scaling] -> [KNNClassifier]. Document decisions.

## 7. KNN Modeling (supervised classification)
### 7.1. Hyperparameters to tune
- n_neighbors (k): controls boundary smoothness and variance.  
- weights: uniform vs distance.  
- p: 1 (Manhattan) vs 2 (Euclidean).  
- metric, leaf_size as needed.

### 7.2. Hyperparameter search
Grid/Random search with stratified CV. Primary optimization: **utility**; secondary metrics: ROC-AUC/PR-AUC.

### 7.3. Performance vs k curve
Document bias-variance trade-off and how utility changes with k.

## 8. Probabilities, Calibration, and Utility-Based Threshold
### 8.1. Probability prediction and calibration
KNN probabilities come from the positive fraction among neighbors. Assess **calibration** and, if necessary, apply calibration (Platt/Isotonic) on a validation set.

### 8.2. Utility vs threshold curve
Construct the utility curve by varying the decision threshold. Explain how the maximum is selected.

### 8.3. Operating threshold selection
Choose $\\tau^*$ that maximizes utility. If business imposes constraints (e.g., minimum TPR), select the best threshold that satisfies them.

## 9. Final Evaluation and Baselines
### 9.1. Primary metrics
ROC-AUC, PR-AUC, confusion matrix on test, and **total utility** vs baselines.

### 9.2. Utility by segments
Analyze utility by relevant subgroups (limits, tenure, internal score) to confirm consistency of benefit.

### 9.3. Stability
Variation across folds/seeds. Note if the model is sensitive to small perturbations.

## 10. Error and Sensitivity Analysis
### 10.1. Sensitivity to k, metric, and scaling
Document how results change with k, p, and scaling type.

### 10.2. Dimensionality and noise
Explain the effect of irrelevant variables on KNN and why filtering or weighting helps.

### 10.3. Business-critical errors
Identify the most financially costly FP/FN; propose complementary rules (e.g., exposure caps by segment).

## 11. Local Interpretability for KNN Decisions
### 11.1. Nearest neighbors
For one applicant, show key attributes and distances of the k neighbors. Explain the local reason for approval or rejection.

### 11.2. Traceability
Keep a record of consulted neighbors and key variables for auditability.

### 11.3. Limitations
KNN lacks global coefficients; explanations are **local** and neighborhood-dependent.

## 12. Risks, Fairness, and Compliance
### 12.1. Subgroup checks
Compare TPR/FPR/PPV across protected vs non-protected subgroups. Flag substantive differences.

### 12.2. Ethical and regulatory implications
Avoid direct sensitive attributes and obvious proxies. Document data governance and decision governance.

### 12.3. Monitoring plan
Drift alerts (data and performance); scheduled retraining; decision and human-override logs.

## 13. Deployment Plan (MVP)
### 13.1. Export
Serialize the full pipeline, including preprocessing and scaling.

### 13.2. Operational requirements
Acceptable inference latency with indexes/precomputed neighborhoods if needed. Fallback policies if the service fails.

### 13.3. Updates and retraining
Retraining cadence and triggers for drift or utility degradation.

## 14. Conclusions and Next Steps
### 14.1. Key findings
With proper scaling and a utility-optimized threshold, KNN can improve utility vs simple rules and baselines.

### 14.2. Improvement roadmap
- Features: more recent behavioral variables, robust aggregations.  
- Model: benchmark against more scalable methods (regularized logistic, trees, gradient boosting).  
- Business: refine costs/benefits with actual LGD/recovery data.

## Appendix A. Data Dictionary (operational summary)
- LIMIT_BAL: assigned credit limit.  
- SEX, EDUCATION, MARRIAGE, AGE: demographic characteristics.  
- PAY_0 ... PAY_6: monthly payment status (ordinal, indicates delays).  
- BILL_AMT1 ... BILL_AMT6: monthly billed amounts.  
- PAY_AMT1 ... PAY_AMT6: monthly paid amounts.  
- default_payment_next_month: binary target (1=default).  
Note: use only information available at decision time; align feature time windows with the target horizon to prevent leakage.

## Appendix B. Reproducibility Checklist
- Fixed and recorded random seeds.  
- Documented stratified splits.  
- Pipeline with preprocessing inside CV.  
- Library and dataset versions.  
- Final hyperparameters and operating threshold $\\tau^*$ saved.  
- Scripts/notebook with run instructions and artifact signatures.

## Appendix C. Business Scenario Definitions (if no official inputs)
To run threshold optimization without official costs:  
- Conservative scenario: FP very costly (high LGD), FN moderate (opportunity cost).  
- Balanced scenario: FP and FN similar magnitude; maximize global utility.  
- Growth-aggressive scenario: FN costly (missed growth), FP moderate; control losses via exposure caps.  
Report results per scenario and select the one meeting risk constraints.