# Homework: Week 2 - The Credit Default Challenge

## Dataset:
We will use the **Default** dataset. It contains data on credit card default, including student status, bank balance, and annual income.

**URL:** https://raw.githubusercontent.com/JWarmenhoven/ISLR-python/master/Notebooks/Data/Default.xlsx

## Part 1: EDA & Balance
1.  **Load Data:** Load the XLSX. Encode 'default' and 'student' to 0/1.
2.  **The Imbalance:** Calculate the percentage of defaulters. Why would a "dumb" model that predicts 'No Default' for everyone have high accuracy?
3.  **Visualization:** Boxplot the 'balance' for Defaulters vs Non-Defaulters. Do defaulters tend to carry higher balances?

## Part 2: Logistic Regression & Interpretability
1.  **Split:** Train/Test split (70/30).
2.  **Fit:** Train a standard Logistic Regression.
3.  **Coefficients:** Print the model coefficients. Which variable (Balance or Income) creates more risk? Does being a Student increase or decrease risk (according to the model)?
4.  **Feature Selection (Lasso):** Fit a Logistic Regression with `penalty='l1'` and `solver='liblinear'`. Try a small `C` (e.g., 0.01). Did any coefficients drop to zero? What does this imply?

## Part 3: The Metric Battle (LogReg vs KNN)
1.  **Fit KNN:** Train a KNN model (try k=9).
2.  **Comprehensive Metrics:** Plot the confusion matrix for both models side-by-side.
Instead of just Accuracy, print a DataFrame comparing Recall, Precision, F1-Score, and ROC-AUC Score for both models (focusing on the "Default=1" class).
3.  **The Trade-off Analysis:**
- **Recall:** Which model is safer (misses fewer defaults)?
- **Precision:** Which model is more efficient (falsely accuses fewer good customers)?
- **F1-Score:** Which model provides the best balance?
4. **Visualizing Performance:** Plot the ROC Curve for both models on the same graph. Which curve is closer to the top-left corner?
5. **The Verdict:** As a Risk Manager, you must choose one model to deploy.

- **Scenario A:** Your bank is conservative and fears losing principal (Needs high Recall). Which model do you pick?

- **Scenario B:** Your bank wants to grow and fears rejecting good customers (Needs high Precision). Which model do you pick?

- **Final Decision:** Considering Explainability (Part 2) and Performance (Part 3), which model is the most realistic choice for a regulated financial institution?


### 1 - Import and Inspect

In [2]:
import pandas as pd

url = "https://raw.githubusercontent.com/JWarmenhoven/ISLR-python/master/Notebooks/Data/Default.xlsx"

df = pd.read_excel(url)

print(df.head())
print(df.info())
print(df.describe())

  warn("Workbook contains no default style, apply openpyxl's default")


   Unnamed: 0 default student      balance        income
0           1      No      No   729.526495  44361.625074
1           2      No     Yes   817.180407  12106.134700
2           3      No      No  1073.549164  31767.138947
3           4      No      No   529.250605  35704.493935
4           5      No      No   785.655883  38463.495879
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  10000 non-null  int64  
 1   default     10000 non-null  object 
 2   student     10000 non-null  object 
 3   balance     10000 non-null  float64
 4   income      10000 non-null  float64
dtypes: float64(2), int64(1), object(2)
memory usage: 390.8+ KB
None
        Unnamed: 0       balance        income
count  10000.00000  10000.000000  10000.000000
mean    5000.50000    835.374886  33516.981876
std     2886.89568    483.714985  13336.639563
min    

### 1.1 - Encode 0/1 to `Default` and `Student` columns

In [4]:
df['default'] = df['default'].replace({'Yes':1,'No':0})
df['student'] = df['student'].replace({'Yes':1,'No':0})
print(df.head())

   Unnamed: 0  default  student      balance        income
0           1        0        0   729.526495  44361.625074
1           2        0        1   817.180407  12106.134700
2           3        0        0  1073.549164  31767.138947
3           4        0        0   529.250605  35704.493935
4           5        0        0   785.655883  38463.495879


  df['student'] = df['student'].replace({'Yes':1,'No':0})


### 1.2 - The Imbalance

**Q:** Why would a "dumb" model that predicts 'No Default' for everyone have high accuracy?

**A:** Since 0nly 3.33% of the dataset is composed by observations of default, a model that outputs 100% of non default would yield  96.67% of accuracy, which is remarkably high. The issue is that, obviously the model is not making any predictions, but it works in this context since the dataset is heavily imbalanced.

In [7]:
pct_default = df['default'].sum() / len(df)
print('Percentage of default:', f"{pct_default*100:.2f}","%")

Percentage of default: 3.33 %
