# Credit Default Prediction with XGBoost and Hyperparameter Tuning

This notebook covers a full pipeline:
- Load Lending Club dataset from Kaggle
- Data preprocessing
- Model selection: XGBoost for binary classification
- Bayesian Optimization for hyperparameter tuning
- Model evaluation using accuracy, precision, recall, F1-score, AUC-ROC
- Confusion matrix visualization

**Note:** Data is loaded directly from Kaggle using the kaggle API. No local `data/` folder is needed.

In [None]:
# Install required packages (if running in a fresh environment)
!pip install xgboost scikit-learn bayesian-optimization kagglehub matplotlib seaborn --quiet

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix)
from xgboost import XGBClassifier
from bayes_opt import BayesianOptimization
import kagglehub

In [None]:
# Download Lending Club dataset from Kaggle using kagglehub
path = kagglehub.dataset_download("wordsforthewise/lending-club")
print("Dataset files downloaded at:", path)

# Load the credit data CSV (adjust filename if different)
file_path = f"{path}/lending_club_loan_two.csv"  # example filename, check actual
df = pd.read_csv(file_path)
print(f"Loaded dataset with {df.shape[0]} rows and {df.shape[1]} columns")

## Data Overview and Preprocessing
- Check for missing values
- Select relevant features
- Encode categorical variables
- Define target variable (default or not)
- Split data into train/test sets

In [None]:
# Basic EDA
print(df.head())
print(df.isnull().sum())

In [None]:
# For simplicity, let's select a few numerical and categorical features
features = ["loan_amnt", "int_rate", "installment", "annual_inc", "dti", "open_acc", "revol_bal", "total_acc", "emp_length"]
target = "loan_status"  # Assuming this contains 'Fully Paid' vs 'Charged Off' or similar

# Filter dataset
df = df[features + [target]].dropna()

# Convert target to binary: 1 if default, 0 if fully paid
df["target"] = df[target].apply(lambda x: 1 if x.lower() != "fully paid" else 0)

# Encode employment length categorical feature (example encoding)
def emp_length_to_int(emp):
    if pd.isnull(emp):
        return 0
    if emp == '< 1 year':
        return 0
    if emp == '10+ years':
        return 10
    try:
        return int(emp.split()[0])
    except:
        return 0

df['emp_length'] = df['emp_length'].apply(emp_length_to_int)

# Define X, y
X = df[features].copy()
X['emp_length'] = df['emp_length']
y = df['target']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training samples: {X_train.shape[0]}, Testing samples: {X_test.shape[0]}")

## Model Selection: XGBoost
XGBoost is chosen for binary classification due to its strong performance on tabular data and multiple tunable hyperparameters.

In [None]:
# Define a function to train and evaluate model given hyperparameters
def xgb_evaluate(max_depth, learning_rate, n_estimators, gamma, min_child_weight, subsample, colsample_bytree):
    params = {
        'max_depth': int(max_depth),
        'learning_rate': learning_rate,
        'n_estimators': int(n_estimators),
        'gamma': gamma,
        'min_child_weight': min_child_weight,
        'subsample': subsample,
        'colsample_bytree': colsample_bytree,
        'objective': 'binary:logistic',
        'eval_metric': 'auc',
        'use_label_encoder': False,
        'random_state': 42
    }
    model = XGBClassifier(**params)
    model.fit(X_train, y_train)
    preds = model.predict_proba(X_train)[:,1]
    auc = roc_auc_score(y_train, preds)
    return auc

## Bayesian Optimization for Hyperparameter Tuning
Optimize for highest AUC on training set (with validation later)

In [None]:
from bayes_opt import BayesianOptimization

pbounds = {
    'max_depth': (3, 10),
    'learning_rate': (0.01, 0.3),
    'n_estimators': (50, 300),
    'gamma': (0, 5),
    'min_child_weight': (1, 10),
    'subsample': (0.5, 1),
    'colsample_bytree': (0.5, 1)
}

optimizer = BayesianOptimization(
    f=xgb_evaluate,
    pbounds=pbounds,
    random_state=42,
    verbose=2
)

# Run optimization
optimizer.maximize(init_points=5, n_iter=15)

## Train Final Model with Best Hyperparameters

In [None]:
# Extract best params
best_params = optimizer.max['params']
best_params['max_depth'] = int(best_params['max_depth'])
best_params['n_estimators'] = int(best_params['n_estimators'])
best_params['objective'] = 'binary:logistic'
best_params['use_label_encoder'] = False
best_params['eval_metric'] = 'auc'
best_params['random_state'] = 42

# Train model
final_model = XGBClassifier(**best_params)
final_model.fit(X_train, y_train)

## Evaluate Model on Test Set with Multiple Metrics

In [None]:
# Predictions
y_pred = final_model.predict(X_test)
y_proba = final_model.predict_proba(X_test)[:,1]

# Metrics
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_proba)

print(f"Accuracy: {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall: {rec:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"AUC-ROC: {auc:.4f}")

## Confusion Matrix Visualization

In [None]:
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=["No Default", "Default"], yticklabels=["No Default", "Default"])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()

# Summary
- Data loaded directly from Kaggle dataset `wordsforthewise/lending-club`.
- Task: binary classification to predict credit default.
- Model: XGBoost with Bayesian hyperparameter optimization.
- Evaluation using multiple metrics: accuracy, precision, recall, F1, AUC-ROC.
- Confusion matrix plotted for error analysis.

## Next Steps
- Improve feature engineering.
- Use cross-validation for more robust evaluation.
- Experiment with other models and ensembles.
