# 6.2. Model Training: XGBoost Classifier

This notebook trains the **XGBoost (Extreme Gradient Boosting) Classifier**.

XGBoost builds trees *sequentially*. Each new tree is built to correct the errors made by the previous ones, making it more robust and accurate.

**Goal:** To see if XGBoost can find more complex patterns in the data and beat the ~66% baseline set by the Logistic Regression model.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    ConfusionMatrixDisplay
)
import os

## Step 1: Load the Tree-Specific Datasets

We will use the `_tree.csv` files.

In [None]:
# --- Define Output Directory ---
OUTPUT_DIR = "xgboost_results"
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)
    print(f"Directory '{OUTPUT_DIR}' created.")

# --- Load Data ---
try:
    X_train = pd.read_csv('X_train_tree.csv')
    y_train = pd.read_csv('y_train_tree.csv')
    X_test = pd.read_csv('X_test_tree.csv')
    y_test = pd.read_csv('y_test_tree.csv')

    print("All tree-specific training and testing data loaded successfully.")
    print(f"X_train shape: {X_train.shape}")
    print(f"X_test shape: {X_test.shape}")
except FileNotFoundError:
    print("Error: Could not find all four _tree.csv files.")
    print("Please upload X_train_tree, y_train_tree, X_test_tree, and y_test_tree.")

## Step 2: Initialize and Train the Model

We will initialize the `XGBClassifier`.
- `n_estimators=100`: Like Random Forest, we'll build 100 trees.
- `n_jobs=-1`: Use all available CPU cores to speed up training.
- `random_state=42`: For reproducibility.

In [None]:
# Initialize the model
model = xgb.XGBClassifier(
    n_estimators=100,
    random_state=42,
    n_jobs=-1,
    use_label_encoder=False,
    eval_metric='logloss'
)

print("Training the XGBoost model...")

# Train the model
model.fit(X_train, y_train.values.ravel())

print("Model training complete.")

## Step 3: Make Predictions on the Test Data

We'll use our new XGBoost model to make predictions on the unseen `X_test_tree` data.

In [None]:
predictions = model.predict(X_test)
print("Predictions made on the test set.")

## Step 4: Evaluate Model Performance

We generate the full classification report and plot the Confusion Matrix to see if XGBoost performed any better than the 66% F1-score from Logistic Regression.

In [None]:
# Check Accuracy
accuracy = accuracy_score(y_test, predictions)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

# Check the full Classification Report
print("\n--- Classification Report ---")
report = classification_report(y_test, predictions, target_names=['Fully Paid (0)', 'Charged Off (1)'])
print(report)

# Check the Confusion Matrix
print("\n--- Confusion Matrix ---")
fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_predictions(
    y_test,
    predictions,
    ax=ax,
    display_labels=['Fully Paid', 'Charged Off'],
    cmap='Purples'
)
plt.title("XGBoost Classifier Confusion Matrix")

# Save the plot
plt.savefig(os.path.join(OUTPUT_DIR, "xgb_confusion_matrix.png"))
plt.show()

## Step 5: Interpret the Model (Feature Importance)

XGBoost also provides `.feature_importances_`. This score measures how useful each feature was to the model when making its predictions.

In [None]:
# Get the feature importances from the trained model
importances = model.feature_importances_

# Get the feature names
feature_names = X_train.columns

# Create a DataFrame to see them clearly
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})

# Sort by importance
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# --- Save and Display Top 15 ---
print("--- Top 15 Most Important Features (XGBoost) ---")
display(importance_df.head(15))

# Save the full list to a CSV
importance_csv_path = os.path.join(OUTPUT_DIR, "xgb_feature_importances.csv")
importance_df.to_csv(importance_csv_path, index=False)

# --- Plot the Top 15 ---
plt.figure(figsize=(10, 8))
sns.barplot(
    x='Importance',
    y='Feature',
    data=importance_df.head(15)
)
plt.title('Top 15 Features (XGBoost)')
plt.savefig(os.path.join(OUTPUT_DIR, "xgb_feature_importance_plot.png"))
plt.show()