# 6.3. Model Training: CatBoost Classifier

This notebook trains a **CatBoost (Categorical Boosting) Classifier**.

CatBoost classifier is specifically optimized to handle categorical data with high accuracy and low overfitting.

We will use the **"Full" dataset** (`X_train_tree_full.csv`) which  includes `grade`, `sub_grade`, and `int_rate`. The classifier will perform special processing to the specified categorical variables.

In [None]:
# Pull the CatBoost library
!pip install catboost

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    ConfusionMatrixDisplay
)
import os

## Step 1: Load the "Full" Datasets

We load the datasets that include the redundant features (`grade`, `int_rate`).

In [None]:
# --- Define Output Directory ---
OUTPUT_DIR = "catboost_results"
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

# --- Load Data ---
try:
    X_train = pd.read_csv('Inputs/X_train_tree_full.csv')
    y_train = pd.read_csv('Inputs/y_train_tree_full.csv')
    X_test = pd.read_csv('Inputs/X_test_tree_full.csv')
    y_test = pd.read_csv('Inputs/y_test_tree_full.csv')

    print("Full tree datasets loaded successfully.")
    print(f"X_train shape: {X_train.shape}")
except FileNotFoundError:
    print("Error: Could not find _tree_full.csv files.")
    print("Please upload the files generated in the 'Advanced Data Preparation' step.")

## Step 2: Define Categorical Features

We flag the categorical columns for the model to interpret them intelligently.

Based on our data prep, the categorical features are: `term`, `grade`, `sub_grade`, `home_ownership`, `verification_status`, `purpose`.

In [None]:
# Identify the indices of categorical columns
categorical_features_names = [
    'term',
    'grade',
    'sub_grade',
    'home_ownership',
    'verification_status',
    'purpose'
]

# Get integer indices (column positions) for the model
cat_features_indices = [X_train.columns.get_loc(c) for c in categorical_features_names if c in X_train.columns]

print(f"CatBoost will treat these {len(cat_features_indices)} columns as categorical features: {categorical_features_names}")

In [None]:
# Convert categorical columns to Integers
for col in categorical_features_names:
    if col in X_train.columns:
        X_train[col] = X_train[col].astype(int)
        X_test[col] = X_test[col].astype(int)

print("Successfully converted categorical features to integers.")

## Step 3: Initialize and Train CatBoost

We initialize the model.
* `iterations=1000`: CatBoost builds up to 1000 trees.
* `learning_rate=0.1`: The step size for learning.
* `depth=6`: The depth of each tree.
* `verbose=100`: Print an update every 100 iterations.

In [None]:
# Initialize the model
model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    eval_metric='Accuracy',
    random_seed=42,
    verbose=100  # Log output every 100 trees
)

print("Training CatBoost model...")

# Train the model
# Pass the cat_features indices
model.fit(
    X_train,
    y_train.values.ravel(),
    cat_features=cat_features_indices
)

print("Training complete.")

## Step 4: Evaluate Performance

We generate the full classification report and plot the Confusion Matrix to see if providing the full feature set (including `grade` and `int_rate`) and using CatBoost improves the 66% baseline performance.

In [None]:
# Make predictions
predictions = model.predict(X_test)

# Metrics
accuracy = accuracy_score(y_test, predictions)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

print("\n--- Classification Report ---")
print(classification_report(y_test, predictions, target_names=['Fully Paid', 'Charged Off']))

# Confusion Matrix
print("\n--- Confusion Matrix ---")
fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_predictions(
    y_test,
    predictions,
    ax=ax,
    display_labels=['Fully Paid', 'Charged Off'],
    cmap='Oranges'
)
plt.title("CatBoost Confusion Matrix")
plt.savefig(os.path.join(OUTPUT_DIR, "catboost_confusion_matrix.png"))
plt.show()

## Step 5: Feature Importance

Since we included all features, we expect `sub_grade`, `grade`, and `int_rate` to dominate this list.

In [None]:
# Get feature importance
importances = model.get_feature_importance()
feature_names = X_train.columns

importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

print("--- Top 15 Most Important Features (CatBoost) ---")
display(importance_df.head(15))

importance_df.to_csv(os.path.join(OUTPUT_DIR, "catboost_feature_importances.csv"), index=False)

# Plot
plt.figure(figsize=(10, 8))
sns.barplot(x='Importance', y='Feature', data=importance_df.head(15))
plt.title('Top 15 Features (CatBoost)')
plt.savefig(os.path.join(OUTPUT_DIR, "catboost_feature_importance_plot.png"))
plt.show()

## CatBoost: Save, Load, and Predict
This section demonstrates how to save a trained CatBoost model, load it, and use it for live predictions (e.g., with user input).

In [None]:
from catboost import CatBoostClassifier

# Train CatBoost model using your actual data
model = CatBoostClassifier(verbose=0)
model.fit(X_train, y_train)

# Save the trained model
model.save_model('catboost_model.cbm')
print('Model saved as catboost_model.cbm')

In [None]:
from catboost import CatBoostClassifier
import pandas as pd

# Load the saved model
model = CatBoostClassifier()
model.load_model('catboost_model.cbm')

# Prepare user input with all features used in training
user_input = {}
for col in X_train.columns:
    # Example: set user values for key features, defaults for others
    if col == 'loan_amnt':
        user_input[col] = [10000]  # Example user value
    elif col == 'int_rate':
        user_input[col] = [12.5]   # Example user value
    elif col == 'term':
        user_input[col] = [1]      # Example user value
    elif col == 'grade':
        user_input[col] = [2]      # Example user value
    elif col == 'sub_grade':
        user_input[col] = [5]      # Example user value
    elif col == 'home_ownership':
        user_input[col] = [1]      # Example user value
    elif col == 'verification_status':
        user_input[col] = [0]      # Example user value
    elif col == 'purpose':
        user_input[col] = [3]      # Example user value
    else:
        # Use mean or default value for other features
        user_input[col] = [X_train[col].mean()]

user_data = pd.DataFrame(user_input)

# Predict
prediction = model.predict(user_data)[0]
result = 'Default' if prediction == 1 else 'Paid Off'
print(f'Prediction: {result}')