# 5.3. Data Preparation for Categorical Boost (CatBoost)

This notebook prepares a new version of our dataset for the CatBoost classifier.

In our previous modeling, we removed "redundant" features (like `grade` and `int_rate`) to prevent multicollinearity, which breaks Logistic Regression.

However, the literature (Ko et al., 2022) suggests that **tree-based models** often perform better when they have access to *all* variations of a feature.
* `int_rate` gives the tree a precise, continuous split point (e.g., > 12.5%).
* `grade` gives the tree a broad, categorical bucket (e.g., "is Grade D?").

By keeping these highly correlated features, we allow the model to find both coarse and fine-grained patterns.

**Goal:** Create a new set of training and testing datasets that include `int_rate`, `grade`, `sub_grade`, and `installment`.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
import numpy as np
import os

## Configuration
We define the input file (our master sample) and a new suffix `_tree_full` for the output files.

In [None]:
# --- Configuration ---
INPUT_FILE = 'lc_loans_master_sample.csv'

## Step 1: Load and Feature Engineering

We load the master sample and re-create our custom `loan_to_income_ratio` feature.

In [None]:
# Load the master sample dataset
try:
    df = pd.read_csv(INPUT_FILE)
    print(f"Loaded master sample with shape: {df.shape}")
except FileNotFoundError:
    print(f"Error: The file '{INPUT_FILE}' was not found.")
    print("Please upload the file to this Colab session.")

# Feature Engineering
df['loan_to_income_ratio'] = df['loan_amnt'] / (df['annual_inc'] + 1)
print("Created 'loan_to_income_ratio' feature.")

## Step 2: Revised Feature Selection (Keeping Redundant Features)

**Crucial Change:** Unlike our previous preparation, we are **NOT** dropping `grade`, `int_rate`, or `installment`. We are only dropping the administrative text columns that provide no predictive value or cause data leakage.

In [None]:
# Features to drop
features_to_drop = [
    'pymnt_plan',
    'initial_list_status',
    'application_type',
    'hardship_flag',
    'disbursement_method',
    'debt_settlement_flag'
]

df = df.drop(columns=features_to_drop, errors='ignore')
print(f"Dropped {len(features_to_drop)} administrative features.")
print(f"Current columns included: {len(df.columns)}")

## Step 3: Train-Test Split

We split the data into training and testing sets, preserving the 50/50 class balance.

In [None]:
y = df['target']
X = df.drop(columns='target')

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,     # 20% for testing
    random_state=42,
    stratify=y         # Keep the 50/50 balance
)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")

## Step 4: Ordinal Encoding (All Categorical Features)

Since tree models handle integers well, we will use `OrdinalEncoder` to convert *all* remaining text columns (including `grade`, `sub_grade`, `home_ownership`, etc.) into numbers (0, 1, 2...).

Note: `OrdinalEncoder` automatically sorts alphabetically, so 'A' becomes 0, 'B' becomes 1, etc., which perfectly preserves the rank for `grade` and `sub_grade`.

In [None]:
print("--- Starting Ordinal Encoding ---")

# Automatically find all categorical (text) columns
categorical_cols = X_train.select_dtypes(include=['object']).columns

print(f"Encoding the following {len(categorical_cols)} columns: {list(categorical_cols)}")

# Initialize the OrdinalEncoder
encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

# Fit on training data
encoder.fit(X_train[categorical_cols])

# Transform both sets
X_train[categorical_cols] = encoder.transform(X_train[categorical_cols])
X_test[categorical_cols] = encoder.transform(X_test[categorical_cols])

print("Ordinal encoding complete.")

## Step 5: Save the "Full" Datasets

We save these new files with the `_tree_full` suffix. These will be the input for our CatBoost, Tuned XGBoost, and Stacking models.

In [None]:
# Save the final, processed data
X_train.to_csv('X_train_tree_full.csv', index=False)
y_train.to_csv('y_train_tree_full.csv', index=False)
X_test.to_csv('X_test_tree_full.csv', index=False)
y_test.to_csv('y_test_tree_full.csv', index=False)

print("Saved X_train_tree_full, y_train_tree_full, X_test_tree_full, and y_test_tree_full.")

# Display a sample to verify grade/int_rate are present
print("\n--- Verification: Checking for 'grade' and 'int_rate' ---")
cols_to_check = ['grade', 'sub_grade', 'int_rate', 'installment']
display(X_train[cols_to_check].head())