# 5.2. Data Preparation for XGBoost

This notebook prepares the `lc_loans_master_sample.csv` file for use with XGBoost.

This process varies from the one for Logistic Regression:
1.  **No Feature Scaling:** Tree models are not sensitive to the scale of features, so we do not need to use `StandardScaler`.
2.  **Ordinal Encoding:** We will convert all categorical columns into simple integer labels (e.g., A=0, B=1, C=2). This is more efficient for tree models than one-hot encoding.
3.  **Feature Selection:** We will drop the same redundant and unneeded columns as before.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
import numpy as np
import os

## Configuration

Define the input file (our master sample) and the new output file names. We'll add `_tree` to distinguish them from the Logistic Regression files.

In [None]:
# --- Configuration ---
INPUT_FILE = 'lc_loans_master_sample.csv'

## Step 1: Load and Prepare Data

Load the master sample, create our `loan_to_income_ratio` feature, and drop all redundant or unneeded columns.

In [None]:
# Load the master sample dataset
try:
    df = pd.read_csv(INPUT_FILE)
    print(f"Loaded master sample with shape: {df.shape}")
except FileNotFoundError:
    print(f"Error: The file '{INPUT_FILE}' was not found.")
    print("Please upload the file to this Colab session.")

# 1. Feature Engineering
df['loan_to_income_ratio'] = df['loan_amnt'] / (df['annual_inc'] + 1)
print("Created 'loan_to_income_ratio' feature.")

# 2. Feature Selection
# Drop all redundant, leaky, or unneeded text columns
features_to_drop = [
    # Redundant with sub_grade
    'grade',
    'int_rate',

    # Redundant with loan_amnt and term
    'installment',

    # Unneeded text columns
    'pymnt_plan',
    'initial_list_status',
    'application_type',
    'hardship_flag',
    'disbursement_method',
    'debt_settlement_flag'
]
df = df.drop(columns=features_to_drop, errors='ignore')
print(f"Dropped {len(features_to_drop)} redundant/unneeded features.")

## Step 2: Define Features (X) and Target (y)

Separate the data into `X` (features) and `y` (target).

In [None]:
y = df['target']
X = df.drop(columns='target')

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

## Step 3: Train-Test Split

Split the data into training and testing sets, ensuring the 50/50 balance is preserved using `stratify=y`.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,     # 20% for testing
    random_state=42,
    stratify=y         # Keep the 50/50 balance
)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")

## Step 4: Feature Encoding (Ordinal)

We will find all remaining text (`object`) columns and convert them to ranked integers using `OrdinalEncoder`.

In [None]:
print("--- Starting Ordinal Encoding ---")

# Automatically find all categorical (text) columns
categorical_cols = X_train.select_dtypes(include=['object']).columns

print(f"Found {len(categorical_cols)} categorical columns to encode: {list(categorical_cols)}")

# Initialize the OrdinalEncoder
encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

# Fit the encoder on the training data
encoder.fit(X_train[categorical_cols])

# Transform both the training and test data
X_train[categorical_cols] = encoder.transform(X_train[categorical_cols])
X_test[categorical_cols] = encoder.transform(X_test[categorical_cols])

print("Ordinal encoding complete.")
print("\n--- Preprocessing for Tree Models Complete ---")

## Step 5: Save the Prepared Datasets

Save the new, tree-ready files.

In [None]:
# Save the final, processed data
X_train.to_csv('X_train_tree.csv', index=False)
y_train.to_csv('y_train_tree.csv', index=False)
X_test.to_csv('X_test_tree.csv', index=False)
y_test.to_csv('y_test_tree.csv', index=False)

print("Saved X_train_tree, y_train_tree, X_test_tree, and y_test_tree to CSV files.")

print("\n--- Final X_train_tree head (note: no scaling): ---")
# Display a sample of the final data. All values should be numeric.
display(X_train.head())