# 5.1. Data Preparation for Logistic Regression

This notebook prepares the `lc_loans_master_sample.csv` file for a Logistic Regression model.

The key steps are:
1.  **Feature Selection:** Remove redundant/collinear features (like `grade`, `int_rate`).
2.  **Feature Engineering:** Create a `loan_to_income_ratio` feature.
3.  **Train-Test Split:** Split the data *before* any transformations to prevent data leakage.
4.  **Encoding:** Convert all categorical features (`sub_grade`, `term`, `purpose`, etc.) into a numeric format.
5.  **Scaling:** Apply `StandardScaler` to all numerical features, which is essential for Logistic Regression.
6.  **Save Output:** Save the final, model-ready `X_train`, `X_test`, `y_train`, and `y_test` files.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import numpy as np
import os

## Configuration
Define the input file (our master sample).

In [None]:
# --- Configuration ---
INPUT_FILE = 'lc_loans_master_sample.csv'

## Step 1: Load and Prepare Data
Load the master sample, create our new engineered feature, and drop the redundant columns to avoid multicollinearity.

In [None]:
# Upload 'lc_loans_master_sample.csv' to Colab
try:
    df = pd.read_csv(INPUT_FILE)
    print(f"Loaded master sample with shape: {df.shape}")
except FileNotFoundError:
    print(f"Error: The file '{INPUT_FILE}' was not found.")
    print("Please upload the file to this Colab session.")

# 1. Feature Engineering
# Add 1 to annual_inc to prevent any divide-by-zero errors
df['loan_to_income_ratio'] = df['loan_amnt'] / (df['annual_inc'] + 1)
print("Created 'loan_to_income_ratio' feature.")

# 2. Feature Selection
# Drop redundant/collinear features
features_to_drop = [
    'grade',       # Redundant with sub_grade
    'int_rate',    # Redundant with sub_grade
    'installment'  # Redundant with loan_amnt and term

    # Drop these unhandled text columns
    'pymnt_plan',
    'initial_list_status',
    'application_type',
    'hardship_flag',
    'disbursement_method',
    'debt_settlement_flag'
]
df = df.drop(columns=features_to_drop, errors='ignore')
print(f"Dropped {len(features_to_drop)} redundant features.")

## Step 2: Define Features (X) and Target (y)
Separate the data into `X` (our features) and `y` (what we want to predict).

In [None]:
y = df['target']
X = df.drop(columns='target')

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

## Step 3: Train-Test Split
We split the data **before** any encoding or scaling. This is a critical step to prevent "data leakage," which would make our model's test results seem better than they actually are.

We will use `stratify=y` to ensure our 50/50 balance is preserved in both the training and testing sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,     # 20% for testing, 80% for training
    random_state=42,   # Ensures the split is reproducible
    stratify=y         # Keeps the 50/50 balance in both sets
)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")

## Step 4: Feature Encoding
Now we convert our remaining text-based columns into numbers.

In [None]:
print("--- Starting Feature Encoding ---")

# --- 4.1 Binary Mapping for 'term' ---
term_map = {36: 0, 60: 1}
X_train['term'] = X_train['term'].map(term_map)
X_test['term'] = X_test['term'].map(term_map)
print("Mapped 'term' to 0 (36mo) and 1 (60mo).")

# --- 4.2 Label Encoding for 'sub_grade' ---
# Create an ordered list of all 35 sub-grades from A1 to G5
all_sub_grades = sorted(X_train['sub_grade'].unique())
# Create the mapping dictionary
sub_grade_map = {grade: i for i, grade in enumerate(all_sub_grades)}

X_train['sub_grade'] = X_train['sub_grade'].map(sub_grade_map)
X_test['sub_grade'] = X_test['sub_grade'].map(sub_grade_map)
print("Mapped 'sub_grade' to ordinal integers (0-34).")

# --- 4.3 One-Hot Encoding for Nominal Features ---
nominal_cols = ['home_ownership', 'purpose', 'verification_status']

# Initialize the OneHotEncoder
# handle_unknown='ignore' tells it to ignore any new categories it might see in the test set
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Fit and transform the training data
ohe_train = ohe.fit_transform(X_train[nominal_cols])
# Only transform the test data (using the fit from training)
ohe_test = ohe.transform(X_test[nominal_cols])

# Create DataFrames from the OHE arrays, using the feature names
ohe_train_df = pd.DataFrame(ohe_train, columns=ohe.get_feature_names_out(), index=X_train.index)
ohe_test_df = pd.DataFrame(ohe_test, columns=ohe.get_feature_names_out(), index=X_test.index)

# Drop the original text columns and add the new one-hot columns
X_train = pd.concat([X_train.drop(columns=nominal_cols), ohe_train_df], axis=1)
X_test = pd.concat([X_test.drop(columns=nominal_cols), ohe_test_df], axis=1)

print(f"One-hot encoded '{nominal_cols}'. New X_train shape: {X_train.shape}")

## Step 5: Feature Scaling
Lastly, we will scale all non-binary numerical features to have a mean of 0 and a standard deviation of 1.

In [None]:
print("--- Starting Feature Scaling ---")

# Identify all columns that are NOT one-hot encoded
ohe_cols = ohe.get_feature_names_out()
cols_to_scale = [col for col in X_train.columns if col not in ohe_cols]

# Identify non-numeric columns that are still in cols_to_scale
non_numeric_cols = X_train[cols_to_scale].select_dtypes(exclude=np.number).columns.tolist()
print(f"Identified non-numeric columns to exclude from scaling: {non_numeric_cols}")

# Exclude the non-numeric columns from cols_to_scale
cols_to_scale = [col for col in cols_to_scale if col not in non_numeric_cols]
print(f"Scaling the following columns: {cols_to_scale}")

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the training data
X_train[cols_to_scale] = scaler.fit_transform(X_train[cols_to_scale])
# Only transform the test data (using the scaler fit from the training data)
X_test[cols_to_scale] = scaler.transform(X_test[cols_to_scale])

print("Applied StandardScaler to all numerical and ordinal features.")
print("\n--- Data Preparation Complete ---")

## Step 6: Save the Prepared Datasets
The data is now fully prepared for modeling. We'll export the four DataFrames to new CSV files.

In [None]:
# Save the final, processed data
X_train.to_csv('X_train.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
X_test.to_csv('X_test.csv', index=False)
y_test.to_csv('y_test.csv', index=False)

print("Saved X_train, y_train, X_test, and y_test to CSV files.")

print("\n--- Final X_train head: ---")
display(X_train.head())

print("\n--- Final X_train info: ---")
X_train.info()

We verify if the encoding and scaling was performed as expected.

In [None]:
import pandas as pd
import numpy as np

# Load the file you want to check
X_train_check = pd.read_csv('X_train.csv')

# Get the summary statistics
summary_stats_transposed = X_train_check.describe().round(3).T

# Display the summary
print("--- Summary Statistics for X_train (Transposed, All Variables) ---")
display(summary_stats_transposed)

# Reset the display option back to default
pd.reset_option('display.max_rows')