# 3. Creating the Master Balanced Sample (Stratified Undersampling)

This notebook creates a balanced "master sample" dataset. This dataset will be the foundation for training and testing all models.

## Strategy: Undersampling the Majority Class

The cleaned dataset (`lc_loans_cleaned.csv`) is imbalanced, with far more "Fully Paid" loans (target=0) than "Charged Off" loans (target=1).

To build a model that learns to recognize default, we will create a 1:1 balanced dataset by:
1.  Taking **all** observations from the minority class (every "Charged Off" loan).
2.  Randomly sampling an **equal number** of observations from the majority class ("Fully Paid" loans).
3.  Combining these two sets into a single, balanced DataFrame.

This gives us the largest possible balanced dataset for training and ensures the model sees every single example of a default.

In [None]:
import pandas as pd
import os

## Configuration

Define the input file (cleaned dataset) and the new output file for the master sample.

In [None]:
# --- Configuration ---
# The cleaned dataset from preprocessing
INPUT_FILE = 'lc_loans_cleaned.csv'

# The new, large, and balanced master sample
OUTPUT_FILE = 'lc_loans_master_sample.csv'

## Step 1: Load the Cleaned Dataset

First, we load the `lc_loans_cleaned.csv` file.

In [None]:
try:
    df = pd.read_csv(INPUT_FILE)
    print("1. Loading the cleaned dataset...")
    print(f"   - Shape of the cleaned dataset: {df.shape}")
    print("\n   - Original class distribution:")
    print(df['target'].value_counts())

except FileNotFoundError:
    print(f"Error: The file '{INPUT_FILE}' was not found.")
    print("Please make sure you've uploaded the file to this Colab session.")

## Step 2: Separate Data into Strata

Next, we split the DataFrame into two separate groups, or "strata," based on the target variable.

In [None]:
# target = 0 -> Fully Paid (majority class)
df_majority = df[df['target'] == 0]

# target = 1 -> Charged Off (minority class)
df_minority = df[df['target'] == 1]

print(f"   - Found {len(df_majority):,} majority class (target=0) loans.")
print(f"   - Found {len(df_minority):,} minority class (target=1) loans.")

## Step 3: Sample the Majority Class and Combine

Now we perform the undersampling. We'll find out how many minority-class observations we have (`N`), and then sample exactly `N` observations from the majority class.

In [None]:
# Get the size of the minority class
N = len(df_minority)

print(f"1. The minority class has {N:,} observations.")
print(f"2. Sampling {N:,} observations from the majority class...")

# Randomly sample the majority class to match the minority class size
df_majority_sampled = df_majority.sample(n=N, random_state=42)

print(f"   - Sampling complete. We now have {len(df_majority_sampled):,} sampled good loans.")

# Combine the full minority DataFrame with the sampled majority DataFrame
df_master_sample = pd.concat([df_majority_sampled, df_minority])
print("\n3. Combined the two DataFrames.")

## Final Step: Shuffle, Save, and Verify

To complete the process, we'll shuffle the combined dataset to ensure the data is randomly mixed. Then, we'll save it to its final file and verify the counts to confirm we have a perfect 1:1 balance.

In [None]:
# Shuffle the final combined DataFrame
df_master_sample = df_master_sample.sample(frac=1, random_state=42).reset_index(drop=True)
print("1. Shuffled the master sample.")

# Save the final balanced sample to a new CSV file
df_master_sample.to_csv(OUTPUT_FILE, index=False)
print(f"\n2. Success! Master sample has been saved to '{OUTPUT_FILE}'")
print("   You can find this file in the Colab file browser on the left.")

# Verify the final shape and distribution
print(f"\n   - Final shape of the master sample: {df_master_sample.shape}")
print(f"\n   - Final distribution of the target variable:")
print(df_master_sample['target'].value_counts())