# 4. Exploratory Data Analysis (EDA)

This notebook performs Exploratory Data Analysis (EDA) on the sampled dataset. The goal is to visualize the relationships between key predictive features and the loan outcome ('Fully Paid' vs. 'Charged Off').

The analysis is divided into two main parts:
1.  **Bivariate Analysis:** We will plot each key feature (numerical and categorical) against the target variable to see how its distribution changes for good and bad loans. This helps identify the most influential predictors of default.
2.  **Multivariate Analysis:** We will create a correlation heatmap to understand the relationships amongst key numerical features.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

## Configuration and Key Variables

Here we define the input file (master sample), the directory where we'll save our plots, and the specific lists of key variables we want to investigate.

In [None]:
# --- Configuration ---
INPUT_FILE = 'lc_loans_master_sample.csv'
OUTPUT_DIR = 'eda_plots'
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

# --- Define Key Variables for Analysis ---
key_numerical_vars = [
    'loan_amnt',
    'int_rate',
    'installment',
    'annual_inc',
    'dti',
    'fico_range_low',
    'credit_history_length_days',
    'inq_last_6mths'
]

key_categorical_vars = [
    'term',
    'grade',
    'sub_grade',
    'home_ownership',
    'purpose'
]

## Step 1: Load the Master Sample Dataset

We'll load the `lc_loans_master_sample.csv` file created in the sampling step and take a quick look at the first few rows.

In [None]:
# Upload the 'lc_loans_master_sample.csv' file.
try:
    df = pd.read_csv(INPUT_FILE)
    print("Master sample dataset loaded successfully.")
    print(f"Shape of the dataset: {df.shape}")
    display(df.head())
except FileNotFoundError:
    print(f"Error: The file '{INPUT_FILE}' was not found.")
    print("Please make sure you've uploaded the file to this Colab session.")

# Set plot style for all visualizations
sns.set_style("whitegrid")

## Step 2: Bivariate Analysis (Key Features vs. Target)

We will loop through our key variables and create plots to visualize their relationship with the loan outcome. This will help us form hypotheses about which factors are most predictive of default.

### Numerical Features vs. Target

We'll use box plots to compare the distributions of our key numerical features for 'Fully Paid' loans (target=0) versus 'Charged Off' loans (target=1).

In [None]:
print(f"--- Generating Bivariate Plots for {len(key_numerical_vars)} Key Numerical Variables ---")

for col in key_numerical_vars:
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=df, x='target', y=col)
    plt.title(f'{col.replace("_", " ").title()} vs. Loan Outcome', fontsize=16)
    plt.xticks([0, 1], ['Fully Paid', 'Charged Off'])

    # Save the plot to the directory
    plt.savefig(os.path.join(OUTPUT_DIR, f'bivariate_boxplot_{col}.png'))

    # Display the plot directly in the notebook
    plt.show()
    plt.close()

The plot above is unreadable due to a few extreme outliers in `annual_inc`. To get a better view of the distribution for the majority of borrowers, we create a new plot that filters out the top 1% of incomes.

In [None]:
# Calculate the 99th percentile
cutoff = df['annual_inc'].quantile(0.99)

# Create a temporary DataFrame for plotting
df_filtered = df[df['annual_inc'] < cutoff]

# Create the new, "zoomed-in" box plot excluding extreme outliers
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_filtered, x='target', y='annual_inc')
plt.title('Annual Income (Excluding Top 1%) vs. Loan Outcome')
plt.xticks([0, 1], ['Fully Paid', 'Charged Off'])
plt.show()

The `dti` box plot is also compressed by extreme outliers, including some unexpected negative values and a few very high values. We will create a "zoomed-in" plot by filtering the data to show the primary distribution (e.g., from 0 up to a DTI of 100).

In [None]:
# Filter for a reasonable DTI range.
# Set the lower limit to 0 as DTI cannot be negative
# Set the upper limit to 100 as anything over 100 is an anomly
cutoff_low = 0
cutoff_high = 100

# Create a temporary DataFrame for plotting, filtering out the extremes
df_filtered_dti = df[(df['dti'] >= cutoff_low) & (df['dti'] < cutoff_high)]

# Create the new, "zoomed-in" box plot excluding extreme outliers
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_filtered_dti, x='target', y='dti')
plt.title('DTI (Filtered 0-100) vs. Loan Outcome')
plt.xticks([0, 1], ['Fully Paid', 'Charged Off'])
plt.show()

### Categorical Features vs. Target

Now we'll use count plots to see how the loan outcomes are distributed across our key categorical features.

In [None]:
print(f"\n--- Generating Bivariate Plots for {len(key_categorical_vars)} Key Categorical Variables ---")

for col in key_categorical_vars:
    plt.figure(figsize=(12, 7))

    # Use a horizontal plot (y-axis) for categories with many or long labels
    if df[col].nunique() > 5:
        sns.countplot(data=df, y=col, hue='target', order=sorted(df[col].unique()))
    else:
        sns.countplot(data=df, x=col, hue='target', order=sorted(df[col].unique()))

    plt.title(f'Loan Outcome by {col.replace("_", " ").title()}', fontsize=16)
    plt.legend(title='Outcome', labels=['Fully Paid', 'Charged Off'])
    plt.tight_layout()

    # Save the plot
    plt.savefig(os.path.join(OUTPUT_DIR, f'bivariate_countplot_{col}.png'))

    # Display the plot
    plt.show()
    plt.close()

## Step 3: Multivariate Analysis (Correlation Between Key Features)

Finally, we'll create a correlation heatmap. This helps us understand the relationships between our key numerical features. High correlation between two features (a value close to 1.0 or -1.0) indicates multicollinearity, which can be an issue for some model types.

In [None]:
print("\n--- Generating Focused Correlation Heatmap ---")

plt.figure(figsize=(12, 10))
key_numeric_df = df[key_numerical_vars]
corr_matrix = key_numeric_df.corr()

sns.heatmap(corr_matrix, cmap='viridis', annot=True, fmt=".2f")
plt.title('Correlation Matrix of Key Numerical Features', fontsize=16)
plt.tight_layout()

# Save the plot
plt.savefig(os.path.join(OUTPUT_DIR, 'multivariate_heatmap_key_features.png'))

# Display the plot
plt.show()
plt.close()

print(f"\n--- Focused EDA Complete. All plots have been saved to the '{OUTPUT_DIR}' directory. ---")

## Grade, Sub-Grade, and Interest Rate Relationship

Our hypothesis is that the `int_rate` is directly determined by the `sub_grade` assigned to the loan. To confirm this, the following script will load the full **cleaned** dataset, group it by `grade` and `sub_grade`, and then calculate the `mean` interest rate for each group.

In [None]:
# Upload 'lc_loans_cleaned.csv' to this Colab session
try:
    df_cleaned = pd.read_csv('lc_loans_cleaned.csv')
    print("Full cleaned dataset loaded.")

    # Group by grade and sub_grade, and calculate mean interest rate and the count
    grade_analysis = df_cleaned.groupby(['grade', 'sub_grade'])['int_rate'].agg(
        mean_int_rate='mean',
        count='count'
    ).reset_index()

    # Format the mean interest rate to two decimal places
    grade_analysis['mean_int_rate'] = grade_analysis['mean_int_rate'].map('{:.2f}'.format)

    # Set pandas options to display all 35 rows
    pd.set_option('display.max_rows', None)

    print("\nRelationship between Grade, Sub-Grade, Mean Interest Rate, and Count:")

    # Rename the columns for a cleaner final print
    grade_analysis = grade_analysis.rename(columns={
        'mean_int_rate': 'Mean Interest Rate',
        'count': 'Count'
    })

    print(grade_analysis.to_string(index=False))

    # Reset display options
    pd.reset_option('display.max_rows')

except FileNotFoundError:
    print("Error: 'lc_loans_cleaned.csv' not found.")
    print("Please upload the cleaned dataset to this Colab session to run this analysis.")

In [None]:
!zip -r eda_plots.zip eda_plots