# 6.1. Model Training: Logistic Regression

This notebook uses the preprocessed, scaled, and balanced data from our previous step to train our first machine learning model: **Logistic Regression**.

The process will be:
1.  **Load** the `X_train`, `y_train`, `X_test`, and `y_test` files.
2.  **Initialize** the Logistic Regression model.
3.  **Train** the model on the `_train` data.
4.  **Evaluate** the model's performance on the unseen `_test` data to see how well it learned to predict loan defaults.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    ConfusionMatrixDisplay
)

## Step 1: Load the Prepared Datasets

First, we load our four pre-processed data files `X_train.csv`, `y_train.csv`, `X_test.csv`, and `y_test.csv`.

In [None]:
try:
    X_train = pd.read_csv('X_train.csv')
    y_train = pd.read_csv('y_train.csv')
    X_test = pd.read_csv('X_test.csv')
    y_test = pd.read_csv('y_test.csv')

    print("All training and testing data loaded successfully.")
    print(f"X_train shape: {X_train.shape}")
    print(f"X_test shape: {X_test.shape}")
except FileNotFoundError:
    print("Error: Could not find all four .csv files.")
    print("Please upload X_train, y_train, X_test, and y_test to this Colab session.")

## Step 2: Initialize and Train the Model

Next, we will create an instance of the `LogisticRegression` model and train it by calling the `.fit()` method. The model "learns" by finding the best patterns in `X_train` that predict the answers in `y_train`.

In [None]:
# Initialize the model
model = LogisticRegression(random_state=42, max_iter=1000)

print("Training the Logistic Regression model...")

# Check data types in X_train
print("Checking data types in X_train:")
print(X_train.dtypes)

# Identify and remove non-numeric columns if any
non_numeric_cols = X_train.select_dtypes(exclude=[np.number]).columns
if not non_numeric_cols.empty:
    print(f"\nRemoving non-numeric columns from X_train: {list(non_numeric_cols)}")
    X_train = X_train.drop(columns=non_numeric_cols)
    # Apply the same changes to X_test
    X_test = X_test.drop(columns=non_numeric_cols)
    print("Non-numeric columns removed.")
else:
    print("\nNo non-numeric columns found in X_train.")


# Train the model
model.fit(X_train, y_train.values.ravel())

print("Model training complete.")

## Step 3: Make Predictions on the Test Data

Now that the model is trained, we'll use it to make predictions on the unseen `X_test` data.

In [None]:
# The model will look at every row in X_test and predict 0 (Fully Paid) or 1 (Charged Off)
predictions = model.predict(X_test)

print("Predictions made on the test set.")

## Step 4: Evaluate Model Performance - The "Report Card"

We then compare the model's `predictions` to the *true answers* in `y_test`.

###1. Accuracy
First, we'll look at **Accuracy**, which is the percentage of the total predictions the model got right.

In [None]:
# Compare the model's predictions to the actual answers
accuracy = accuracy_score(y_test, predictions)

print(f"Model Accuracy: {accuracy * 100:.2f}%")

###2. Classification Report (Precision, Recall, and F1-Score)

Accuracy can be misleading, especially in risk. We need to know what kind of correct and incorrect predictions it's making. The **Classification Report** gives us these crucial metrics.

* **Precision (Class 1):** Of all the loans the model *predicted* would default, what percentage actually did? **(High precision = The model is trustworthy when it flags a loan as bad).**
* **Recall (Class 1):** Of all the loans that *actually* defaulted, what percentage did the model successfully catch? **(High recall = The model is good at finding most of the bad loans).**
* **F1-Score:** The balanced average of Precision and Recall.

In [None]:
# Generate the full classification report
report = classification_report(y_test, predictions, target_names=['Fully Paid (0)', 'Charged Off (1)'])

print("--- Classification Report ---")
print(report)

###3. The Confusion Matrix

The **Confusion Matrix** is a visual breakdown of all predictions. It shows us the *four types* of predictions the model made.

* **True Negatives (Top-Left):** Correctly predicted 'Fully Paid'.
* **True Positives (Bottom-Right):** Correctly predicted 'Charged Off'.
* **False Positives (Top-Right):** *Incorrectly* labeled a good loan as 'Charged Off'. ("safe error," - the model denied a good applicant).
* **False Negatives (Bottom-Left):** *Incorrectly* labeled a bad loan as 'Fully Paid'. ("costly error!" - the model approved a loan that will default).

In [None]:
print("--- Confusion Matrix ---")

# Generate and display the Confusion Matrix
fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_predictions(
    y_test,
    predictions,
    ax=ax,
    display_labels=['Fully Paid', 'Charged Off'],
    cmap='Blues'
)
plt.title("Confusion Matrix")
plt.show()

## Step 5: Interpret the Model (Feature Importance)

Finally, we look at features which the Logistic Regression model found to be the most predictive by pulling the "coefficients" (or weights) it assigned to each feature.

* **Large Positive Coefficient:** This feature strongly predicts a **Default (1)**.
* **Large Negative Coefficient:** This feature strongly predicts a **Full Repayment (0)**.

In [None]:
import os
import matplotlib.pyplot as plt
import seaborn as sns

# --- Define the Output Directory ---
OUTPUT_DIR = "logistic_regression_results"

# Create the output directory if it doesn't exist
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Get the coefficients from the trained model
coefficients = model.coef_[0]

# Get the feature names from the X_train columns
feature_names = X_train.columns

# Create a DataFrame to see them clearly
importance_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})

# Sort by coefficient to see the most influential features
importance_df = importance_df.sort_values(by='Coefficient', ascending=False)

print("--- Top 10 Features Predicting DEFAULT (High Risk) ---")
display(importance_df.head(10))

print("\n--- Top 10 Features Predicting FULL REPAYMENT (Low Risk) ---")
display(importance_df.tail(10).sort_values(by='Coefficient', ascending=True))

# --- Plot the Top 10 Features Predicting DEFAULT (High Risk) ---
plt.figure(figsize=(10, 8))
sns.barplot(
    x='Coefficient',
    y='Feature',
    data=importance_df.head(10)
)
plt.title('Top 10 Features Predicting Default (Logistic Regression)')
plt.savefig(os.path.join(OUTPUT_DIR, "lr_feature_importance_plot.png"))
plt.show()