<a href="https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/example-notebooks/23_logistic_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Logistic Regression for Classification

This notebook contains code examples from the **Introduction to Logistic Regression for Classification** chapter (Chapter 23) of the BANA 4080 textbook. Follow along to practice building classification models using logistic regression with pandas, scikit-learn, and Python.

## 📚 Chapter Overview

This chapter introduces logistic regression, the foundational algorithm for classification problems in business. You'll learn how to predict categories (like Yes/No, Default/No Default) instead of continuous numbers, and understand how to interpret probability-based predictions for business decision-making.

## 🎯 What You'll Practice

- Understand why linear regression fails for classification and how logistic regression solves this
- Build and interpret simple and multiple logistic regression models using scikit-learn
- Work with probabilities, odds, and log-odds in business contexts
- Make probability-based predictions and apply the 0.5 classification threshold
- Use proper train/test splits to evaluate classification model performance
- Recognize practical considerations like class imbalance in real datasets

## 💡 How to Use This Notebook

1. **Read the chapter first** - This notebook supplements the textbook, not replaces it
2. **Run cells sequentially** - Code builds on previous examples
3. **Experiment freely** - Modify code to test your understanding
4. **Practice variations** - Try different approaches to reinforce learning

## Setup and Data Loading

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from ISLP import load_data

# Suppress numerical warnings for cleaner output
warnings.filterwarnings('ignore', category=RuntimeWarning)

# Load the Default dataset used in this chapter
Default = load_data('Default')
print("Default dataset shape:", Default.shape)
print("\nFirst few rows:")
Default.head()

## Why Linear Regression Fails for Classification

Let's first explore why we can't use linear regression for classification problems by creating some sample credit default data and seeing what happens.

In [None]:
# Create sample credit default data to demonstrate the problem
np.random.seed(42)
balances = np.linspace(0, 3000, 100)
# Higher balances increase default probability
probabilities = 1 / (1 + np.exp(-(balances - 1500) / 300))
defaults = np.random.binomial(1, probabilities)

default_data = pd.DataFrame({
    'balance': balances,
    'default': defaults
})

print("Sample of credit default data:")
print(default_data.head(10))

In [None]:
# Try linear regression on classification data
X = default_data[['balance']]
y = default_data['default']

linear_model = LinearRegression()
linear_model.fit(X, y)
linear_predictions = linear_model.predict(X)

# Visualize the problem
plt.figure(figsize=(10, 6))
plt.scatter(default_data['balance'], default_data['default'], alpha=0.6, label='Actual data')
plt.plot(default_data['balance'], linear_predictions, color='red', linewidth=2, label='Linear regression')
plt.xlabel('Credit Card Balance ($)')
plt.ylabel('Default (0=No, 1=Yes)')
plt.title('Why Linear Regression Fails for Classification')
plt.legend()
plt.grid(True, alpha=0.3)

# Highlight the problems
plt.axhline(y=0, color='black', linestyle='--', alpha=0.5)
plt.axhline(y=1, color='black', linestyle='--', alpha=0.5)
plt.show()

print(f"Linear regression predictions range from {linear_predictions.min():.2f} to {linear_predictions.max():.2f}")
print("But we need probabilities between 0 and 1!")

### 🏃‍♂️ Try It Yourself

Examine the linear regression results above. What specific problems do you see with using linear regression for classification? List at least 2 issues with the predictions.

In [None]:
# Your observations here
# Problem 1:
# Problem 2:
# Problem 3:

## Understanding the Logistic Function

The logistic function solves these problems by transforming any real number into a value between 0 and 1, creating the characteristic S-shaped curve.

In [None]:
# Demonstrate the logistic function
z_values = np.linspace(-6, 6, 100)
probabilities = 1 / (1 + np.exp(-z_values))

plt.figure(figsize=(10, 6))
plt.plot(z_values, probabilities, linewidth=3, color='blue')
plt.xlabel('z (linear combination: β₀ + β₁x₁ + β₂x₂ + ...)')
plt.ylabel('Probability')
plt.title('The Logistic Function: Transforming Linear Predictions to Probabilities')
plt.grid(True, alpha=0.3)

# Highlight key points
plt.axhline(y=0.5, color='red', linestyle='--', alpha=0.7, label='50% probability threshold')
plt.axhline(y=0, color='black', linestyle='-', alpha=0.3)
plt.axhline(y=1, color='black', linestyle='-', alpha=0.3)
plt.legend()

# Add annotations
plt.annotate('Approaches 0\n(Very Low Probability)', xy=(-4, 0.02), xytext=(-5, 0.2),
            arrowprops=dict(arrowstyle='->', color='gray'), fontsize=10)
plt.annotate('Approaches 1\n(Very High Probability)', xy=(4, 0.98), xytext=(3, 0.8),
            arrowprops=dict(arrowstyle='->', color='gray'), fontsize=10)
plt.annotate('50% Decision\nBoundary', xy=(0, 0.5), xytext=(1, 0.6),
            arrowprops=dict(arrowstyle='->', color='red'), fontsize=10)
plt.show()

In [None]:
# Understanding Probability vs. Odds vs. Log-odds
probabilities = [0.1, 0.2, 0.5, 0.8, 0.9]
odds = [p/(1-p) for p in probabilities]
log_odds = [np.log(o) for o in odds]

comparison_df = pd.DataFrame({
    'Probability': probabilities,
    'Odds': odds,
    'Log-odds': log_odds,
    'Business_Interpretation': [
        'Very unlikely event (10% chance)',
        'Unlikely event (20% chance)',
        'Neutral/uncertain (50-50 chance)',
        'Likely event (80% chance)',
        'Very likely event (90% chance)'
    ]
})

print("Understanding Probability vs. Odds vs. Log-odds:")
print(comparison_df.round(3))

### 🏃‍♂️ Try It Yourself

Calculate the odds and log-odds for a probability of 0.75 (75% chance). What does this mean in business terms?

In [None]:
# Your code here
p = 0.75
# Calculate odds: 
# Calculate log-odds:
# Business interpretation:

## Building Your First Logistic Regression Model

Now let's apply logistic regression to the Default dataset to predict customer default based on their credit card balance.

In [None]:
# Explore the Default dataset
print("Default distribution:")
print(Default['default'].value_counts())
print(f"\nDefault rate: {Default['default'].value_counts(normalize=True)['Yes']:.1%}")

# Summary statistics by default status
print("\nSummary by default status:")
print(Default.groupby('default', observed=False)[['balance', 'income']].mean().round(0))

In [None]:
# Prepare the data for modeling
# Convert categorical variables to numeric
Default_encoded = pd.get_dummies(Default, columns=['student'], drop_first=True)
Default_encoded['default_binary'] = (Default_encoded['default'] == 'Yes').astype(int)

print("Encoded dataset:")
print(Default_encoded.head())

# Define features and target
X = Default_encoded[['balance', 'income', 'student_Yes']]
y = Default_encoded['default_binary']

print(f"\nFeature matrix shape: {X.shape}")
print(f"Target variable shape: {y.shape}")
print(f"Default rate in our target: {y.mean():.1%}")

In [None]:
# Simple logistic regression with balance only
X_simple = Default_encoded[['balance']]

# Fit the logistic regression model
log_reg_simple = LogisticRegression(random_state=42)
log_reg_simple.fit(X_simple, y)

print("Simple logistic regression model fitted successfully!")

In [None]:
# Extract model components
intercept = log_reg_simple.intercept_[0]
balance_coef = log_reg_simple.coef_[0][0]

print("Simple Logistic Regression Results:")
print(f"Intercept: {intercept:.4f}")
print(f"Balance coefficient: {balance_coef:.6f}")

# Interpretation
print(f"\nModel interpretation:")
print(f"Log-odds equation: log-odds = {intercept:.4f} + {balance_coef:.6f} × balance")
print(f"\nFor each $1 increase in balance, log-odds increase by {balance_coef:.6f}")
print(f"For each $1,000 increase in balance, log-odds increase by {balance_coef*1000:.3f}")

### 🏃‍♂️ Try It Yourself

Create a visualization showing the S-shaped logistic regression curve fitted to the Default data. Compare this to the linear regression line we saw earlier.

In [None]:
# Your code here
# Hint: Use predict_proba to get probability predictions
# Create a range of balance values for smooth curve
# Plot both the actual data points and the fitted curve

## Making Predictions with Logistic Regression

One of the key advantages of logistic regression is that it provides both probability estimates and binary classifications.

In [None]:
# Make predictions for specific balance amounts to show the progression
example_balances = pd.DataFrame({'balance': [500, 1000, 1500, 2000, 2500, 3000]})

# Get probability predictions - returns probabilities for both classes
probabilities = log_reg_simple.predict_proba(example_balances)
print("predict_proba() output (columns: [No Default, Default]):")
print(probabilities.round(4))

# Get binary classifications - returns 0 or 1 based on 50% threshold
classifications = log_reg_simple.predict(example_balances)
print(f"\npredict() output (0=No Default, 1=Default):")
print(classifications)

In [None]:
# Organize this information into a clear table
prob_default = probabilities[:, 1]  # Extract just the default probabilities

prediction_results = pd.DataFrame({
    'Balance': example_balances['balance'],
    'Probability_of_Default': prob_default,
    'Predicted_Class': classifications,
    'Business_Interpretation': [
        'Very low risk - safe customer',
        'Low risk - monitor balance growth',
        'Moderate risk - consider credit limit review',
        'High risk - proactive intervention recommended',
        'Very high risk - immediate attention required',
        'Extremely high risk - consider account restrictions'
    ]
})

print("Prediction Examples:")
print(prediction_results.round(4))

### 🏃‍♂️ Try It Yourself

Manually calculate the probability of default for a customer with a $2,000 balance using the logistic regression equation. Compare your result to scikit-learn's prediction.

In [None]:
# Your code here
# Manual calculation: probability = 1 / (1 + exp(-(intercept + coefficient * balance)))
balance_test = 2000

# Manual calculation:

# Scikit-learn prediction:

# Compare results:

## Multiple Predictor Logistic Regression

Now let's build a more comprehensive model using all available features: balance, income, and student status.

In [None]:
# Show the features we'll use in our multiple regression model
print("Features in our model:")
print(X.columns.tolist())
print(f"\nFeature matrix shape: {X.shape}")
print(f"Sample of feature data:")
print(X.head())

In [None]:
# Multiple logistic regression with all features
log_reg_multiple = LogisticRegression(random_state=42)
log_reg_multiple.fit(X, y)

# Extract coefficients
intercept_multi = log_reg_multiple.intercept_[0]
coefficients = log_reg_multiple.coef_[0]

print("Multiple Logistic Regression Results:")
print(f"Intercept: {intercept_multi:.6f}")
print(f"Balance coefficient: {coefficients[0]:.6f}")
print(f"Income coefficient: {coefficients[1]:.6f}")
print(f"Student coefficient: {coefficients[2]:.6f}")

In [None]:
# Interpret the coefficients in business context
print("Business Interpretation of Coefficients:")
print("\n• Balance coefficient ({:.6f}):".format(coefficients[0]))
print("  Higher balances increase default risk - customers with more debt struggle with payments")

print("\n• Income coefficient ({:.6f}):".format(coefficients[1]))
print("  Income has negligible effect once balance is accounted for")

print("\n• Student coefficient ({:.6f}):".format(coefficients[2]))
print("  Being a student decreases default risk, possibly due to family support or careful spending")

### 🏃‍♂️ Try It Yourself

Create a model using only income as a predictor. How do the coefficients compare to the multiple regression model? What does this tell you about the relationship between income and default?

In [None]:
# Your code here
# Build income-only model
# Compare coefficients
# Interpret the differences

## Model Evaluation with Train/Test Split

To properly evaluate our models, we need to use a train/test split to assess performance on unseen data.

In [None]:
# Split the data into training and testing sets
X_simple_train, X_simple_test, X_train, X_test, y_train, y_test = train_test_split(
    X_simple, X, y, test_size=0.3, random_state=42
)

print(f"Training set size: {len(X_train)} observations")
print(f"Test set size: {len(X_test)} observations")
print(f"Training default rate: {y_train.mean():.1%}")
print(f"Test default rate: {y_test.mean():.1%}")

In [None]:
# Retrain both models on training data only
log_reg_simple_new = LogisticRegression(random_state=42)
log_reg_multiple_new = LogisticRegression(random_state=42)

log_reg_simple_new.fit(X_simple_train, y_train)
log_reg_multiple_new.fit(X_train, y_train)

# Make predictions on test data
pred_simple_test = log_reg_simple_new.predict(X_simple_test)
pred_multiple_test = log_reg_multiple_new.predict(X_test)

# Calculate test accuracy
accuracy_simple_test = accuracy_score(y_test, pred_simple_test)
accuracy_multiple_test = accuracy_score(y_test, pred_multiple_test)

print(f"Model Performance on Test Data:")
print(f"Simple model (balance only): {accuracy_simple_test:.1%} accuracy")
print(f"Multiple model (all features): {accuracy_multiple_test:.1%} accuracy")

print(f"\nBoth models achieve the same accuracy!")
print(f"This suggests accuracy alone may not tell the full story...")

In [None]:
# Compare individual predictions
sample_customers = X_test.head(10)
prob_simple = log_reg_simple_new.predict_proba(sample_customers[['balance']])[:, 1]
prob_multiple = log_reg_multiple_new.predict_proba(sample_customers)[:, 1]

comparison_df = pd.DataFrame({
    'Balance': sample_customers['balance'].values,
    'Income': sample_customers['income'].values,
    'Student': sample_customers['student_Yes'].values,
    'Actual_Default': y_test.head(10).values,
    'Simple_Model_Prob': prob_simple,
    'Multiple_Model_Prob': prob_multiple,
    'Probability_Difference': prob_multiple - prob_simple
})

print("Sample Predictions Comparison:")
print(comparison_df.round(4))

### 🏃‍♂️ Try It Yourself

Examine the probability differences between the two models. For which types of customers do the models give the most different predictions? What might this tell us about when additional features matter?

In [None]:
# Your analysis here
# Look at cases with largest probability differences
# What patterns do you notice?

## Understanding Class Imbalance

Our Default dataset has a severe class imbalance issue - only about 3% of customers default. This can make accuracy misleading.

In [None]:
# Analyze class imbalance in our dataset
class_counts = Default['default'].value_counts()
minority_percentage = class_counts.min() / class_counts.sum()

print(f"Class balance analysis:")
print(class_counts)
print(f"\nMinority class (defaults) represents {minority_percentage:.1%} of data")

# What would happen if we always predicted "No Default"?
naive_accuracy = (Default['default'] == 'No').mean()
print(f"\nAccuracy of always predicting 'No Default': {naive_accuracy:.1%}")
print(f"Our model accuracy: {accuracy_simple_test:.1%}")
print(f"\nOur model is only slightly better than always guessing 'No'!")

In [None]:
# Let's see what our model actually predicts
test_predictions = log_reg_simple_new.predict(X_simple_test)
prediction_counts = pd.Series(test_predictions).value_counts()

print("What our simple model predicts on test data:")
print(f"Predicted No Default (0): {prediction_counts.get(0, 0)} customers")
print(f"Predicted Default (1): {prediction_counts.get(1, 0)} customers")

print(f"\nOur model predicts 'Default' for {prediction_counts.get(1, 0) / len(test_predictions):.1%} of customers")
print(f"Actual default rate in test set: {y_test.mean():.1%}")

## 🚀 Practice Challenges

Test your understanding with these additional exercises that combine multiple concepts from the chapter.

### Challenge 1: Custom Threshold Analysis

Instead of using the default 0.5 threshold, experiment with different probability thresholds (0.1, 0.3, 0.7) for classification. How does this affect the predictions? What threshold would you recommend for a conservative bank?

In [None]:
# Your solution here
# Get probability predictions
# Apply different thresholds
# Compare results and business implications

### Challenge 2: Feature Engineering

Create a new feature called `balance_to_income_ratio` and add it to your model. Does this improve the model's ability to distinguish between different customer types? What does the coefficient tell you?

In [None]:
# Your solution here
# Create the new feature
# Add it to the model
# Interpret the results

### Challenge 3: Business Scenario Analysis

Imagine you're advising a credit card company. Using your logistic regression model, analyze these three customer profiles and provide business recommendations:
- Customer A: $1,200 balance, $35,000 income, not a student
- Customer B: $2,800 balance, $25,000 income, student
- Customer C: $800 balance, $65,000 income, not a student

In [None]:
# Your solution here
# Create customer profiles
# Get probability predictions
# Provide business recommendations for each

## 📝 Chapter Summary

In this notebook, you practiced:

- ✅ Understanding why linear regression fails for classification problems
- ✅ Working with the logistic function and S-shaped probability curves
- ✅ Building simple and multiple logistic regression models with scikit-learn
- ✅ Interpreting coefficients in terms of log-odds and business impact
- ✅ Making probability-based predictions and applying classification thresholds
- ✅ Using proper train/test splits for model evaluation
- ✅ Recognizing class imbalance issues and their impact on accuracy

## 🔗 Connections to Other Chapters

- **Previous chapters**: Built on linear regression concepts from Chapter 21, extending them to classification problems
- **Upcoming chapters**: Chapter 24 will introduce more sophisticated evaluation metrics (precision, recall, F1-score) that better handle class imbalance

## 📚 Additional Resources

- [Scikit-learn Logistic Regression Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
- [Understanding the Logistic Function](https://en.wikipedia.org/wiki/Logistic_function)
- [Binary Classification Evaluation Metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics)

## 🎯 Next Steps

1. **Review the chapter** to reinforce concepts about probability interpretation and business applications
2. **Complete the end-of-chapter exercises** in the textbook using different ISLP datasets
3. **Practice with your own datasets** to build confidence with real-world classification problems
4. **Move on to Classification Evaluation** when ready to learn about advanced metrics for imbalanced data