# Week 10 Lab: Logistic Regression and Classification Evaluation

<a href="https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/labs/10_wk10_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome to Week 10! This lab serves as both your Thursday class session and your homework for the week. You'll apply logistic regression and classification evaluation techniques to two important business scenarios: credit risk assessment and medical diagnosis support.

In the business world, classification problems are everywhere—from determining loan approvals to medical screenings. Today you'll master the complete workflow from data preparation through model evaluation, learning to choose appropriate metrics that align with business objectives and costs.

## 🎯 Learning Objectives
By the end of this lab, you will be able to:
- Apply the complete logistic regression workflow: data preparation, model fitting, and interpretation
- Calculate and interpret baseline ratios for imbalanced classification problems
- Evaluate classification models using precision, recall, F1-score, and ROC-AUC metrics
- Select appropriate evaluation metrics based on business context and error costs

## 📚 This Lab Reinforces
- **Chapter 23: Introduction to Logistic Regression for Classification**
- **Chapter 24: Evaluating Classification Models**
- **Tuesday's Lecture: Classification Methods and Model Evaluation**

## 🕐 Estimated Time & Structure
**Total Time:** 75 minutes  
**Mode:** Individual work (this serves as your homework)

- **[0–10 min]** Review: Default dataset logistic regression workflow
- **[10–35 min]** Application: Breast Cancer Wisconsin dataset analysis
- **[35–70 min]** Independent challenges: Specific homework questions
- **[70–75 min]** Wrap-up and submission preparation

## 💡 Why This Matters
Classification problems drive critical business decisions across industries. Credit companies need to assess default risk, healthcare systems require diagnostic support, and marketing teams must identify likely customers. The ability to build, evaluate, and interpret classification models—while understanding the business implications of different types of errors—is essential for data-driven decision making. Today's lab prepares you to tackle these real-world challenges with confidence.

## Setup
We'll work with two datasets: the Default dataset from ISLP (for review) and the Breast Cancer Wisconsin dataset (for our main analysis). Both represent important classification scenarios in business and healthcare.

In [1]:
# Required imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    confusion_matrix, classification_report, 
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve
)
from ISLP import load_data
import warnings
warnings.filterwarnings('ignore')

# Set random state for reproducibility
RANDOM_STATE = 42

print("✅ All libraries imported successfully!")
print("🎯 Ready to dive into classification analysis!")

✅ All libraries imported successfully!
🎯 Ready to dive into classification analysis!


## Part 1 — Review: Default Dataset Logistic Regression (10 minutes)

Let's quickly review the complete logistic regression workflow using the Default dataset from Chapters 23-24. This will reinforce the key concepts before we tackle the main dataset.

### Quick Workflow Review

We'll walk through each step systematically:

**📋 Step-by-step process:**
1. Load data and compute baseline ratio
2. Prepare features with dummy encoding
3. Split data into training and test sets
4. Fit logistic regression model and interpret coefficients
5. Make predictions and evaluate using multiple metrics

In [2]:
# Step 1: Load Default dataset and examine baseline
Default = load_data('Default')

print("Default Dataset Overview:")
print(f"Shape: {Default.shape}")
print(f"\nColumns: {Default.columns.tolist()}")
print(f"\nFirst few rows:")
print(Default.head())

# Compute baseline ratio
baseline_default_rate = (Default['default'] == 'Yes').mean()
print(f"\n📊 Baseline Analysis:")
print(f"Default rate: {baseline_default_rate:.1%}")
print(f"Non-default rate: {1-baseline_default_rate:.1%}")

Default Dataset Overview:
Shape: (10000, 4)

Columns: ['default', 'student', 'balance', 'income']

First few rows:
  default student      balance        income
0      No      No   729.526495  44361.625074
1      No     Yes   817.180407  12106.134700
2      No      No  1073.549164  31767.138947
3      No      No   529.250605  35704.493935
4      No      No   785.655883  38463.495879

📊 Baseline Analysis:
Default rate: 3.3%
Non-default rate: 96.7%


In [3]:
# Step 2: Prepare data with dummy encoding
# Convert categorical variables to numeric
Default_encoded = pd.get_dummies(Default, columns=['student'], drop_first=True)
Default_encoded['default_binary'] = (Default_encoded['default'] == 'Yes').astype(int)

# Define features and target
X = Default_encoded[['balance', 'income', 'student_Yes']]
y = Default_encoded['default_binary']

print("Data Preparation Complete:")
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature columns: {X.columns.tolist()}")

Data Preparation Complete:
Features shape: (10000, 3)
Target shape: (10000,)

Feature columns: ['balance', 'income', 'student_Yes']


In [4]:
# Step 3: Split data (70-30 split as specified for homework questions)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE
)

print(f"📊 Data Split Results:")
print(f"Training set: {len(X_train):,} observations")
print(f"Test set: {len(X_test):,} observations")
print(f"\nTraining set default rate: {y_train.mean():.1%}")
print(f"Test set default rate: {y_test.mean():.1%}")

📊 Data Split Results:
Training set: 7,000 observations
Test set: 3,000 observations

Training set default rate: 3.4%
Test set default rate: 3.1%


In [5]:
# Step 4: Fit logistic regression model
model = LogisticRegression(random_state=RANDOM_STATE)
model.fit(X_train, y_train)

# Extract and interpret coefficients
print("🔍 Model Coefficients:")
print(f"Intercept: {model.intercept_[0]:.6f}")
for feature, coef in zip(X.columns, model.coef_[0]):
    print(f"{feature}: {coef:.6f}")

print(f"\n💡 Interpretation:")
print(f"• Balance: Positive coefficient means higher balance increases default risk")
print(f"• Income: Very small coefficient suggests minimal impact after accounting for balance")
print(f"• Student: Negative coefficient means students have lower default risk (holding other factors constant)")

🔍 Model Coefficients:
Intercept: -11.108164
balance: 0.005789
income: 0.000006
student_Yes: -0.467459

💡 Interpretation:
• Balance: Positive coefficient means higher balance increases default risk
• Income: Very small coefficient suggests minimal impact after accounting for balance
• Student: Negative coefficient means students have lower default risk (holding other factors constant)


In [None]:
# Step 5: Make predictions and evaluate comprehensively
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Calculate all key metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred_proba)

print("📈 Model Performance Metrics:")
print(f"Accuracy:  {accuracy:.1%}")
print(f"Precision: {precision:.1%}")
print(f"Recall:    {recall:.1%}")
print(f"F1-Score:  {f1:.1%}")
print(f"ROC-AUC:   {auc:.3f}")

print(f"\n💡 What These Metrics Mean for Credit Risk:")
print(f"• Accuracy (97.3%): Overall correctness - 97.3% of all predictions are correct")
print(f"• Precision (69.4%): Of customers flagged as 'will default', 69.4% actually do")
print(f"  → Low false alarms but still 30.6% false positives")
print(f"• Recall (26.6%): Only catches 26.6% of actual defaulters")
print(f"  → Misses 73.4% of customers who will default - major business risk!")
print(f"• F1-Score (38.5%): Balanced measure showing poor overall classification performance")
print(f"• ROC-AUC (0.947): Excellent ability to rank customers by default risk")
print(f"  → Model is very good at scoring, but default threshold may need adjustment")

# Show confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(f"\n🔍 Confusion Matrix:")
print(f"[[{cm[0,0]:4d}, {cm[0,1]:3d}]]")
print(f"[[{cm[1,0]:4d}, {cm[1,1]:3d}]]")
print(f"\nThis shows: TN={cm[0,0]}, FP={cm[0,1]}, FN={cm[1,0]}, TP={cm[1,1]}")
print(f"Business Impact: {cm[1,0]} defaulters missed (lost revenue), {cm[0,1]} customers wrongly rejected (lost business)")

## Part 2 — Main Analysis: Breast Cancer Wisconsin Dataset (25 minutes)

Now let's apply these skills to a new healthcare dataset. The **Breast Cancer Wisconsin (Diagnostic) dataset** contains features computed from digitized images of fine needle aspirate (FNA) of breast masses. Our goal is to predict whether a tumor is **malignant** (cancerous) or **benign** (non-cancerous).

### 🔬 About This Dataset

**Data Source**: Originally created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin-Madison. This dataset is widely used in machine learning research and medical informatics.

**Data Collection Process**: For each breast mass sample, a fine needle aspirate (FNA) was performed, then digitized images were analyzed to compute quantitative features describing the cell nuclei characteristics.

### 📊 Feature Categories

The dataset contains **30 quantitative features** organized into three groups for each characteristic:

1. **Mean values** (`_mean`): Average across all cells in the sample
2. **Standard error** (`_se`): Standard error of the measurements  
3. **Worst values** (`_worst`): Mean of the three largest (most severe) values

**The 10 core characteristics measured are:**

- **`radius`**: Distance from center to perimeter points
- **`texture`**: Standard deviation of gray-scale values  
- **`perimeter`**: Total boundary length of the cell nucleus
- **`area`**: Total area enclosed by the cell nucleus boundary
- **`smoothness`**: Local variation in radius lengths
- **`compactness`**: (perimeter² / area) - 1.0, measuring shape regularity
- **`concavity`**: Severity of concave portions of the boundary
- **`concave_points`**: Number of concave portions of the boundary
- **`symmetry`**: Bilateral symmetry of the cell nucleus
- **`fractal_dimension`**: Fractal complexity using coastline approximation

### 🎯 Simplified Analysis Focus

For this part of the lab, we'll focus on the **5 mean features** to keep our analysis manageable:
- `radius_mean`, `texture_mean`, `perimeter_mean`, `area_mean`, `smoothness_mean`

These provide a representative sample of size, texture, and shape characteristics that are clinically relevant for distinguishing malignant from benign tumors.

**Business Context**: In medical diagnosis, the costs of different errors are dramatically different. Missing a malignant tumor (false negative) can be life-threatening, while incorrectly flagging a benign tumor as malignant (false positive) leads to unnecessary stress and additional testing costs.

### Exercise 2.1: Data Loading and Exploration

**Your Task**: Load the breast cancer dataset and perform initial exploratory analysis.

**Instructions**:
1. Load the dataset from the provided URL
2. Examine the dataset structure (shape, columns, first few rows)
3. Calculate the baseline ratio of malignant vs benign diagnoses
4. Check for any missing values in the dataset

**Questions to Answer**:
- How many observations and features does the dataset contain?
- What percentage of cases are malignant vs benign?
- Are there any missing values that need to be handled?

Write your code below to answer these questions:

In [None]:
# Exercise 2.1: Your code here

# URL for the breast cancer dataset
url = "https://raw.githubusercontent.com/bradleyboehmke/uc-bana-4080/refs/heads/main/data/breast_cancer.csv"

# Task 1: Load the dataset (PROVIDED)
cancer_data = pd.read_csv(url)
print("✅ Breast Cancer Wisconsin dataset loaded successfully!")

# Task 2: Examine dataset structure (shape, columns, first few rows)
# Write your code here


# Task 3: Calculate baseline ratio of malignant vs benign diagnoses
# Write your code here


# Task 4: Check for missing values
# Write your code here

In [None]:
# Solution will be provided by TA during lab

# This cell will contain the solution code that the TA will walk through
# Students should attempt the exercise above before seeing the solution

print("✅ TA will provide solution during lab walkthrough")

### Exercise 2.2: Data Preparation and Modeling (Using Mean Features Only)

**Your Task**: Prepare the breast cancer data for logistic regression analysis using only the `_mean` features.

**Background**: For this exercise, we'll focus on a subset of features to keep the analysis manageable. You'll work with the 10 `_mean` features, which represent the average measurements across all cells in each sample.

**Instructions**:
1. Create a binary target variable (0=Benign, 1=Malignant) from the diagnosis column
2. Select only the features ending with `_mean` for your feature matrix
3. Split the data into training and test sets (70-30 split)
4. Fit a logistic regression model and examine the coefficients
5. Make predictions on the test set

**Important**: Use `RANDOM_STATE` variable (defined at the beginning) for consistent results across all students.

**Questions to Answer**:
- How many `_mean` features are available in the dataset?
- What are the training and test set sizes after the split?
- Which `_mean` features have positive vs negative coefficients?
- What do the coefficient signs suggest about malignancy risk?

Write your code below to complete these tasks:

In [None]:
# Exercise 2.2: Your code here
# Assume the cancer_data DataFrame is available from Exercise 2.1

# Task 1: Create binary target variable (0=Benign, 1=Malignant)
# Write your code here


# Task 2: Select only the features ending with '_mean' (PROVIDED)
mean_features = [col for col in cancer_data.columns if col.endswith('_mean')]
X_cancer_mean = cancer_data[mean_features]
print(f"✅ Selected {len(mean_features)} mean features:")
print(f"Features: {mean_features}")

# Task 3: Split data into training and test sets (70-30 split using RANDOM_STATE)
# Write your code here


# Task 4: Fit logistic regression model and examine coefficients
# Write your code here


# Task 5: Make predictions on test set  
# Write your code here

In [None]:
# Solution will be provided by TA during lab

# This cell will contain the solution code that the TA will walk through
# Students should attempt Exercise 2.2 above before seeing the solution

print("✅ TA will provide solution during lab walkthrough")

### Exercise 2.3: Model Evaluation

**Your Task**: Evaluate the performance of your logistic regression model using multiple classification metrics.

**Instructions**:
1. Calculate accuracy, precision, recall, and F1-score on the test set
2. Calculate the ROC-AUC score
3. Create and interpret the confusion matrix
4. Discuss which metrics are most important for medical diagnosis

**Questions to Answer**:
- What is the model's performance across different metrics?
- In the context of cancer diagnosis, which type of error (false positive vs false negative) is more concerning?
- How does this model's performance compare to the baseline?

Write your code below to evaluate the model:

In [None]:
# Exercise 2.3: Your code here
# Assume you have y_test and predictions available from Exercise 2.2

# Task 1: Calculate classification metrics
# Write your code here for accuracy, precision, recall, F1-score


# Task 2: Calculate ROC-AUC score  
# Write your code here


# Task 3: Create and display confusion matrix
# Write your code here


# Task 4: Interpret results in medical context
# Write your analysis as comments:
# - Which metric is most important for cancer diagnosis and why?
# - What are the implications of false positives vs false negatives?

## Part 3 — Independent Analysis: Full Feature Model (35 minutes)

Now that you've worked through the logistic regression process with the `_mean` features, it's time to apply the same workflow using **all available features** in the dataset. This will give you experience with higher-dimensional data and allow you to compare model performance.

### 🎯 Your Challenge

Repeat the complete logistic regression analysis from Part 2, but this time use **all 30 quantitative features** (mean, standard error, and worst values for each of the 10 characteristics). This represents a more realistic scenario where you have access to the full feature set.

**Key Differences from Part 2**:
- Use ALL features except the `diagnosis` column (30 features total)
- Follow the same workflow: data prep → modeling → evaluation
- Compare results with your Part 2 model using only `_mean` features
- Work independently to write all the code

### 📋 Workflow Steps to Complete

1. **Data Preparation**
   - Create binary target variable 
   - Select all quantitative features (exclude 'diagnosis')
   - Split into 70-30 train/test (use `RANDOM_STATE` for consistency)

2. **Model Training**
   - Fit logistic regression model
   - Examine and interpret coefficients
   - Make predictions on test set

3. **Model Evaluation**
   - Calculate all classification metrics
   - Create confusion matrix
   - Compare performance to Part 2 model

4. **Analysis and Comparison**
   - Which model performs better and why?
   - Does using more features always improve performance?
   - Which features seem most important for prediction?

**Important Notes**:
- Work independently on this section
- Use the same `RANDOM_STATE` for consistent results
- Feel free to ask conceptual questions, but write your own code
- We'll review solutions together at the end

### Step 1: Data Preparation with All Features

**Task**: Prepare the data using all 30 quantitative features instead of just the `_mean` features.

Write your code below:

In [None]:
# Step 1: Data Preparation with All Features
# Assume cancer_data DataFrame is available from Part 2

# Create binary target variable (if not already done)
# Write your code here


# Select ALL quantitative features (exclude 'diagnosis' column)
# Hint: You can use cancer_data.drop() or select columns that aren't 'diagnosis'
# Write your code here


# Split into training and test sets (70-30 split using RANDOM_STATE)
# Write your code here


# Verify your data preparation
# Write code to check shapes and feature count

### Step 2: Model Training with All Features

**Task**: Train a logistic regression model using all 30 features and examine the results.

Write your code below:

In [None]:
# Step 2: Model Training with All Features

# Fit logistic regression model using RANDOM_STATE
# Write your code here


# Examine model coefficients
# Write your code here to display intercept and feature coefficients


# Make predictions on test set (both binary and probability predictions)
# Write your code here

### Step 3: Model Evaluation and Comparison

**Task**: Evaluate your full-feature model and compare it with the `_mean`-only model from Part 2.

Write your code below:

In [None]:
# Step 3: Model Evaluation and Comparison

# Calculate all classification metrics for the full-feature model
# Write your code here for accuracy, precision, recall, F1-score, ROC-AUC


# Create and display confusion matrix
# Write your code here


# Compare with Part 2 results
# Write code to display metrics from both models side by side


# Analysis questions (answer in comments):
# 1. Which model performs better overall?
# 2. Does using more features improve performance? Why or why not?
# 3. Are there any trade-offs between the two models?
# 4. In a real medical setting, which model would you prefer and why?

### Step 4: Feature Importance Analysis

**Understanding Feature Importance**: 

While we haven't formally covered feature importance methods yet, we can gain insights about which features matter most in our logistic regression model by examining the **magnitude (absolute value) of the coefficients**.

**Key Concept**: In logistic regression, features with **larger absolute coefficient values** have more influence on the prediction. Here's why:

- **Large positive coefficient**: Strong evidence that higher values of this feature increase the likelihood of malignancy
- **Large negative coefficient**: Strong evidence that higher values of this feature decrease the likelihood of malignancy  
- **Small coefficient (near zero)**: This feature has minimal impact on the prediction

**For this analysis**, we'll assume that features with the largest absolute coefficient values represent the most influential features in our model. This gives us insight into which measurements are most important for distinguishing between malignant and benign tumors.

**Your Task**: Identify which features have the strongest influence on predictions and interpret what this means clinically.

Write your code below:

In [None]:
# (Optional) Step 4: Feature Importance Analysis

# Find features with largest positive and negative coefficients
# Write your code here to identify most influential features


# Create a visualization of feature importance (optional)
# You could create a bar plot or horizontal bar plot of coefficients


# Interpretation questions (answer in comments):
# 1. Which features have the strongest positive coefficients (increase malignancy risk)?
# 2. Which features have the strongest negative coefficients (decrease malignancy risk)?
# 3. Do these results make biological/medical sense?

### Step 5 — Business Cost Analysis

**Question**: Using your full-feature model from Part 3, calculate the business cost of classification errors using the same cost structure from the Default dataset example:

- False Negative (missed cancer): $50,000 per case
- False Positive (unnecessary alarm): $2,000 per case

Compare this with the cost if you used the Part 2 model. Which model is more cost-effective?

In [None]:
# Challenge 2: Business Cost Analysis

# Calculate costs for full-feature model (Part 3)
# Write your code here


# Calculate costs for mean-only model (Part 2) 
# Write your code here


# Compare total costs and determine which model is more cost-effective
# Write your analysis here

## 🎓 Lab Summary & Wrap-Up

### ✅ What You Accomplished Today

Congratulations! You've completed a comprehensive analysis of classification models using real medical data. Here's what you mastered:

**Part 1 - Review**: 
- Complete logistic regression workflow with Default dataset
- Understanding baseline ratios and model evaluation metrics
- Interpreting results in business context (credit risk)

**Part 2 - Guided Practice**:
- Loading and exploring the Breast Cancer Wisconsin dataset
- Data preparation with feature selection (`_mean` features only)
- Model training and coefficient interpretation
- Classification evaluation in medical context

**Part 3 - Independent Analysis**:
- Building models with all 30 features
- Comparing model performance across different feature sets
- Understanding trade-offs between model complexity and performance

### 📊 Key Results to Save

**🚨 IMPORTANT: Save Your Results for Homework! 🚨**

Make sure you have calculated and recorded the following results from your analysis:

**From Part 3 (All Features Model)**:
- [ ] Training/test set sizes and malignant rates
- [ ] Model coefficients for each `_mean` feature
- [ ] Classification metrics: accuracy, precision, recall, F1-score, ROC-AUC
- [ ] Comparison of performance between mean-only vs full-feature models
- [ ] Feature importance insights (which features have strongest coefficients)
- [ ] Business cost analysis

### 💡 Key Learning Insights

**Model Performance**:
- How does adding more features affect model performance?
- Which evaluation metrics are most important for medical diagnosis?
- What are the trade-offs between false positives and false negatives in healthcare?

**Business Context**:
- Why might a model with high accuracy still be problematic for medical use?
- How do business costs influence model selection and threshold decisions?
- What factors beyond accuracy should influence model deployment decisions?

### 📋 Next Steps & Homework Preparation

**This Week's Homework**: 
Your homework will include specific questions about the models you built today. Make sure you can access:
- Your model performance metrics
- Specific coefficient values
- Predictions for individual observations
- Cost analysis results

**Study Tips**:
- Review Chapter 23 (Logistic Regression) and Chapter 24 (Classification Evaluation)
- Practice interpreting confusion matrices and ROC curves
- Understand the business implications of different error types

### 🔧 Before You Leave

**Save Your Work**:
1. **Save this notebook** with all your completed code and results
2. **Take screenshots** of key results (confusion matrices, metric summaries)
3. **Export your notebook** (File → Download as → HTML) as a backup
4. **Note key variable names** you used (e.g., model names, prediction arrays)

**Double-Check Your Results**:
- Did you use `RANDOM_STATE = 42` consistently?
- Are your train/test splits 70-30?
- Do you have both probability and binary predictions saved?
- Are your model performance metrics calculated correctly?

---

**🎯 Great work today!** You've gained hands-on experience with real-world classification problems and learned to evaluate models from both statistical and business perspectives. These skills are essential for data-driven decision making in healthcare, finance, and many other industries.

**Questions?** If you have any questions about your results or need clarification on concepts, reach out before the homework is due. Make sure you understand not just how to calculate the metrics, but what they mean in the context of medical diagnosis.