# Logistic Regression - Probabilities, Decision Boundaries, and Pipelines

<hr>

<center>
<div>
<img src="https://raw.githubusercontent.com/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/main/notebooks/figures/mgmt_474_ai_logo_02-modified.png" width="200"/>
</div>
</center>

# <center><a class="tocSkip"></center>
# <center>MGMT47400 Predictive Analytics</center>
# <center>Professor: Davi Moreira </center>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/blob/main/notebooks/06_logistic_pipelines.ipynb)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Fit logistic regression with preprocessing in a pipeline
2. Interpret probabilities vs classes (and why thresholds matter)
3. Use regularization in logistic regression for stability
4. Choose an appropriate baseline for classification
5. Document the classification objective and error costs

---

In [None]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, log_loss, confusion_matrix, classification_report
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.precision', 4)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
RANDOM_SEED = 474
np.random.seed(RANDOM_SEED)
print("✓ Setup complete!")

## 1. Load Classification Dataset

In [None]:
# Load breast cancer dataset (binary classification)
data = load_breast_cancer(as_frame=True)
df = data.frame
X = data.data
y = data.target

print(f"Dataset: {data.DESCR.split('===')[0].strip()}")
print(f"\nShape: {X.shape}")
print(f"Target classes: {data.target_names}")
print(f"Class distribution:")
print(y.value_counts())
print(f"\nClass balance: {y.value_counts(normalize=True).round(3).to_dict()}")

# Split data
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.20, random_state=RANDOM_SEED, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=RANDOM_SEED, stratify=y_temp)

print(f"\nTrain: {len(X_train)} | Val: {len(X_val)} | Test: {len(X_test)} (locked)")

## 2. Classification Baselines

### Why Baselines Matter for Classification

**Common baselines:**
- **Most frequent class**: Always predict the majority class
- **Stratified random**: Predict classes proportional to training distribution
- **Domain heuristic**: Simple rule based on domain knowledge

**Key insight**: With imbalanced classes, even naive baselines can have high accuracy!

In [None]:
# Most frequent class baseline
baseline_mf = DummyClassifier(strategy='most_frequent')
baseline_mf.fit(X_train, y_train)

y_pred_baseline = baseline_mf.predict(X_val)
baseline_acc = accuracy_score(y_val, y_pred_baseline)

print("=== BASELINE: MOST FREQUENT CLASS ===")
print(f"Validation Accuracy: {baseline_acc:.4f}")
print(f"\nThis baseline always predicts: {data.target_names[int(baseline_mf.predict([X_train.iloc[0]])[0])]}")
print(f"\n⚠️ Accuracy can be misleading! We need better metrics.")

## 3. Logistic Regression: From Log-Odds to Probabilities

### The Math (Simplified)

**Linear combination:**
```
z = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ
```

**Logistic function (sigmoid):**
```
P(y=1|X) = 1 / (1 + e^(-z))
```

**Properties:**
- Output is always between 0 and 1 (valid probability)
- z = 0 → P = 0.5 (decision boundary)
- Large positive z → P ≈ 1
- Large negative z → P ≈ 0

In [None]:
# Visualize sigmoid function
z = np.linspace(-10, 10, 200)
sigmoid = 1 / (1 + np.exp(-z))

plt.figure(figsize=(10, 6))
plt.plot(z, sigmoid, linewidth=2)
plt.axhline(y=0.5, color='r', linestyle='--', label='Default threshold (0.5)')
plt.axvline(x=0, color='g', linestyle='--', alpha=0.5, label='Decision boundary (z=0)')
plt.xlabel('Linear Combination (z)')
plt.ylabel('Probability P(y=1|X)')
plt.title('Logistic (Sigmoid) Function')
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

print("💡 The sigmoid squashes any real number into [0, 1]")
print("💡 Default: if P > 0.5, predict class 1; else predict class 0")

## 4. Logistic Regression Pipeline

In [None]:
# Basic logistic regression pipeline
log_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=RANDOM_SEED, max_iter=1000))
])

log_pipeline.fit(X_train, y_train)

# Predictions
y_pred_train = log_pipeline.predict(X_train)
y_pred_val = log_pipeline.predict(X_val)

# Probabilities
y_proba_train = log_pipeline.predict_proba(X_train)
y_proba_val = log_pipeline.predict_proba(X_val)

print("=== LOGISTIC REGRESSION ===")
print(f"Train Accuracy: {accuracy_score(y_train, y_pred_train):.4f}")
print(f"Val Accuracy: {accuracy_score(y_val, y_pred_val):.4f}")
print(f"\nTrain Log Loss: {log_loss(y_train, y_proba_train):.4f}")
print(f"Val Log Loss: {log_loss(y_val, y_proba_val):.4f}")

print(f"\n✓ Improvement over baseline: {accuracy_score(y_val, y_pred_val) - baseline_acc:.4f}")

## 📝 PAUSE-AND-DO Exercise 1 (10 minutes)

**Task:** Build logistic pipeline and compute validation accuracy + log loss.

The pipeline is already built above. Now:
1. Look at the probabilities for a few samples
2. Understand the difference between `.predict()` and `.predict_proba()`
3. Explain why log loss might be better than accuracy

---

In [None]:
# Examine predictions vs probabilities
sample_df = pd.DataFrame({
    'True_Label': y_val.iloc[:10].values,
    'Predicted_Class': y_pred_val[:10],
    'Prob_Class_0': y_proba_val[:10, 0],
    'Prob_Class_1': y_proba_val[:10, 1],
    'Correct': y_val.iloc[:10].values == y_pred_val[:10]
})

print("=== SAMPLE PREDICTIONS ===")
print(sample_df)

print("\n💡 Notice:")
print("  - Probabilities sum to 1.0 for each sample")
print("  - Predicted class = argmax(probabilities)")
print("  - Some predictions are confident (prob close to 0 or 1)")
print("  - Some predictions are uncertain (prob close to 0.5)")

### YOUR ANALYSIS:

**Question 1: What's the difference between `.predict()` and `.predict_proba()`?**  
[Your answer]

**Question 2: Why might log loss be better than accuracy?**  
[Your answer - hint: think about probability quality]

**Question 3: What does a probability of 0.51 vs 0.99 tell you?**  
[Your answer - both predict class 1, but...]

---

## 5. Thresholding Matters!

In [None]:
# Try different thresholds
thresholds = [0.3, 0.5, 0.7, 0.9]
threshold_results = []

for thresh in thresholds:
    y_pred_thresh = (y_proba_val[:, 1] >= thresh).astype(int)
    acc = accuracy_score(y_val, y_pred_thresh)
    cm = confusion_matrix(y_val, y_pred_thresh)
    
    threshold_results.append({
        'Threshold': thresh,
        'Accuracy': acc,
        'Predicted_Positive': y_pred_thresh.sum(),
        'Predicted_Negative': len(y_pred_thresh) - y_pred_thresh.sum()
    })

results_df = pd.DataFrame(threshold_results)
print("=== THRESHOLD SENSITIVITY ===")
print(results_df)

print("\n💡 Key insight: Changing the threshold changes predictions!")
print("💡 Default 0.5 is not always optimal")
print("💡 We'll explore this more in upcoming notebooks")

## 📝 PAUSE-AND-DO Exercise 2 (10 minutes)

**Task:** Change threshold from 0.5 and observe metric shifts.

Already done above. Now answer:

---

### YOUR OBSERVATIONS:

**Observation 1: What happens when you lower the threshold?**  
[Hint: more/fewer positive predictions?]

**Observation 2: What happens when you raise the threshold?**  
[Hint: how does it affect prediction distribution?]

**Observation 3: When might you want a threshold other than 0.5?**  
[Think about business costs]

---

## 6. Regularized Logistic Regression

In [None]:
# Regularized logistic regression (lower C = stronger regularization)
log_reg_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(C=1.0, random_state=RANDOM_SEED, max_iter=1000))
])

log_reg_pipeline.fit(X_train, y_train)

# Compare different C values
C_values = [0.01, 0.1, 1.0, 10.0, 100.0]
reg_results = []

for C in C_values:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', LogisticRegression(C=C, random_state=RANDOM_SEED, max_iter=1000))
    ])
    pipe.fit(X_train, y_train)
    
    train_acc = pipe.score(X_train, y_train)
    val_acc = pipe.score(X_val, y_val)
    
    reg_results.append({
        'C': C,
        'Train_Acc': train_acc,
        'Val_Acc': val_acc,
        'Gap': train_acc - val_acc
    })

reg_df = pd.DataFrame(reg_results)
print("=== REGULARIZATION STRENGTH (C parameter) ===")
print(reg_df)
print("\n💡 Lower C = stronger regularization")
print("💡 Look for good validation performance without huge train-val gap")

best_C = reg_df.loc[reg_df['Val_Acc'].idxmax(), 'C']
print(f"\n✓ Best validation accuracy at C = {best_C}")

## 7. Confusion Matrix

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

# Confusion matrix
cm = confusion_matrix(y_val, y_pred_val)

fig, ax = plt.subplots(figsize=(8, 6))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=data.target_names)
disp.plot(ax=ax, cmap='Blues', values_format='d')
plt.title('Confusion Matrix - Validation Set')
plt.tight_layout()
plt.show()

print("\n=== CONFUSION MATRIX INTERPRETATION ===")
print(f"True Negatives (TN): {cm[0, 0]}")
print(f"False Positives (FP): {cm[0, 1]}")
print(f"False Negatives (FN): {cm[1, 0]}")
print(f"True Positives (TP): {cm[1, 1]}")

print("\n💡 In medical diagnosis:")
print("   FP = False alarm (predicted malignant, actually benign)")
print("   FN = Missed diagnosis (predicted benign, actually malignant)")
print("\n⚠️ Which error is more costly? This drives threshold choice!")

## 8. Wrap-Up: Key Takeaways

### What We Learned Today:

1. **Logistic Regression**: Maps linear combinations to probabilities via sigmoid
2. **Probabilities vs Classes**: `.predict_proba()` gives you more information than `.predict()`
3. **Thresholds Matter**: Default 0.5 is not always optimal
4. **Baselines**: Even naive strategies can have decent accuracy with imbalance
5. **Regularization**: Control C to prevent overfitting

### Critical Rules:

> **"Always look at probabilities, not just classes"**

> **"Accuracy is not enough - confusion matrix reveals errors"**

> **"Thresholds should be tuned to business costs"**

### Next Steps:

- Next notebook: Classification metrics (precision, recall, ROC, PR curves)
- We'll learn how to systematically choose thresholds
- Class imbalance handling strategies

---

## Bibliography

- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning with Python* - Classification chapter
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* - Logistic regression foundations
- scikit-learn User Guide: [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
- scikit-learn User Guide: [Probability calibration](https://scikit-learn.org/stable/modules/calibration.html)

---



<center>

Thank you!

</center>