# Logistic Regression - Probabilities, Decision Boundaries, and Pipelines

<hr>

<center>
<div>
<img src="https://raw.githubusercontent.com/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/main/notebooks/figures/mgmt_474_ai_logo_02-modified.png" width="200"/>
</div>
</center>

# <center><a class="tocSkip"></center>
# <center>MGMT47400 Predictive Analytics</center>
# <center>Professor: Davi Moreira </center>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/blob/main/notebooks/06_logistic_pipelines.ipynb)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Fit logistic regression with preprocessing in a pipeline
2. Interpret probabilities vs classes (and why thresholds matter)
3. Use regularization in logistic regression for stability
4. Choose an appropriate baseline for classification
5. Document the classification objective and error costs

---

In [None]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, log_loss, confusion_matrix, classification_report
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.precision', 4)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
RANDOM_SEED = 474
np.random.seed(RANDOM_SEED)
print("✓ Setup complete!")

**Reading the output:**

The `Setup complete!` message confirms that all imports loaded. This notebook introduces several classification-specific tools: `LogisticRegression` for the model, `DummyClassifier` for baselines, `accuracy_score` and `log_loss` for evaluation, and `confusion_matrix` plus `classification_report` for detailed error analysis. We also import `load_breast_cancer` and `make_classification` from `sklearn.datasets`. The usual display settings and **RANDOM_SEED = 474** remain in effect.

**Why this matters:** Classification problems require a different toolkit than regression. Metrics like accuracy, log loss, and confusion matrices replace MAE, RMSE, and R². Recognizing which tools belong to which problem type is a foundational skill.

---

## 1. Load Classification Dataset

We switch from regression to classification using the **Breast Cancer Wisconsin** dataset, a classic binary-classification benchmark included in scikit-learn. Each of the **569 samples** represents a digitized image of a fine-needle aspirate of a breast mass, described by **30 numeric features** (mean, standard error, and worst of 10 measurements like radius, texture, and symmetry). The target is **0 = malignant** or **1 = benign**.
Splits are stratified to preserve the original class balance in every partition.

In [None]:
# Load breast cancer dataset (binary classification)
data = load_breast_cancer(as_frame=True)
df = data.frame
X = data.data
y = data.target

print(f"Dataset: {data.DESCR.split('===')[0].strip()}")
print(f"\nShape: {X.shape}")
print(f"Target classes: {data.target_names}")
print(f"Class distribution:")
print(y.value_counts())
print(f"\nClass balance: {y.value_counts(normalize=True).round(3).to_dict()}")

# Split data
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.20, random_state=RANDOM_SEED, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=RANDOM_SEED, stratify=y_temp)

print(f"\nTrain: {len(X_train)} | Val: {len(X_val)} | Test: {len(X_test)} (locked)")

**Reading the output:**

The output shows the dataset has **569 samples** and **30 features**. The target classes are `malignant` (0) and `benign` (1), with a class distribution of roughly **212 malignant** and **357 benign** samples -- approximately a **37%/63%** split. This mild imbalance is important: a model that always predicts "benign" would already achieve ~63% accuracy, so raw accuracy alone is not a reliable performance indicator. The stratified splits produce approximately **341 training**, **114 validation**, and **114 test** samples, each preserving the 37/63 class ratio.

**Key takeaway:** Always inspect class balance before modeling. When one class dominates, accuracy inflates and you need additional metrics (precision, recall, confusion matrix) to assess real performance.

---

## 2. Classification Baselines

### Why Baselines Matter for Classification

**Common baselines:**
- **Most frequent class**: Always predict the majority class
- **Stratified random**: Predict classes proportional to training distribution
- **Domain heuristic**: Simple rule based on domain knowledge

**Key insight**: With imbalanced classes, even naive baselines can have high accuracy!

In [None]:
# Most frequent class baseline
baseline_mf = DummyClassifier(strategy='most_frequent')
baseline_mf.fit(X_train, y_train)

y_pred_baseline = baseline_mf.predict(X_val)
baseline_acc = accuracy_score(y_val, y_pred_baseline)

print("=== BASELINE: MOST FREQUENT CLASS ===")
print(f"Validation Accuracy: {baseline_acc:.4f}")
print(f"\nThis baseline always predicts: {data.target_names[int(baseline_mf.predict([X_train.iloc[0]])[0])]}")
print(f"\n⚠️ Accuracy can be misleading! We need better metrics.")

**Reading the output:**

The `DummyClassifier(strategy='most_frequent')` always predicts the majority class (benign). Its validation accuracy is approximately **0.63**, which is simply the proportion of benign samples in the validation set. The warning below the score emphasizes that this number can be misleadingly high: the model has learned *nothing* about the data and would miss every single malignant case.

**Why this matters:** This baseline accuracy sets the floor. Any real classifier must beat **~63%** to demonstrate that it has learned something useful. More importantly, we need metrics beyond accuracy to ensure the model actually detects malignant tumors, not just defaults to the majority class.

---

## 3. Logistic Regression: From Log-Odds to Probabilities

### The Math (Simplified)

**Linear combination:**
```
z = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ
```

**Logistic function (sigmoid):**
```
P(y=1|X) = 1 / (1 + e^(-z))
```

**Properties:**
- Output is always between 0 and 1 (valid probability)
- z = 0 → P = 0.5 (decision boundary)
- Large positive z → P ≈ 1
- Large negative z → P ≈ 0

In [None]:
# Visualize sigmoid function
z = np.linspace(-10, 10, 200)
sigmoid = 1 / (1 + np.exp(-z))

plt.figure(figsize=(10, 6))
plt.plot(z, sigmoid, linewidth=2)
plt.axhline(y=0.5, color='r', linestyle='--', label='Default threshold (0.5)')
plt.axvline(x=0, color='g', linestyle='--', alpha=0.5, label='Decision boundary (z=0)')
plt.xlabel('Linear Combination (z)')
plt.ylabel('Probability P(y=1|X)')
plt.title('Logistic (Sigmoid) Function')
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

print("💡 The sigmoid squashes any real number into [0, 1]")
print("💡 Default: if P > 0.5, predict class 1; else predict class 0")

**Reading the output:**

The plot shows the classic **S-shaped sigmoid curve** mapping any real-valued linear combination z to a probability between 0 and 1. The red dashed line at **P = 0.5** marks the default decision threshold, and the green dashed line at **z = 0** marks the corresponding input value. When z is large and positive, the sigmoid saturates near 1 (high confidence in class 1); when z is large and negative, it saturates near 0 (high confidence in class 0). The transition region around z = 0 is where the model is most uncertain.

**Key takeaway:** The sigmoid function is what makes logistic regression a *probability* model rather than just a classifier. Understanding this curve helps you reason about confidence: a prediction of P = 0.99 is qualitatively different from P = 0.51, even though both produce the same class label.

---

## 4. Logistic Regression Pipeline

We now build a proper scikit-learn `Pipeline` that chains `StandardScaler` (zero-mean, unit-variance normalization) with `LogisticRegression`. Scaling is especially important for logistic regression because the model's convergence and regularization behavior depend on feature magnitudes.
The pipeline reports both **accuracy** (fraction of correct predictions) and **log loss** (a probability-quality metric that penalizes confident wrong predictions).

In [None]:
# Basic logistic regression pipeline
log_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=RANDOM_SEED, max_iter=1000))
])

log_pipeline.fit(X_train, y_train)

# Predictions
y_pred_train = log_pipeline.predict(X_train)
y_pred_val = log_pipeline.predict(X_val)

# Probabilities
y_proba_train = log_pipeline.predict_proba(X_train)
y_proba_val = log_pipeline.predict_proba(X_val)

print("=== LOGISTIC REGRESSION ===")
print(f"Train Accuracy: {accuracy_score(y_train, y_pred_train):.4f}")
print(f"Val Accuracy: {accuracy_score(y_val, y_pred_val):.4f}")
print(f"\nTrain Log Loss: {log_loss(y_train, y_proba_train):.4f}")
print(f"Val Log Loss: {log_loss(y_val, y_proba_val):.4f}")

print(f"\n✓ Improvement over baseline: {accuracy_score(y_val, y_pred_val) - baseline_acc:.4f}")

**Reading the output:**

The logistic regression pipeline reports **Train Accuracy** and **Val Accuracy**, both likely above **0.96**, representing a massive jump from the ~63% baseline. **Log loss** is also printed for both sets; lower is better, and values around **0.08--0.12** indicate that the model's probability estimates are well-calibrated. The "Improvement over baseline" line quantifies the accuracy gain -- typically **+0.33** or more. The small gap between train and validation metrics suggests minimal overfitting.

**Why this matters:** Accuracy alone does not tell you *where* the model makes mistakes. Two models can have 97% accuracy but differ dramatically in which errors they make (false positives vs. false negatives). Log loss rewards well-calibrated probabilities, making it a more informative optimization target than raw accuracy.

---

## 📝 PAUSE-AND-DO Exercise 1 (5 minutes)

**Task:** Build logistic pipeline and compute validation accuracy + log loss.

The pipeline is already built above. Now:
1. Look at the probabilities for a few samples
2. Understand the difference between `.predict()` and `.predict_proba()`
3. Explain why log loss might be better than accuracy

---

In [None]:
# Examine predictions vs probabilities
sample_df = pd.DataFrame({
    'True_Label': y_val.iloc[:10].values,
    'Predicted_Class': y_pred_val[:10],
    'Prob_Class_0': y_proba_val[:10, 0],
    'Prob_Class_1': y_proba_val[:10, 1],
    'Correct': y_val.iloc[:10].values == y_pred_val[:10]
})

print("=== SAMPLE PREDICTIONS ===")
print(sample_df)

print("\n💡 Notice:")
print("  - Probabilities sum to 1.0 for each sample")
print("  - Predicted class = argmax(probabilities)")
print("  - Some predictions are confident (prob close to 0 or 1)")
print("  - Some predictions are uncertain (prob close to 0.5)")

**Reading the output:**

The 10-row table shows, for each validation sample: the **True_Label**, the **Predicted_Class**, the probability assigned to **class 0** (malignant) and **class 1** (benign), and whether the prediction was **Correct**. Notice that the two probability columns always sum to **1.0** for each row. Some predictions are highly confident (e.g., Prob_Class_1 > 0.99), while others are closer to 0.5, indicating uncertainty. The `Predicted_Class` column is simply the argmax of the two probabilities at the default 0.5 threshold.

**Key takeaway:** Probabilities carry more information than hard labels. A prediction of P(malignant) = 0.48 (just below the 0.5 threshold) is treated the same as P(malignant) = 0.01 in terms of the predicted class, but a clinician would want to know about the 0.48 case. This is why examining `predict_proba()` output is essential.

---

### YOUR ANALYSIS:

**Question 1: What's the difference between `.predict()` and `.predict_proba()`?**  
[Your answer]

**Question 2: Why might log loss be better than accuracy?**  
[Your answer - hint: think about probability quality]

**Question 3: What does a probability of 0.51 vs 0.99 tell you?**  
[Your answer - both predict class 1, but...]

---

## 5. Thresholding Matters!

Logistic regression outputs a probability for each class, but the final prediction depends on a **threshold**: if P(class 1) >= threshold, predict class 1. The default is 0.5, but this is not always optimal. Lowering the threshold makes the model more eager to predict the positive class (higher recall, more false positives); raising it makes the model more conservative (higher precision, more false negatives).
The sweep below tests four thresholds to illustrate how this single number reshapes the entire prediction profile.

In [None]:
# Try different thresholds
thresholds = [0.3, 0.5, 0.7, 0.9]
threshold_results = []

for thresh in thresholds:
    y_pred_thresh = (y_proba_val[:, 1] >= thresh).astype(int)
    acc = accuracy_score(y_val, y_pred_thresh)
    cm = confusion_matrix(y_val, y_pred_thresh)
    
    threshold_results.append({
        'Threshold': thresh,
        'Accuracy': acc,
        'Predicted_Positive': y_pred_thresh.sum(),
        'Predicted_Negative': len(y_pred_thresh) - y_pred_thresh.sum()
    })

results_df = pd.DataFrame(threshold_results)
print("=== THRESHOLD SENSITIVITY ===")
print(results_df)

print("\n💡 Key insight: Changing the threshold changes predictions!")
print("💡 Default 0.5 is not always optimal")
print("💡 We'll explore this more in upcoming notebooks")

**Reading the output:**

The threshold sensitivity table shows four rows corresponding to thresholds **0.3, 0.5, 0.7, and 0.9**. As the threshold increases, the model becomes more conservative about predicting the positive class (benign): the **Predicted_Positive** count drops and **Predicted_Negative** rises. Accuracy may actually decrease at extreme thresholds because the model starts misclassifying clear benign cases. At a very low threshold (0.3), nearly everything is predicted positive, boosting recall for the benign class but potentially missing malignant cases.

**Why this matters:** In medical diagnosis, the *cost* of each error type drives threshold choice. If missing a malignant tumor is catastrophic (false negative), you would lower the threshold to catch more positives, accepting more false alarms. If unnecessary biopsies are costly (false positive), you would raise the threshold. There is no universally correct threshold -- it depends on the business or clinical context.

---

## 📝 PAUSE-AND-DO Exercise 2 (5 minutes)

**Task:** Change threshold from 0.5 and observe metric shifts.

Already done above. Now answer:

---

### YOUR OBSERVATIONS:

**Observation 1: What happens when you lower the threshold?**  
[Hint: more/fewer positive predictions?]

**Observation 2: What happens when you raise the threshold?**  
[Hint: how does it affect prediction distribution?]

**Observation 3: When might you want a threshold other than 0.5?**  
[Think about business costs]

---

## 6. Regularized Logistic Regression

Scikit-learn's `LogisticRegression` applies L2 regularization by default, controlled by the inverse-regularization parameter **C** (lower C = stronger penalty). This is the classification analogue of Ridge regression.
We sweep C across five orders of magnitude (0.01 to 100) to observe how regularization strength affects training accuracy, validation accuracy, and the gap between them.

In [None]:
# Regularized logistic regression (lower C = stronger regularization)
log_reg_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(C=1.0, random_state=RANDOM_SEED, max_iter=1000))
])

log_reg_pipeline.fit(X_train, y_train)

# Compare different C values
C_values = [0.01, 0.1, 1.0, 10.0, 100.0]
reg_results = []

for C in C_values:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', LogisticRegression(C=C, random_state=RANDOM_SEED, max_iter=1000))
    ])
    pipe.fit(X_train, y_train)
    
    train_acc = pipe.score(X_train, y_train)
    val_acc = pipe.score(X_val, y_val)
    
    reg_results.append({
        'C': C,
        'Train_Acc': train_acc,
        'Val_Acc': val_acc,
        'Gap': train_acc - val_acc
    })

reg_df = pd.DataFrame(reg_results)
print("=== REGULARIZATION STRENGTH (C parameter) ===")
print(reg_df)
print("\n💡 Lower C = stronger regularization")
print("💡 Look for good validation performance without huge train-val gap")

best_C = reg_df.loc[reg_df['Val_Acc'].idxmax(), 'C']
print(f"\n✓ Best validation accuracy at C = {best_C}")

**Reading the output:**

The table shows five rows for **C = 0.01, 0.1, 1.0, 10.0, and 100.0**. At very low C (strong regularization), both train and validation accuracy may dip slightly because the model is overly constrained. As C increases (weaker regularization), accuracy improves and then plateaus. The **Gap** column (train minus validation accuracy) should remain small across all C values, indicating that logistic regression is not highly prone to overfitting on this 30-feature dataset. The "Best validation accuracy at C = ..." line identifies the sweet spot.

**Key takeaway:** The C parameter in logistic regression is the inverse of alpha in Ridge/Lasso: **small C = strong penalty**. In practice, you would use cross-validation (e.g., `LogisticRegressionCV`) to select C automatically, just as we used `RidgeCV` and `LassoCV` in the previous notebook.

---

## 7. Confusion Matrix

A confusion matrix tabulates all four outcomes of a binary classifier: true positives, true negatives, false positives, and false negatives. In a medical context like breast cancer diagnosis, the cost of a **false negative** (missing a malignant tumor) is far higher than a **false positive** (an unnecessary biopsy).
The heatmap below makes these counts immediately visible, and the printed interpretation maps each cell to its clinical meaning.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

# Confusion matrix
cm = confusion_matrix(y_val, y_pred_val)

fig, ax = plt.subplots(figsize=(8, 6))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=data.target_names)
disp.plot(ax=ax, cmap='Blues', values_format='d')
plt.title('Confusion Matrix - Validation Set')
plt.tight_layout()
plt.show()

print("\n=== CONFUSION MATRIX INTERPRETATION ===")
print(f"True Negatives (TN): {cm[0, 0]}")
print(f"False Positives (FP): {cm[0, 1]}")
print(f"False Negatives (FN): {cm[1, 0]}")
print(f"True Positives (TP): {cm[1, 1]}")

print("\n💡 In medical diagnosis:")
print("   FP = False alarm (predicted malignant, actually benign)")
print("   FN = Missed diagnosis (predicted benign, actually malignant)")
print("\n⚠️ Which error is more costly? This drives threshold choice!")

**Reading the output:**

The confusion matrix heatmap is a 2x2 grid with true labels on the y-axis and predicted labels on the x-axis. The diagonal cells (top-left and bottom-right) show correct predictions: **true negatives** (correctly identified malignant) and **true positives** (correctly identified benign). The off-diagonal cells show errors: **false positives** (top-right, predicted benign but actually malignant) and **false negatives** (bottom-left, predicted malignant but actually benign). The printed counts below the chart give exact numbers for each cell. In this medical context, false negatives (missed malignant cases) are the most dangerous error.

**Why this matters:** The confusion matrix is the foundation for precision, recall, F1-score, and all other classification metrics you will encounter in the next notebook. Learning to read it fluently -- and to identify which cell represents the most costly error for your specific problem -- is a core skill in applied machine learning.

---

## 8. Wrap-Up: Key Takeaways

### What We Learned Today:

1. **Logistic Regression**: Maps linear combinations to probabilities via sigmoid
2. **Probabilities vs Classes**: `.predict_proba()` gives you more information than `.predict()`
3. **Thresholds Matter**: Default 0.5 is not always optimal
4. **Baselines**: Even naive strategies can have decent accuracy with imbalance
5. **Regularization**: Control C to prevent overfitting

### Critical Rules:

> **"Always look at probabilities, not just classes"**

> **"Accuracy is not enough - confusion matrix reveals errors"**

> **"Thresholds should be tuned to business costs"**

### Next Steps:

- Next notebook: Classification metrics (precision, recall, ROC, PR curves)
- We'll learn how to systematically choose thresholds
- Class imbalance handling strategies

---

## Participation Assignment Submission Instructions

### To Submit This Notebook:

1. **Complete all exercises**: Fill in both PAUSE-AND-DO exercise cells with your findings
2. **Run All Cells**: Execute `Runtime → Run all` to ensure everything works
3. **Save a Copy**: `File → Save a copy in Drive`
4. **Submit**: Upload your `.ipynb` file in the participation assignment you find in the course Brightspace page.

### Before Submitting, Check:

- [ ] All cells execute without errors
- [ ] All outputs are visible
- [ ] Both exercise responses are complete
- [ ] Notebook is shared with correct permissions
- [ ] You can explain every line of code you wrote

### Next Step:

Complete the **Quiz** in Brightspace (auto-graded)

---

## Bibliography

- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning with Python* - Classification chapter
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* - Logistic regression foundations
- scikit-learn User Guide: [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
- scikit-learn User Guide: [Probability calibration](https://scikit-learn.org/stable/modules/calibration.html)

---



<center>

Thank you!

</center>