# Introduction to Machine Learning: Supervised Learning

**Instructor:** Daniel Acuna, Ph.D.
**Position:** Associate Professor of Computer Science
**Institution:** University of Colorado Boulder

---

Lab 3: Classification Methods

---

In [None]:
# ## Setup (do not edit)
#
# This cell imports all necessary libraries for the assignment.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    roc_curve,
    auc,
)

# Set a random state for reproducibility
RANDOM_STATE = 42

## 1. Load and Describe the Data (10 points)

The dataset for this lab is the **Wisconsin Breast Cancer dataset**. The features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The target column, `diagnosis`, indicates whether a tumor is **M** (malignant) or **B** (benign).

In [None]:
# Grade Cell: Question 1
#
# Task: Load the dataset and display its first 5 rows.
#
# Instructions:
# 1. Load the 'wisconsin_breast_cancer.csv' file into a pandas DataFrame called `df`.
# 2. Use the `.head()` method to display the first 5 rows.

# your code here
#raise NotImplementedError
df = pd.read_csv('wisconsin_breast_cancer.csv')
df.head()

In [None]:
# If all tests pass (there might be hidden tests), you will earn 10 points
# Test Cell: Question 1
assert "df" in locals(), "DataFrame 'df' not found."
assert df.shape[0] > 0, "DataFrame 'df' is empty."
assert "diagnosis" in df.columns, "The 'diagnosis' column is missing."
print("DataFrame loaded successfully!")
df.head()

## 2. Prepare the Data (10 points)

Before training a model, the data needs to be preprocessed. This involves encoding the target variable to a numerical format, dropping unnecessary columns, and separating features from the target.

In [None]:
# Grade Cell: Question 2
#
# Task: Prepare the data for modeling.
#
# Instructions:
# 1. Map the 'diagnosis' column to binary values: 'M' (malignant) to 1 and 'B' (benign) to 0.
# 2. Create the feature matrix `X` by dropping the 'diagnosis' column.
# 3. Create the target vector `y` from the now-encoded 'diagnosis' column.

# your code here
#raise NotImplementedError
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})
X = df.drop('diagnosis', axis=1)
y = df['diagnosis']

In [None]:
# If all tests pass (there might be hidden tests), you will earn 10 points
# Test Cell: Question 2
assert "X" in locals() and "y" in locals(), "X and/or y are not defined."
assert y.dtype == "int64" or y.dtype == "int32", "Target 'y' is not numeric."
assert X.shape[1] == 30, "Feature matrix 'X' should have 30 columns."
print("Data preparation successful!")

## 3. Data Splitting and Scaling (10 points)

To evaluate the model's performance on unseen data, we must split the dataset into training and testing sets. We will also scale the features, which is crucial for distance-based and optimization-based algorithms like Logistic Regression and LDA.

In [None]:
# Grade Cell: Question 3
#
# Task: Split and scale the data.
#
# Instructions:
# 1. Split `X` and `y` into `X_train`, `X_test`, `y_train`, and `y_test` with a `test_size` of 0.2 and `random_state=RANDOM_STATE`.
# 2. Initialize a `StandardScaler` and fit it on `X_train`.
# 3. Transform both `X_train` and `X_test` using the fitted scaler, naming them `X_train_scaled` and `X_test_scaled`.

# your code here
#raise NotImplementedError
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# If all tests pass (there might be hidden tests), you will earn 10 points
# Test Cell: Question 3
assert "X_train_scaled" in locals(), "Scaled training data not found."
assert "X_test_scaled" in locals(), "Scaled test data not found."
assert X_train.shape[0] > 0 and X_test.shape[0] > 0, "Train/test sets are empty."
print("Data splitting and scaling successful!")

## 4. Train a Basic Logistic Regression Model (10 points)

Now it's time to train our first classification model. You will use `LogisticRegression` from Scikit-learn to build a baseline model.

In [None]:
# Grade Cell: Question 4
#
# Task: Train a baseline logistic regression model.
#
# Instructions:
# 1. Initialize a `LogisticRegression` model, setting `random_state` to `RANDOM_STATE`.
# 2. Train the model using the scaled training data (`X_train_scaled`, `y_train`).
# 3. Store the trained model in a variable called `log_reg_baseline`.

# your code here
#raise NotImplementedError
log_reg_baseline = LogisticRegression(random_state=RANDOM_STATE)
log_reg_baseline.fit(X_train_scaled, y_train)


In [None]:
# If all tests pass (there might be hidden tests), you will earn 10 points
# Test Cell: Question 4
assert "log_reg_baseline" in locals(), "Baseline logistic regression model not found."
assert hasattr(log_reg_baseline, "coef_"), "Model does not appear to be trained."
print("Baseline model training successful!")

## 5. Make Predictions and Evaluate (10 points)

With a trained model, you can now make predictions and evaluate its performance using key classification metrics.

In [None]:
# Grade Cell: Question 5
#
# Task: Make predictions and calculate performance metrics.
#
# Instructions:
# 1. Use the baseline model to make predictions on the scaled test data. Store them in `y_pred_baseline`.
# 2. Calculate accuracy, precision, recall, and F1-score. Store them in `accuracy_baseline`, `precision_baseline`, `recall_baseline`, and `f1_baseline`.

# your code here
#raise NotImplementedError
y_pred_baseline = log_reg_baseline.predict(X_test_scaled)
accuracy_baseline = accuracy_score(y_test, y_pred_baseline, normalize=True)
precision_baseline = precision_score(y_test, y_pred_baseline)
recall_baseline = recall_score(y_test, y_pred_baseline)
f1_baseline = f1_score(y_test, y_pred_baseline)

In [None]:
# If all tests pass (there might be hidden tests), you will earn 10 points
# Test Cell: Question 5
assert "y_pred_baseline" in locals(), "Baseline predictions not found."
assert "accuracy_baseline" in locals(), "Baseline accuracy not calculated."
print(f"Baseline Accuracy: {accuracy_baseline:.4f}")
print(f"Baseline Precision: {precision_baseline:.4f}")
print(f"Baseline Recall: {recall_baseline:.4f}")
print(f"Baseline F1 Score: {f1_baseline:.4f}")

## 6. Compute and Visualize the Confusion Matrix (10 points)

The confusion matrix provides a detailed breakdown of correct and incorrect classifications, which is essential for understanding a model's performance beyond a single accuracy score.

**Note**: For autograding, only the computed `conf_matrix_baseline` variable will be checked, not the plot itself.

In [None]:
# Grade Cell: Question 6
#
# Task: Compute and visualize the confusion matrix for the baseline model.
#
# Instructions:
# 1. Compute the confusion matrix using `y_test` and `y_pred_baseline`. Store it in `conf_matrix_baseline`.
# 2. Use `seaborn.heatmap` to visualize the confusion matrix.

# your code here
#raise NotImplementedError
conf_matrix_baseline = confusion_matrix(y_test, y_pred_baseline)
sns.heatmap(conf_matrix_baseline, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


In [None]:
# If all tests pass (there might be hidden tests), you will earn 10 points
# Test Cell: Question 6
assert "conf_matrix_baseline" in locals(), "Baseline confusion matrix not found."
assert conf_matrix_baseline.shape == (
    2,
    2,
), "Confusion matrix has incorrect dimensions."
print("Baseline confusion matrix computed successfully.")

## 7. Plot ROC Curve and Calculate AUC (10 points)

The ROC curve illustrates the diagnostic ability of a classifier as its discrimination threshold is varied. The Area Under the Curve (AUC) provides an aggregate measure of performance across all thresholds.

**Note**: For autograding, only the computed `roc_auc_baseline` variable will be checked, not the plot itself.

In [None]:
# Grade Cell: Question 7
#
# Task: Generate the ROC curve for the baseline model.
#
# Instructions:
# 1. Get the prediction probabilities for the positive class.
# 2. Compute the false positive rate (`fpr`), true positive rate (`tpr`), and thresholds.
# 3. Calculate the Area Under the ROC Curve (`roc_auc_baseline`).
# 4. Plot the ROC curve.

# your code here
#raise NotImplementedError
y_pred_proba_baseline = log_reg_baseline.predict_proba(X_test_scaled)[:, 1]
fpr_baseline, tpr_baseline, thresholds_baseline = roc_curve(y_test, y_pred_proba_baseline)
roc_auc_baseline = auc(fpr_baseline, tpr_baseline)


In [None]:
# If all tests pass (there might be hidden tests), you will earn 10 points
# Test Cell: Question 7
assert "roc_auc_baseline" in locals(), "ROC AUC for baseline model not calculated."
assert isinstance(roc_auc_baseline, (float, np.floating)), "ROC AUC must be a float"
assert 0.0 <= roc_auc_baseline <= 1.0, "ROC AUC must be between 0 and 1"
assert roc_auc_baseline > 0.5, "ROC AUC should be better than random (>0.5)"
print(f"Baseline Model AUC: {roc_auc_baseline:.4f}")

## 8. Implement L2 Regularization (10 points)

Regularization is a technique to prevent overfitting by penalizing large model coefficients. You will now train a logistic regression model with L2 regularization.

In [None]:
# Grade Cell: Question 8
#
# Task: Train a regularized logistic regression model.
#
# Instructions:
# 1. Initialize a `LogisticRegression` model with `penalty='l2'`, `C=0.1`, and `random_state=RANDOM_STATE`.
# 2. Train the model on the scaled training data.
# 3. Store the trained model in `log_reg_l2`.
# 4. Make predictions on the scaled test data and calculate the accuracy, storing it in `accuracy_l2`.

# your code here
#raise NotImplementedError
log_reg_l2 = LogisticRegression(penalty='l2', C=0.1, random_state=RANDOM_STATE)
log_reg_l2.fit(X_train_scaled, y_train)
y_pred_l2 = log_reg_l2.predict(X_test_scaled)
accuracy_l2 = accuracy_score(y_test, y_pred_l2, normalize=True)


In [None]:
# If all tests pass (there might be hidden tests), you will earn 10 points
# Test Cell: Question 8
assert "log_reg_l2" in locals(), "L2 regularized model not found."
assert hasattr(log_reg_l2, "coef_"), "L2 model does not appear to be trained."
assert "accuracy_l2" in locals(), "Accuracy for L2 model not calculated."
print(f"L2 Regularized Model Accuracy: {accuracy_l2:.4f}")

## 9. Compare Regularized vs. Unregularized Model (10 points)

Let's compare the coefficients of the baseline and regularized models to see the effect of regularization. Regularization should "shrink" the coefficients toward zero.

In [None]:
# Grade Cell: Question 9
#
# Task: Compare the magnitudes of the model coefficients.
#
# Instructions:
# 1. Calculate the average absolute value of the coefficients for the baseline model (`log_reg_baseline`) and store it in `avg_coef_baseline`.
# 2. Calculate the average absolute value of the coefficients for the L2 regularized model (`log_reg_l2`) and store it in `avg_coef_l2`.

# your code here
#raise NotImplementedError
avg_coef_baseline = np.mean(np.abs(log_reg_baseline.coef_))
avg_coef_l2 = np.mean(np.abs(log_reg_l2.coef_))


In [None]:
# If all tests pass (there might be hidden tests), you will earn 10 points
# Test Cell: Question 9
assert "avg_coef_baseline" in locals(), "Average baseline coefficient not found."
assert "avg_coef_l2" in locals(), "Average L2 coefficient not found."
print(f"Average Baseline Coefficient Magnitude: {avg_coef_baseline:.4f}")
print(f"Average L2 Regularized Coefficient Magnitude: {avg_coef_l2:.4f}")
assert (
    avg_coef_l2 < avg_coef_baseline
), "L2 coefficients should be smaller than baseline."
print(
    "Coefficient comparison successful. L2 regularization shrinks coefficients as expected."
)

## 10. Train a Linear Discriminant Analysis (LDA) Model (10 points)

As an alternative approach, let's train a Linear Discriminant Analysis (LDA) model and compare its performance.

In [None]:
# Grade Cell: Question 10
#
# Task: Train an LDA model and evaluate its accuracy.
#
# Instructions:
# 1. Initialize a `LinearDiscriminantAnalysis` model.
# 2. Train the model on the scaled training data.
# 3. Make predictions on the scaled test data.
# 4. Calculate the accuracy and store it in `accuracy_lda`.

# your code here
#raise NotImplementedError
lda = LinearDiscriminantAnalysis()
lda.fit(X_train_scaled, y_train)
y_pred_lda = lda.predict(X_test_scaled)
accuracy_lda = accuracy_score(y_test, y_pred_lda, normalize=True)

In [None]:
# If all tests pass (there might be hidden tests), you will earn 10 points
# Test Cell: Question 10
assert "accuracy_lda" in locals(), "LDA accuracy not found."
assert isinstance(accuracy_lda, (float, np.floating)), "Accuracy must be a float"
assert 0.0 <= accuracy_lda <= 1.0, "Accuracy must be between 0 and 1"
assert accuracy_lda > 0.5, "Accuracy should be better than random guessing"
print(f"LDA Model Accuracy: {accuracy_lda:.4f}")
print(f"Baseline Logistic Regression Accuracy: {accuracy_baseline:.4f}")

## Next Steps

Congratulations on completing the assignment! Before submitting:

1. Make sure all your cells run without errors.
2. Ensure you've answered all parts of each question.
3. If any autograder tests fail, revisit your answers.
