### **State University of Campinas - UNICAMP** </br>
**Course**: MC886A </br>
**Professor**: Marcelo da Silva Reis </br>
**TA (PED)**: Marcos Vinicius Souza Freire

---

### **Hands-On: Model Selection**
##### Notebook: 01 Model Selection

> Dataset from Scikit Learn - [load_breast_cancer](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html), based on [Breast Cancer Wisconsin (Diagnostic)](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic)(1993)[1]
---

**This notebook covers the following topics:**

- **Model Selection and Regularization:** Using subset selection (RFE), Ridge (L2) and Lasso (L1) regression.
- **Advanced Model Selection:** Applying regularization with PyTorch for logistic regression, and a demonstration with k-Nearest Neighbors and Random Forest.

Throughout the notebook we illustrate the methods using formulas, interactive Plotly graphs for the decision boundaries, and well-structured code cells.

Based on the Jurafsky & Martin (2025) lectures [2]

---


In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd

# Replace Matplotlib with Plotly for interactive plotting
import plotly.graph_objects as go
import plotly.express as px

from sklearn.datasets import make_classification, load_breast_cancer
from sklearn.model_selection import train_test_split, KFold, LeaveOneOut, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, TensorDataset

import warnings
warnings.filterwarnings('ignore')

# Set seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)


#### **Basic exploration of the dataset**

In [None]:
# Let's load the Breast Cancer Dataset from Scikit-Learn
cancer_dataset = load_breast_cancer()

In [None]:
# Keys in dataset
cancer_dataset.keys()

In [None]:
# Malignant or benign value
cancer_dataset['target']

In [None]:
# Target value name malignant or benign tumor
cancer_dataset['target_names']

In [None]:
# Description of data
print(cancer_dataset['DESCR'])

In [None]:
# Name of features
print(cancer_dataset['feature_names'])

In [None]:
# Create datafrmae
cancer_df = pd.DataFrame(np.c_[cancer_dataset['data'],cancer_dataset['target']],
             columns = np.append(cancer_dataset['feature_names'], ['target']))

In [None]:
# Head of cancer DataFrame
cancer_df.head(6)

In [None]:
# Tail of cancer DataFrame
cancer_df.tail(6)

In [None]:
# Information of cancer Dataframe
cancer_df.info()

In [None]:
# Numerical distribution of data
cancer_df.describe()

---

### **Helper Function**

Evaluate Classifier - borrowed from the Notebook 00 Logistic Regression and Classification and Resampling methods


In [None]:
def evaluate_classifier(y_true, y_pred):
    """Print evaluation metrics for a classifier."""
    print("Accuracy:", accuracy_score(y_true, y_pred))
    print("\nConfusion Matrix:")
    print(confusion_matrix(y_true, y_pred))
    print("\nClassification Report:")
    print(classification_report(y_true, y_pred))


### **Part 1: Model Selection and Regularization I**

In this part we explore:

- **Subset Selection:** using Recursive Feature Elimination (RFE)
- **Ridge Regression (L2 Regularization) and Lasso Regression (L1 Regularization)**

These methods help us control overfitting by penalizing large weights.


In [None]:
# Load the Breast Cancer dataset
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("Full dataset shape:", X.shape)

# Subset Selection using Recursive Feature Elimination (RFE)
lda = LinearDiscriminantAnalysis()
rfe = RFE(estimator=lda, n_features_to_select=5)
X_train_rfe = rfe.fit_transform(X_train, y_train)
X_test_rfe = rfe.transform(X_test)
print("Selected features (indices):", np.where(rfe.support_)[0])
lda.fit(X_train_rfe, y_train)
y_pred_rfe = lda.predict(X_test_rfe)
print("\nSubset Selection (RFE) Evaluation:")
evaluate_classifier(y_test, y_pred_rfe)

# Ridge Regression (L2 Regularization)
class RidgeRegression(nn.Module):
    def __init__(self, input_dim, lambda_reg=0.1):
        super(RidgeRegression, self).__init__()
        self.linear = nn.Linear(input_dim, 1)
        self.lambda_reg = lambda_reg

    def forward(self, x):
        return torch.sigmoid(self.linear(x))

    def ridge_penalty(self):
        # L2 penalty
        return self.lambda_reg * sum(torch.sum(param ** 2) for param in self.parameters())

# Standardize features for Ridge regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train_tensor = torch.FloatTensor(X_train_scaled)
y_train_tensor = torch.FloatTensor(y_train.reshape(-1, 1))
X_test_tensor = torch.FloatTensor(X_test_scaled)

lambda_values = [0.0, 0.01, 0.1, 1.0]
ridge_accuracies = []

for l in lambda_values:
    model_ridge = RidgeRegression(X_train_scaled.shape[1], lambda_reg=l)
    criterion = nn.BCELoss()
    optimizer = optim.SGD(model_ridge.parameters(), lr=0.01)
    epochs = 1000
    for epoch in range(epochs):
        outputs = model_ridge(X_train_tensor)
        loss = criterion(outputs, y_train_tensor) + model_ridge.ridge_penalty()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    model_ridge.eval()
    with torch.no_grad():
        y_pred_probs = model_ridge(X_test_tensor)
        y_pred = (y_pred_probs > 0.5).float().numpy().flatten()
        acc = accuracy_score(y_test, y_pred)
        ridge_accuracies.append(acc)
    print(f"Ridge: Lambda = {l}, Test Accuracy = {acc:.4f}")

# Plot Ridge results using Plotly
fig = go.Figure()
fig.add_trace(go.Scatter(x=lambda_values, y=ridge_accuracies, mode='lines+markers'))
fig.update_layout(
    title="Effect of Ridge Regularization",
    xaxis=dict(title="Regularization Strength (λ)", type="log"),
    yaxis_title="Accuracy"
)
fig.show()

# Lasso Regression (L1 Regularization)
class LassoRegression(nn.Module):
    def __init__(self, input_dim, lambda_reg=0.1):
        super(LassoRegression, self).__init__()
        self.linear = nn.Linear(input_dim, 1)
        self.lambda_reg = lambda_reg

    def forward(self, x):
        return torch.sigmoid(self.linear(x))

    def lasso_penalty(self):
        # L1 penalty
        return self.lambda_reg * sum(torch.sum(torch.abs(param)) for param in self.parameters())

lambda_values = [0.0, 0.01, 0.1, 1.0]
lasso_accuracies = []
nonzero_coeffs = []

for l in lambda_values:
    model_lasso = LassoRegression(X_train_scaled.shape[1], lambda_reg=l)
    criterion = nn.BCELoss()
    optimizer = optim.SGD(model_lasso.parameters(), lr=0.01)
    epochs = 1000
    for epoch in range(epochs):
        outputs = model_lasso(X_train_tensor)
        loss = criterion(outputs, y_train_tensor) + model_lasso.lasso_penalty()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    model_lasso.eval()
    with torch.no_grad():
        y_pred_probs = model_lasso(X_test_tensor)
        y_pred = (y_pred_probs > 0.5).float().numpy().flatten()
        acc = accuracy_score(y_test, y_pred)
        lasso_accuracies.append(acc)
        # Count non-zero weights
        weight = model_lasso.linear.weight.data.numpy().flatten()
        nonzeros = np.sum(np.abs(weight) > 0.01)
        nonzero_coeffs.append(nonzeros)
    print(f"Lasso: Lambda = {l}, Test Accuracy = {acc:.4f}, Non-zero Coeffs = {nonzeros}/{len(weight)}")

# Plot Lasso results using Plotly (Accuracy and sparsity)
fig = go.Figure()
fig.add_trace(go.Scatter(x=lambda_values, y=lasso_accuracies, mode='lines+markers', name="Accuracy"))
fig.update_layout(
    title="Effect of Lasso Regularization on Accuracy",
    xaxis=dict(title="Regularization Strength (λ)", type="log"),
    yaxis_title="Accuracy"
)
fig.show()

fig = go.Figure()
fig.add_trace(go.Scatter(x=lambda_values, y=nonzero_coeffs, mode='lines+markers', name="Non-zero Coefficients"))
fig.update_layout(
    title="Lasso Regularization: Sparsity",
    xaxis=dict(title="Regularization Strength (λ)", type="log"),
    yaxis_title="Number of Non-zero Coefficients"
)
fig.show()


### **Part 2: Model Selection and Regularization II (PyTorch)**

This section integrates model selection using PyTorch implementations along with hyperparameter tuning
of different methods including:

- **Logistic Regression with Regularization**
- **k-Nearest Neighbors (kNN)**
- **Random Forest**


In [None]:
# Load and preprocess the breast cancer dataset
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train_tensor = torch.FloatTensor(X_train_scaled)
y_train_tensor = torch.FloatTensor(y_train.reshape(-1, 1))
X_test_tensor = torch.FloatTensor(X_test_scaled)
y_test_tensor = torch.FloatTensor(y_test.reshape(-1, 1))

# Logistic Regression with Regularization in PyTorch
class LogisticRegressionWithReg(nn.Module):
    def __init__(self, input_dim, l1_lambda=0.0, l2_lambda=0.0):
        super(LogisticRegressionWithReg, self).__init__()
        self.linear = nn.Linear(input_dim, 1)
        self.l1_lambda = l1_lambda
        self.l2_lambda = l2_lambda

    def forward(self, x):
        return torch.sigmoid(self.linear(x))

    def regularization_loss(self):
        l1 = self.l1_lambda * sum(torch.sum(torch.abs(param)) for param in self.parameters())
        l2 = self.l2_lambda * sum(torch.sum(param ** 2) for param in self.parameters())
        return l1 + l2

# Cross-validation function for PyTorch models
def cross_validate_model(X, y, model_class, model_params, cv=5, epochs=500, batch_size=32, lr=0.01):
    kf = KFold(n_splits=cv, shuffle=True, random_state=42)
    scores = []
    for train_idx, val_idx in kf.split(X):
        X_train_fold, X_val_fold = X[train_idx], X[val_idx]
        y_train_fold, y_val_fold = y[train_idx], y[val_idx]
        X_train_tensor = torch.FloatTensor(X_train_fold)
        y_train_tensor = torch.FloatTensor(y_train_fold).reshape(-1, 1)
        X_val_tensor = torch.FloatTensor(X_val_fold)

        train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
        loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

        model = model_class(**model_params)
        criterion = nn.BCELoss()
        optimizer = optim.SGD(model.parameters(), lr=lr)

        for epoch in range(epochs):
            for inputs, labels in loader:
                outputs = model(inputs)
                loss = criterion(outputs, labels)
                if hasattr(model, 'regularization_loss'):
                    loss += model.regularization_loss()
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

        model.eval()
        with torch.no_grad():
            outputs = model(X_val_tensor)
            y_pred = (outputs > 0.5).float().numpy().flatten()
            acc = accuracy_score(y_val_fold, y_pred)
            scores.append(acc)
    return np.mean(scores), np.std(scores)

input_dim = X_train_scaled.shape[1]
model_params = {'input_dim': input_dim, 'l1_lambda': 0.01, 'l2_lambda': 0.01}
mean_acc, std_acc = cross_validate_model(X_train_scaled, y_train, LogisticRegressionWithReg, model_params, cv=5, epochs=500)
print(f"Logistic Regression with Regularization: {mean_acc:.4f} ± {std_acc:.4f}")

# Train final logistic regression model with regularization on full training set
best_model = LogisticRegressionWithReg(input_dim, l1_lambda=0.01, l2_lambda=0.01)
criterion = nn.BCELoss()
optimizer = optim.SGD(best_model.parameters(), lr=0.01)
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
for epoch in range(500):
    for inputs, labels in loader:
        outputs = best_model(inputs)
        loss = criterion(outputs, labels) + best_model.regularization_loss()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

best_model.eval()
with torch.no_grad():
    outputs = best_model(X_test_tensor)
    y_pred_logreg = (outputs > 0.5).float().numpy().flatten()
    logreg_test_acc = accuracy_score(y_test, y_pred_logreg)
print(f"Final Logistic Regression with Reg - Test Accuracy: {logreg_test_acc:.4f}")

# k-Nearest Neighbors (kNN)
param_grid = {'n_neighbors': [3, 5, 7, 9]}
grid_search_knn = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid_search_knn.fit(X_train_scaled, y_train)
print(f"Best k for kNN: {grid_search_knn.best_params_['n_neighbors']}, CV Accuracy: {grid_search_knn.best_score_:.4f}")
best_knn = KNeighborsClassifier(n_neighbors=grid_search_knn.best_params_['n_neighbors'])
best_knn.fit(X_train_scaled, y_train)
y_pred_knn = best_knn.predict(X_test_scaled)
knn_test_acc = accuracy_score(y_test, y_pred_knn)
print(f"kNN Test Accuracy: {knn_test_acc:.4f}")

# Random Forest
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30]
}
grid_search_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5)
grid_search_rf.fit(X_train_scaled, y_train)
print(f"Best Params for Random Forest: {grid_search_rf.best_params_}, CV Accuracy: {grid_search_rf.best_score_:.4f}")
best_rf = RandomForestClassifier(**grid_search_rf.best_params_, random_state=42)
best_rf.fit(X_train_scaled, y_train)
y_pred_rf = best_rf.predict(X_test_scaled)
rf_test_acc = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Test Accuracy: {rf_test_acc:.4f}")

print("\nFinal Model Comparison on Test Set:")
print(f"Logistic Regression with Reg: {logreg_test_acc:.4f}")
print(f"kNN: {knn_test_acc:.4f}")
print(f"Random Forest: {rf_test_acc:.4f}")


---

## **REFERENCES**

[1] Wolberg, W., Mangasarian, O., Street, N., & Street, W. (1993). Breast Cancer Wisconsin (Diagnostic) [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5DW2B.

[2] Jurafsky and Martin. (2025). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models, 3rd edition. Ch. 5. Logistic Regression. Online manuscript released January 12, 2025. https://web.stanford.edu/~jurafsky/slp3.