Math156 - Machine Learning
Assignment 3

Exercise 3:
Implement a program to train a binary logistic regression model using mini-batch SGD. Use the logistic regression model we derived in class, corresponding to equation (4.90) from the textbook, and where the feature transformation is the identity function.

The program should include the following hyperparameters:
Batch size;
Fixed Learning Rate;
Maximum Number of Iterations

In [148]:
import numpy as np

In [149]:
### For Exericse 4 (d) ###
def initialize_weights(n_features):
    return np.random.randn(n_features)
### Intializing weights for 4 (d) ###

class LRSGD:
    def __init__(self, learning_rate, batch_size, max_iterations):
        self.learning_rate = learning_rate
        self.batch_size = batch_size
        self.max_iterations = max_iterations
        self.weights = None

    def sigmoid(self, z):
        return 1 / (1+ np.exp(-z))
    
    def cross_entropy_l(self, y_true, y_pred):
        return -np.sum(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = initialize_weights(n_features)

        for iteration in range(self.max_iterations):
            indices = np.arange(n_samples)
            np.random.shuffle(indices)
            X = X[indices]
            y = y[indices]

            for start_idx in range(0, n_samples, self.batch_size):
                end_idx = min(start_idx + self.batch_size, n_samples)
                X_batch = X[start_idx:end_idx]
                y_batch = y[start_idx:end_idx]

                l_output = np.dot(X_batch, self.weights)
                y_pred = self.sigmoid(l_output)

                gradients = np.dot(X_batch.T, (y_pred - y_batch)) / len(y_batch)
                
                self.weights -= self.learning_rate * gradients
    
    def pred_probability(self, X):
        l_output = np.dot(X, self.weights)
        return self.sigmoid(l_output)

    def predict(self, X):
        return (self.pred_probability(X) >= 0.5).astype(int)

Exericse 4:

(a)

Download the Wisconsin Breast Cancer dataset from the UCI Machine Learning Repository or scikit-learn’s built-in datasets.

In [150]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X, y = data.data, data.target

(b)

Split the dataset into train, validation, and test sets.

In [151]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size = 0.5, random_state = 42)

(c)

Report the size of each class in your training (+ validation) set.

In [152]:
X_train_val = np.concatenate((X_train, X_val))
y_train_val = np.concatenate((y_train, y_val))

unique, counts = np.unique(y_train_val, return_counts = True)
class_distribution = dict(zip(unique, counts))
print(class_distribution)

{0: 186, 1: 297}


(d)

Train a binary logistic regression model using your implementation from problem 3. Initialize
the model weights randomly, sampling from a standard Gaussian distribution. Experiment with
different choices of fixed learning rate and batch size.

In [153]:
### "Initialize the model weights randomly, sampling from a standard Gaussian distribution": wrote function for this in Exercise 3 ###

learning_rate = 0.0001
batch_size = 16
max_iterations = 1000
model = LRSGD(learning_rate = learning_rate, batch_size = batch_size, max_iterations = max_iterations)

model.fit(X_train, y_train)
model.predict(X_train)

  return 1 / (1+ np.exp(-z))


array([0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0,

(e)

Use the trained model to report the performance of the model on the test set. For evaluation
metrics, use accuracy, precision, recall, and F1-score.

In [154]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_test_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_test_pred)
precision = precision_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)
f1 = f1_score(y_test, y_test_pred)

print(f"Test Set Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

Test Set Performance:
Accuracy: 0.7558
Precision: 1.0000
Recall: 0.6500
F1-Score: 0.7879
