## Exercise 3: Implement minibatch SGD to train a binary Logistic Regression model. 

To implement that, we write a class.

In [66]:
import numpy as np
from sklearn.preprocessing import StandardScaler

class LogisticRegressionSGD:
    def __init__(self, learning_rate=0.01, batch_size=32, max_iters=1000):
        self.learning_rate = learning_rate
        self.batch_size = batch_size
        self.max_iters = max_iters
        self.weights = None
        # scales the data to 0 mean and unit variance to avoid overflow
        self.scaler = StandardScaler()

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def compute_loss(self, X, y):
        m = X.shape[0]
        y_pred = self.sigmoid(X.dot(self.weights))
        #add epsilon to avoid log 0
        epsilon = 1e-10
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        loss = -np.mean(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))
        return loss

    def gradient(self, X, y):
        m = X.shape[0]
        y_pred = self.sigmoid(X.dot(self.weights))
        return (1 / m) * X.T.dot(y_pred - y)

    def train(self, X, y):
        X = self.scaler.fit_transform(X)
        n_samples, n_features = X.shape
        # # initialize the weight as all 0 vactor
        # self.weights = np.zeros(n_features)
        # initialize the weight using standard Gaussian random number
        self.weights = np.random.randn(n_features)

        losses = []

        for i in range(self.max_iters):
            indices = np.random.permutation(n_samples)
            X_shuffled = X[indices]
            y_shuffled = y[indices]

            for start in range(0, n_samples, self.batch_size):
                end = min(start + self.batch_size, n_samples)
                X_batch = X_shuffled[start:end]
                y_batch = y_shuffled[start:end]

                # Compute gradient and update weights
                grad = self.gradient(X_batch, y_batch)
                self.weights -= self.learning_rate * grad

            # monitor loss
            loss = self.compute_loss(X, y)
            losses.append(loss)
            if i % 100 == 0:
                print(f"Iteration {i}, Loss: {loss:.4f}")
        print(f'\nfinal loss: {losses[-1]:.4f}')

        return losses

    def predict(self, X):
        X = self.scaler.transform(X)
        probabilities = self.sigmoid(X.dot(self.weights))
        return (probabilities >= 0.5).astype(int)



## exercise 4: Model implementation:  classification on a breast cancer dataset

#### 1. Import the dataset

In [29]:
# the following code is credit to https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17) 
  
# data (as pandas dataframes) 
X0 = breast_cancer_wisconsin_diagnostic.data.features 
y0 = breast_cancer_wisconsin_diagnostic.data.targets 
  
# metadata 
#print(breast_cancer_wisconsin_diagnostic.metadata) 
  
# variable information 
#print(breast_cancer_wisconsin_diagnostic.variables) 


In [30]:
# convert the dataframe to numpy array
X = X0.to_numpy()
y = y0.to_numpy()

#### 2. Split the dataset into train, validation and test sets

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
# split the data into train:validation:test = 8:1:1
X_train, X_temp, y_train, y_temp = train_test_split(X, y, train_size = 0.8, random_state=23)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, train_size = 0.5, random_state=23)

# check the result
print(f'number of training data: {len(X_train)}')
print(f'number of validation data: {len(X_val)}')
print(f'number of test data: {len(X_test)}')

# check if the data and target length fits
assert len(X_train) == len(y_train)
assert len(X_val) == len(y_val)
assert len(X_test) == len(y_test)

#### 3. Report the size of each class in training and validation set

In [25]:
from collections import Counter
# change shape: (n, 1) --> (n,)
def count(y):
    y_flat = y.flatten()
    category_counts = Counter(y_flat)
    return category_counts

print('size of each class in training set:')
print(count(y_train))
print('\nsize of each class in validation set:')
print(count(y_val))


size of each class in training set:
Counter({'B': 282, 'M': 173})

size of each class in validation set:
Counter({'B': 41, 'M': 16})


We can see that the size of each class in both training set and validation is unbalanced.

#### 4. Train our minibatch-SGD model (from exercise 3)

In [79]:
from sklearn.preprocessing import LabelEncoder
# preprocessing: change the label to 0/1
# change shape: (n, 1) --> (n,)
y_train = y_train.flatten()
y_val = y_val.flatten()
y_test = y_test.flatten()
# encode y: M, B --> 0, 1
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_val_encoded = label_encoder.fit_transform(y_val)
y_test_encoded = label_encoder.fit_transform(y_test)


Iteration 0, Loss: 1.3285
Iteration 100, Loss: 0.0844
Iteration 200, Loss: 0.0697
Iteration 300, Loss: 0.0633
Iteration 400, Loss: 0.0595
Iteration 500, Loss: 0.0569
Iteration 600, Loss: 0.0551
Iteration 700, Loss: 0.0537
Iteration 800, Loss: 0.0526
Iteration 900, Loss: 0.0517
final loss: 0.0509


In [97]:
# train the model
model = LogisticRegressionSGD(learning_rate=0.05, batch_size=8, max_iters=1000)
losses = model.train(X_train, y_train_encoded)
# do prediction on training and validation set
train_predictions = model.predict(X_train)
val_predictions = model.predict(X_val)


Iteration 0, Loss: 0.3402
Iteration 100, Loss: 0.0518
Iteration 200, Loss: 0.0471
Iteration 300, Loss: 0.0446
Iteration 400, Loss: 0.0429
Iteration 500, Loss: 0.0416
Iteration 600, Loss: 0.0406
Iteration 700, Loss: 0.0398
Iteration 800, Loss: 0.0392
Iteration 900, Loss: 0.0386
final loss: 0.0381


Then we need to finetune the hpyerparameter and find the best value. We evaluate the model performance using **Macro F-1 score** of the validation set (since the validation data is also strongly imbalanced)

In [98]:
from sklearn.metrics import f1_score
# present macro f1 score on validation set
print(f"the Macro F1 score on validation set is {f1_score(y_val_encoded, val_predictions, average = 'macro'):.4f}")

the Macro F1 score on validation set is 1.0000


Report the macro F1 score with different hyperparameters:
| learning rate | max_iteration | Batch size |   Macro F1    |  Train Loss  |
|---------------|---------------|------------|---------------|--------------|
|      0.01     |     200       |      16    |     0.9787    |    0.0738    |
|      0.01     |     500       |      16    |     0.9787    |    0.0609    |
|      0.01     |     1000      |      16    |     0.9787    |    0.0509    |
|      0.005    |     1000      |      16    |     0.9787    |    0.0556    |
|      0.005    |     1000      |      8     |     1.0000    |    0.0375    |
|      0.05     |     1000      |      16    |     1.0000    |    0.0418    |
|      0.05     |     1000      |      32    |     0.9787    |    0.0570    |




Since the validation set is rather small, the macro F1 score are quite close. Thus we take the train loss into consideration and finally decide that **learning rate = 0.005, max iteration = 1000, batch size = 8** to be the final hyperparameter.

#### 5. report the performance of the model on the test set

In [100]:
from sklearn.metrics import classification_report
# run the model on test set to see its performance
test_predictions = model.predict(X_test)

print(classification_report(y_test_encoded, test_predictions))

              precision    recall  f1-score   support

           0       0.94      0.97      0.96        34
           1       0.95      0.91      0.93        23

    accuracy                           0.95        57
   macro avg       0.95      0.94      0.94        57
weighted avg       0.95      0.95      0.95        57



#### 6. Summarization 

Overall, the model performance is very good. The F1 scores on train set and validation set are both close to 1, in test set the score is slightly lower but acceptable. 

From (c) we can see that the dataset is unbalanced, so using macro F1 score as evaluation is better than simply using accuracy. 

This is a classification problem, and the 2 classes can be easily splitted, thus the predicted value will be very close to 0 or 1 sometimes. This will cause blowing up and divide by zero in log in the calculation of loss. Thus a small $\epsilon$ could be added to avoud NAN problems. 

The loss in training is always decreasing, meaning that the model does not overfit to our data. 

When choosing hyperparameter, I did several experiment and find that the more iteration we take, the smaller batch size, the smaller learning rate leads to a better performance. However, the training time will also grow. When choosing the hyperparameter, we should balance both side. 

Because of the rather small dataset and unbalanced spliting(the validation set is too small), The differences in performance when changing hyperparameters are not obvious. Thus we can also see the loss on training set to evaluate model performance. 

Additionally, we could re-split the train and validation set to bring greater changes in macro F1 score when fintuning our hyperparameter. 