<a href="https://colab.research.google.com/github/uscmlsystems/ml-systems-hw1-madhavdanturthi/blob/main/2_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EE 508 HW 1 Part 2: Classification

Your task in this Colab notebook is to fill out the sections that are specified by **TODO** (please search the keyword `TODO` to make sure you do not miss any).

## Cross Validation, Bias-Variance trade-off, Overfitting

In this section, we will demonstrate data splitting and the validation process in machine learning paradigms. We will use the Iris dataset from the `sklearn` library.

Objective:
- Train a Fully-Connected Network (FCN) for classification.  
- Partition the data using three-fold cross-validation and report the training, validation, and testing accuracy.  
- Train the model using cross-entropy loss and evaluate it with 0/1 loss.  

In [None]:
# import required libraries and dataset
import numpy as np
# load sklearn for ML functions
from sklearn.datasets import load_iris
# load torch dataaset for training NNs
import torch
import torch.nn as nn
import torch.optim as optim
# plotting library
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use(['ggplot'])

### **TODO 1**: Implement the cross validation function
In this function, the dataset is first shuffled. Then, we need to implement a loop that iterates through each fold, selecting a subset of samples as the validation set while assigning the remaining samples to the training set, and stores these partitions in the `folds` list.

In [None]:
def cross_validation(x: np.array, y: np.array, n_folds: int=3):
    """
    Splitting the dataset to the given fold
    Parameters:
    - x: Feaures of the dataset, with shape (n_samples, n_features)
    - y: Class label of the dataset, with shape (n_samples,)
    - n_folds: the given number of partitions
        For instnace, 5-fold CV with 100 percentage:
        fold_1: training on 20~99, validation on 0~19(%)
        fold_2: training on 0~19 and 40~99, validation on 20~39(%)
        fold_3: training on 0~39 and 60~99, validation on 40~59(%)
        fold_4: training on 0~59 and 80~99, validation on 60~79(%)
        fold_5: training on 0~79, validation on 80~99(%)

    Returns:
    - folds (list): In the format with len(folds) == n_folds
        [
            (x_train_fold1, y_train_fold1, x_valid_fold1, y_valid_fold1),
            (x_train_fold2, y_train_fold2, x_valid_fold2, y_valid_fold2),
            (x_train_fold3, y_train_fold3, x_valid_fold3, y_valid_fold3),
            ...
        ]
    """

    folds = []
    n_data = x.shape[0]
    index = np.arange(n_data)
    # shaffle the data with np.random.shuffle
    np.random.shuffle(index)
    # find the partition with numpy.linspace
    partitions = np.linspace(0, n_data, num=n_folds+1, endpoint=True)
    partitions = partitions.astype(int)

    # Finish the code here
    for i in range(n_folds):
      testIdx = index[partitions[i]:partitions[i+1]]
      trainIdx = np.concatenate([index[:partitions[i]], index[partitions[i+1]:]])
      folds.append((x[trainIdx], y[trainIdx], x[testIdx], y[testIdx]))

    print(f"The Partitions:")
    for idx, (_, train_y, _, valid_y) in enumerate(folds):
        print(f"[Fold-{idx+1}] #Training: {train_y.shape[0]:4>0d}; #Validation: {valid_y.shape[0]:4>0d}")
        from collections import Counter
        # you check check the label distribution
        # print(Counter(train_y))
        # print(Counter(valid_y))

    return folds

In [None]:
# fixed the random seed
np.random.seed(42)
# Load Iris dataset
iris = load_iris()
x, y = iris.data, iris.target
# Split into training and testing sets
three_folds = cross_validation(x, y)

The Partitions:
[Fold-1] #Training: 100; #Validation: 50
[Fold-2] #Training: 100; #Validation: 50
[Fold-3] #Training: 100; #Validation: 50


### **TODO 2**: Build a Fully-Connect Networks with PyTorch
In this section, we build simple FCN models with different numbers of hidden units for the classification task.

- **Training:** Use cross-entropy for optimization.  
- **Inferencing:** Evaluate with 0/1 loss.  

In [None]:
# define the FCN model
class FCN_model(nn.Module):
    # take the argument for the number of hidden units
    def __init__(self, n_hidden=32):
        # Finish the code here
        super(FCN_model, self).__init__()
        self.inputSize = x.shape[1]
        self.classes = np.unique(y).shape[0]
        self.hidden1 = nn.Linear(self.inputSize, n_hidden)
        self.hidden2 = nn.Linear(n_hidden, n_hidden)
        self.relu = nn.ReLU()
        self.output = nn.Linear(n_hidden, self.classes)

    def forward(self, x):
        # Finish the code here
        x = self.hidden1(x)
        x = self.relu(x)
        x = self.hidden2(x)
        x = self.relu(x)
        x = self.output(x)

        return x

Set up the evaluation and training functions for the FCN models.

In [None]:
def eval(model:nn.Module,
         x:torch.tensor,
         y:torch.tensor) -> float:
    """Evaluate the model: inference the model with 0/1 loss
    We can define the output label is the maximum logit from the model

    Parameters:
    - model: the FCN model
    - x: input features
    - y: ground truth labels, dtype=long

    Returns:
    - loss: the average 0/1 loss value
    """
    # Evaluate the model
    model.eval()
    with torch.no_grad():
        preds = torch.argmax(model(x), dim=1)

    loss = 0
    # Finish the code here
    for i in range(preds.shape[0]):
      if preds[i] != y[i]:
        loss += 1

    print(f"Averaging 0/1 loss: {loss/preds.shape[0]:.4f}")
    return loss/preds.shape[0]

In [None]:
def train(model:nn.Module,
          x_train:torch.tensor,
          y_train:torch.tensor,
          x_valid:torch.tensor,
          y_valid:torch.tensor,
          epochs:int=300):
    """Trining process
    Parameters:
    - model: the FCN model
    - x_train, y_train: trainig features and labels (dtype=long)
    - x_valid, y_valid: validation features and labels (dtype=long)
    - epochs: number of the epoches for training
    """
    # To simplify the process
    # we do not take batches but use all the training samples
    # set up the objective function and the optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=1e-2)
    # training loop
    for epoch in range(epochs):
        model.train()
        # Forward pass
        outputs = model(x_train)
        loss = criterion(outputs, y_train)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (epoch + 1) % 100 == 0:
            print(f"Epoch [{epoch + 1}/{epochs}], Cross Entropy Loss: {loss.item():.4f}")
            print(f"[Train] ", end="")
            eval(model, x_train, y_train)
            print(f"[Valid] ", end="")
            eval(model, x_valid, y_valid)


### **TODO 3**: Conduct the training/validation process in each fold
We will use three-fold validation, meaning you need to train three models and report the training and validation loss for all three folds.  

First, instantiate an FCN model with 32 hidden units.  
Then, call the `train` function, which takes the training and validation folds created by the `cross_validation()` function, along with the model, as input. Set `epochs` to `500`.  


In [None]:
train_losses, valid_losses = [], []

for idx, (x_train, y_train, x_valid, y_valid) in enumerate(three_folds):
    print(f"===== Traing Fold {idx} =====")
    x_train = torch.Tensor(x_train)
    y_train = torch.tensor(y_train, dtype=torch.long)
    x_valid = torch.Tensor(x_valid)
    y_valid = torch.tensor(y_valid, dtype=torch.long)

    # Finish the code here
    model = FCN_model(n_hidden=32)
    train(model, x_train, y_train, x_valid, y_valid, 500)


    train_losses.append(eval(model, x_train, y_train))
    valid_losses.append(eval(model, x_valid, y_valid))

===== Traing Fold 0 =====
Epoch [100/500], Cross Entropy Loss: 0.7262
[Train] Averaging 0/1 loss: 0.2800
[Valid] Averaging 0/1 loss: 0.2800
Epoch [200/500], Cross Entropy Loss: 0.4748
[Train] Averaging 0/1 loss: 0.0800
[Valid] Averaging 0/1 loss: 0.0800
Epoch [300/500], Cross Entropy Loss: 0.3668
[Train] Averaging 0/1 loss: 0.0400
[Valid] Averaging 0/1 loss: 0.0200
Epoch [400/500], Cross Entropy Loss: 0.2840
[Train] Averaging 0/1 loss: 0.0400
[Valid] Averaging 0/1 loss: 0.0000
Epoch [500/500], Cross Entropy Loss: 0.2161
[Train] Averaging 0/1 loss: 0.0300
[Valid] Averaging 0/1 loss: 0.0000
Averaging 0/1 loss: 0.0300
Averaging 0/1 loss: 0.0000
===== Traing Fold 1 =====
Epoch [100/500], Cross Entropy Loss: 0.7441
[Train] Averaging 0/1 loss: 0.3300
[Valid] Averaging 0/1 loss: 0.3400
Epoch [200/500], Cross Entropy Loss: 0.4805
[Train] Averaging 0/1 loss: 0.1300
[Valid] Averaging 0/1 loss: 0.1600
Epoch [300/500], Cross Entropy Loss: 0.3527
[Train] Averaging 0/1 loss: 0.0200
[Valid] Averaging

In [None]:
print(f"#Fold, training loss, validation loss")
for idx, (train_loss, valid_loss) in enumerate(zip(train_losses, valid_losses)):
    print(f"{idx:>5d},          {train_loss:.2f},            {valid_loss:.2f}")

#Fold, training loss, validation loss
    0,          0.03,            0.00
    1,          0.02,            0.06
    2,          0.02,            0.02


### **TODO4**: Check over-fitting with complex model
We can follow the same procedure with a more complex FCN model.  
Now, set the `number of hidden units` to `2048` and repeat the process for three-fold validation with `epochs = 500`.  
The gap between the training and validation performance should increase.  

In [None]:
train_overfit, valid_overfit = [], []

for idx, (x_train, y_train, x_valid, y_valid) in enumerate(three_folds):
    print(f"===== Traing Fold {idx} =====")
    x_train = torch.Tensor(x_train)
    y_train = torch.tensor(y_train, dtype=torch.long)
    x_valid = torch.Tensor(x_valid)
    y_valid = torch.tensor(y_valid, dtype=torch.long)

    # Finish the code here
    model = FCN_model(n_hidden=2048)
    train(model, x_train, y_train, x_valid, y_valid, 500)


    train_overfit.append(eval(model, x_train, y_train))
    valid_overfit.append(eval(model, x_valid, y_valid))

===== Traing Fold 0 =====
Epoch [100/500], Cross Entropy Loss: 0.3343
[Train] Averaging 0/1 loss: 0.2200
[Valid] Averaging 0/1 loss: 0.1800
Epoch [200/500], Cross Entropy Loss: 0.2298
[Train] Averaging 0/1 loss: 0.1300
[Valid] Averaging 0/1 loss: 0.0800
Epoch [300/500], Cross Entropy Loss: 0.1309
[Train] Averaging 0/1 loss: 0.0400
[Valid] Averaging 0/1 loss: 0.0200
Epoch [400/500], Cross Entropy Loss: 0.0885
[Train] Averaging 0/1 loss: 0.0200
[Valid] Averaging 0/1 loss: 0.0000
Epoch [500/500], Cross Entropy Loss: 0.0805
[Train] Averaging 0/1 loss: 0.0200
[Valid] Averaging 0/1 loss: 0.0000
Averaging 0/1 loss: 0.0200
Averaging 0/1 loss: 0.0000
===== Traing Fold 1 =====
Epoch [100/500], Cross Entropy Loss: 0.3159
[Train] Averaging 0/1 loss: 0.2200
[Valid] Averaging 0/1 loss: 0.2400
Epoch [200/500], Cross Entropy Loss: 0.2000
[Train] Averaging 0/1 loss: 0.1200
[Valid] Averaging 0/1 loss: 0.1600
Epoch [300/500], Cross Entropy Loss: 0.0913
[Train] Averaging 0/1 loss: 0.0300
[Valid] Averaging

In [None]:
print(f"#Fold, training loss, validation loss")
for idx, (train_loss, valid_loss) in enumerate(zip(train_overfit, valid_overfit)):
    print(f"{idx:>5d},          {train_loss:.2f},            {valid_loss:.2f}")

#Fold, training loss, validation loss
    0,          0.02,            0.00
    1,          0.01,            0.06
    2,          0.04,            0.02


### **TODO 5**: Compare the FCN with statistical ML models
Here, we will use the Naive Bayes model from the `sklearn` library and perform three-fold validation.  

In [None]:
# Load the Naive Bayes classifier from the library
from sklearn.naive_bayes import GaussianNB

train_nb, valid_nb = [], []
for idx, (x_train, y_train, x_valid, y_valid) in enumerate(three_folds):

    # Finish the code here
    nb = GaussianNB()
    nb.fit(x_train, y_train)
    train_acc = nb.score(x_train, y_train)
    valid_acc = nb.score(x_valid, y_valid)


    train_nb.append(1 - train_acc)
    valid_nb.append(1 - valid_acc)

In [None]:
print(f"#Fold, training loss, validation loss")
for idx, (train_loss, valid_loss) in enumerate(zip(train_nb, valid_nb)):
    print(f"{idx:>5d},          {train_loss:.2f},            {valid_loss:.2f}")

#Fold, training loss, validation loss
    0,          0.05,            0.04
    1,          0.02,            0.06
    2,          0.04,            0.04


### **TODO 6**:
Answer the following questions in the next cell.  
1. What is the the bias-variance trade-off in machine learning?
2. How to reduce overfitting and underfitting?
3. How do the training and inference processes differ between the Naive Bayes model and a fully connected neural network?

Your anwser:
1. The bias-variance trade-off is essentially that high-bias can lead to underfitting while high-variance can lead to overfitting. Therefore, it is important to balance both so that there is no underfitting or overfitting and the model is able to generalize well without error due to variance or error due to bias.

2. You can reduce overfitting by using a simpler model which is not too complex or by increasing the training set size. This ensures that there are not very simple assumptions being made. The most common way to reduce overfitting however is to use regularization by adding a penalty to the loss function, which discourages large model weights and improves generalizations to unseen data. By limiting the magnitude of the model parameters, the model becomes less sensitive to small noises, reducing variance and thereby reducing overfitting. For underfitting, you can increase the model complexitiy and also train the model for longer. You can also reduce regularization if the limits are too strong.

3. For Naive Bayes, training is done based on using Bayes theorem to compute probabilities for features while assuming conditional independence given the classification. Inference is done by using the computed probabiliites to classify new data. On the other hand, a fully connected neural network uses backpropagation and gradient descent to update weights through iterations for training, making it a very computationally expensive process. For inference, forward propagation is used through multiple hidden layers and activation functions which also makes it slower than Naive Bayes.
