# Homework-2: 

`ANN training in Keras or Pytorch & Hyper-parameter tuning`

## Overview 

* Classification is one of the most common forms of supervised machine learning 
* In this homework we will explore "model tuning" for the case of a multi-class classification problem, as applied to the MNIST data set
* `You can do this assignment in either Keras OR PyTorch` (or both), it is your choice.

## Submission 

* You need to upload TWO documents to Canvas when you are done
  * (1) A PDF (or HTML) of the completed form of the `HW-2.ipynb` document 
* The final uploaded version should NOT have any code-errors present 
* All outputs must be visible in the uploaded version, including code-cell outputs, images, graphs, etc

`IMPORTANT`: THERE ARE MANY WAYS TO DO THIS, SO FEEL FREE TO DEVIATE SLIGHTLY FROM THE EXACT DETAILS, BUT THE OVERALL RESULT AND FLOW SHOULD MATCH WHAT IS OUTLINED BELOW. 

**IF YOU ARE BUMPING UP AGAINST COMPUTE LIMITS (e.g LONG RUNTIMES), CONSIDER USING GOOGLE COLAB OR A COLAB PROFESSIONAL ACCOUNT TO RUN THE CODE. IF THIS STILL IS NOT SUFFICIENT, YOU CAN DOWNSAMPLE THE DATA AND PERFORM A LESS COMPREHENSIVE HYPER-PARAMETER SEARCH.**

In [None]:
# Import
import torch
from torch import nn
from torch.utils.data import DataLoader, Subset
from torch.utils.data import random_split
from torchvision import datasets
from torchvision.transforms import ToTensor
from torchvision import transforms

from sklearn.model_selection import KFold
import numpy as np

# Setting seed
seed = 6600
torch.manual_seed(seed)


<torch._C.Generator at 0x7b517c1ed130>

## HW-2.1: Data preparation

* Normalize the data as needed
* Partition data into training, validation, and test (i.e. leave one out CV)
  * One option to do this is to give these arrays global scope so they are seen inside the training function (so they don't need to be passed to functions)
* **Optional but recommended:** Create a K-fold cross validation data set, rather than just doing leave one out
* Do any other preprocessing you feel is needed

In [2]:
### Download training and test data from open datasets.
## Transform to normalize to [-1,1]
# Normalize to [-1,1]
transform = transforms.Compose([
    transforms.ToTensor(), 
    transforms.Normalize((0.5,), (0.5,))  
])

## Importing Training and Test data
training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=False,        # Switch to True if downloading data instead of importing
    transform=transform,
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=False,
    transform=transform,
)

## Defining folds for k-fold cross validation in training loop 
kfold = KFold(n_splits=5, shuffle=True, random_state=seed)

# explore data-set object
print("\n TYPE:\n",type(training_data))
print("\n SUMMARY:\n",training_data)
print("\n SUMMARY:\n",test_data)
print("\n ATTRIBUTES:\n",training_data.__dict__.keys())
print("\n FIRST DATA POINT:\n",)
img, label = training_data[0]
print(img.shape,label)
print(len(training_data))


 TYPE:
 <class 'torchvision.datasets.mnist.FashionMNIST'>

 SUMMARY:
 Dataset FashionMNIST
    Number of datapoints: 60000
    Root location: data
    Split: Train
    StandardTransform
Transform: Compose(
               ToTensor()
               Normalize(mean=(0.5,), std=(0.5,))
           )

 SUMMARY:
 Dataset FashionMNIST
    Number of datapoints: 10000
    Root location: data
    Split: Test
    StandardTransform
Transform: Compose(
               ToTensor()
               Normalize(mean=(0.5,), std=(0.5,))
           )

 ATTRIBUTES:
 dict_keys(['root', 'transform', 'target_transform', 'transforms', 'train', 'data', 'targets'])

 FIRST DATA POINT:

torch.Size([1, 28, 28]) 9
60000


## HW-2.2: Generalized model

* Create a `General` model function (or class) that takes hyper-parameters and evaluates the model
  * The function should work with a set of hyper parameters than can be easily be controlled and varied by the user (for later parameter tuning)
  * This should work for the training, test, and validation set 
* Feel free to recycle code from the lab assignments and demo's  
* Use the deep learning best practices that we discussed in class. 
* Document what is going on in the code, as needed, with narrative markdown text between cells.

In [None]:
# Get cpu or gpu device for training.
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

# Create general model class
class GeneralModel(nn.Module):
    def __init__(self, image_size=(28,28), num_classes=10, layer_sizes=(32,32),activation="relu",dropout_rate=0):
        super().__init__()
        self.flatten = nn.Flatten()

        # Building layers dynamically based on layer sizes
        layers = []
        input_size = image_size[0] * image_size[1]

        for hidden_size in layer_sizes:
            # Linear layer
            layers.append(nn.Linear(input_size, hidden_size))

            # Activation (either ReLU or Sigmoid)
            if activation == "relu":
                layers.append(nn.ReLU())
            elif activation == "sigmoid":
                layers.append(nn.Sigmoid())
            elif activation == "tanh":
                layers.append(nn.Tanh())
            else:
                raise ValueError(f"Invalid activation {activation}. Activation must be either relu, sigmoid, or tanh")
            
            # Dropout rate
            if dropout_rate > 0:
                layers.append(nn.Dropout(dropout_rate))

            # Set input size for next layer
            input_size = hidden_size

        # Final output layer
        layers.append(nn.Linear(input_size,num_classes))

        self.linear_relu_stack = nn.Sequential(*layers)

    def forward(self,x):
        x = self.flatten(x)
        x = self.linear_relu_stack(x)
        return x 

Using cuda device


## HW-2.3: Model training function

* You can do this in either a function (or python class), or however you think is best. 
* **Create a training function** (or class) that takes hyper-parameter choices and trains the model
  * If you are doing "leave one out", your training function only needs to do one training per hyper-parameter choice
  * If you are doing K-fold cross validation, you should train the model K times for each hyper-parameter choice, and report the average result cross the training runs at the end (this is technically a better practice but requires more computation). 
  * Use a dense feed forward ANN model, with the correct output layer activation, and correct loss function
  * `You MUST use early stopping` inside the function, otherwise it defeats the point
  * **Have at least the following hyper-parameters as inputs to this function** 
    * L1 regularization constant, L2 regularization constant, dropout rate 
    * Learning rate
    * Weight Initialization: Fully random vs Xavier Weight Initialization
    * Hidden layer activation function choice (use relu, sigmoid, or tanh)
    * Number and size of ANN hidden layers 
    * Optimizer choice, have at least three included (Adam, SGD, or RmsProp)
    * You can wrap all of the hyper-parameter arguments into a dictionary, or do it however you want  
  * **Visualization**
    * Include a boolean parameter as a function input that controls whether visualization is created or not
    * If `true`, Monitor training and validation throughout training by plotting
    * Report a confusion matrix 
  * Return the final training and validation error (averaged if using K-fold)
    * again, you must use early stopping to report the best training/validation loss without over-fitting
* Depending how you do this, it can be a lot of computation, start small and scale up and consider using Co-lab 
  

In [None]:
class TrainModel:
    def __init__(self, learning_rate=0.01, layer_sizes=(32,32), optimizer="adam", activation = "relu",
                 weight_init="random", l1_lambda=0, l2_lambda=0, dropout_rate=0.0, device="cpu"):
        
        # Store hyper-parameters
        self.device         = device
        self.dropout_rate   = dropout_rate
        self.learning_rate  = learning_rate
        self.weight_init    = weight_init
        self.optimizer      = optimizer
        self.layer_sizes    = layer_sizes
        self.activation     = activation
        self.l1_lambda      = l1_lambda
        self.l2_lambda      = l2_lambda

        # Initialize model
        self.model = GeneralModel(
            image_size=(28, 28),
            num_classes=10,
            layer_sizes = self.layer_sizes,
            activation  = self.activation,
            dropout_rate= self.dropout_rate
        )

        # Initialize loss function
        self.loss_fn = nn.CrossEntropyLoss()

        # Initialize optimizer
        self._setup_optimizer()

        # Initialize history tracking
        self.history = {
            'train_loss': [],
            'train_acc': [],
            'val_loss': [],
            'val_acc': []
        }
        
        # For k-fold CV results
        self.kfold_results = []
        
        # Track best model 
        self.best_model_state = None
        self.best_val_acc = 0.0

    ##################################
    ######## Set up optimizer ########
    ##################################
    def _setup_optimizer(self):
        if self.optimizer == "adam":
            self.opt = torch.optim.Adam(
                self.model.parameters(),
                lr=self.learning_rate,
                weight_decay=self.l2_lambda
            )
        elif self.optimizer == "sgd":
            self.opt = torch.optim.SGD(
                self.model.parameters(),
                lr=self.learning_rate,
                momentum=0.9,
                weight_decay=self.l2_lambda  
            )
        elif self.optimizer == "rmsprop":
            self.opt = torch.optim.RMSprop(
                self.model.parameters(),
                lr=self.learning_rate,
                alpha=0.99,
                momentum=0.0,
                weight_decay=self.l2_lambda
            )
        else:
            raise ValueError(f"Unknown optimizer: {self.optimizer}")

    ##########################################  
    ### Reset model to original parameters ###
    ##########################################
    def reset_model(self):
        self.model = GeneralModel(
            image_size=(28, 28),
            num_classes=10,
            layer_sizes = self.layer_sizes,
            activation  = self.activation,
            dropout_rate= self.dropout_rate
        )

        # Initialize optimizer
        self._setup_optimizer()

        self.history = {
            'train_loss': [],
            'train_acc': [],
            'val_loss': [],
            'val_acc': []
        }

    #################################
    ######## Train one epoch ########
    #################################
    def train_epoch(self, train_loader):
        """Training one epoch"""
        self.model.train()
        train_loss = 0
        correct    = 0
        total      = 0

        for images, labels in train_loader:
            images, labels = images.to(self.device), labels.to(self.device)

            # Forward pass
            outputs = self.model(images)
            loss = self.loss_fn(outputs, labels)

            # L1 Regularization 
            if self.l1_lambda > 0:
                l1_penalty = 0
                for param in self.model.parameters():
                    l1_penalty += torch.sum(torch.abs(param))
                loss = loss + self.l1_lambda * l1_penalty

            # Backward pass 
            self.opt.zero_grad() 
            loss.backward()
            self.opt.step()

            # Track metrics 
            train_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1) 
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

        # Calculate average loss and accuracy for the epoch 
        avg_loss = train_loss / len(train_loader)
        accuracy = 100 * correct / total

        return avg_loss, accuracy
    
    ######################################
    ##### Evaluate model performance #####
    ######################################
    def evaluate(self, data_loader):
        """Evaluate model on validation or test set"""
        # Set to evaluation mode 
        self.model.eval()  
        eval_loss = 0
        correct = 0
        total = 0
        
        # Don't compute gradients
        with torch.no_grad():  
            for images, labels in data_loader:
                images, labels = images.to(self.device), labels.to(self.device)
                
                # Forward pass
                outputs = self.model(images)
                loss = self.loss_fn(outputs, labels)
                
                # Track metrics
                eval_loss += loss.item()
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
        
        avg_loss = eval_loss / len(data_loader)
        accuracy = 100 * correct / total
        return avg_loss, accuracy

    #########################
    ##### Fit the model #####
    #########################
    def fit(self, train_loader, val_loader, epochs, patience=None):
        best_val_loss = float('inf')
        patience_counter = 0

        for epoch in range(epochs):
            # Train model
            train_loss, train_acc = self.train_epoch(train_loader)
            self.history['train_loss'].append(train_loss)
            self.history['train_acc'].append(train_acc)

            # Validate 
            val_loss, val_acc = self.evaluate(val_loader)
            self.history['val_loss'].append(val_loss)
            self.history['val_acc'].append(val_acc)

            # Track best model 
            if val_acc > self.best_val_acc:
                self.best_val_acc = val_acc 
                self.best_model_state = self.model.state_dict().copy()

            # Print progress
            print(f"Epoch {epoch+1}/{epochs} - "
                f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}% - "
                f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%")
            
            # Early stopping
            if patience is not None:
                if val_loss < best_val_loss:
                    best_val_loss = val_loss
                    patience_counter = 0
                else:
                    patience_counter += 1

                if patience_counter >= patience:
                    print(f"\nEarly stopping triggered after {epoch+1} epochs!")
                    print(f"Best validation loss: {best_val_loss:.4f}")
                    break

        # Load best model weights
        if self.best_model_state is not None:
            self.model.load_state_dict(self.best_model_state)
            print(f"\nLoaded best model with validation accuracy: {self.best_val_acc:.2f}%")
            
        return self.history
    
    ##################################
    ### Run with K-Fold Validation ###
    ##################################
    def run_kfold_cv(self, train_data, epochs, batch_size=64, patience=None):
        fold_results = []

        for fold, (train_ids, val_ids) in enumerate(kfold.split(train_data)):
            print(f"\n{'*'*50}")
            print(f"FOLD {fold + 1}/{kfold.n_splits}")
            print(f"{'*'*50}")

            # Reset model for this fold
            self.reset_model()

            # Create data loaders for this fold
            train_subset = Subset(train_data, train_ids)
            val_subset   = Subset(train_data, val_ids)

            train_loader = DataLoader(train_subset, batch_size=batch_size, shuffle=True)
            val_loader = DataLoader(val_subset, batch_size=batch_size, shuffle=False)

            # Train on this fold with early stopping
            history = self.fit(train_loader, val_loader, epochs, patience=patience)

            # Store results
            fold_results.append({
                'fold': fold + 1,
                'final_train_acc': history['train_acc'][-1],
                'final_val_acc': history['val_acc'][-1],
                'best_val_acc': self.best_val_acc,
                'epochs_trained': len(history['train_loss']),
                'history': history
            })

        # Store all fold results
        self.kfold_results = fold_results

        # Store all fold results
        self.kfold_results = fold_results
        
        # Print summary
        print(f"\n{'='*50}")
        print("K-FOLD CROSS VALIDATION SUMMARY")
        print(f"{'='*50}")
        
        val_accs = [r['final_val_acc'] for r in fold_results]
        epochs_trained = [r['epochs_trained'] for r in fold_results]
        
        print(f"Validation Accuracies: {val_accs}")
        print(f"Mean Validation Accuracy: {np.mean(val_accs):.2f}%")
        print(f"Std Validation Accuracy: {np.std(val_accs):.2f}%")
        print(f"Epochs per fold: {epochs_trained}")
        print(f"Average epochs: {np.mean(epochs_trained):.1f}")
        
        return fold_results

    


In [None]:
def plot_history(self):
    """Plot training history"""
    import matplotlib.pyplot as plt
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
    
    # Plot loss
    ax1.plot(self.history['train_loss'], label='Train Loss')
    ax1.plot(self.history['val_loss'], label='Val Loss')
    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('Loss')
    ax1.set_title('Training and Validation Loss')
    ax1.legend()
    ax1.grid(True)
    
    # Plot accuracy
    ax2.plot(self.history['train_acc'], label='Train Acc')
    ax2.plot(self.history['val_acc'], label='Val Acc')
    ax2.set_xlabel('Epoch')
    ax2.set_ylabel('Accuracy (%)')
    ax2.set_title('Training and Validation Accuracy')
    ax2.legend()
    ax2.grid(True)
    
    plt.tight_layout()
    plt.show()

def save_model(self, filepath):
    """Save model weights"""
    torch.save(self.model.state_dict(), filepath)
    print(f"Model saved to {filepath}")

def load_model(self, filepath):
    """Load model weights"""
    self.model.load_state_dict(torch.load(filepath))
    print(f"Model loaded from {filepath}")

## HW-2.4: Hyper-parameter tuning

* Keep detailed records of hyper-parameter choices and associated training & validation errors
* Think critically and visualize the results of the search as needed

* **Do each of these in a different sub-section of your notebook**
  
* **Explore hyper-parameter choice-0**
  * for hidden activation=Relu, hidden layers = [32,32], optimizer=adam
  * Vary the learning rate via a grid search pattern
  * Plot training and validation error as a function of the learning rate
  * Repeat this exercise for both random and Xavier weight initialization 

* **Explore hyper-parameter choice-1**
  * for hidden activation=relu, hidden layers = [64,64], optimizer=adam
  * Vary L1 and L2 in a 10x10 grid search (without dropout) 
  * Plot validation and training error as a function of L1 and L2 regularization in a 2D heatmap 
  * Plot the ratio (or difference) of validation to training error as a function of L1 and L2 regularization in a 2D heatmap 

* **Explore hyper-parameter choice-2**
  * for hidden activation=sigmoid, hidden layers = [96,96,96], optimizer=**rmsprop**
  * Vary drop-out parameter in a 1x10 grid search (without L1 or L2 regularization) 
  * Plot training and validation error as a function of dropout rate  
  * Plot the ratio (or difference) of validation to training error as a function of dropout rate  

* **Explore hyper-parameter choice-3:**
  * for hidden activation=relu, hidden layers = [96,96,96], optimizer=**adam**
  * Vary drop-out parameter in a 1x10 grid search (without L1 or L2 regularization) 
  * Plot training and validation as a function of dropout rate  
  * Plot the ratio (or difference) of validation to training error as a function of dropout rate  

* `Optional` Systematically search for the best regularization parameters choice (3D search) using random search algorithm 
  * (https://en.wikipedia.org/wiki/Random_search)[https://en.wikipedia.org/wiki/Random_search]
  * Try to see how deep you can get the ANN (max hidden layers) without suffering from the vanishing gradient effect  
  
* `Final fit`
  * At the very end, select a best fit model and report, training, validation, and test errors at the very end
  * Make sure your "plotting variable=True" when for the final training
  

# Bonus assignment 

`+5 bonus points`

`You DO NOT need to do this if you don't want to`

* Once the data is collected, this HW should be quite easy, since most of the code can be recycled from the labs & textbook. 

* Do this in a file called `bonus.ipynb`, have it save its results to a folder "data"

`Data collection`

* Develope a text based classification data-set:
* Use the Wikipedia API to search for articles to generate the data-set
* Select a set of highly different topics (i.e. labels), for example,
  * multi-class case: y=(pizza, oak_trees, basketball, ... , etc)=(0,1,2, ... , N-1)
  * You don't have to use these, you can use whatever labels you want
  * `Have AT LEAST 10 labels.` 
  * The more different the topics, the easier the classification task should be 
* Search for Wikipedia pages about these topics and harvest the text from the pages. 
* Do some basic text cleaning as needed. 
  * e.g. use the NLTK sentence tokenizer to break the text into sentences. 
  * Then form chunks of text that are five sentences long as your "inputs".
* The "label" for these chunks will be the search label used to find the text. 
* The data set will not be perfect. 
  * There will be chunks of text that are not related to the topic (i.e. noise). 
  * However that is just something we have to live with.
* **Important**: Always start small when writing & debugging THEN scale up. 
* The more chunks of text you have the better.
  * Save the text and labels to the same format used by the textbook, that way you can recycle your lab code seamlessly. 
* `Optional practice`: You can also "tag" each chunk of text with an associated "compound" sentiment score computed using the NLTK sentiment analysis. From this you can train a regression model in part-2. This is somewhat silly, and is just for educational purposes, since your using a model output to train another model. 

`Model training`

* Repeat the model training and hyper-parameter tuning exercise for MNIST, but with your text.