## Milestone 2: Neural Network Baseline and Hyperparameter Optimization

LIS 640 - Introduction to Applied Deep Learning

Due 3/7/25

## **Overview**
In Milestone 1 you have:
1. **Defined a deep learning problem** where AI can make a meaningful impact.
2. **Identified three datasets** that fit your topic and justified their relevance.
3. **Explored and visualized** the datasets to understand their structure.
4. **Implemented a PyTorch Dataset class** to prepare data for deep learning.

In Milestone 2 we will take the next step and implement a neural network baseline based on what we have learned in class! For this milestone, please use one of the datasets you picked in the last milestone. If you pick a new one, make sure to do Steps 2 - 4 again. 


## **Step 1: Define Your Deep Learning Problem**

The first step is to be clear about what you want your model to predict. Is your goal a classification or a regression task? what are the input features and what are you prediction targets y? Make sure that you have a sensible choice of features and a sensible choice of prediction targets y in your dataloader.

**Write down one paragraph of justification for how you set up your DataLoader below. If it makes sense to change the DataLoader from Milestone 1, describe what you changed and why:**

This project is a regression task, where the goal is to use time-series analog sensor data from a manufacturing assembly line to predict a continuous numerical target related to system performance, anomalies, or equipment failure. The input features (X) consist of sensor readings from actuators, transducers, and control mechanisms, capturing critical operational parameters over time. The prediction target (y) is selected based on key metrics that indicate system health, efficiency, or potential faults.

In setting up the DataLoader, some changes I have made are:

- Selecting only numerical features relevant to predicting system behavior.
- Normalizing the inputs using MinMaxScaler to stabilize training and improve convergence.
- Ensuring proper train-validation-test splits to evaluate the model’s generalization.
- Correcting shape of the data to ensure input Shape Matches Model Expectations

## **Step 2: Train a Neural Network in PyTorch**

We learned in class how to implement and train a feed forward neural network in pytorch. You can find reference implementations [here](https://github.com/mariru/Intro2ADL/blob/main/Week5/Week5_Lab_Example.ipynb) and [here](https://www.kaggle.com/code/girlboss/mmlm2025-pytorch-lb-0-00000). Tip: Try to implement the neural network by yourself from scratch before looking at the reference.


In [1]:
# imports
import os
import pandas as pd
import torch
from torch import nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from torchvision.transforms import ToTensor
from torch.utils.data import Dataset
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from torch import nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

In [2]:
device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"
print(f"Using {device} device")

Using mps device


In [3]:
# define dataloaders: make sure to have a train, validation and a test loader

class AnalogSensorDataset(Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels.reshape(-1, 1)  # Ensure y is (batch_size, 1)

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        x = torch.tensor(self.features[idx], dtype=torch.float32)
        y = torch.tensor(self.labels[idx], dtype=torch.float32)
        return x, y

    
# Load pandas dataframe
df = pd.read_csv("R01_Data.csv")

# Convert `_time` to a numerical feature (elapsed seconds since start)
df["_time"] = pd.to_datetime(df["_time"])  # Convert to datetime
df["_time"] = (df["_time"] - df["_time"].min()).dt.total_seconds()  # Convert to elapsed time

# Drop non-numeric columns (if `Description` is not needed)
df = df.drop(columns=["Description"])

# Separate features and target (assuming last column is the target)
X = df.iloc[:, :-1].values  # Features (all but last column)
y = df.iloc[:, -1].values   # Target (last column)

# Train-Test Split (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train-Validation Split (80% Train, 20% Validation)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Normalize Features using Training Set Statistics
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

scaler_y = MinMaxScaler()
y_train = scaler_y.fit_transform(y_train.reshape(-1, 1))
y_val = scaler_y.transform(y_val.reshape(-1, 1))
y_test = scaler_y.transform(y_test.reshape(-1, 1))

# Create dataset instances
train_data = AnalogSensorDataset(X_train, y_train)
val_data = AnalogSensorDataset(X_val, y_val)
test_data = AnalogSensorDataset(X_test, y_test)

# Create data loaders
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
val_loader = DataLoader(val_data, batch_size=64, shuffle=False)
test_loader = DataLoader(test_data, batch_size=64, shuffle=False)

for batch in train_loader:
    features, labels = batch
    print("Feature Sample:", features.shape)
    print("Label Sample:", labels.shape)
    break

  df = pd.read_csv("R01_Data.csv")


Feature Sample: torch.Size([64, 8])
Label Sample: torch.Size([64, 1])


In [13]:
# define the model

class NeuralNetwork(nn.Module):
    def __init__(self, d_in, d_out, d_hidden, n_layers = 2):
        super().__init__()
        layers = [nn.Linear(d_in, d_hidden), nn.BatchNorm1d(d_hidden, momentum = 0.1),
            nn.ReLU()]
        for layer in range(n_layers):
            layers += [nn.Linear(d_hidden, d_hidden), nn.BatchNorm1d(d_hidden, momentum = 0.1),nn.ReLU(), nn.Dropout(p=0.3)]
        layers += [nn.Linear(d_hidden, d_out)]
        self.linear_relu_stack = nn.Sequential(*layers)

    def forward(self, x):
        logits = self.linear_relu_stack(x)
        return logits

In [14]:
model = NeuralNetwork(8, 1, 100).to(device)
print(model)

NeuralNetwork(
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=8, out_features=100, bias=True)
    (1): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): Linear(in_features=100, out_features=100, bias=True)
    (4): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU()
    (6): Dropout(p=0.3, inplace=False)
    (7): Linear(in_features=100, out_features=100, bias=True)
    (8): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (9): ReLU()
    (10): Dropout(p=0.3, inplace=False)
    (11): Linear(in_features=100, out_features=1, bias=True)
  )
)


In [15]:
# define the loss function and the optimizer

batch_size = 64

loss_fn = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=3e-3)  # Adjust learning rate as needed

In [7]:
# train the model

class EarlyStopping:
    def __init__(self, patience=5, min_delta=0):
        self.patience = patience
        self.min_delta = min_delta
        self.best_loss = float("inf")
        self.counter = 0

    def __call__(self, val_loss):
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0  # Reset counter if improvement
        else:
            self.counter += 1  # Increment if no improvement
            if self.counter >= self.patience:
                return True  # Stop training
        return False
        

# Training function
def train(model, train_loader, loss_fn, optimizer, device):
    model.train()  # Set model to training mode
    total_loss = 0
    
    for batch_X, batch_y in train_loader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)  # Move data to device

        optimizer.zero_grad()  # Reset gradients
        outputs = model(batch_X).squeeze()  # Forward pass
        loss = loss_fn(outputs, batch_y.to(device).squeeze())  # Compute loss
        loss.backward()  # Backpropagation
        optimizer.step()  # Update weights
        
        total_loss += loss.item()  # Accumulate loss
    
    return total_loss / len(train_loader)  # Return average loss per batch

# Evaluation function
def evaluate(model, val_loader, loss_fn, device):
    model.eval()  # Set model to evaluation mode
    total_loss = 0
    
    with torch.no_grad():  # Disable gradient computation
        for batch_X, batch_y in val_loader:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            outputs = model(batch_X).squeeze()  # Ensure output shape matches labels
            loss = loss_fn(outputs, batch_y.squeeze())
            total_loss += loss.item()

    avg_loss = total_loss / len(val_loader)
    return avg_loss  # Return MSE loss (lower is better)

    
# Train the model for 3 different hyper parameter settings (e.g. different learning rates, different loss functions that 
#make sense for your data, etc.)

num_epochs = 6
early_stopping = EarlyStopping(patience=3, min_delta=1e-4)  # Stop if no improvement for 3 epochs

for epoch in range(num_epochs):
    train_loss = train(model, train_loader, loss_fn, optimizer, device)  # Get training loss
    val_loss = evaluate(model, val_loader, loss_fn, device)  # Get validation loss

    print(f"Epoch {epoch+1}/{num_epochs}: Train Loss = {train_loss:.4f}, Validation Loss = {val_loss:.4f}")

    # Check early stopping
    if early_stopping(val_loss):
        print("Early stopping triggered. Stopping training.")
        break


Epoch 1/6: Train Loss = 0.0024, Validation Loss = 0.0002
Epoch 2/6: Train Loss = 0.0009, Validation Loss = 0.0002
Epoch 3/6: Train Loss = 0.0008, Validation Loss = 0.0002
Epoch 4/6: Train Loss = 0.0008, Validation Loss = 0.0001
Epoch 5/6: Train Loss = 0.0008, Validation Loss = 0.0001
Epoch 6/6: Train Loss = 0.0007, Validation Loss = 0.0004


In [8]:
# test the model

test_loss = evaluate(model, test_loader, loss_fn, device)
print(f"Final Test Loss (MSE): {test_loss:.4f}")

Final Test Loss (MSE): 0.0004


## **Step 2 continued: Try Stuff**

Use your code above to try different architectures. Make sure to use early stopping! Try adding Dropout and BatchNorm, try different learning rates. How do they affect training and validation performance? 

 **Summarize your observations in a paragraph below:**

I experimented with different architectures by adjusting Dropout, Batch Normalization, and Learning Rate, while using early stopping to prevent overfitting. Here’s what I observed:

1. Decreasing Dropout (p=0.4 → 0.3)

Initially, the model showed signs of over-regularization with Dropout = 0.4, where validation loss remained too stable without improving further.
Reducing Dropout to 0.3 improved training efficiency, allowing the model to learn more meaningful features.
Final Test Loss (MSE): 0.0002, indicating a well-generalized model without excessive regularization.

2. Decreasing Learning Rate (0.01 → 3e-3)

The train loss initially remained constant (0.0016 for multiple epochs) with a lower learning rate (1e-3), suggesting it was too low.
Increasing it to 3e-3 allowed more effective weight updates, leading to a smoother decrease in training loss (0.0024 → 0.0009).
Validation loss remained stable at 0.0002, meaning the model converged efficiently without overshooting.

3. Modifying BatchNorm (Momentum Adjusted to 0.1)

Before adding BatchNorm, the model overfitted, with training loss decreasing but validation loss increasing.
After adding BatchNorm (momentum = 0.1), the model stabilized, preventing rapid fluctuations in loss.
Lowering momentum (0.99 → 0.1) improved adaptability, leading to better weight updates across mini-batches.
Final performance improved, with both train and validation loss decreasing smoothly while maintaining stability.

## **Step 3: Hyperparameter Optimization with Optuna**

As you can see, hyperparameter optimization can be tedious. In class we used [optuna](https://optuna.org/#code_examples) to automate the process. Your next task is to wrap your code from Step 2 into an objective which you can then optimize with optuna. Under the [code exaples](https://optuna.org/#code_examples) there is a tab *PyTorch* which should be helpful as it provides a minimal example on how to wrap PyTorch code inside an objective.

**Important: Make sure the model is evaluated on a validation set, not the training data!!**


In [10]:
!pip install optuna

Collecting optuna
  Obtaining dependency information for optuna from https://files.pythonhosted.org/packages/28/09/c4d329f7969443cdd4d482048ca406b6f61cda3c8e99ace71feaec7c8734/optuna-4.2.1-py3-none-any.whl.metadata
  Downloading optuna-4.2.1-py3-none-any.whl.metadata (17 kB)
Collecting alembic>=1.5.0 (from optuna)
  Obtaining dependency information for alembic>=1.5.0 from https://files.pythonhosted.org/packages/99/f7/d398fae160568472ddce0b3fde9c4581afc593019a6adc91006a66406991/alembic-1.15.1-py3-none-any.whl.metadata
  Downloading alembic-1.15.1-py3-none-any.whl.metadata (7.2 kB)
Collecting colorlog (from optuna)
  Obtaining dependency information for colorlog from https://files.pythonhosted.org/packages/e3/51/9b208e85196941db2f0654ad0357ca6388ab3ed67efdbfc799f35d1f83aa/colorlog-6.9.0-py3-none-any.whl.metadata
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Obtaining dependency information for Mako from https://files.python

In [None]:
import optuna
from sklearn.metrics import mean_squared_error
import torch.optim as optim

# Define an objective function to be minimized.
def objective(trial, train_loader=train_loader, val_loader=val_loader):
    # Define hyperparameters to optimize
    d_hidden = trial.suggest_int("d_hidden", 64, 256, step=32)  # Hidden layer size
    lr = trial.suggest_float("lr", 1e-4, 1e-2, log=True)  # Learning rate (log scale)
    n_layers = trial.suggest_int("n_layers", 2, 6, step=1)
    
    # Define model
    model = NeuralNetwork(d_in=8, d_out=1, d_hidden=d_hidden, n_layers=n_layers).to(device)
    loss_fn = torch.nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)

    # Implement Early Stopping
    early_stopping = EarlyStopping(patience=3, min_delta=1e-4)

    # Train model
    num_epochs = 10  # Keep small for quick tuning
    for epoch in range(num_epochs):
        model.train()
        total_train_loss = 0

        for batch_X, batch_y in train_loader:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            
            optimizer.zero_grad()
            preds = model(batch_X).squeeze()  # Ensure output shape matches labels
            loss = loss_fn(preds, batch_y)
            loss.backward()
            optimizer.step()
            total_train_loss += loss.item()

        avg_train_loss = total_train_loss / len(train_loader)

        # Evaluate on validation set
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for batch_X, batch_y in val_loader:
                batch_X, batch_y = batch_X.to(device), batch_y.to(device)
                preds = model(batch_X).squeeze()
                val_loss += loss_fn(preds, batch_y).item()

        avg_val_loss = val_loss / len(val_loader)

        # Report intermediate results to Optuna
        trial.report(avg_val_loss, epoch)
        if trial.should_prune():
            raise optuna.exceptions.TrialPruned()

        # Check early stopping
        if early_stopping(avg_val_loss):
            break

    return avg_val_loss  # Return average validation loss
    

# Create a study object
study = optuna.create_study(direction="minimize")  # Minimize MSE
study.optimize(objective, n_trials=6)  # Run 20 trials

# Print out the best parameters.
print("Best hyperparameters:", study.best_params)

d_hidden = study.best_params["d_hidden"]
lr = study.best_params["lr"]
n_layers = study.best_params["n_layers"]


## **Step 3 continued: Insights**

Did you find the hyperparameter search helpful? Does it help to increase the number of trials in the optimization? Note that so far we have used the simplest version of optuna which has many nice features. Can you discover more useful features by browsing the optuna website? (Hint: try pruning)

## **Step 4: Final Training**

Now that you have found a good hyperparameter setting the validation set is no longer needed. The last step is to combine the training and validation set into a combined training set and retrain the model under the best parameter setting found. Report your final loss on your test data.

## **Final Submission**
Upload your submission for Milestone 2 to Canvas. 
Happy Deep Learning! 🚀