# Praktikum 3

Authors: Ahmed Khalil and Fiona Lublow

Research question: How effective are different optimization strategies for improving accuracy and training speed of fully connected neural nets?


## Theory


### Data and model choices

We trained a fully connected neural network (FCNN) on the MNIST dataset, a benchmark in machine learning.

To make a decision on the architecture of our neural net we consulted multiple sources:

An, S., Lee, M., Park, S., Yang, H., & So, J. (2020). [An ensemble of simple convolutional neural network models for mnist digit recognition](https://arxiv.org/pdf/2008.10400). arXiv preprint arXiv:2008.10400.

Montgomery, R. M. (2024). [Exploring Neural Networks: A Walk Through the MNIST Dataset Classification](https://www.researchgate.net/profile/Richard-Murdoch-Montgomery/publication/382678573_Exploring_Neural_Networks_A_Walk_Through_the_MNIST_Dataset_Classification/links/66cab445c2eaa5002314d834/Exploring-Neural-Networks-A-Walk-Through-the-MNIST-Dataset-Classification.pdf). ESS Open Archive eprints, 116, 11620168.

Tabik, S., Peralta, D., Herrera-Poyatos, A., & Herrera, F. (2017). [A snapshot of image pre-processing for convolutional neural networks: case study of MNIST](https://link.springer.com/content/pdf/10.2991/ijcis.2017.10.1.38.pdf). International Journal of Computational Intelligence Systems, 10(1), 555-568.

 While convolutional neural networks (CNNs) are the modern standard due to their efficiency and performance, our focus was on basic optimization strategies, making the simpler FCNN architecture more suitable. Such networks, while limited to approximately three layers before degradation occurs, allow us to explore overfitting, a common issue when training larger models with many parameters.

In fully connected neural networks, a maximum of around 3 layers are advisable, since more layers lead to vanishing gradients and slow learning.

Modern approaches to MNIST reach higher accuracies, e.g. single CNN reached 99.75% accuracy in 2016.

Hasanpour, S. H., Rouhani, M., Fayyaz, M., & Sabokrou, M. (2023). Let's keep it simple: Using simple architectures to outperform deeper and more complex architectures. arXiv. https://arxiv.org/abs/1608.06037

Deeper fully connected networks with skip connections are being explored to reduce the work necessary for feature engineering in CNNs: Wang, R., Fu, B., Fu, G., & Wang, M. (2017). Deep & Cross Network for Ad Click Predictions. arXiv. https://arxiv.org/abs/1708.05123

### Why optimize?

Hahnloser, R. L. (1998). [On the piecewise analysis of networks of linear threshold neurons](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=a09ef3a49654fdbc7c446884369f8c9ca542b012). Neural Networks, 11(4), 691-697.

Ying, X. (2019, February). [An overview of overfitting and its solutions](https://iopscience.iop.org/article/10.1088/1742-6596/1168/2/022022/pdf). In Journal of physics: Conference series (Vol. 1168, p. 022022). IOP Publishing.

Hochreiter, S. (1998). [The vanishing gradient problem during learning recurrent neural nets and problem solutions](http://www.bioinf.jku.at/publications/older/2304.pdf). International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02), 107-116.

Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). [Gradient flow in recurrent nets: the difficulty of learning long-term dependencies](https://www.researchgate.net/profile/Y-Bengio/publication/2839938_Gradient_Flow_in_Recurrent_Nets_the_Difficulty_of_Learning_Long-Term_Dependencies/links/546cd26e0cf2193b94c577c2/Gradient-Flow-in-Recurrent-Nets-the-Difficulty-of-Learning-Long-Term-Dependencies.pdf).

When training neural nets, we have to deal with the problem of overfitting:

This occurs when a model becomes overly specific to training data, leading to poor generalization, often indicated by rising validation loss. This will eventually happen with any FCNN, given a complex enough NN. Optimization strategies not only combat this, but can also accelerate training.

### Strategies:

Marti, K. (2008). [Stochastic optimization methods](https://www.academia.edu/download/80852723/978-3-662-46214-0.pdf) (Vol. 2). Berlin: Springer.

- Early stopping

Bai, Y., Yang, E., Han, B., Yang, Y., Li, J., Mao, Y., ... & Liu, T. (2021). [Understanding and improving early stopping for learning with noisy labels. Advances in Neural Information Processing Systems](https://proceedings.neurips.cc/paper/2021/file/cc7e2b878868cbae992d1fb743995d8f-Paper.pdf), 34, 24392-24403.

A simple way to deal with overfitting is to track the validation loss during training, and if it starts to rise, stop at that point. to improve the stopping point, we can keep backups of the NN during certain intervals, so we can go back to before it started to overfit. This can be expensive in memory.

In our experiments, significant overfitting was not observed, likely due to our architecture and training parameters, so early stopping was not tested extensively.

- Regularization

Girosi, F., Jones, M., & Poggio, T. (1995). [Regularization theory and neural networks architectures](https://www.researchgate.net/profile/Michael-Jones-66/publication/2246342_Regularization_Theory_and_Neural_Networks_Architectures/links/02bfe50d33d1a45e52000000/Regularization-Theory-and-Neural-Networks-Architectures.pdf). Neural computation, 7(2), 219-269.

Regularization helps control weight magnitudes, with L2 regularization being the most common.

Regularization is applied to the weights of each layer, adjusting them to be in the same range.

- Dropout

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). [Dropout: a simple way to prevent neural networks from overfitting](https://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf). The journal of machine learning research, 15(1), 1929-1958.

Introduced by Srivastava et al. (2014), dropout reduces overfitting by randomly disabling a subset of nodes during training, effectively applying noise to the net. Disabled nodes do not contribute to forward passes or backpropagation. The trained weights are then all applied during evaluation and application of the model. Since they would overexcite the network, their values need to be adjusted by the same ratio that dropout was applied. This is handled by PyTorch's .eval() function. For our experiments, dropout was set to 50% to observe its impact.

- ADAM

Kingma, D.P., & Ba, J. (2014). [Adam: A Method for Stochastic Optimization](https://arxiv.org/pdf/1412.6980). CoRR, abs/1412.6980.

One of the modern, established optimizers that combines different methods of tracking momentum and variance in momentum over multiple backpropagation steps. It is robust and generally requires little hyperparameter tuning, making it a popular choice today.

## Description of Experiment

We used  PyTorch to train and evaluate our neural network.

The MNIST dataset was loaded via PyTorch and randomly split 80/20 into training and validation sets. Training data was split into batches of 64.

Our intention is the examination of strategies that mainly combat overfitting, so we needed our architecture to give rise to overfitting. This is typically observed in larger neural networks with higher parameter counts.

The input layer has a fixed size of 784, the output layer a size of 10.

We applied ReLU activation for hidden layers and softmax for the output layer.

As a starting point for number of parameters, Srivasta et al. (2014) tested Dropout with a neural network with 2 layers of 8192 nodes each.

We used PyTorch implementation of stochastic gradient descent, better optimizers supplied by torch.optim such as ADAM are too "good" for us to see the effect of the strategies we wanted to test.

Experiments were carried out with and without momentum, a setting of gradient descent function, which can help with getting stuck in low gradient zone and local minima.

For our experiment, PyTorch’s SGD optimizer includes L2 regularization via its weight decay parameter, set to 1e-4 our experiments.

Dropout was applied via the dropout attribute of the model class. Dropout was set to 0.5, i.e. in each batch half of all hidden nodes are dropped. Reducing the weights in evaluation is automated by torch with the .eval() setting.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader, random_split
import time
import csv
import pandas as pd
import plotly.express as px
import glob

In [None]:
# Define the transforms: converts to tensor and normalizes
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

# Load the full training dataset
full_train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)

# Use a smaller subset for training (2000 entries)
subset_size = 2000
subset_indices = torch.randperm(len(full_train_dataset))[:subset_size]  # Randomly select 2000 indices
train_dataset = torch.utils.data.Subset(full_train_dataset, subset_indices)

# Use the remaining data for validation
remaining_indices = list(set(range(len(full_train_dataset))) - set(subset_indices.tolist()))
val_dataset = torch.utils.data.Subset(full_train_dataset, remaining_indices)

# Load the test dataset
test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

In [None]:
# Get the specific activation function for the hidden layers
def get_activation_function(activation):
    if activation == 'relu':
        return F.relu
    elif activation == 'sigmoid':
        return torch.sigmoid
    elif activation == 'tanh':
        return torch.tanh
    else:
        raise ValueError(f"Unbekannte Aktivierungsfunktion: {activation}")

In [None]:
# Class to build up dynamic neural Networks
# Allows specific hidden layers (define by the size of the list and the numbers of neurons init) and different activation functions
class FullyConnectedNN(nn.Module):
    def __init__(self, input_size, output_size, hidden_layers, activation, dropout):
        super(FullyConnectedNN, self).__init__()
        self.dropout = nn.Dropout(p=(0.5 if dropout == True else 0))

        #empty List of hidden layers
        self.layers = nn.ModuleList()

        current_size = input_size

        # Creates the hidden layers and the specific amount of neurons at each iteration
        for hidden_size in hidden_layers:
            self.layers.append(nn.Linear(current_size, hidden_size))
            # connect to the next layer
            current_size = hidden_size

        # output layer
        self.layers.append(nn.Linear(current_size, output_size))

        # Sets the activation function
        self.activation = get_activation_function(activation)

    def forward(self, x):
        for layer in self.layers[:-1]:
            x = self.activation(layer(x))
            if self.dropout.p > 0:
                x = self.dropout(x)
        x = self.layers[-1](x)
        return x

    def get_accuracy(self, loader, device) -> float:
        self.eval()
        correct = 0
        with torch.no_grad():
            for data, target in loader:
                data, target = data.to(device), target.to(device)
                data = data.view(data.size(0), -1)
                output = self(data)
                pred = output.argmax(dim=1, keepdim=True)
                correct += pred.eq(target.view_as(pred)).sum().item()
        return correct / len(loader.dataset)

    def get_loss(self, loader, device) -> float:
        self.eval()
        total_loss = 0.0
        criterion = nn.CrossEntropyLoss()

        with torch.no_grad():
            for data, target in loader:
                data, target = data.to(device), target.to(device)
                data = data.view(data.size(0), -1)
                output = self(data)
                loss = criterion(output, target)
                total_loss += loss.item() * data.size(0)

        avg_loss = total_loss / len(loader.dataset)
        return avg_loss


def train_and_evaluate(model, train_loader, val_loader, test_loader, epochs, lr, L2_reg, optimizer_type):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)

    optimizer = \
        (optim.SGD(model.parameters(), lr=lr,
                   weight_decay=(1e-4 if L2_reg else 0)) if optimizer_type == "SGD"
         else optim.Adam(model.parameters(), lr=lr))
    criterion = nn.CrossEntropyLoss()  # includes softmax

    training_history = []

    start_time = time.perf_counter()

    for epoch in range(epochs):
        # Training phase
        model.train()

        for data, target in train_loader:
            data, target = data.to(device), target.to(device)
            data = data.view(data.size(0), -1)

            optimizer.zero_grad()  # Clear gradients
            output = model(data)  # Forward pass
            loss = criterion(output, target)  # Compute loss
            loss.backward()  # Backpropagation
            optimizer.step()  # Update weights

        train_loss = model.get_loss(train_loader, device)

        # Validation phase
        val_loss = model.get_loss(val_loader, device)

        accuracy = model.get_accuracy(val_loader, device)

        training_history.append({
            "Epoch": epoch + 1,
            "Train Loss": train_loss,
            "Validation Loss": val_loss,
            "Accuracy": accuracy,
        })

        # Print epoch summary
        print(f"Epoch {epoch + 1}/{epochs} | Train Loss: {train_loss:.4f} | Validation Loss: {val_loss:.4f} | Accuracy: {accuracy:.2%}")

    end_time = time.perf_counter()

    # Final accuracy on test set
    accuracy = model.get_accuracy(test_loader, device)

    output_csv = ("./data/"
                  + str(hidden_layers)
                  + optimizer_type
                  + "noMomentum"
                  + "_" "dropout_" + str(dropout) + "_"
                  + "L2_" + str(L2_reg)
                  + ".csv")

    with open(output_csv, mode='w', newline='') as csvfile:
        csv_writer = csv.DictWriter(csvfile,
                                    fieldnames=["Epoch", "Train Loss", "Validation Loss", "Accuracy"],)
        csv_writer.writeheader()
        csv_writer.writerows(training_history)

    return end_time - start_time, accuracy

In [None]:
input_size = 28 * 28
output_size = 10
hidden_layers = [2024]
activation = 'relu'
optimizer_type = "SGD"
epochs = 500
learning_rate = 0.005

In [None]:
dropout = True
L2_reg = True


model1 = FullyConnectedNN(input_size, output_size, hidden_layers, activation, dropout)
print(f"Model loaded, starting training...")
time1, accuracy1 = train_and_evaluate(model1, train_loader, val_loader, test_loader, epochs, learning_rate, L2_reg,
                                      optimizer_type)
print(f"Model 1: Time={time1:.2f}s, test accuracy={accuracy1:.2%}")

In [None]:
dropout = True
L2_reg = False


model2 = FullyConnectedNN(input_size, output_size, hidden_layers, activation, dropout)
print(f"Model loaded, starting training...")
time2, accuracy2 = train_and_evaluate(model2, train_loader, val_loader, test_loader, epochs, learning_rate, L2_reg,
                                      optimizer_type)
print(f"Model 1: Time={time2:.2f}s, test accuracy={accuracy2:.2%}")

In [None]:
dropout = False
L2_reg = True


model3 = FullyConnectedNN(input_size, output_size, hidden_layers, activation, dropout)
print(f"Model loaded, starting training...")
time3, accuracy3 = train_and_evaluate(model3, train_loader, val_loader, test_loader, epochs, learning_rate, L2_reg,
                                      optimizer_type)
print(f"Model 1: Time={time3:.2f}s, test accuracy={accuracy3:.2%}")

In [None]:
dropout = False
L2_reg = False


model4 = FullyConnectedNN(input_size, output_size, hidden_layers, activation, dropout)
print(f"Model loaded, starting training...")
time4, accuracy4 = train_and_evaluate(model4, train_loader, val_loader, test_loader, epochs, learning_rate, L2_reg,
                                      optimizer_type)
print(f"Model 1: Time={time4:.2f}s, test accuracy={accuracy4:.2%}")

## Result

Using the full dataset of 60.000 images we were not able to observe clear overfitting by looking at the validation loss, even with larger networks (two hidden layers of 8192 Nodes).

Using a smaller subset of 2.000 images for training did lead to rising validation loss.

With a network of one hidden layer of 20424 nodes, after 500 epochs we reached around 91.44% accuracy with both dropout and regularization, and 90.59% accuracy without those.

Just L2 Regularization reached 90.65% accuracy, just Dropout reached 91.67% accuracy.

Overfitting is also signified by a large difference in training vs validation loss, this can be observed in our collected data.

With the same hyperparameters, there was a difference in training loss (\~0.0332) and validation loss (\~0.385) with no dropout and L2, vs a difference of training loss (\~0.028) and validation loss (\~0.331) with both dropout and L2.


> none:

> Epoch 500/500 | Train Loss: 0.0332 | Validation Loss: 0.3854 | Accuracy: 89.81%
Model 1: Time=2603.10s, test accuracy=90.59%

> just dropout:

> Epoch 500/500 | Train Loss: 0.0268 | Validation Loss: 0.3265 | Accuracy: 90.92%
Model 1: Time=2363.71s, test accuracy=91.67%

> just L2:

> Epoch 500/500 | Train Loss: 0.0331 | Validation Loss: 0.3809 | Accuracy: 89.90%
Model 1: Time=2236.66s, test accuracy=90.65%

> both:

> Epoch 500/500 | Train Loss: 0.0280 | Validation Loss: 0.3305 | Accuracy: 90.84%
Model 1: Time=2516.10s, test accuracy=91.44%


In [None]:
df = pd.read_csv("data/[2024]SGDnoMomentum_dropout_True_L2_True.csv")

fig = px.line(df, x="Epoch", y=["Train Loss", "Validation Loss"],
              labels={"Epoch": "Epoch", "value": "Value"},
              title="Train and Validation Loss with L2 Regularization and Dropout")

fig.show()

In [None]:
df = pd.read_csv("data/[2024]SGDnoMomentum_dropout_False_L2_True.csv")

fig = px.line(df, x="Epoch", y=["Train Loss", "Validation Loss"],
              labels={"Epoch": "Epoch", "value": "Value"},
              title="Train and Validation Loss only L2 Regularization")

fig.show()

In [None]:
df = pd.read_csv("data/[2024]SGDnoMomentum_dropout_True_L2_False.csv")

fig = px.line(df, x="Epoch", y=["Train Loss", "Validation Loss"],
              labels={"Epoch": "Epoch", "value": "Value"},
              title="Train and Validation Loss only Dropout")

fig.show()

In [None]:
df = pd.read_csv("data/[2024]SGDnoMomentum_dropout_False_L2_False.csv")

fig = px.line(df, x="Epoch", y=["Train Loss", "Validation Loss"],
              labels={"Epoch": "Epoch", "value": "Value"},
              title="Train and Validation Loss without optimization")

fig.show()

In [None]:
path_to_csvs = "./data/*.csv"

# Read all CSV files and add an identifier column
all_files = glob.glob(path_to_csvs)
df_list = []

for file in all_files:
    if not file.__contains__("2024"):
        continue
    temp_df = pd.read_csv(file)

    # Add a column to identify which file/experiment this is
    temp_df['Experiment'] = file.split('/')[-1].replace('.csv', '')  # Use filename as experiment name

    df_list.append(temp_df)

combined_df = pd.concat(df_list, ignore_index=True)

# Melt the dataframe for easier plotting
# We want 'Epoch' on the x-axis and values from 'Train Loss', 'Validation Loss', and 'Accuracy' on the y-axis
melted_df = combined_df.melt(
    id_vars=['Epoch', 'Experiment'],
    value_vars=['Train Loss', 'Validation Loss'],
    var_name='Metric',
    value_name='Value'
)

fig = px.line(
    melted_df,
    x='Epoch',
    y='Value',
    color='Experiment',  # Different lines for each experiment
    line_dash='Metric',  # Different line styles for Train Loss and Validation Loss
    title='Training Progress Across Experiments',
    labels={'Value': 'Metric Value', 'Epoch': 'Epoch'},
    template='plotly',
    log_y= True
)

fig.show()


The figure above shows that L2 had only a small effect, but dropout was effective.

We observe that without dropout the validation loss starts to rise from \~0.355 after about 170 epochs to \~0.385 at 500 epochs. With dropout it steadily descends to \~0.33.

## Reflection

Implementations of NN in modern libraries are very optimized, and it was hard to find information on making a "bad" NN, thus observing heavy overfitting was difficult. The "worst" optimizer in PyTorch, the stochastic gradient descent optimizer, is quite advanced compared to what was available to researches historically when less advanced architectures (FCNN) were in the focus.

Maybe doing more research on a problem that is known to lead to overfitting faster, would have enabled us to start in a better direction.



