**Challenge: Implement a Multiclass Classification Neural Network using PyTorch**

Objective:
Build a feedforward neural network using PyTorch to predict the species of iris flowers in a multiclass classification problem. The dataset used for this challenge is the Iris dataset, which consists of features like sepal length, sepal width, petal length, and petal width.

Steps:

1. **Data Preparation**: Load the MNIST dataset using ```torchvision.datasets.MNIST```. Standardize/normalize the features. Split the dataset into training and testing sets using, for example, ```sklearn.model_selection.train_test_split()```. **Bonus scores**: *use PyTorch's built-* ```DataLoader``` *to split the dataset*.

2. **Neural Network Architecture**: Define a simple feedforward neural network using PyTorch's ```nn.Module```. Design the input layer to match the number of features in the MNIST dataset and the output layer to have as many neurons as there are classes (10). You can experiment with the number of hidden layers and neurons to optimize the performance. **Bonus scores**: *Make your architecture flexibile to have as many hidden layers as the user wants, and use hyperparameter optimization to select the best number of hidden layeres.*

3. **Loss Function and Optimizer**: Choose an appropriate loss function for multiclass classification. Select an optimizer, like SGD (Stochastic Gradient Descent) or Adam.

4. **Training**: Write a training loop to iterate over the dataset.
Forward pass the input through the network, calculate the loss, and perform backpropagation. Update the weights of the network using the chosen optimizer.

5. **Testing**: Evaluate the trained model on the test set. Calculate the accuracy of the model.

6. **Optimization**: Experiment with hyperparameters (learning rate, number of epochs, etc.) to optimize the model's performance. Consider adjusting the neural network architecture for better results. **Notice that you can't use the optimization algorithms from scikit-learn that we saw in lab1: e.g.,** ```GridSearchCV```.


1. **Data Preparation**

In [92]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from torch.utils.data import DataLoader, TensorDataset

# load MNIST dataset & transform to tensors
data = torchvision.datasets.MNIST(root="./data", train=True, download=True, transform=torchvision.transforms.ToTensor())
og_features = data.data
labels = data.targets

# Standardize pixel values
scaler = StandardScaler()
standardized_features = scaler.fit_transform(og_features.view(og_features.size(0), -1)) # flatten for NN input

# train/test split
train_features, test_features, labels_train, labels_test = train_test_split(
    standardized_features, labels, test_size=0.2, random_state=42
)

# DataLoader for batches
train_dataset = TensorDataset(torch.Tensor(train_features), torch.Tensor(labels_train))
test_dataset = TensorDataset(torch.Tensor(test_features), torch.Tensor(labels_test))

batch_size = 512 # will be optimized
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)




2. **Neural Network Architecture**

In [93]:
import torch.nn.functional as F

class MyNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_additional_hidden_layers):
        super(MyNN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.additional_hidden_layers = nn.ModuleList([nn.Linear(hidden_size, hidden_size) for _ in range(num_additional_hidden_layers)])
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        for layer in self.additional_hidden_layers:
            x = F.relu(layer(x))
        x = self.fc2(x)
        return x

input_size = 28*28  # flattened image size
output_size = 10  # 10 digits
#we can change those two parameters to use different architectures
hidden_size = 128 # optimize this at 6)
num_additional_hidden_layers = 0 # we start with one hidden layer and optimize at 6)


# Create model
model = MyNN(input_size, hidden_size, output_size, num_additional_hidden_layers)

# Print architecture
print(model)


MyNN(
  (fc1): Linear(in_features=784, out_features=128, bias=True)
  (additional_hidden_layers): ModuleList()
  (fc2): Linear(in_features=128, out_features=10, bias=True)
)


3. **Loss Function and Optimizer**

In [94]:
import torch.optim as optim

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Print the loss function and optimizer
print("Loss Function:", loss_fn)
print("Optimizer:", optimizer)


Loss Function: CrossEntropyLoss()
Optimizer: Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 0.001
    maximize: False
    weight_decay: 0
)


4. **Training**

In [95]:
epochs = 10 
batch_size = 512

# Training loop
for epoch in range(epochs):
    model.train()  #train mode
    
    for inputs, labels in train_loader:
        
        optimizer.zero_grad()

        outputs = model(inputs)

        loss = loss_fn(outputs, labels)

        loss.backward()

        optimizer.step()

    print(f'Finished epoch {epoch}, latest loss {loss}')


Finished epoch 0, latest loss 0.3141784071922302
Finished epoch 1, latest loss 0.17846040427684784
Finished epoch 2, latest loss 0.13639381527900696
Finished epoch 3, latest loss 0.06876114010810852
Finished epoch 4, latest loss 0.10731310397386551
Finished epoch 5, latest loss 0.09498702734708786
Finished epoch 6, latest loss 0.055537376552820206
Finished epoch 7, latest loss 0.05380956456065178
Finished epoch 8, latest loss 0.03768095746636391
Finished epoch 9, latest loss 0.04853275790810585


5. **Testing**

In [96]:
model.eval()  #evaluation mode

correct = 0
total = 0

with torch.no_grad():
    for inputs, labels in test_loader: # because we also use batches for testing
        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = correct / total

print(f'Test Accuracy: {accuracy:.4f}')

Test Accuracy: 0.9724


6. **Optimization**

In [99]:
import optuna

def objective(trial):
    # hyperparameters to be optimized
    learning_rate = trial.suggest_categorical('learning_rate', [0.001, 0.01, 0.1])
    num_epochs = trial.suggest_categorical('num_epochs', [10, 15])
    hidden_size = trial.suggest_categorical('hidden_size', [128, 256, 512, 1028]) # not optimal, but this will the same for all layers
    num_additional_hidden_layers = trial.suggest_categorical('num_additional_hidden_layers', [1, 2, 3])
    batch_size = trial.suggest_categorical('batch_size', [256, 512, 1024])

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) # for batch size 
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False) # for batch size 
    model = MyNN(input_size, hidden_size, output_size, num_additional_hidden_layers) # for hidden size and additional layers
    optimizer = optim.Adam(model.parameters(), lr=learning_rate) # for learning rate

    # Training loop
    for epoch in range(num_epochs):
        model.train()
        for inputs, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()

    # evaluation
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, labels in test_loader:
            outputs = model(inputs)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    accuracy = correct / total

    return accuracy

# Create a study object and optimize
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=10)

# Get the best hyperparameters
best_params = study.best_params
best_learning_rate = best_params['learning_rate']
best_num_epochs = best_params['num_epochs']
best_hidden_size = best_params['hidden_size']
best_num_additional_hidden_layers = best_params['num_additional_hidden_layers']
best_batch_size = best_params['batch_size']


print("Best Hyperparameters:")
print(f"Learning Rate: {best_learning_rate}")
print(f"Number of Epochs: {best_num_epochs}")
print(f"Hidden Size: {best_hidden_size}")
print(f"Number of Hidden Layers: {best_num_additional_hidden_layers+1}") # +1 because we started already with one hidden layer
print(f"Batch Size: {best_batch_size}")


[I 2023-12-20 21:52:00,160] A new study created in memory with name: no-name-ee9ac213-945f-4674-ad3a-41c030774137
[I 2023-12-20 21:52:15,670] Trial 0 finished with value: 0.9744166666666667 and parameters: {'learning_rate': 0.001, 'num_epochs': 10, 'hidden_size': 128, 'num_additional_hidden_layers': 2, 'batch_size': 256}. Best is trial 0 with value: 0.9744166666666667.
[I 2023-12-20 21:53:51,736] Trial 1 finished with value: 0.15391666666666667 and parameters: {'learning_rate': 0.1, 'num_epochs': 10, 'hidden_size': 1028, 'num_additional_hidden_layers': 1, 'batch_size': 256}. Best is trial 0 with value: 0.9744166666666667.
[I 2023-12-20 21:54:12,922] Trial 2 finished with value: 0.18458333333333332 and parameters: {'learning_rate': 0.1, 'num_epochs': 10, 'hidden_size': 512, 'num_additional_hidden_layers': 1, 'batch_size': 1024}. Best is trial 0 with value: 0.9744166666666667.
[I 2023-12-20 21:55:47,344] Trial 3 finished with value: 0.9783333333333334 and parameters: {'learning_rate': 0.

Best Hyperparameters:
Learning Rate: 0.001
Number of Epochs: 15
Hidden Size: 1028
Number of Hidden Layers: 3


That took about 9 minutes for me.   

Finally the best model has an accuracy of 0.9783333333333334, which is slightly better than my initial model.   
I guess my inital model was already pretty good, because i manually tested learning rate, batch size and epoch number before.  

Hyperparameters:  
Learning Rate: 0.001  
Number of Epochs: 15  
Hidden Size: 1028  
Number of Hidden Layers: 3  
Batch Size: 1024
