# **Deep Learning Course**

## **Loss Functions and Multilayer Perceptrons (MLP)**

---

### **Student Information:**

- **Name:** *(Amirparsa Bahrami)*
- **Student Number:** *(401101332)*

---

### **Assignment Overview**

In this notebook, we will explore various loss functions used in neural networks, with a specific focus on their role in training **Multilayer Perceptrons (MLPs)**. By the end of this notebook, you will have a deeper understanding of:
- Types of loss functions
- How loss functions affect the training process
- The relationship between loss functions and model optimization in MLPs

---

### **Table of Contents**

1. Introduction to Loss Functions
2. Types of Loss Functions
3. Multilayer Perceptrons (MLP)
4. Implementing Loss Functions in MLP
5. Conclusion

---



# 1.Introduction to Loss Functions

In deep learning, **loss functions** play a crucial role in training models by quantifying the difference between the predicted outputs and the actual targets. Selecting the appropriate loss function is essential for the success of your model. In this assay, we will explore various loss functions available in PyTorch, understand their theoretical backgrounds, and provide you with a scaffolded class to experiment with these loss functions.

Before begining, let's train a simle MLP model using the **L1Loss** function. We'll return to this model later to experiment with different loss functions. We'll start by importing the necessary libraries and defining the model architecture.

First things first, let's talk about **L1Loss**.

### 1. L1Loss (`torch.nn.L1Loss`)
- **Description:** Also known as Mean Absolute Error (MAE), L1Loss computes the average absolute difference between the predicted values and the target values.
- **Use Case:** Suitable for regression tasks where robustness to outliers is desired.

Here is the mathematical formulation of L1Loss:
\begin{equation}
\text{L1Loss} = \frac{1}{n} \sum_{i=1}^{n} |y_{\text{pred}_i} - y_{\text{true}_i}|
\end{equation}

Let's implement a simple MLP model using the L1Loss function.

In [61]:
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from torch.optim import Adam
# Don't be courious about Adam, it's just a fancy name for a fancy optimization algorithm
from tqdm import tqdm
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

Here, we'll define a class called `SimpleMLP` that inherits from `nn.Module`. This class can have multiple layers, and we'll use the `nn.Sequential` module to define the layers of the model. The model will have the following architecture:

In [62]:
class SimpleMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_hidden_layers=1, last_layer_activation_fn=nn.ReLU):
        super(SimpleMLP, self).__init__()
        layers = []
        # Input layer
        layers.append(nn.Linear(input_dim, hidden_dim))
        layers.append(nn.ReLU())
        # Hidden layers
        for _ in range(num_hidden_layers - 1):
            layers.append(nn.Linear(hidden_dim, hidden_dim))
            layers.append(nn.ReLU())
        # Output layer
        layers.append(nn.Linear(hidden_dim, output_dim))
        if last_layer_activation_fn is not None:
            layers.append(last_layer_activation_fn())
        self.model = nn.Sequential(*layers)

    def forward(self, x):
        return self.model(x)

Now, let's define a class called `SimpleMLP_Loss` that has the following architecture:

In [63]:
class SimpleMLPTrainer:
    def __init__(self, model, criterion, optimizer):
        self.model = model
        self.criterion = criterion
        self.optimizer = optimizer

    def train(self, train_loader, num_epochs):
        self.model.train()
        train_losses = []
        for epoch in range(num_epochs):
            epoch_loss = 0
            for inputs, targets in tqdm(train_loader, desc=f'Epoch {epoch+1}/{num_epochs}'):
                outputs = self.model(inputs)
                loss = self.criterion(outputs, targets)
                self.optimizer.zero_grad()
                loss.backward()
                self.optimizer.step()
                epoch_loss += loss.item()
            avg_loss = epoch_loss / len(train_loader)
            train_losses.append(avg_loss)
            print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}')
        return train_losses

    def evaluate(self, val_loader):
        self.model.eval()
        total_loss = 0
        correct = 0
        with torch.no_grad():
            for inputs, targets in val_loader:
                outputs = self.model(inputs)
                outputs = outputs.view(-1)  # Reshape outputs to match targets
                loss = self.criterion(outputs, targets)
                total_loss += loss.item()
                if isinstance(self.criterion, (nn.CrossEntropyLoss, nn.NLLLoss)):
                    _, predicted = torch.max(outputs.data, 1)
                    correct += (predicted == targets).sum().item()
                else:
                    predicted = outputs.round()
                    correct += (predicted == targets).sum().item()
        avg_loss = total_loss / len(val_loader)
        accuracy = 100 * correct / (len(val_loader.dataset))
        print(f'Validation Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}%')
        return avg_loss, accuracy

Next, lets test our model using the L1Loss function. You'll use <span style="color:red">*Titanic Dataset*</span> to train the model.


In [64]:
# Load dataset
train_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
data = pd.read_csv(train_url)

# Preprocessing
data = data[['Pclass', 'Sex', 'Age', 'Fare', 'Survived']].dropna()
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})

# Features and target
X = data[['Pclass', 'Sex', 'Age', 'Fare']].values.astype('float32')
y = data['Survived'].values.astype('float32')

# Convert to tensors
X_tensor = torch.tensor(X)
y_tensor = torch.tensor(y)

# Split the data
X_train, X_val, y_train, y_val = train_test_split(X_tensor, y_tensor, test_size=0.2, random_state=42)

# Create DataLoaders
train_dataset = TensorDataset(X_train, y_train)
val_dataset = TensorDataset(X_val, y_val)

batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)


<div style="text-align: center;"> <span style="color:red; font-size: 26px; font-weight: bold;">Let's train!</span> </div>

In [65]:
from torch.nn import L1Loss

# Define the model
input_dim = X_train.shape[1]
hidden_dim = 16
output_dim = 1  # Regression output for L1Loss
num_hidden_layers = 2
model = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers, last_layer_activation_fn=None)

# Define the criterion and optimizer
criterion = nn.L1Loss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Create trainer instance
trainer = SimpleMLPTrainer(model, criterion, optimizer)

# Train the model
num_epochs = 20
trainer.train(train_loader, num_epochs)

# Evaluate the model
trainer.evaluate(val_loader)

Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 393.87it/s]


Epoch [1/20], Loss: 1.0195


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 557.94it/s]


Epoch [2/20], Loss: 0.5878


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 435.91it/s]


Epoch [3/20], Loss: 0.4827


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 322.82it/s]


Epoch [4/20], Loss: 0.4719


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 544.51it/s]


Epoch [5/20], Loss: 0.4614


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 538.83it/s]


Epoch [6/20], Loss: 0.4618


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 534.05it/s]


Epoch [7/20], Loss: 0.4548


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 513.81it/s]


Epoch [8/20], Loss: 0.4508


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 522.44it/s]


Epoch [9/20], Loss: 0.4428


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 495.59it/s]


Epoch [10/20], Loss: 0.4362


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 440.03it/s]


Epoch [11/20], Loss: 0.4327


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 450.85it/s]


Epoch [12/20], Loss: 0.4289


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 500.46it/s]


Epoch [13/20], Loss: 0.4261


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 481.51it/s]


Epoch [14/20], Loss: 0.4260


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 464.39it/s]


Epoch [15/20], Loss: 0.4224


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 447.58it/s]


Epoch [16/20], Loss: 0.4210


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 510.76it/s]


Epoch [17/20], Loss: 0.4182


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 467.01it/s]


Epoch [18/20], Loss: 0.4180


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 433.27it/s]


Epoch [19/20], Loss: 0.4166


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 489.17it/s]

Epoch [20/20], Loss: 0.4167
Validation Loss: 0.3994, Accuracy: 60.84%





(0.3994342446327209, 60.83916083916084)

---
# 2. Types of Loss Functions

PyTorch offers a variety of built-in loss functions tailored for different types of problems, such as regression, classification, and more. Below, we discuss several commonly used loss functions, their theoretical foundations, and typical use cases.

### 2. MSELoss (`torch.nn.MSELoss`)
- **Description:** Mean Squared Error (MSE) calculates the average of the squares of the differences between predicted and target values.
- **Use Case:** Commonly used in regression problems where larger errors are significantly penalized.

Here is boring math stuff for MSE:
\begin{equation}
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^{2}
\end{equation}

<span style="color:red; font-size: 18px; font-weight: bold;">Warning:</span> Don't forget to reinitialize the model before experimenting with different loss functions.

In [68]:
from torch.nn import MSELoss

# Reinitialize the model
model = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers, last_layer_activation_fn=None)

# Define the criterion and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

trainer = SimpleMLPTrainer(model, criterion, optimizer)

# Train the model
trainer.train(train_loader, num_epochs)

# Evaluate the model
trainer.evaluate(val_loader)

Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 549.19it/s]


Epoch [1/20], Loss: 5.8768


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 504.01it/s]


Epoch [2/20], Loss: 0.5782


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 528.69it/s]


Epoch [3/20], Loss: 0.3223


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 419.21it/s]


Epoch [4/20], Loss: 0.2965


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 512.45it/s]


Epoch [5/20], Loss: 0.2874


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 388.48it/s]


Epoch [6/20], Loss: 0.2750


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 542.96it/s]


Epoch [7/20], Loss: 0.2722


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 449.80it/s]


Epoch [8/20], Loss: 0.2704


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 361.30it/s]


Epoch [9/20], Loss: 0.2578


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 420.68it/s]


Epoch [10/20], Loss: 0.2565


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 367.32it/s]


Epoch [11/20], Loss: 0.2532


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 392.11it/s]


Epoch [12/20], Loss: 0.2534


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 407.90it/s]


Epoch [13/20], Loss: 0.2520


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 510.24it/s]


Epoch [14/20], Loss: 0.2502


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 378.13it/s]


Epoch [15/20], Loss: 0.2490


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 487.23it/s]


Epoch [16/20], Loss: 0.2497


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 508.24it/s]


Epoch [17/20], Loss: 0.2477


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 531.63it/s]


Epoch [18/20], Loss: 0.2500


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 544.53it/s]


Epoch [19/20], Loss: 0.2475


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 435.91it/s]

Epoch [20/20], Loss: 0.2490
Validation Loss: 0.2346, Accuracy: 66.43%





(0.23455969989299774, 66.43356643356644)

### 3. NLLLoss (`torch.nn.NLLLoss`)
- **Description:** Negative Log-Likelihood Loss measures the likelihood of the target class under the predicted probability distribution.
- **Use Case:** Typically used in multi-class classification tasks, especially when combined with `log_softmax` activation.

Here is the mathematical formulation of NLLLoss:
\begin{equation}
\text{NLLLoss} = -\frac{1}{n} \sum_{i=1}^{n} \log(y_{i})
\end{equation}

I hope you note the logarithm in the formula. It's important!

Why?

In this part, run your training with Relu at last layer. <span style="color:red; font-weight: bold;">Discuss </span> and explain the difference between the results of the two models. Find a proper solution to the problem.


# the line outputs = outputs.view(-1) is Removed from the evaluate method, so the model’s output shape is not altered unintentionally.

In [69]:
class SimpleMLPTrainer:
    def __init__(self, model, criterion, optimizer):
        self.model = model
        self.criterion = criterion
        self.optimizer = optimizer

    def train(self, train_loader, num_epochs):
        self.model.train()
        train_losses = []
        for epoch in range(num_epochs):
            epoch_loss = 0
            for inputs, targets in tqdm(train_loader, desc=f'Epoch {epoch+1}/{num_epochs}'):
                outputs = self.model(inputs)
                loss = self.criterion(outputs, targets)
                self.optimizer.zero_grad()
                loss.backward()
                self.optimizer.step()
                epoch_loss += loss.item()
            avg_loss = epoch_loss / len(train_loader)
            train_losses.append(avg_loss)
            print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}')
        return train_losses

    def evaluate(self, val_loader):
        self.model.eval()
        total_loss = 0
        correct = 0
        with torch.no_grad():
            for inputs, targets in val_loader:
                outputs = self.model(inputs)
                outputs = outputs
                loss = self.criterion(outputs, targets)
                total_loss += loss.item()
                if isinstance(self.criterion, (nn.CrossEntropyLoss, nn.NLLLoss)):
                    _, predicted = torch.max(outputs.data, 1)
                    correct += (predicted == targets).sum().item()
                else:
                    predicted = outputs.round()
                    correct += (predicted == targets).sum().item()
        avg_loss = total_loss / len(val_loader)
        accuracy = 100 * correct / (len(val_loader.dataset))
        print(f'Validation Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}%')
        return avg_loss, accuracy

In [70]:
from torch.nn import NLLLoss

# Modify the model for classification
output_dim = 2  # Number of classes
model = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers, last_layer_activation_fn=nn.ReLU)

# Define the criterion and optimizer
criterion = nn.NLLLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Update the labels to be LongTensor for classification
y_train_long = y_train.long()
y_val_long = y_val.long()

# Update datasets and loaders
train_dataset = TensorDataset(X_train, y_train_long)
val_dataset = TensorDataset(X_val, y_val_long)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)

# Create trainer instance
trainer = SimpleMLPTrainer(model, criterion, optimizer)

# Train the model
trainer.train(train_loader, num_epochs)

# Evaluate the model
trainer.evaluate(val_loader)

Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 486.23it/s]


Epoch [1/20], Loss: -0.8004


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 350.53it/s]


Epoch [2/20], Loss: -2.0685


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 328.09it/s]


Epoch [3/20], Loss: -3.8088


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 317.40it/s]


Epoch [4/20], Loss: -6.6676


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 472.54it/s]


Epoch [5/20], Loss: -10.8292


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 410.98it/s]


Epoch [6/20], Loss: -16.9392


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 481.20it/s]


Epoch [7/20], Loss: -25.7110


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 460.98it/s]


Epoch [8/20], Loss: -37.7600


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 395.21it/s]


Epoch [9/20], Loss: -53.8497


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 451.43it/s]


Epoch [10/20], Loss: -75.0161


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 443.07it/s]


Epoch [11/20], Loss: -101.4194


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 577.90it/s]


Epoch [12/20], Loss: -134.1188


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 465.37it/s]


Epoch [13/20], Loss: -176.4697


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 478.05it/s]


Epoch [14/20], Loss: -224.9504


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 437.53it/s]


Epoch [15/20], Loss: -286.6676


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 477.75it/s]


Epoch [16/20], Loss: -358.9829


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 474.18it/s]


Epoch [17/20], Loss: -441.4628


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 566.98it/s]


Epoch [18/20], Loss: -538.9040


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 477.61it/s]


Epoch [19/20], Loss: -648.6813


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 506.08it/s]

Epoch [20/20], Loss: -780.1832
Validation Loss: -812.7282, Accuracy: 39.16%





(-812.7281860351562, 39.16083916083916)

# Using LogSoftmax Activation Function

In [72]:
# Run with --- activation function
from torch.nn import NLLLoss

# Modify the model to use LogSoftmax
class SimpleMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_hidden_layers=1):
        super(SimpleMLP, self).__init__()
        layers = []
        # Input layer
        layers.append(nn.Linear(input_dim, hidden_dim))
        layers.append(nn.ReLU())
        # Hidden layers
        for _ in range(num_hidden_layers - 1):
            layers.append(nn.Linear(hidden_dim, hidden_dim))
            layers.append(nn.ReLU())
        # Output layer without activation
        layers.append(nn.Linear(hidden_dim, output_dim))
        self.hidden_layers = nn.Sequential(*layers)
        self.output_activation = nn.LogSoftmax(dim=1)

    def forward(self, x):
        x = self.hidden_layers(x)
        x = self.output_activation(x)
        return x

# Reinitialize the model
model = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers)

# Define the criterion and optimizer
criterion = nn.NLLLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Create trainer instance
trainer = SimpleMLPTrainer(model, criterion, optimizer)

# Train the model
trainer.train(train_loader, num_epochs)

# Evaluate the model
trainer.evaluate(val_loader)

Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 578.30it/s]


Epoch [1/20], Loss: 0.8284


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 518.93it/s]


Epoch [2/20], Loss: 0.6095


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 526.16it/s]


Epoch [3/20], Loss: 0.6097


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 560.57it/s]


Epoch [4/20], Loss: 0.5983


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 526.42it/s]


Epoch [5/20], Loss: 0.5961


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 357.98it/s]


Epoch [6/20], Loss: 0.6003


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 375.59it/s]


Epoch [7/20], Loss: 0.5902


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 383.24it/s]


Epoch [8/20], Loss: 0.5894


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 462.99it/s]


Epoch [9/20], Loss: 0.5960


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 464.78it/s]


Epoch [10/20], Loss: 0.5994


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 519.31it/s]


Epoch [11/20], Loss: 0.5896


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 544.24it/s]


Epoch [12/20], Loss: 0.5886


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 561.56it/s]


Epoch [13/20], Loss: 0.5950


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 556.57it/s]


Epoch [14/20], Loss: 0.5825


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 587.40it/s]


Epoch [15/20], Loss: 0.5806


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 378.89it/s]


Epoch [16/20], Loss: 0.5822


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 521.37it/s]


Epoch [17/20], Loss: 0.5854


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 558.76it/s]


Epoch [18/20], Loss: 0.5896


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 529.47it/s]


Epoch [19/20], Loss: 0.5838


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 501.10it/s]

Epoch [20/20], Loss: 0.5816
Validation Loss: 0.6686, Accuracy: 62.24%





(0.6686157703399658, 62.23776223776224)

Your reason for your choice:
<div>
** By using LogSoftmax as the activation function for the last layer, we ensure that the outputs are log probabilities, which is what NLLLoss expects. This change aligns the model's output with the requirements of the loss function, allowing for proper training. **
</div>



### 4. CrossEntropyLoss (`torch.nn.CrossEntropyLoss`)
- **Description:** Combines `LogSoftmax` and `NLLLoss` in one single class. It computes the cross-entropy loss between the target and the output logits.
- **Use Case:** Widely used for multi-class classification problems.

The mathematical formulation of CrossEntropyLoss is as follows:
\begin{equation}
  \text{CrossEntropy}(y, \hat{y}) = - \sum_{i=1}^{C} y_i \log\left(\frac{e^{\hat{y}_i}}{\sum_{j=1}^{C} e^{\hat{y}_j}}\right)
\end{equation}
  where:
  - \( C \) is the number of classes,
  - \( y_i \) is a one-hot encoded target vector (or a scalar class label),
  - \( \hat{y}_i \) represents the logits (unnormalized model outputs) for each class.
  
  In practice, `torch.nn.CrossEntropyLoss` expects raw logits as input and internally applies the softmax function to convert the logits into probabilities, followed by the negative log-likelihood computation.

- **Background:** Cross-entropy measures the difference between the true distribution \( y \) and the predicted distribution \( \hat{y} \). The function minimizes the negative log-probability assigned to the correct class, effectively penalizing predictions that deviate from the true class, making it a standard choice for classification tasks in deep learning.

Now, let's implement a class called `SimpleMLP_Loss` that has the following architecture:


In [73]:
from torch.nn import CrossEntropyLoss

# Modify the model to output raw logits
class SimpleMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_hidden_layers=1):
        super(SimpleMLP, self).__init__()
        layers = []
        # Input layer
        layers.append(nn.Linear(input_dim, hidden_dim))
        layers.append(nn.ReLU())
        # Hidden layers
        for _ in range(num_hidden_layers - 1):
            layers.append(nn.Linear(hidden_dim, hidden_dim))
            layers.append(nn.ReLU())
        # Output layer without activation
        layers.append(nn.Linear(hidden_dim, output_dim))
        self.model = nn.Sequential(*layers)

    def forward(self, x):
        return self.model(x)

# Reinitialize the model
model = SimpleMLP(input_dim, hidden_dim, 2, num_hidden_layers)

# Define the criterion and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Create trainer instance
trainer = SimpleMLPTrainer(model, criterion, optimizer)

# Train the model
trainer.train(train_loader, num_epochs)

# Evaluate the model
trainer.evaluate(val_loader)



Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 537.45it/s]


Epoch [1/20], Loss: 1.4005


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 547.02it/s]


Epoch [2/20], Loss: 0.7790


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 542.22it/s]


Epoch [3/20], Loss: 0.6791


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 565.45it/s]


Epoch [4/20], Loss: 0.6499


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 547.69it/s]


Epoch [5/20], Loss: 0.6275


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 426.44it/s]


Epoch [6/20], Loss: 0.6149


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 400.50it/s]


Epoch [7/20], Loss: 0.6031


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 382.52it/s]


Epoch [8/20], Loss: 0.5988


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 402.17it/s]


Epoch [9/20], Loss: 0.5923


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 501.43it/s]


Epoch [10/20], Loss: 0.5886


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 491.57it/s]


Epoch [11/20], Loss: 0.5878


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 548.82it/s]


Epoch [12/20], Loss: 0.5837


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 553.27it/s]


Epoch [13/20], Loss: 0.5872


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 455.46it/s]


Epoch [14/20], Loss: 0.5866


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 494.83it/s]


Epoch [15/20], Loss: 0.5817


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 437.59it/s]


Epoch [16/20], Loss: 0.5840


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 520.78it/s]


Epoch [17/20], Loss: 0.6023


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 496.26it/s]


Epoch [18/20], Loss: 0.5836


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 451.89it/s]


Epoch [19/20], Loss: 0.5938


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 407.30it/s]


Epoch [20/20], Loss: 0.5879
Validation Loss: 0.6777, Accuracy: 62.24%


(0.677659022808075, 62.23776223776224)


### 5. KLDivLoss (`torch.nn.KLDivLoss`)
- **Description:** Kullback-Leibler Divergence Loss measures how one probability distribution diverges from a second, reference distribution. Unlike other loss functions that focus on classification, KL divergence specifically compares the relative entropy between two distributions. It quantifies the information loss when using the predicted distribution to approximate the true distribution.

- **Mathematical Function:**
\begin{equation}
  \text{KL}(P \parallel Q) = \sum_{i=1}^{C} P(i) \left( \log P(i) - \log Q(i) \right)
\end{equation}
  where:
  - \( P \) is the target (true) probability distribution,
  - \( Q \) is the predicted distribution (often the output of `log_softmax`),
  - \( C \) is the number of classes.

  KL divergence is always non-negative, and it equals zero if the two distributions are identical. The loss function expects the model's output to be in the form of log-probabilities (using `log_softmax`) and compares this against a target probability distribution, which is typically a normalized distribution (using softmax).

- **Use Case:** KLDivLoss is frequently used in:
  - **Variational Autoencoders (VAEs):** In VAEs, KL divergence is used to measure how much the learned latent space distribution deviates from a prior distribution (often Gaussian).
  - **Knowledge Distillation:** In teacher-student models, KL divergence is used to transfer the "soft" knowledge from a teacher model to a student model by comparing their output probability distributions.
  - **Reinforcement Learning:** It can be used to update policies while minimizing the divergence from a previous policy.

- **Background:** Kullback-Leibler divergence, a core concept in information theory, measures the inefficiency of assuming the predicted distribution \( Q \) when the true distribution is \( P \). It is asymmetric, meaning that \( KL(P \parallel Q) \neq KL(Q \parallel P) \), so the direction of the comparison matters.

Again, in this part, run your training with Relu at last layer. <span style="color:red; font-weight: bold;">Discuss </span> and explain the difference between the results of the two models. Find a proper solution to the problem.


In [75]:
# Run with relu activation function
from torch.nn import NLLLoss

# Modify the model to output log probabilities
class SimpleMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_hidden_layers=1):
        super(SimpleMLP, self).__init__()
        layers = []
        # Input layer
        layers.append(nn.Linear(input_dim, hidden_dim))
        layers.append(nn.ReLU())
        # Hidden layers
        for _ in range(num_hidden_layers - 1):
            layers.append(nn.Linear(hidden_dim, hidden_dim))
            layers.append(nn.ReLU())
        # Output layer
        layers.append(nn.Linear(hidden_dim, output_dim))
        layers.append(nn.ReLU())  # Using ReLU activation
        self.model = nn.Sequential(*layers)

    def forward(self, x):
        return self.model(x)

# Define the prepare_targets function
def prepare_targets(targets, num_classes):
    # Ensure targets are LongTensors for indexing
    targets = targets.long()
    # Create a zero tensor of floats
    targets_one_hot = torch.zeros(targets.size(0), num_classes, device=targets.device, dtype=torch.float)
    # Use scatter_ to create one-hot encoding
    targets_one_hot.scatter_(1, targets.unsqueeze(1), 1.0)
    # No need to apply softmax to targets
    return targets_one_hot

    # Define the model
input_dim = X_train.shape[1]
hidden_dim = 16
output_dim = 2  # Number of classes
num_hidden_layers = 2
model = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers)

# Define the criterion and optimizer
criterion = nn.KLDivLoss(reduction='batchmean')
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Update the trainer class
class SimpleMLPTrainer:
    def __init__(self, model, criterion, optimizer):
        self.model = model
        self.criterion = criterion
        self.optimizer = optimizer

    def train(self, train_loader, num_epochs):
        self.model.train()
        train_losses = []
        for epoch in range(num_epochs):
            epoch_loss = 0
            for inputs, targets in tqdm(train_loader, desc=f'Epoch {epoch+1}/{num_epochs}'):
                outputs = self.model(inputs)
                # Prepare targets as one-hot encoded probabilities
                targets_prob = prepare_targets(targets, outputs.size(1))
                loss = self.criterion(outputs, targets_prob)
                self.optimizer.zero_grad()
                loss.backward()
                self.optimizer.step()
                epoch_loss += loss.item()
            avg_loss = epoch_loss / len(train_loader)
            train_losses.append(avg_loss)
            print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}')
        return train_losses

    def evaluate(self, val_loader):
        self.model.eval()
        total_loss = 0
        correct = 0
        total_samples = 0
        with torch.no_grad():
            for inputs, targets in val_loader:
                outputs = self.model(inputs)
                # Prepare targets as one-hot encoded probabilities
                targets_prob = prepare_targets(targets, outputs.size(1))
                loss = self.criterion(outputs, targets_prob)
                total_loss += loss.item()
                _, predicted = torch.max(outputs.data, 1)
                correct += (predicted == targets).sum().item()
                total_samples += targets.size(0)
        avg_loss = total_loss / len(val_loader)
        accuracy = 100 * correct / total_samples
        print(f'Validation Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}%')
        return avg_loss, accuracy

# Create trainer instance
trainer = SimpleMLPTrainer(model, criterion, optimizer)

# Train the model
num_epochs = 20
trainer.train(train_loader, num_epochs)

# Evaluate the model
trainer.evaluate(val_loader)

Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 542.48it/s]


Epoch [1/20], Loss: -0.7262


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 536.13it/s]


Epoch [2/20], Loss: -2.2893


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 516.88it/s]


Epoch [3/20], Loss: -4.5896


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 536.01it/s]


Epoch [4/20], Loss: -8.0171


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 551.99it/s]


Epoch [5/20], Loss: -14.1495


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 490.83it/s]


Epoch [6/20], Loss: -23.5059


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 480.05it/s]


Epoch [7/20], Loss: -37.4686


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 478.01it/s]


Epoch [8/20], Loss: -56.5854


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 421.07it/s]


Epoch [9/20], Loss: -89.5649


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 493.40it/s]


Epoch [10/20], Loss: -140.3739


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 459.24it/s]


Epoch [11/20], Loss: -209.9956


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 490.14it/s]


Epoch [12/20], Loss: -299.4946


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 395.17it/s]


Epoch [13/20], Loss: -415.4422


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 463.19it/s]


Epoch [14/20], Loss: -562.4066


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 507.22it/s]


Epoch [15/20], Loss: -740.0564


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 462.60it/s]


Epoch [16/20], Loss: -959.0925


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 550.83it/s]


Epoch [17/20], Loss: -1220.6294


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 471.42it/s]


Epoch [18/20], Loss: -1520.2678


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 397.35it/s]


Epoch [19/20], Loss: -1877.7461


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 395.96it/s]


Epoch [20/20], Loss: -2269.9305
Validation Loss: -2315.3808, Accuracy: 39.16%


(-2315.3808349609376, 39.16083916083916)

# Issue: Using ReLU activation at the last layer outputs non-negative values but does not produce a valid probability distribution required for KLDivLoss.

# Solution: Use LogSoftmax for the model output and Softmax for the target distribution to compute KL divergence properly.

In [76]:
# Run with --- activation function
from torch.nn import NLLLoss

# Modify the model to use LogSoftmax
class SimpleMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_hidden_layers=1):
        super(SimpleMLP, self).__init__()
        layers = []
        # Input layer
        layers.append(nn.Linear(input_dim, hidden_dim))
        layers.append(nn.ReLU())
        # Hidden layers
        for _ in range(num_hidden_layers - 1):
            layers.append(nn.Linear(hidden_dim, hidden_dim))
            layers.append(nn.ReLU())
        # Output layer without activation
        layers.append(nn.Linear(hidden_dim, output_dim))
        self.hidden_layers = nn.Sequential(*layers)
        self.output_activation = nn.LogSoftmax(dim=1)

    def forward(self, x):
        x = self.hidden_layers(x)
        x = self.output_activation(x)
        return x

# Define the prepare_targets function
def prepare_targets(targets, num_classes):
    # Ensure targets are LongTensors for indexing
    targets = targets.long()
    # Create a zero tensor of floats
    targets_one_hot = torch.zeros(targets.size(0), num_classes, device=targets.device, dtype=torch.float)
    # Use scatter_ to create one-hot encoding
    targets_one_hot.scatter_(1, targets.unsqueeze(1), 1.0)
    # No need to apply softmax to targets
    return targets_one_hot

    # Define the model
input_dim = X_train.shape[1]
hidden_dim = 16
output_dim = 2  # Number of classes
num_hidden_layers = 2
model = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers)

# Define the criterion and optimizer
criterion = nn.KLDivLoss(reduction='batchmean')
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Update the trainer class
class SimpleMLPTrainer:
    def __init__(self, model, criterion, optimizer):
        self.model = model
        self.criterion = criterion
        self.optimizer = optimizer

    def train(self, train_loader, num_epochs):
        self.model.train()
        train_losses = []
        for epoch in range(num_epochs):
            epoch_loss = 0
            for inputs, targets in tqdm(train_loader, desc=f'Epoch {epoch+1}/{num_epochs}'):
                outputs = self.model(inputs)
                # Prepare targets as one-hot encoded probabilities
                targets_prob = prepare_targets(targets, outputs.size(1))
                loss = self.criterion(outputs, targets_prob)
                self.optimizer.zero_grad()
                loss.backward()
                self.optimizer.step()
                epoch_loss += loss.item()
            avg_loss = epoch_loss / len(train_loader)
            train_losses.append(avg_loss)
            print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}')
        return train_losses

    def evaluate(self, val_loader):
        self.model.eval()
        total_loss = 0
        correct = 0
        total_samples = 0
        with torch.no_grad():
            for inputs, targets in val_loader:
                outputs = self.model(inputs)
                # Prepare targets as one-hot encoded probabilities
                targets_prob = prepare_targets(targets, outputs.size(1))
                loss = self.criterion(outputs, targets_prob)
                total_loss += loss.item()
                _, predicted = torch.max(outputs.data, 1)
                correct += (predicted == targets).sum().item()
                total_samples += targets.size(0)
        avg_loss = total_loss / len(val_loader)
        accuracy = 100 * correct / total_samples
        print(f'Validation Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}%')
        return avg_loss, accuracy

# Create trainer instance
trainer = SimpleMLPTrainer(model, criterion, optimizer)

# Train the model
num_epochs = 20
trainer.train(train_loader, num_epochs)

# Evaluate the model
trainer.evaluate(val_loader)

Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 482.68it/s]


Epoch [1/20], Loss: 1.6738


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 531.76it/s]


Epoch [2/20], Loss: 0.7565


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 470.93it/s]


Epoch [3/20], Loss: 0.6342


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 505.17it/s]


Epoch [4/20], Loss: 0.6209


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 521.55it/s]


Epoch [5/20], Loss: 0.6147


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 531.67it/s]


Epoch [6/20], Loss: 0.6170


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 487.61it/s]


Epoch [7/20], Loss: 0.6120


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 481.11it/s]


Epoch [8/20], Loss: 0.6089


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 518.03it/s]


Epoch [9/20], Loss: 0.6031


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 498.13it/s]


Epoch [10/20], Loss: 0.6035


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 488.33it/s]


Epoch [11/20], Loss: 0.6047


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 425.60it/s]


Epoch [12/20], Loss: 0.5991


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 472.85it/s]


Epoch [13/20], Loss: 0.5995


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 361.90it/s]


Epoch [14/20], Loss: 0.6003


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 544.59it/s]


Epoch [15/20], Loss: 0.5933


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 458.08it/s]


Epoch [16/20], Loss: 0.5967


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 409.35it/s]


Epoch [17/20], Loss: 0.5976


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 355.78it/s]


Epoch [18/20], Loss: 0.5884


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 362.64it/s]


Epoch [19/20], Loss: 0.5966


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 381.67it/s]

Epoch [20/20], Loss: 0.5920
Validation Loss: 0.6278, Accuracy: 65.03%





(0.6278239727020264, 65.03496503496504)

Your reason for your choice:

<div>
**By using LogSoftmax at the output layer and preparing the targets as probability distributions using Softmax, we ensure that both the model outputs and targets are valid probability distributions required for KL divergence computation.**
</div>

### 6. CosineEmbeddingLoss (`torch.nn.CosineEmbeddingLoss`)
- **Description:** Measures the cosine similarity between two input tensors, `x1` and `x2`, and computes the loss based on a label `y` that indicates whether the tensors should be similar (`y = 1`) or dissimilar (`y = -1`). Cosine similarity focuses on the angle between vectors, disregarding their magnitude.

- **Mathematical Function:**
\begin{equation}
  \text{CosineEmbeddingLoss}(x1, x2, y) =
  \begin{cases}
  1 - \cos(x_1, x_2), & \text{if } y = 1 \\
  \max(0, \cos(x_1, x_2) - \text{margin}), & \text{if } y = -1
  \end{cases}
\end{equation}
  where $ \cos(x_1, x_2) $ is the cosine similarity between the two vectors, and `margin` is a threshold that determines how dissimilar the vectors should be.

- **Use Case:** Commonly used in tasks like face verification, image similarity, and other scenarios where the relative orientation of vectors (angle) is more important than their length, such as in embeddings and metric learning.

- **Background:** Cosine similarity compares the directional alignment of vectors, making it ideal for high-dimensional data where the magnitude may not be as informative. This loss is particularly useful when training models to learn meaningful embeddings that capture semantic similarity.

You'll become more fimiliar with this loss function in future.

---

# Regularization in Machine Learning

## Introduction

Regularization is a fundamental technique in machine learning that helps prevent overfitting by adding a penalty to the loss function. This penalty discourages the model from becoming too complex, ensuring better generalization to unseen data. In this notebook, you will explore the concepts of regularization, understand different types of regularization techniques, and apply them using Python's popular libraries.

## What is Regularization?

Regularization involves adding a regularization term to the loss function used to train machine learning models. This term imposes a constraint on the model's coefficients, effectively reducing their magnitude. By doing so, regularization helps in:

- **Preventing Overfitting:** Ensures the model does not become too tailored to the training data.
- **Improving Generalization:** Enhances the model's performance on new, unseen data.
- **Feature Selection:** Especially in L1 regularization, it can drive some coefficients to zero, effectively selecting important features.

## Types of Regularization

There are several types of regularization techniques, each imposing different constraints on the model's parameters:

### 1. L1 Regularization (Lasso)

L1 regularization adds the absolute value of the magnitude of coefficients as a penalty term to the loss function. It can lead to sparse models where some feature coefficients are exactly zero.

### 2. L2 Regularization (Ridge)

L2 regularization adds the squared magnitude of coefficients as a penalty term to the loss function. It tends to shrink the coefficients evenly but does not set them to zero.

### 3. Elastic Net

Elastic Net combines both L1 and L2 regularization penalties. It balances the benefits of both Lasso and Ridge methods, allowing for feature selection and coefficient shrinkage.

## Homework Time!
Import Iris dataset from sklearn.datasets and apply ridge regression with different alpha values. Then, create a gif that shows the changes of the classification boundary with respect to alpha values.

Import the libs that you need and start coding!

In [54]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from PIL import Image
from io import BytesIO
import imageio
import warnings
from sklearn.neural_network import MLPClassifier

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

Load the Iris dataset and select Setosa and Versicolor classes

In [49]:
# 1. Load and Prepare the Iris Dataset
data = load_iris()
X, y = data.data, data.target
# Select only two classes for binary classification (Setosa and Versicolor)
X, y = X[y != 2], y[y != 2]
# Select two features for 2D visualization (Sepal Length and Petal Length)
X = X[:, [0, 2]]
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


Define Function to Plot Decision Boundary

In [50]:
def plot_decision_boundary(model, X, y, alpha):
    # Define the grid (use meshgrid)
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))

    # Predict over the grid
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Create a figure
    fig, ax = plt.subplots(figsize=(6, 5))

    # Plot the decision boundary
    ax.contourf(xx, yy, Z, alpha=0.3, levels=[-0.1, 0.1, 1.1], colors=['blue', 'red'])

    # Scatter plot of the training data
    scatter = ax.scatter(
        X[:, 0], X[:, 1], c=y, cmap='bwr', edgecolor='k', s=50
    )

    # Title and labels
    ax.set_title(f'MLP Decision Boundary (alpha={alpha})')
    ax.set_xlabel('Sepal Length (standardized)')
    ax.set_ylabel('Petal Length (standardized)')

    # Remove axes for clarity
    ax.set_xticks([])
    ax.set_yticks([])

    # Tight layout
    plt.tight_layout()

    # Save the plot to a BytesIO object
    buf = BytesIO()
    plt.savefig(buf, format='png')
    plt.close(fig)
    buf.seek(0)
    return Image.open(buf)

Train MLP with Varying Alpha Values and Collect Images

In [52]:
def create_decision_boundary_gif(alpha_values, X_train, y_train, n_neurons):

    # List to store images
    images = []

    for idx, alpha in enumerate(alpha_values):
        print(f"Processing alpha={alpha:.4f} ({idx + 1}/{len(alpha_values)})")

        # Create and train the MLP
        mlp = MLPClassifier(hidden_layer_sizes=(n_neurons,), alpha=alpha, max_iter=1000, random_state=42)
        mlp.fit(X_train, y_train)

        # Plot decision boundary and get the image
        img = plot_decision_boundary(mlp, X_train, y_train, alpha)
        images.append(img)

    # Save the images as a GIF
    gif_filename = 'mlp_classification_boundaries.gif'
    images[0].save(
        gif_filename,
        save_all=True,
        append_images=images[1:],
        duration=500,
        loop=0
    )

    print(f"GIF saved as '{gif_filename}'")

    # return the gif
    return gif_filename

## RUN

In [58]:

# Use np.logspace to generate alpha values, with at least 20 values
alpha_values = np.logspace(-3, 3, 20)
# Define the number of neurons in the hidden layer
n_neurons =  10

# Create the decision boundary GIF
gif_dir = create_decision_boundary_gif(alpha_values, X_train, y_train, n_neurons)

Processing alpha=0.0010 (1/20)
Processing alpha=0.0021 (2/20)
Processing alpha=0.0043 (3/20)
Processing alpha=0.0089 (4/20)
Processing alpha=0.0183 (5/20)
Processing alpha=0.0379 (6/20)
Processing alpha=0.0785 (7/20)
Processing alpha=0.1624 (8/20)
Processing alpha=0.3360 (9/20)
Processing alpha=0.6952 (10/20)
Processing alpha=1.4384 (11/20)
Processing alpha=2.9764 (12/20)
Processing alpha=6.1585 (13/20)
Processing alpha=12.7427 (14/20)
Processing alpha=26.3665 (15/20)
Processing alpha=54.5559 (16/20)
Processing alpha=112.8838 (17/20)
Processing alpha=233.5721 (18/20)
Processing alpha=483.2930 (19/20)
Processing alpha=1000.0000 (20/20)
GIF saved as 'mlp_classification_boundaries.gif'


Your gif should look like this:

<div style="text-align: center;">

### **Multilayer Perceptron Classification Boundaries**

![Classification Boundaries](mlp_classification_boundaries_example.gif)

*Figure 1: Demonstration of classification boundaries created by a Multilayer Perceptron (MLP) model.*

</div>

