In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# pip3 install torch torchvision torchaudio
import torch
import torch.nn as nn
from torchvision import datasets
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor

## 1. Activation functions

### ReLU

Activation functions are crucial components of neural networks because they introduce non-linear behavior. Without them, a network composed of stacked linear transformations would still behave like a single linear model, limiting its ability to capture complex patterns in the data.

One of the most widely used activation functions today is the Rectified Linear Unit (ReLU), defined as

\begin{align}
ReLU(x) = \begin{cases}
               0               & x<0\\
               x               & x\geq 0\\ \end{cases}
\end{align}


**1.1 Find the derivative of the ReLU function, $R'(x)$. Carefully consider the two regions when $x<0$ and $x\geq0$.** <br>

---

*Your answer here:*  

---
**1.2 We now want to implement the ReLU activation in code. Fill in the missing parts of the following class so that it computes both the activation and its derivative.**

In [None]:
class ReLU:
    @staticmethod
    def forward(x):
        return ...  # your code here

    @staticmethod
    def gradient(x):
        return ...  # your code here


### Sigmoid

One of the classical choices is the sigmoid function, which smoothly squashes any real-valued input into the range (0,1). This property made it popular in the early days of neural networks, particularly for binary classification problems, since its output can be interpreted as a probability.

\begin{align}
S(z) = \frac{1}{1 + e^{-z}}.
\end{align}

**1.3 Find the derivative of the Sigmoid function, $S'(x)$.** <br>

---

*Your answer here:*  

---
**1.4 We now want to implement the Sigmoid activation in code. Fill in the missing parts of the following class so that it computes both the activation and its derivative.**

In [None]:
class Sigmoid:
    @staticmethod
    def forward(x):
        return ...

    @staticmethod
    def gradient(x):
        return ...

**1.5 For both activations compute both their output and derivative values for the following range.**
- Hint: Use a **2 × 2 grid of subplots**.
- Hint: Label your axes, add **titles** for each subplot, and don’t forget **legends**.  

In [None]:
x = torch.linspace(-10, 10, 100)

fig, axs = plt.subplots(2, 2, figsize=(10, 8))

# TODO: plot ReLU and Sigmoid functions and their derivatives

## 2. The Perceptron

The **perceptron** is the simplest model of a neuron.  
Given an input vector $(x \in \mathbb{R}^d)$, it computes:

\begin{align}
y = \phi\!\left(\sum_{i=1}^{d} w_i x_i + b\right),
\end{align}

where  
- $(x_i)$ are the inputs,  
- $(w_i)$ are the corresponding weights,  
- $(b)$ is the bias term,  
- $(\phi(\cdot))$ is the activation function (e.g., sigmoid, ReLU, etc.).  

**2.1 Implement a `Perceptron` class with:**
   - a constructor that initializes weights and bias randomly,  
   - a `forward(x)` method that returns the activated output.

In [None]:
class Perceptron(torch.nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        # TODO: initialize weights and bias
        self.W = ...  
        self.b = ...

    def forward(self, x):
        # TODO: compute z = w·x + b
        # TODO: apply sigmoid activation
        return ...

**2.2 To verify your implementation recreate the Perceptron using Pytorch's built-in `torch.nn` modules and check that they behave in the same way**

PyTorch already provides layers for you:

- Look at `torch.nn.Linear` for the weight + bias computation.
- Don’t forget to apply an activation function afterwards — check out modules in `torch.nn`.
- You can combine them in sequence using `nn.Sequential`.

In [None]:
# Create both layers
in_features, out_features = 3, 1
my_layer = ...
TorchPerceptron = nn.Sequential(
    ...,
    ...
)

# Copy parameters from your layer to torch.nn.Linear
with torch.no_grad():
    TorchPerceptron[0].weight.copy_(my_layer.W)
    TorchPerceptron[0].bias.copy_(my_layer.b)

# Test input
x = torch.rand(in_features)

# Forward pass
# TODO: compute outputs from both layers

print("Difference:", ...)

## 3. The Linear Layer

A perceptron takes one input vector and produces a single output after applying a weighted sum, a bias, and an activation.  

A **linear layer** is simply a collection of multiple perceptrons stacked together.  
- Instead of one weight vector $(w)$, we now have a weight matrix $(W \in \mathbb{R}^{m \times d})$.  
- Each row of $(W)$ corresponds to the weights of one perceptron.  
- The bias term becomes a vector $(b \in \mathbb{R}^m)$.  
- The output is a vector $(y \in \mathbb{R}^m)$:  

\begin{align}
y &= W x + b
\end{align}

where  
- $x$ is the input of dimension $d$,  
- $W$ applies $m$ linear combinations of the inputs,  
- $b$ shifts (translates) the result.

This is often called an **affine transformation**: a linear transformation plus a translation.

**3.1 Fill the following code to implement your own Linear Layer class.**

In [None]:
class Linear(torch.nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        # TODO: initialize weight matrix W and bias vector b
        self.W = ...
        self.b = ...

    def forward(self, x):
        # TODO: implement y = Wx + b
        return ...

**3.2 To verify your implementation recreate the Linear Layer using Pytorch's built-in `torch.nn` modules and check that they behave in the same way**

In [None]:
# Create both layers
in_features, out_features = 3, 2
my_layer = ...
TorchLinearLayer = ...

# Copy parameters from your layer to torch.nn.Linear
with torch.no_grad():
    TorchLinearLayer.weight.copy_(my_layer.W)
    TorchLinearLayer.bias.copy_(my_layer.b)

# Test input
x = torch.rand(in_features)

# Forward pass
# TODO: compute outputs from both layers

print("Difference:", ...)

## 4. Multi-Layer Perceptron (MLP)
So far, you built:
- a Perceptron → single linear unit + nonlinearity
- a Linear layer → general affine transformation (x @ W^T + b)

These are the *building blocks* of neural networks.

A **Multi-Layer Perceptron (MLP)** is simply a stack of perceptrons (linear layers with nonlinearities between them).

- The first layer transforms the input into a hidden representation.
- A nonlinear activation (e.g., ReLU, Sigmoid, Tanh) makes the model expressive.
- The next layer(s) take the hidden representation and produce outputs.

\begin{align}
\mathbf{a}_{l} &= {\mathbf{W}}_{l}^T \mathbf{x}_{l-1} + \mathbf{b}_{l}\;. \\
\end{align}

Followed by the nonlinear activation 
\begin{align}
\mathbf{x}_{l} &= f_{l}(\mathbf{a}_{l})
\end{align}

where:
- $f_l$; Activation function for layer $l$
- $x_l$: The output of layer $l$.
- ${\mathbf{W}}_{l}$ Weights of layer $l$.
- $\mathbf{b}_{l}$ Bias of layer $l$.

**4.1 Fill the following code to implement your own MLP class.**

In [None]:
import torch
import torch.nn.functional as F

class MyMLP(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        D_in: input dimension
        H: hidden dimension
        D_out: output dimension
        """
        super().__init__()
        # TODO: initialize weights and biases
        return

    def forward(self, x):
        #TODO: implement the forward pass
        return ...

**4.2 To verify your implementation recreate the MLP using Pytorch's built-in `torch.nn` modules and check that they behave in the same way**

In [None]:
torch.manual_seed(0)
D_in, H, D_out = 3, 4, 1

manual = MyMLP(D_in, H, D_out)
TorchMLP = ...

# TODO: Copy weights so forward passes should match exactly


# Single example
x1 = torch.tensor([0.5, -1.0, 2.0])
y_manual_1 = manual(x1)
y_nn_1 = TorchMLP(x1)
print("single:", torch.allclose(y_manual_1, y_nn_1, atol=1e-7), y_manual_1, y_nn_1)

# Test input
xB = torch.randn(5, D_in)


# Forward pass
# TODO: compute outputs from both layers

print("All close:", torch.allclose(..., ..., atol=1e-7))

## 5. Mean Squared Error (MSE) Loss

The Mean Squared Error (MSE) measures how far predictions are from the true values by averaging the squared differences.

For $N$ predictions $\hat{y}_i$ and true labels $y_i$:

\begin{align}
\mathcal{L}_{\text{MSE}} &= \frac{1}{N} \sum_{i=1}^N (\hat{y}_i - y_i)^2
\end{align}

**5.1 Find the derivative of the MSE function** <br>

---

*Your answer here:*  

---
**5.2 We now want to implement the MSE Loss in code. Fill in the missing parts of the following class so that it computes both the activation and its derivative.**

In [None]:
class MSE:
    @staticmethod
    def loss(y_true: torch.Tensor, y_pred: torch.Tensor) -> torch.Tensor:
        pass

    @staticmethod
    def gradient(y_true: torch.Tensor, y_pred: torch.Tensor) -> torch.Tensor:
        pass

**5.3 Compare against PyTorch’s built-in implementation. You can use `torch.nn.MSELoss` to compute the loss and gradients automatically.**

In [None]:
N, C = 5, 3
y_true = torch.rand(N, C)                      
y_pred = torch.rand(N, C, requires_grad=True)  

# --- Built-in PyTorch loss ---
loss_fn = ...
torch_loss = ...

# Backprop to get gradients
torch_loss.backward()
torch_grad = y_pred.grad.clone()

# --- Your implementation ---
manual_loss = ...
manual_grad = ...

# --- Compare ---
print("PyTorch loss:", torch_loss.item())
print("Manual loss:", manual_loss.item())

print("\nPyTorch gradient:\n", torch_grad)
print("Manual gradient:\n", manual_grad)

print("\nLoss close?  ", torch.allclose(torch_loss, manual_loss))
print("Grad close?  ", torch.allclose(torch_grad, manual_grad))

## 6. The whole Pipeline: Training on FashionMNIST

### Loading the FashionMNIST Dataset

We will use **FashionMNIST**, a dataset of grayscale 28×28 images of clothing items (e.g., shirts, shoes, bags).  
It is built into PyTorch and can be easily downloaded.

#### Datasets and Dataloaders
- A **Dataset** object (like `datasets.FashionMNIST`) gives you access to the data samples and their labels.  
- A **DataLoader** wraps a dataset and helps you:
  - Load the data in **mini-batches** (instead of one sample at a time).  
  - **Shuffle** the data during training (good for generalization).  
  - Use multiple worker processes to speed up loading.  

In practice, you almost always combine a Dataset with a DataLoader when training models in PyTorch.


In [None]:
training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

# Setup the Dataloader for training
train_dataloader = DataLoader(training_data, batch_size=64)
test_dataloader = DataLoader(test_data, batch_size=64)

sample_idx = torch.randint(len(training_data), size=(1,)).item()
img, label = training_data[sample_idx]
plt.imshow(img.squeeze(), cmap="gray")
plt.title(f"Label: {label}")
plt.show()

**5.1 How many features does each sample have?**

**5.2 How many classes do we have in the dataset?**

---

*Your answers here:*  

---

### Choosing a Device (CPU or GPU)

Training deep learning models can be much faster on a GPU, if one is available.  
In PyTorch, we usually set a `device` variable so that both the model and the data can be placed consistently on either:

- **GPU (`"cuda"`)** → preferred for faster training when available  
- **CPU (`"cpu"`)** → always available, sufficient for small exercises  

For this exercise, using a GPU is **not required** — but it’s good practice to write code that supports both.


In [None]:
# Select GPU if available, otherwise fall back to CPU
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Using {device} device")

### Defining a Neural Network in PyTorch

We now define a simple **feedforward neural network** for image classification.

- **`nn.Flatten()`** → converts a 2D image (28×28 pixels) into a 1D tensor (length 784).  
- **`nn.Sequential()`** → a container that runs layers in order. Here it includes:  
  1. A linear (fully connected) layer mapping from `28*28` inputs to a hidden dimension.  
  2. A **ReLU** activation function for nonlinearity.  
  3. Another linear layer mapping from the hidden dimension to 10 output classes.

The network returns **logits** (unnormalized scores for each class).


In [None]:
class NeuralNetwork(nn.Module):
    def __init__(self, in_features, hidden_features, out_features):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(in_features, hidden_features),
            nn.ReLU(),
            nn.Linear(hidden_features, out_features),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits
    
# Initialize the model with the appropriate arguments based on our dataset and move it to the device.
# Hint: For hidden features you can use 128
model = ...

# Display the model architecture
print(model)

### Hyperparameters, Loss Function, and Optimizer

To train our neural network, we need to set a few key choices:

- **Learning rate**: controls how big each parameter update step is.  
- **Batch size**: number of samples processed together before updating weights.  
- **Epochs**: how many full passes we make over the training dataset.  

We also need:

- **Loss function**: measures how far the model’s predictions are from the true labels.  
  - Here we use **Cross-Entropy Loss**, the standard choice for multi-class classification.  
- **Optimizer**: updates model parameters using the gradients.  
  - Here we use **Stochastic Gradient Descent (SGD)** with the chosen learning rate.


In [None]:
# Hyperparameters
learning_rate = 1e-3
batch_size = 64
epochs = 10

# Loss function (for classification)
loss_fn = nn.CrossEntropyLoss()

# Optimizer (SGD with given learning rate)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

### Training and Evaluating a Neural Network

Once we have a dataset and a model, the next step is to **train** the model so that it can make accurate predictions.

### Training
Training is the process of teaching the model to minimize a **loss function** by adjusting its parameters (weights and biases).  
This is done using an algorithm called **backpropagation** combined with an **optimizer** (such as SGD or Adam).
We will be looking closer at backpropagation in the next sessions.

The training loop typically consists of:
1. **Forward pass** → feed the input through the model to get predictions.  
2. **Compute the loss** → measure how far predictions are from the true labels.  
3. **Backward pass** → compute gradients of the loss with respect to model parameters.  
4. **Update parameters** → use the optimizer to adjust weights and biases.  

Repeating this process over the dataset (for multiple **epochs**) gradually improves the model.

### Evaluation
After training, we need to evaluate the model on **unseen data** (the test set).  
During evaluation:
- We disable gradient calculations (`torch.no_grad()`), since we are not training.  
- The model is set to **evaluation mode** (`model.eval()`), which is important for certain layers (e.g., dropout, batch normalization).  
- We measure **accuracy** and **average loss** to understand how well the model generalizes.


**5.1 Complete the training loop and train the model for 10 epochs.**

In [None]:
def train(dataloader, model, loss_fn, optimizer, losses=[]):
    model.train()  # set model to training mode
    
    for batch, (X, y) in enumerate(dataloader):
        # Forward pass
        # TODO: perform the forward pass 
        pred = ...

        # Compute the loss
        loss = loss_fn(pred, y)

        # Just for logging
        losses.append(loss.item())

        # Backpropagation
        optimizer.zero_grad() # 1. Reset gradients
        loss.backward()       # 2. Compute current gradients
        optimizer.step()      # 3. Update parameters

        if batch % 100 == 0:
            loss_val = loss.item()
            print(f"loss: {loss_val:>7f}")
    return losses

**5.2 Define a function to evaluate the model on the test dataset.**

In [None]:

def test(dataloader, model, loss_fn):
    """
    TODO:  Evaluate the model on the test dataset and return the accuracy and average loss.
    """
    ...

accuracy, loss = test(test_dataloader, model, loss_fn)

### Saving and Loading Models in PyTorch

After training a neural network, it’s important to save the learned parameters so we can reuse the model later without retraining from scratch.

In PyTorch, we typically save the **state dictionary** (`state_dict`) of the model, which contains all trainable parameters (weights and biases).  

```python
torch.save(model.state_dict(), "model_weights.pth")
```

To reuse a saved model:

1. Recreate the model architecture.
2. Load the saved state dictionary into it.
3. Evaluate or continue training as needed.

This way, the new model has the same parameters as the trained one.

In [None]:
# Save the trained model's parameters
torch.save(model.state_dict(), "model_weights.pth")


# Create a new instance of the model (untrained) and load the saved parameters
model_new = NeuralNetwork()
model_new.load_state_dict(torch.load("model_weights.pth"))
test(test_dataloader, model_new, loss_fn)

## 7. MCQ

---

## 7.1 Activation Functions
Which of the following is the main purpose of using an activation function in a neural network?  

A. To increase the number of layers in the network  
B. To introduce non-linearity so the network can model complex functions  
C. To normalize the input data before training  
D. To reduce overfitting during training  

**Answer:** 

---

## 7.2. The Perceptron
A single perceptron can only represent:  

A. Any continuous function  
B. Non-linear decision boundaries  
C. Linear decision boundaries  
D. Polynomial functions  

**Answer:**

---

## 7.3. Linear Layer
In a linear (fully connected) layer with input dimension $d$ and output dimension $m$, the weight matrix $W$ has the shape:  

A. $(d \times m)$  
B. $(m \times d)$  
C. $(d \times d)$  
D. $(m \times m)$  

**Answer:** 

---

## 7.4. Loss Functions
The Mean Squared Error (MSE) loss between predictions $\hat{y}$ and targets $y$ is defined as:  

A. $\frac{1}{N}\sum_{i=1}^N |\hat{y}_i - y_i|$  
B. $\frac{1}{N}\sum_{i=1}^N (\hat{y}_i - y_i)^2$  
C. $\sum_{i=1}^N (\hat{y}_i - y_i)$  
D. $\max(\hat{y}_i, y_i)$  

**Answer:** 

---

## 7.5. Multi-Layer Perceptron (MLP)
Compared to a single perceptron, a multi-layer perceptron can:  

A. Only model linear functions  
B. Model more complex, non-linear functions  
C. Avoid the need for activation functions  
D. Train without using backpropagation  

**Answer:**

---

## 7.6. Training Procedure
Which of the following is the correct order of steps in one training iteration?  

A. Backward pass → Forward pass → Update weights  
B. Forward pass → Compute loss → Backward pass → Update weights  
C. Update weights → Forward pass → Compute loss → Backward pass  
D. Forward pass → Update weights → Compute loss → Backward pass  

**Answer:** 

---
