Status:  ✅ done 

## Exercise 20

---

Welcome to the 20th exercise in which we will practice `CNN`s. First, we will go over a couple of theoretical examples and then we move to actually training one `CNN` for the MNIST dataset.

> Imports

In [4]:
from numpy.core.fromnumeric import shape
import torch
from torchvision import datasets
from torchvision.transforms import ToTensor
from torch.utils.data import DataLoader
from torch import nn
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, accuracy_score

### 🏷 CNN Theory

---

> Quick recap

If you need 5 min read recap of the most important concepts from `CNN`, I suggest you read my [note](https://www.notion.so/ludekcizinsky/NN-and-CNN-8c739d9486c142f28df7acb6d99ccccc#cf0004b507bd4f8da9ee5b1bcb1976fc).

> Convolutional layer (filters, padding, parameters), relation to FFNN

Let's consider a simple example where:
- as an input we take $32 \times 32$ grayscale image (i.e. we can represent it with a simple matrix)
- we run the image through three $5 \times 5$ filters
- we will use `no padding` around the image

Now, what would be the size of the outputted channels after we run the given filters on the grayscale image? That is fairly easy to compute:
- input dimension is $N = 32$
- filter dimension is $F = 5$
- we use no padding, therefore the input dimension $N$ will shrink by $F - 1$
- therefore, the output dimension $O$ will be:

$
O = N - (F - 1)
$

- more specifically in our case:

$
O = 32 - (5 - 1) = 28
$

Now, what if we want to end up with an output of the same dimension as an input? Then we just need to shift an angle on what we are interested in, which is what should be the input dimension $N$:

$$
\begin{aligned}
O = N - (F - 1) \\
- N = -O - (F - 1) \\
N = O + (F - 1)
\end{aligned}
$$

So given the desired output size $O$ and filter size $F$, we can compute the needed input size $N$. In our particular case:

$
N = 32 + (5 - 1) = 36
$

Therefore, we need to add 4 pixels to each row and column in the input, i.e., on each side of the input image, we need to add 2 pixels. Using this `padded` image, we can obtain output image with the same dimension. **Note: I have assumed stride length to be 1.**

Now, how many parameters do we actually have in this convolutional layer? Well, each filter has $25$ parameters (recall we use $5 \times 5$ filters) and since we have three filters, then this means we have 75 parameters in total.

Last but not the least, why do we actually need convolutional layer? Well, the input image has 1024 pixels, if we would just feed these to our `FFNN`, we would lose a lot of information, especially about the order of the pixels. Therefore, the core idea behind `convolutional layer` is to have model learn some patterns from the image and THEN feed these patterns as an input to our `FFNN`.

> CNN and memory consumption (point: we actually need pooling layers)

Now, we will take a little thought tour into the space of CNN's hardware utilization, more specifically RAM. Let's assume the following for our `CNN`:
- for any convolutional layer described below, we always use $3 \times 3$ filters
- when convolving, we use stride being equal to 2 and the same 'padding' strategy, i.e., output dimension is equal to the output dimension
- we have three convolutional layers where
    - first outputs 100 channels
    - second outputs 200 channels
    - and third outputs 400 channels
- as an input we take RGB images of resolution $200 \times 300$ pixels

First, we want to answer the following question:

> How much RAM will be required when computing a single sample pass through this `CNN` assuming we use 32-bit floats to represent any parameter/value?

This question can be broken down into two parts:
- how much RAM do we need to store the model's parameters?
- how much RAM do we need to store the input $x$ and then the outputted channels?

Let's start with the first subquestion by computing the number of parameters for each convolutional layer:
- **Convolutional layer 1:** Here we take the RGB input and transform it into 100 new channels. To obtain each of these channels, we will need 3 filters (for each channel), therefore, per output channel we will have $3 \times 3 \times 3 = 27$ parameters. So in total, we need to train **2'700 parameters** for the first convolutional layer.

- **Convolutional layer 2:** Here we take as an input $100$ channels, therefore for each output channel, we need $100 \times 3 \times 3 = 900$ parameters. Since we have $200$ output channels, in total we need to train **180'000** parameters.

- **Convolutional layer 3:** Here we take as an input $200$ channels, therefore for each output channel, we will need $200 \times 3 \times 3 = 1800$ parameters. So, in total, we will need to train **720'000 parameters**.

Therefore, summing over all parameters, our network has **902 700 trainable parameters**. Assuming each of these will be represented as an $32$-bit float which is 4 bytes, then all these parameters take $3'610'800$ bytes which is roughly $3'610'800/1000/1000 \approx 3.45$ MB. 

Now, we need to answer the second subquestion, which is how much memory do the actual channels take? Again, let's go layer by layer:

- **Convolutional layer 1:** we take as an input 3D array which has in total $3 \times 200 \times 300$ values. As an output, we obtain 100 channels, but of what what dimension? Well, on each of the image channels, we apply $3 \times 3$ filter with stride 2 (i.e. we skip every other pixel in each direction - row and column wise). This means that for each channel (RGB) we compute $100 \times 150$ output and then we summarize these outputs into a single output of the same size, so per output channel we will therefore need $4 \times 100 \times 150$ values, which means going through the first convolutional layer will require:

$
(3 + 1) \times 100 \times 150 \times 100 = 6'000'000 \text{ values}
$

- **Convolutional layer 2:** similarly for this layer:

$
(100 + 1) \times 50 \times 75 \times 200 = 75'750'000 \text{ values}
$

- **Convolutional layer 3:** and finally for the last layer:

$
(200 + 1) \times 25 \times 37 \times 400 = 74'370'000 \text{ values}
$

So in total, if we wanted to fit all values into main memory, we would need:

$
((6'000'000 + 75'750'000 + 74'370'000)\times 4)/1000/1000 \approx 624.5 \text{ MB}
$

Therefore, to answer our initial question completely, we would roughly need **628 MB of memory**. 

Recall that we have so far assummed that we simply use a single sample/image as an input, now another question might be:

> how much memory will we actually need if we use 50 images as an input?

This is fairly straightforward:

$
624.5 \times 50 + 3.5 \approx 31 \text{ GB}
$

With all of this being said, it feels like our CNN is missing something since we want to reduce the memory consumption if possible because otherwise it would be fairly difficult to train CNNs with such architectures. This is where pooling layers come into the picture. Pooling layers help us reduce the dimensions of the output layers by essentially summarizing them and as such reducing the dimension by several factors.

> Section summary

In this section, we went over the core CNN concepts:

- how to use **filters**
- how to use **padding** if we want/or not to keep the dimension of the output same as of the input
- compute the **number of parameters** for given CNN architecture
- role of **stride length** in terms of output dimension
- role of **pooling layer**

You should be able to explain all of these after going through this section. As a bonus, it would be nice if you could also make a simple example showing memory consumption of the given CNN architecture.

### 🏷 Practical example

---

> Intro to the problem

The purpose of this exercise is to train a model which can predict one of the 10 clothes' labels as defined in the [Fashion-Mnist dataset](https://github.com/zalandoresearch/fashion-mnist). More specifically, we divide this section into two major parts:

- construct a model using `PyTorch` - we will implement [LeNet architecture](LeNet.pdf)
- construct a pipeline that will train the model and then show its performance on test data

> Constructing the model

The LeNet architecture is implemented in the `__init__` method. The rest of the functions will help us to define the pipeline down below.

In [5]:
class NeuralNetwork(nn.Module):
    def __init__(self, **INPUT):

        # Inherit from nn.Module
        super(NeuralNetwork, self).__init__()

        # Define activation function
        self.af = INPUT.get('af')

        # Define drop ratio
        self.dpr = INPUT.get('dpr')

        # Define neural network architecture
        self.nn = nn.Sequential(
            # C1: 6@28x28
            nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(6),
            self.af(),
            nn.AvgPool2d(kernel_size=2, stride=2),

            # C2: 16@10x10
            nn.Dropout(self.dpr),
            nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0),
            nn.BatchNorm2d(16),
            self.af(),
            nn.AvgPool2d(kernel_size=2, stride=2),

            # Apply flattening on the output
            nn.Flatten(),

            # Dense part
            # * L1
            nn.Dropout(self.dpr),
            nn.Linear(400, 120),
            nn.BatchNorm1d(120),
            self.af(),

            # * L2
            nn.Dropout(self.dpr),
            nn.Linear(120, 84),
            nn.BatchNorm1d(84),
            self.af(),

            # * L3
            nn.Dropout(self.dpr),
            nn.Linear(84, 10)

        )

        # Define batch size
        self.batch_size = INPUT.get('batch_size')

        # Define datasets
        self.training = DataLoader(
            INPUT.get('trd'), batch_size=self.batch_size, shuffle=True)
        self.validation = DataLoader(
            INPUT.get('vd'), batch_size=self.batch_size, shuffle=True)
        self.test_data = DataLoader(
            INPUT.get('ted'), batch_size=len(INPUT.get('ted')), shuffle=True)

        # Define loss funtion
        self.loss_fn = INPUT.get('loss_fn')

        # Define learning rate
        self.lr = INPUT.get('lr')

        # Define number of epochs
        self.epochs = INPUT.get('epochs')

        # Define optimizer
        self.optimizer = INPUT.get('optim')(self.parameters(), lr=self.lr)

        # Save training progress
        self.loss_history = []
        self.acc_history = []

    def forward(self, x):
        logits = self.nn(x)
        return logits

    def train_loop(self):

        size = len(self.training.dataset)
        for batch, (X, y) in enumerate(self.training):

            # Compute prediction and loss
            pred = self.forward(X)
            loss = self.loss_fn(pred, y)

            # Backpropagation
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()

            if batch % 100 == 0:
                loss, current = loss.item(), batch * len(X)
                print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

    def val_loop(self):
        size = len(self.validation.dataset)
        num_batches = len(self.validation)
        test_loss, correct = 0, 0

        with torch.no_grad():
            for X, y in self.validation:
                pred = self.forward(X)
                test_loss += self.loss_fn(pred, y).item()
                correct += (pred.argmax(1) == y).type(torch.float).sum().item()

        test_loss /= num_batches
        correct /= size
        print(
            f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

        # Save it to history
        self.acc_history.append(correct)
        self.loss_history.append(test_loss)

    def visualize(self):
        x = [i for i in range(self.epochs)]
        y1 = self.acc_history
        y2 = self.loss_history
        plt.plot(x, y1)
        plt.plot(x, y2)
        plt.show()

    def fit(self):

        for t in range(self.epochs):
            print(f"Epoch {t+1}\n-------------------------------")
            self.train_loop()
            self.val_loop()
        print("Done!")

    def predict(self, x):
        logits = self.forward(x)
        softmax = nn.Softmax(dim=1)
        return softmax(logits).argmax(1)

    def test(self):

        # Get data
        X, y = next(iter(self.test_data))

        # Predict values
        y_hat = self.predict(X)

        print("Accuracy score for test data")
        print("-"*60)
        print(f"Acc: {accuracy_score(y, y_hat)*100} %")
        print()
        print("Confusion matrix for test data")
        print("-"*60)
        print(confusion_matrix(y, y_hat))

> Pipeline

Here, we simply get our training and test data. Then train the model and finally show its performance.

In [6]:
print("-"*60)
print("[0] Loading data")
print("-"*60)

# Get training
training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

# Split training into training and validation
g_cpu = torch.Generator()
g_cpu.manual_seed(3)
training_data, val_data = torch.utils.data.random_split(
    training_data, [50000, 10000])

# Get test data
test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=False,
    transform=ToTensor()
)
print("-"*60)
print("[1] Loading dataset done")
print("-"*60)

print("-"*60)
print("[2] Initialize model")
print("-"*60)
INPUT = {
    'batch_size': 100,
    'trd': training_data,
    'vd': val_data,
    'ted': test_data,
    'loss_fn': nn.CrossEntropyLoss(),
    'lr': 1e-1,
    'epochs': 5,
    'af': nn.Sigmoid,
    'optim': torch.optim.Adam,
    'dpr': 1e-3

}
model = NeuralNetwork(**INPUT)

print("-"*60)
print("[3] Model initialized and ready to be trained")
print("-"*60)

print("-"*60)
print("[4] Started to train model")
print("-"*60)
model.fit()
print("-"*60)
print("[5] Model trained successfully")
print("-"*60)

print("-"*60)
print("[6] Predicting values for test data")
print("-"*60)
model.test()

print("-"*60)
print("[7] End of the pipeline.")
print("-"*60)

------------------------------------------------------------
[0] Loading data
------------------------------------------------------------
------------------------------------------------------------
[1] Loading dataset done
------------------------------------------------------------
------------------------------------------------------------
[2] Initialize model
------------------------------------------------------------
------------------------------------------------------------
[3] Model initialized and ready to be trained
------------------------------------------------------------
------------------------------------------------------------
[4] Started to train model
------------------------------------------------------------
Epoch 1
-------------------------------
loss: 2.398207  [    0/50000]
loss: 0.575191  [10000/50000]
loss: 0.569845  [20000/50000]
loss: 0.622981  [30000/50000]
loss: 0.350341  [40000/50000]
Test Error: 
 Accuracy: 83.8%, Avg loss: 0.446889 

Epoch 2
----

> Section summary

After going through this section, you should be able to solve a more complex machine learning problem using `CNN` built with `PyTorch`. If you have done your work well enough, I believe this is a nice project to show to a potential employer when proving that you can work with `PyTorch`. 🙏

---

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=44b4318f-b707-4245-8a54-52edfdffa4de' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>