### Chapter 3 - Computer Vision

**This week's exercise has 3 tasks, for a total of 10.5 points. Don't forget to submit your solutions to GitHub!**

In this chapter, we want you to become proficient at the following tasks:
- Building core components of modern PyTorch models
- Assembling modern PyTorch models from components
- Training a modern model on a real-world task and achieving passable results

**Note**: Since you have already proven that you are capable of creating the core components of a typical training loop yourself, we will provide some utility code for this section. This is done so that you can focus on the important parts of this lesson, and to help us debug your code in case you need help.

#### Chapter 3.1 - Data Augmentation

**What is data augmentation?**
Have you ever lost your glasses and then squinted, or tried to look through a rainy window? Or looked at a false color image, maybe a forest where the trees are blue and the sky green? You can usually make an educated guess what you are looking at, even though the image you see is different than usual. This is, essentialy, what data augmentation is. It's the same data, still recognizable, but slightly altered in some way.

**What is that useful for?**
Let me begin with an anecdote that you've probably heard in the lectures. Say you have pictures of cats and dogs, and want your model to tell the two apart. How many people you know go to the park with their dogs? I imagine many. Hence, many images of dogs are dogs lying on the grass. The same is generally untrue for cats, at least I have never heard of anyone walking their cat to the park. At any rate, here is what happens when I train a neural network on these images: The model takes a shortcut. It sees a lot of green and the correct answer for these pictures is always "Dog". It learns "Green = Dog". This is what we call overfitting. We have overfitted to the existence of green in the background as a quintessential part of what makes a dog. Sometimes, we get away with this, if our data always has this correlation.

Now I get some new data. A bunch of people have taken pictures of their cats, sunbathing on the terrace. The garden is in the background. Lots of green. The model, in its infinite wisdom, will at first guess that these images are of dogs. Clearly, our model's ability to tell apart cats and dogs has not generalized to this new dataset.

So how can we prevent the model from taking shortcuts and encourage learning information that generalizes? We force these generalizations in training. If I gave you an image of a dog, but the grass was brown, and the dog green, you could still identify it as a dog, instead of a cat, right? And so should the model, if we can manage it. So let's also make it train using pictures of cats and dogs where the colours are different or removed. Suddenly, the shortcut solution is no longer useful, and the model must rely on shape, texture, or contextual information like a leash. The practice of color change described is a practical and useful data augmentation that is used in state-of-the-art image recognition.

In addition to color changes, there is a myriad of other techniques, such as cropping, image rotation or flipping, edge filters, solarization, random noise, and many, many more. Basically, anything that you believe may eventually show up in testing data and that you want the model to generalize to, can be made into a corresponding data augmentation.

**How do we use data augmentations in practice?**
There are two ways of adding data augmentation during training. Either, you can implement it inside of your dataset, so that it only returns augmented image tensors, or right before feeding your image tensors into your model. Both options are acceptable and come with advantages and disadvantages, although the more common way is to separate dataset and augmentations. We also showcase the native PyTorch way of augmenting data below.

If you are particularly eager, or want to try your hand at making image augmentation functions yourself, it can be fun and is definitely good practice. However, PyTorch comes with a large selection of image augmentations right out of the box, and in the following chapter, we will look at how to make use of them.

In [1]:
import torch
import torch.nn as nn
import torchvision.transforms as tt
import torchvision.transforms.functional as ttf

# Torchvision contains two ways of utilizing transforms:
# Functional and Composed.

# Functional does what it advertises - it is a function which
# you can use on your tensors. Here is an example which performs
# a center crop:

dummy_images = torch.rand((16, 1, 256, 256))
transformed_images = ttf.center_crop(img = dummy_images, output_size = (128, 128))
print(transformed_images.size())

# Functional transforms have the inherent advantage of giving the
# user very fine-grained control.

torch.Size([16, 1, 128, 128])


In [2]:
# The alternative is the so-called Composed form, which uses
# classes to achieve the same result. We make a Composed Transform
# like so:

dummy_images = torch.rand((16, 1, 256, 256))
transforms = tt.Compose([
    tt.RandomCrop(size = (128, 128)),
    tt.RandomHorizontalFlip()
])
transformed_images = transforms(dummy_images)
print(transformed_images.size())

# As you can see, Compose offers us the option of sequentially
# executing multiple transformations in a single line of code.
# We also get the option of using randomized augmentations,
# where the randomization is already done for us.

# In practice, either style of writing transformations is fine.
# In fact, they are equivalent, as Compose calls the functional
# versions of the transforms under the hood. In the case of
# randomized augmentations, the class handles all the randomizing
# and then calls the functional transform with the random inputs.

torch.Size([16, 1, 128, 128])


**Task 1 (1+1 points)**: A complete list of Torchvision's available transforms can be found here: https://docs.pytorch.org/vision/0.9/transforms.html. Consider the task we are working on right now - working with CT images from the LiTS 2017 dataset. Which data augmentations strike you as a good idea to add to our training **(1 point)**? Which do you think are a bad idea or cannot work at all **(1 point)**? Are there any which are missing in Torchvision? If you don't know what they do, try them out and judge for yourselves. Can you think of other image types with other physics behind them? Are the rules for them going to be different?

There are no definitely correct or incorrect answers here. The goal for this task is for you to be able to argue your case convincingly (to us) and think closely about your dataset. You can test your assumptions when completing the other tasks.

Gute Augmentation:

- Kleine Rotationen und Translationen (Verschiebungen)
- ROI Cropp (nicht in torch)
- Zoom / Veränderung der FOV (Organgrößen leicht verschieden)
- minimale Intensitäts Variation (nicht in torch)
 oder simuliertes Rauschen

Schlechte Augmentation:

- Collor Jitter (Keine Farbkanäle, geht granicht)
- Posterize (Nur kleine Unterschiede zur Unterschiedung)
- Greyscale (Ist schon Grey)
- Spiegelungen (anatomisch unkorrekt)
- starke Intensität varriation (benötigt zu Unterscheidung)
- Random Crop (Zu viele Bilder ohen Lesion)
- Verzerrungen/Shear (anatomisch unkorrekt)

Andere Physik
- MRT (Intensitäten sind willkürlicher abhängig von der Pulssequenz/ Modell muss auf Intensitätschwankungen trainiert werden)
- Ultra Schall (Bilder sind immer verzerrt/ FoV ganz anders / mehr Rauschen)

#### Chapter 3.2 - Regularization Techniques

While there are multiple definitions or guidelines for what regularization is supposed to do (see lectures), in terms of practical concerns, all regularization techniques have the same aim, expressed through different means: Improving some aspect of the performance of your deep learning models. We differentiate them, broadly, from data augmentations, because regularization techniques generally concern themselves with the learning process, e.g. loss function modifications, learning rate optimizations, temporary model modifications, etc., and *not* the underlying data in our model training.

There are a number of different strategies, far too many to list all of them here, but a few particularly successful ones have made it into common use - so much so that they are more prevalent than regularization-less, "vanilla" deep learning. These fall into different groups, briefly discussed below.

#### Additional Loss Components

The loss function for any given modern optimization task is typically continuous and not always smooth everywhere. As a consequence, there are many different parameter configurations in a model that result in the same train-time loss. Not all of these express the same behavior during training or testing, however. When we modify our loss to penalize certain training behaviors, we allow the training process to select for models and parameters that give models that generalize better, converge to a solution faster, etc.,  despite often expressing the same training loss. Let's look at some examples that should be familiar from the lecture:

**L1 Loss** - Often also called LASSO, L1 Loss is a penalty term added to the normal loss during training, which is defined as:
$L_{LASSO} = \sum_{p=1}^{P} |\Theta_{P}|$. Growing linearly with parameter magnitude, we penalize the model. This, in turn, forces the model to use fewer and smaller weights - relying on more weights than it needs, and thus probably overfitting, is disincentivized. Similarly, we just forced our model to stick to weights near zero, which we already know is generally a preferable area for model activations to stick to.

**L2 Loss** - L2 Loss is the more popular cousin of the L1 Loss, which does approximately the same thing, except the penalty is equal to the sum over all squared parameter magnitudes: $L_{LASSO} = \sum_{p=1}^{P} |\Theta_{P}|^2$ The reasoning behind it is similar, but it has seen far more practical adoption.

**Weight Decay** - An operation that effectively performs the same duty, weight decay reduces the magnitude of weights after each backward pass, for example by subtracting a small constant or multiplying with a factor. In essence, this eliminates parameters which are rarely "used" and were thus likely involved in overfitting on a small amount of data anyway. Parameters that are regularly updated (and therefore probably useful), will always remain near their optimal value despite weight decay. Interestingly, Weight Decay is mathematically equivalent to L2 Loss in terms of net parameter updates.

#### Training Strategies
**Early Stopping**: For early stopping, we monitor the performance of the model on a validation set and stop training when the performance stops improving. This ensures that the model does not continue to train on the training data and potentially overfit, while also saving computational resources.

**Dropout**: Dropout is a regularization technique where, during training, a random subset of neurons in a layer is "dropped out" (set to zero) for each forward pass. This prevents the network from relying too heavily on specific neurons and encourages it to learn more robust and generalized features. During inference, all neurons are used, but their outputs are scaled to account for the dropout during training.

**Learning Rate Scheduling (LR Scheduling)**: Learning rate scheduling involves dynamically adjusting the learning rate during training. A high learning rate at the start helps the model converge quickly, while a lower learning rate later allows for fine-tuning. Common strategies include step decay, exponential decay, and cosine annealing. Proper learning rate scheduling can lead to faster convergence and better generalization.

**Task 1.5 (3 x 0.5 points)**: Let us implement and compare the effects of different regularization techniques on a simple neural network. To do so, follow these steps:

1. Create a small neural network (e.g., 2-3 layers) and train it on the LiTS dataset without any regularization. Record the training and validation accuracy/loss. (P.S.: You can use the model from last week's exercise as a starting point.)
2. Add L2 regularization to the model and observe how it affects the training and validation performance (Check the Adam optimizer documentation to find out how to add this regularization). Compare the results with the unregularized model.
3. Add dropout to the model and repeat the training process (check the PyTorch documentation to find out how to add this regularization - you do not need to implement it yourself). Compare the results with the previous models.

For each step, make sure to copy the relevant code snippets into a new cell, instead of modifying the existing code. This way, we can keep track of the different versions of the model and their performances.

You might want to look up the documentation for implementing these techniques in PyTorch.

For each regularization technique, explain how it impacts the model's performance and generalization. Which combination of techniques works best for this dataset? Why do you think that is the case?

In [9]:
!gdown 1TItTaso19GFTPdDnynVnqJvHsCm_RGlI
!rm -rf ./sample_data/
!unzip -qq Clean_LiTS.zip
!rm ./Clean_LiTS.zi

Downloading...
From (original): https://drive.google.com/uc?id=1TItTaso19GFTPdDnynVnqJvHsCm_RGlI
From (redirected): https://drive.google.com/uc?id=1TItTaso19GFTPdDnynVnqJvHsCm_RGlI&confirm=t&uuid=f9e48ac9-fc55-44f3-bf8b-6f0ddfa0d9d7
To: /content/Clean_LiTS.zip
100% 2.56G/2.56G [00:36<00:00, 71.1MB/s]
rm: cannot remove './Clean_LiTS.zi': No such file or directory


In [10]:
import torchvision.transforms.functional as ttf
import pandas as pd
from torch.utils.data import Dataset, DataLoader
import PIL

class LiTS_Dataset(Dataset):

    """
    For our sample solution, we go for the easier variant.

    In this specific dataset, we don't load the images until we need them - for a
    short training, or limited resources, this is good behavior. If you have the
    necessary RAM to pre-load all of your data, you don't have to load the data
    multiple times, and save compute costs in the long run. The downside is that
    when you are trying to debug, you wait for ages every time, and if you simply
    do not have the compute resources, you can't even do it.
    """

    def __init__(self, csv: str, mode: str):

        self.csv = csv
        self.data = pd.read_csv(self.csv)
        self.mode = mode
        assert mode in ["train", "val", "test"] # has to be train, val, or test data - if not, assert throws an error

    def __len__(self):

        return len(self.data)

    def __getitem__(self, idx):

        file = self.data.loc[idx, "filename"]
        with PIL.Image.open(f"./Clean_LiTS/{self.mode}/{file}") as f:
            f = f.convert("L")
            image = ttf.pil_to_tensor(f)

        image = image.to(dtype = torch.float32)
        image -= torch.min(image)
        image /= torch.max(image)

        liver_visible = self.data.loc[idx, "liver_visible"]
        lesion_visible = self.data.loc[idx, "lesion_visible"]
        # Note that targets must have the data type torch.long - a 64-bit integer,
        # unlike the image tensor, which is usually a 32-bit float, the default
        # dtype for tensors when none is given
        if lesion_visible and liver_visible:
            target = torch.tensor(2, dtype = torch.long)
        elif not lesion_visible and liver_visible:
            target = torch.tensor(1, dtype = torch.long)
        elif not lesion_visible and not liver_visible:
            target = torch.tensor(0, dtype = torch.long)
        else:
            print(
                idx,
                lesion_visible,
                liver_visible,
                self.data.loc[idx, "liver_visible"],
                self.data.loc[idx, "lesion_visible"],
                self.data.loc[idx, "filename"]
                )
            raise ValueError("Invalid target")

        return image, target

train_dataset = LiTS_Dataset(csv = "./Clean_LiTS/train_classes.csv", mode="train")
val_dataset = LiTS_Dataset(csv = "./Clean_LiTS/val_classes.csv", mode="val")
test_dataset = LiTS_Dataset(csv = "./Clean_LiTS/test_classes.csv", mode="test")

batch_size = 16

train_dataloader = DataLoader(
    dataset = train_dataset,
    batch_size = batch_size,
    shuffle = True,
    drop_last = True
)

val_dataloader = DataLoader(
    dataset = val_dataset,
    batch_size = batch_size,
    num_workers = 0,
    shuffle = False,
    drop_last = True
)

test_dataloader = DataLoader(
    dataset = test_dataset,
    batch_size = batch_size,
    shuffle = False,
    drop_last = True
)

In [14]:
device = ("cuda" if torch.cuda.is_available() else "cpu")

# Insert your model here

class YourModel(nn.Module):
    def __init__(self, in_channels=1, out_classes=3):
        super(YourModel, self).__init__()

        self.conv1 = torch.nn.Conv2d(in_channels=in_channels, out_channels=16, kernel_size=(3, 3), padding=1)
        self.conv2 = torch.nn.Conv2d(in_channels = 16, out_channels = 32, kernel_size=(3, 3), padding=1)
        self.conv3 = torch.nn.Conv2d(in_channels = 32, out_channels = 64, kernel_size=(3, 3), padding=1)
        self.conv4 = torch.nn.Conv2d(in_channels = 64, out_channels = 128, kernel_size=(3, 3), padding=1)

        self.relu = torch.nn.ReLU()
        self.pool = torch.nn.MaxPool2d(2)

        self.fc1 = torch.nn.Linear(in_features = 128 * 16 * 16, out_features = 256)
        self.fc2 = torch.nn.Linear(in_features = 256, out_features =3)

        self.dropout = torch.nn.Dropout(p=0.5)

    def forward(self, x: torch.Tensor):

        x = self.pool(self.relu(self.conv1(x)))
        x = self.pool(self.relu(self.conv2(x)))
        x = self.pool(self.relu(self.conv3(x)))
        x = self.pool(self.relu(self.conv4(x)))

        x = x.flatten(start_dim=1)

        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)

        return x

In [15]:
# Now for the training loop
loss_criterion = torch.nn.CrossEntropyLoss()

def train_loop(your_model):
    """
    This function runs your train loop and returns the trained model.
    """

    model = your_model(in_channels = 1, out_classes = 3)
    model = model.to(device = device)

    loss_criterion = torch.nn.CrossEntropyLoss()


    optimizer = torch.optim.Adam(
        params=model.parameters(),
        lr=1e-4,
        weight_decay=1e-5  # L2
    )

    num_epochs = 15

    for epoch in range(num_epochs):

        for step, (data, targets) in enumerate(train_dataloader):

            optimizer.zero_grad()
            data, targets = data.to(device), targets.to(device)
            predictions = model(data)
            loss = loss_criterion(predictions, targets)

            if step % 50 == 0:
                # This uses the length of the current set - "train"
                print(f"Epoch [{epoch+1}/{num_epochs}]\t Step [{step+1}/{len(train_dataloader.dataset)//batch_size}]\t Loss: {loss.item():.4f}")

            loss.backward()
            optimizer.step()

        # Validate every 2 epochs
        if epoch % 2 == 0:

            # Validation mode on
            model.eval()

            # Don't track gradients for validation
            with torch.no_grad():

                hits = 0
                losses = []
                batch_sizes = []

                for step, (data, targets) in enumerate(val_dataloader):

                    data, targets = data.to(device), targets.to(device)
                    predictions = model(data)
                    loss = loss_criterion(predictions, targets)
                    losses.append(loss.item())
                    batch_sizes.append(data.size()[0])

                    class_predictions = torch.argmax(predictions, dim = 1).flatten()
                    hits = hits + sum([1 if cp == t else 0 for cp, t in zip(class_predictions, targets)])

                accuracy = hits / len(val_dataloader.dataset)
                avg_loss = sum([l * bs for l, bs in zip(losses, batch_sizes)]) / sum(batch_sizes)
                print(f"Epoch: {epoch+1},\t Validation Loss: {avg_loss:.4f},\t Accuracy: {accuracy:.4f}")

            # After we are done validating, let's not forget to go back to storing gradients.
            model.train()

    return model

In [16]:
trained_model = train_loop(YourModel)

Epoch [1/15]	 Step [1/2217]	 Loss: 1.1075
Epoch [1/15]	 Step [51/2217]	 Loss: 0.8566
Epoch [1/15]	 Step [101/2217]	 Loss: 0.7830
Epoch [1/15]	 Step [151/2217]	 Loss: 0.7974
Epoch [1/15]	 Step [201/2217]	 Loss: 0.8077
Epoch [1/15]	 Step [251/2217]	 Loss: 0.4525
Epoch [1/15]	 Step [301/2217]	 Loss: 0.5593
Epoch [1/15]	 Step [351/2217]	 Loss: 0.6973
Epoch [1/15]	 Step [401/2217]	 Loss: 0.7749
Epoch [1/15]	 Step [451/2217]	 Loss: 0.5155
Epoch [1/15]	 Step [501/2217]	 Loss: 0.3345
Epoch [1/15]	 Step [551/2217]	 Loss: 0.5802
Epoch [1/15]	 Step [601/2217]	 Loss: 0.3502
Epoch [1/15]	 Step [651/2217]	 Loss: 0.3660
Epoch [1/15]	 Step [701/2217]	 Loss: 0.3699
Epoch [1/15]	 Step [751/2217]	 Loss: 0.2065
Epoch [1/15]	 Step [801/2217]	 Loss: 0.2946
Epoch [1/15]	 Step [851/2217]	 Loss: 0.6406
Epoch [1/15]	 Step [901/2217]	 Loss: 0.5193
Epoch [1/15]	 Step [951/2217]	 Loss: 0.5941
Epoch [1/15]	 Step [1001/2217]	 Loss: 0.2893
Epoch [1/15]	 Step [1051/2217]	 Loss: 0.3496
Epoch [1/15]	 Step [1101/2217]	 L

In [17]:
# Standard

trained_model.eval()

with torch.no_grad():

    hits = 0
    losses = []
    batch_sizes = []

    for step, (data, targets) in enumerate(test_dataloader):

        data, targets = data.to(device), targets.to(device)
        predictions = trained_model(data)
        loss = loss_criterion(predictions, targets)
        losses.append(loss.item())
        batch_sizes.append(data.size()[0])
        class_predictions = torch.argmax(predictions, dim = 1).flatten()
        hits = hits + sum([1 if cp == t else 0 for cp, t in zip(class_predictions, targets)])

    accuracy = hits / len(test_dataloader.dataset)
    avg_loss = sum([l * bs for l, bs in zip(losses, batch_sizes)]) / sum(batch_sizes)
    print(f"Test Loss: {avg_loss:.4f},\t Accuracy: {accuracy:.4f}")

Test Loss: 1.5275,	 Accuracy: 0.7574


### L2-Regularization

In [18]:

  # optimizer = torch.optim.Adam(
      #  params=model.parameters(),
       # lr=1e-4,
       # weight_decay=1e-5  # L2
  # )

trained_model1 = train_loop(YourModel)

trained_model1.eval()

with torch.no_grad():

    hits = 0
    losses = []
    batch_sizes = []

    for step, (data, targets) in enumerate(test_dataloader):

        data, targets = data.to(device), targets.to(device)
        predictions = trained_model1(data)
        loss = loss_criterion(predictions, targets)
        losses.append(loss.item())
        batch_sizes.append(data.size()[0])
        class_predictions = torch.argmax(predictions, dim = 1).flatten()
        hits = hits + sum([1 if cp == t else 0 for cp, t in zip(class_predictions, targets)])

    accuracy = hits / len(test_dataloader.dataset)
    avg_loss = sum([l * bs for l, bs in zip(losses, batch_sizes)]) / sum(batch_sizes)
    print(f"Test Loss: {avg_loss:.4f},\t Accuracy: {accuracy:.4f}")

Epoch [1/15]	 Step [1/2217]	 Loss: 1.0972
Epoch [1/15]	 Step [51/2217]	 Loss: 0.8323
Epoch [1/15]	 Step [101/2217]	 Loss: 1.0817
Epoch [1/15]	 Step [151/2217]	 Loss: 0.6268
Epoch [1/15]	 Step [201/2217]	 Loss: 0.4279
Epoch [1/15]	 Step [251/2217]	 Loss: 0.3117
Epoch [1/15]	 Step [301/2217]	 Loss: 0.4290
Epoch [1/15]	 Step [351/2217]	 Loss: 0.2703
Epoch [1/15]	 Step [401/2217]	 Loss: 0.7447
Epoch [1/15]	 Step [451/2217]	 Loss: 0.3775
Epoch [1/15]	 Step [501/2217]	 Loss: 0.2129
Epoch [1/15]	 Step [551/2217]	 Loss: 0.1579
Epoch [1/15]	 Step [601/2217]	 Loss: 0.3333
Epoch [1/15]	 Step [651/2217]	 Loss: 0.1405
Epoch [1/15]	 Step [701/2217]	 Loss: 0.2533
Epoch [1/15]	 Step [751/2217]	 Loss: 0.4812
Epoch [1/15]	 Step [801/2217]	 Loss: 0.3356
Epoch [1/15]	 Step [851/2217]	 Loss: 0.4838
Epoch [1/15]	 Step [901/2217]	 Loss: 0.8975
Epoch [1/15]	 Step [951/2217]	 Loss: 0.4541
Epoch [1/15]	 Step [1001/2217]	 Loss: 0.1665
Epoch [1/15]	 Step [1051/2217]	 Loss: 0.4049
Epoch [1/15]	 Step [1101/2217]	 L

### Dropout

In [19]:
   # self.dropout = torch.nn.Dropout(p=0.5)

   # x = self.dropout(x)

trained_model2 = train_loop(YourModel)

trained_model2.eval()

with torch.no_grad():

    hits = 0
    losses = []
    batch_sizes = []

    for step, (data, targets) in enumerate(test_dataloader):

        data, targets = data.to(device), targets.to(device)
        predictions = trained_model2(data)
        loss = loss_criterion(predictions, targets)
        losses.append(loss.item())
        batch_sizes.append(data.size()[0])
        class_predictions = torch.argmax(predictions, dim = 1).flatten()
        hits = hits + sum([1 if cp == t else 0 for cp, t in zip(class_predictions, targets)])

    accuracy = hits / len(test_dataloader.dataset)
    avg_loss = sum([l * bs for l, bs in zip(losses, batch_sizes)]) / sum(batch_sizes)
    print(f"Test Loss: {avg_loss:.4f},\t Accuracy: {accuracy:.4f}")




Epoch [1/15]	 Step [1/2217]	 Loss: 1.1149
Epoch [1/15]	 Step [51/2217]	 Loss: 1.2823
Epoch [1/15]	 Step [101/2217]	 Loss: 0.8730
Epoch [1/15]	 Step [151/2217]	 Loss: 0.6028
Epoch [1/15]	 Step [201/2217]	 Loss: 0.5060
Epoch [1/15]	 Step [251/2217]	 Loss: 0.8361
Epoch [1/15]	 Step [301/2217]	 Loss: 0.4397
Epoch [1/15]	 Step [351/2217]	 Loss: 0.4814
Epoch [1/15]	 Step [401/2217]	 Loss: 0.3721
Epoch [1/15]	 Step [451/2217]	 Loss: 0.1336
Epoch [1/15]	 Step [501/2217]	 Loss: 0.2689
Epoch [1/15]	 Step [551/2217]	 Loss: 0.4442
Epoch [1/15]	 Step [601/2217]	 Loss: 0.2130
Epoch [1/15]	 Step [651/2217]	 Loss: 0.3680
Epoch [1/15]	 Step [701/2217]	 Loss: 0.2229
Epoch [1/15]	 Step [751/2217]	 Loss: 0.3230
Epoch [1/15]	 Step [801/2217]	 Loss: 0.2105
Epoch [1/15]	 Step [851/2217]	 Loss: 0.4599
Epoch [1/15]	 Step [901/2217]	 Loss: 0.4275
Epoch [1/15]	 Step [951/2217]	 Loss: 0.9152
Epoch [1/15]	 Step [1001/2217]	 Loss: 0.4587
Epoch [1/15]	 Step [1051/2217]	 Loss: 0.5262
Epoch [1/15]	 Step [1101/2217]	 L

L2: Extra Loss; verhindert, dass das Modell auf einzelne Weights fokussiert (kein Overfitting)
Dropout: Es wird eine Anzahl von Neuronen während jedem durchlaufen zufällig deaktiviert. Verhindert Overfitting, Modell generalsisiert besser.

Viele Parameter in einem Bild. Modell kann leicht zufällige Details lernen. Dies verhindert die Regularisierung.

#### Chapter 3.3 - Batch Normalization

Batch Normalization is a technique to improve the training of deep neural networks by normalizing the inputs to each layer. It was introduced to address the problem of internal covariate shift, which refers to the change in the distribution of layer inputs during training as the parameters of the previous layers change.

**Intuition:**
The idea behind Batch Normalization is to normalize the inputs to each layer so that they have a mean of 0 and a standard deviation of 1. This ensures that the inputs to each layer are on a similar scale, which helps the network learn faster and more effectively. By normalizing the inputs, Batch Normalization reduces the sensitivity of the network to the initialization of weights and allows for the use of higher learning rates.

**How it works:**
1. For each mini-batch, Batch Normalization computes the mean and variance of the inputs.
2. The inputs are then normalized using these statistics:
    $\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}$
    where $\mu$ is the mean, $\sigma^2$ is the variance, and $\epsilon$ is a small constant added for numerical stability.
3. To allow the network to learn the optimal scale and shift for the normalized inputs, two learnable parameters, $\gamma$ (scale) and $\beta$ (shift), are introduced:
    $y = \gamma \hat{x} + \beta$

**Problems it solves:**
1. **Internal Covariate Shift:** By normalizing the inputs to each layer, Batch Normalization reduces the changes in the distribution of layer inputs during training, making the optimization process more stable.
2. **Faster Training:** Normalized inputs allow for the use of higher learning rates, leading to faster convergence.
3. **Regularization Effect:** Batch Normalization introduces some noise due to the mini-batch statistics, which acts as a form of regularization and reduces the need for other regularization techniques like Dropout.

In [None]:
# Here is how to add it into your model as a layer:

bn = torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, device=None, dtype=None)

#### Chapter 3.4 - Modern Computer Vision Models

AlexNet (original paper: https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf) is in many senses the grandfather of modern neural networks, being the first one to successfully combine multiple GPUs for training with a deep neural network. While it is no longer in use today, the lessons learned from AlexNet very much are, and multi-GPU setups and deep convolutional neural networks remain a staple of computer vision methods.

Modern Computer Vision uses a number of different models, but perhaps none is as prolific as the original ResNet, in particular the ResNet-50. Even though it is far from the strongest model available today, its flexibility, modest size, and robust performance across tasks makes it a favorite, both in general computer vision and medical computer vision, where it is commonly used as the encoder in segmentation models (more on that later). The original paper (https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf) has garnered almost 300'000 citations, and its descendants have dominated challenges and paper submissions in the field for a significant amount of time.

**Task 2 (up to 6 points)**: Your task is to write one of these two modern models from scratch. The points are awarded for correctly implementing these models. You can choose your own difficulty here, and can earn fewer or more points, depending on which you feel more comfortable building. AlexNet requires only components that you have already seen last week - convolutions, pooling, and linear layers, while ResNet requires you to build skip connections and bottleneck blocks from scratch.

Option 1 - AlexNet **(4 points)**:
- Building the model **(2 points)**
- You do not have to implement the parts where multiple GPUs are required
- Add some type of data augmentation and regularization to the training script **(1 point)**
- Do a proper evaluation of the test set, including confusion matrix and precision-recall curves **(1 point)**

Option 2 - ResNet-50 **(7 points)**
- Correctly implementing Skip Connections **(1 point)**
- Correctly implementing Residual/Bottleneck Blocks **(3 points)**
- Correctly building the ResNet from these Blocks **(1 point)**
- For BatchNorm you are allowed to simply use the existing implementation
- Add some type of data augmentation and regularization to the training script **(1 point)**
- Do a proper evaluation of the test set, including confusion matrix and precision-recall curves **(1 point)**

You must verify that your model actually trains and is capable of solving the classification task on LiTS 2017. You should be able to explain every piece of code to the tutors that grade your solution, so if you use any help in building the model (e.g. Chat-GPT, Cursor, etc.), be prepared to explain what code blocks do what, and why you implemented them in the specific way you did, and not any other. The points are awarded for programming *and* understanding!

In [None]:
import torch
import torch.nn as nn

class ResidualBlock(nn.Module):
        def __init__(self, in_channels, out_channels, stride = 1):
            super(ResidualBlock, self).__init__()

            self.conv1 = nn.Sequential(
                            nn.Conv2d(in_channels, out_channels, kernel_size = 1, stride = 1),
                            nn.BatchNorm2d(out_channels),
                            nn.ReLU())

            self.conv2 = nn.Sequential(
                            nn.Conv2d(out_channels, out_channels, kernel_size = 3, stride = stride, padding = 1),
                            nn.BatchNorm2d(out_channels),
                            nn.ReLU())

            self.conv2 = nn.Sequential(
                            nn.Conv2d(out_channels, out_channels *4, kernel_size=1, stride=1, padding=1),
                            nn.BatchNorm2d(out_channels),
                            nn.ReLU())


            self.downsample = None
            if stride != 1 or in_channels != out_channels *4:
               self.downsample = nn.Sequential(
                    nn.Conv2d(in_channels, out_channels * 4, kernel_size=1, stride=stride, bias=False),
                    nn.BatchNorm2d(out_channels)
               )

            self.relu = nn.ReLU()

        def forward(self, x):
            identity = x

            out = self.conv1(x)
            out = self.conv2(out)
            out= self.conv3(out)

            if self.downsample is not None:
                identity = self.downsample(x)
            out += identity
            out = self.relu(out)
            return out


class ResNet(nn.Module):
    def __init__(self, in_channels=1, out_classes=3):
        super(ResNet, self).__init__()

        self.conv1 = nn.Sequential(
            nn.Conv2d(in_channels, 64, kernel_size=7, stride=2, padding=3),
            nn.BatchNorm2d(64),
            nn.ReLU()
        )

        self.pool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))

        self.layer1 = nn.Sequential(
            ResidualBlock(64, 64),
            ResidualBlock(256, 64),
            ResidualBlock(256, 64)
        )

        self.layer2 = nn.Sequential(
            ResidualBlock(64, 128, stride=2),
            ResidualBlock(128, 128),
            ResidualBlock(128, 128),
            ResidualBlock(128, 128)
        )

        self.layer3 = nn.Sequential(
            ResidualBlock(128, 256, stride=2),
            ResidualBlock(256, 256),
            ResidualBlock(256, 256),
            ResidualBlock(256, 256),
            ResidualBlock(256, 256),
            ResidualBlock(256, 256)
        )

        self.layer4 = nn.Sequential(
            ResidualBlock(256, 512, stride=2),
            ResidualBlock(512, 512),
            ResidualBlock(512, 512)
        )

        self.fc = torch.nn.Linear(in_features = 512, out_features =out_classes)

        self.dropout = torch.nn.Dropout(p=0.5)

    def forward(self, x):
        x = self.conv1(x)
        x = self.pool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.dropout(x)
        x = self.fc(x)

        return x

In [None]:
res_net = train_loop(ResNet)

res_net.eval()

with torch.no_grad():

    hits = 0
    losses = []
    batch_sizes = []

    for step, (data, targets) in enumerate(test_dataloader):

        data, targets = data.to(device), targets.to(device)
        predictions = res_net(data)
        loss = loss_criterion(predictions, targets)
        losses.append(loss.item())
        batch_sizes.append(data.size()[0])
        class_predictions = torch.argmax(predictions, dim = 1).flatten()
        hits = hits + sum([1 if cp == t else 0 for cp, t in zip(class_predictions, targets)])

    accuracy = hits / len(test_dataloader.dataset)
    avg_loss = sum([l * bs for l, bs in zip(losses, batch_sizes)]) / sum(batch_sizes)
    print(f"Test Loss: {avg_loss:.4f},\t Accuracy: {accuracy:.4f}")
