KUL H02A5a Computer Vision: Group Assignment 2
---------------------------------------------------------------
Student numbers: <span style="color:red">r1, r2, r0977144, r0974272, r0974660</span>.

In this group assignment your team will delve into some deep learning applications for computer vision. The assignment will be delivered in the same groups from *Group assignment 1* and you start from this template notebook. The notebook you submit for grading is the last notebook pinned as default and submitted to the [Kaggle competition](https://www.kaggle.com/t/d11be6a431b84198bc85f54ae7e2563f) prior to the deadline on **Wednesday 22 May 23:59**. Closely follow [these instructions](https://github.com/gourie/kaggle_inclass) for joining the competition, sharing your notebook with the TAs and making a valid notebook submission to the competition. A notebook submission not only produces a *submission.csv* file that is used to calculate your competition score, it also runs the entire notebook and saves its output as if it were a report. This way it becomes an all-in-one-place document for the TAs to review. As such, please make sure that your final submission notebook is self-contained and fully documented (e.g. provide strong arguments for the design choices that you make). Most likely, this notebook format is not appropriate to run all your experiments at submission time (e.g. the training of CNNs is a memory hungry and time consuming process; due to limited Kaggle resources). It can be a good idea to distribute your code otherwise and only summarize your findings, together with your final predictions, in the submission notebook. For example, you can substitute experiments with some text and figures that you have produced "offline" (e.g. learning curves and results on your internal validation set or even the test set for different architectures, pre-processing pipelines, etc). We advise you to first go through the PDF of this assignment entirely before you really start. Then, it can be a good idea to go through this notebook and use it as your first notebook submission to the competition. You can make use of the *Group assignment 2* forum/discussion board on Toledo if you have any questions. Good luck and have fun!

---------------------------------------------------------------
NOTES:
* This notebook is just a template. Please keep the five main sections, but feel free to adjust further in any way you please!
* Clearly indicate the improvements that you make! You can for instance use subsections like: *3.1. Improvement: applying loss function f instead of g*.


# 1. Overview
This assignment consists of *three main parts* for which we expect you to provide code and extensive documentation in the notebook:
* Image classification (Sect. 2)
* Semantic segmentation (Sect. 3)
* Adversarial attacks (Sect. 4)

In the first part, you will train an end-to-end neural network for image classification. In the second part, you will do the same for semantic segmentation. For these two tasks we expect you to put a significant effort into optimizing performance and as such competing with fellow students via the Kaggle competition. In the third part, you will try to find and exploit the weaknesses of your classification and/or segmentation network. For the latter there is no competition format, but we do expect you to put significant effort in achieving good performance on the self-posed goal for that part. Finally, we ask you to reflect and produce an overall discussion with links to the lectures and "real world" computer vision (Sect. 5). It is important to note that only a small part of the grade will reflect the actual performance of your networks. However, we do expect all things to work! In general, we will evaluate the correctness of your approach and your understanding of what you have done that you demonstrate in the descriptions and discussions in the final notebook.

## 1.1 Deep learning resources
If you did not yet explore this in *Group assignment 1 (Sect. 2)*, we recommend using the TensorFlow and/or Keras library for building deep learning models. You can find a nice crash course [here](https://colab.research.google.com/drive/1UCJt8EYjlzCs1H1d1X0iDGYJsHKwu-NO).

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
import numpy as np
import pandas as pd
import tensorflow as tf
from matplotlib import pyplot as plt
import tensorflow_hub as hub
from tensorflow.keras import layers, Model
from tqdm import tqdm
import cv2
from sklearn.model_selection import StratifiedKFold
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
from tensorflow.keras.optimizers import Adam
from PIL import Image, ImageOps
import scipy as scipy
from tensorflow.keras.layers import Input, Conv2D, Conv2DTranspose, Activation, MaxPooling2D, UpSampling2D, concatenate
!pip install -q git+https://github.com/tensorflow/examples.git
!pip install iterative-stratification

2024-05-22 11:07:55.347088: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-22 11:07:55.347174: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-22 11:07:55.452077: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Collecting iterative-stratification
  Downloading iterative_stratification-0.1.7-py3-none-any.whl.metadata (1.3 kB)
Downloading iterative_stratification-0.1.7-py3-none-any.whl (8.5 kB)
Installing collected packages: iterative-stratification
Successfully installed iterative-stratification-0.1.7


## 1.2 PASCAL VOC 2009
For this project you will be using the [PASCAL VOC 2009](http://host.robots.ox.ac.uk/pascal/VOC/voc2009/index.html) dataset. This dataset consists of colour images of various scenes with different object classes (e.g. animal: *bird, cat, ...*; vehicle: *aeroplane, bicycle, ...*), totalling 20 classes.

In [None]:
# Loading the training data
train_df = pd.read_csv('//kaggle/input/kul-h02a5a-computer-vision-ga2-2024/train/train_set.csv', index_col="Id")
labels = train_df.columns
train_df["img"] = [np.load('/kaggle/input/kul-h02a5a-computer-vision-ga2-2024/train/img/train_{}.npy'.format(idx)) for idx, _ in train_df.iterrows()]
train_df["seg"] = [np.load('/kaggle/input/kul-h02a5a-computer-vision-ga2-2024/train/seg/train_{}.npy'.format(idx)) for idx, _ in train_df.iterrows()]
print("The training set contains {} examples.".format(len(train_df)))

fig, axs = plt.subplots(2, 20, figsize=(10 * 20, 10 * 2))
for i, label in enumerate(labels):
    df = train_df.loc[train_df[label] == 1]
    axs[0, i].imshow(df.iloc[0]["img"], vmin=0, vmax=255)
    axs[0, i].set_title("\n".join(label for label in labels if df.iloc[0][label] == 1), fontsize=40)
    axs[0, i].axis("off")
    axs[1, i].imshow(df.iloc[0]["seg"], vmin=0, vmax=20)  # with the absolute color scale it will be clear that the arrays in the "seg" column are label maps (labels in [0, 20])
    axs[1, i].axis("off")
plt.show()

# The training dataframe contains for each image 20 columns with the ground truth classification labels and 20 column with the ground truth segmentation maps for each class
train_df.head(1)

The training set contains 749 examples.


In [3]:
# Loading the test data
test_df = pd.read_csv('/kaggle/input/kul-h02a5a-computer-vision-ga2-2024/test/test_set.csv', index_col="Id")
test_df["img"] = [np.load('/kaggle/input/kul-h02a5a-computer-vision-ga2-2024/test/img/test_{}.npy'.format(idx)) for idx, _ in test_df.iterrows()]
test_df["seg"] = [-1 * np.ones(img.shape[:2], dtype=np.int8) for img in test_df["img"]]
print("The test set contains {} examples.".format(len(test_df)))

# The test dataframe is similar to the training dataframe, but here the values are -1 --> your task is to fill in these as good as possible in Sect. 2 and Sect. 3; in Sect. 6 this dataframe is automatically transformed in the submission CSV!
test_df.head(1)

The test set contains 750 examples.


Unnamed: 0_level_0,aeroplane,bicycle,bird,boat,bottle,bus,car,cat,chair,cow,...,horse,motorbike,person,pottedplant,sheep,sofa,train,tvmonitor,img,seg
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,"[[[139, 130, 115], [136, 127, 112], [112, 102,...","[[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, ..."


## 1.3 Your Kaggle submission
Your filled test dataframe (during Sect. 2 and Sect. 3) must be converted to a submission.csv with two rows per example (one for classification and one for segmentation) and with only a single prediction column (the multi-class/label predictions running length encoded). You don't need to edit this section. Just make sure to call this function at the right position in this notebook.

In [4]:
def _rle_encode(img):
    """
    Kaggle requires RLE encoded predictions for computation of the Dice score (https://www.kaggle.com/lifa08/run-length-encode-and-decode)

    Parameters
    ----------
    img: np.ndarray - binary img array
    
    Returns
    -------
    rle: String - running length encoded version of img
    """
    pixels = img.flatten()
    pixels = np.concatenate([[0], pixels, [0]])
    runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
    runs[1::2] -= runs[::2]
    rle = ' '.join(str(x) for x in runs)
    return rle

def generate_submission(df, fileName = "submission.csv"):
    """
    Make sure to call this function once after you completed Sect. 2 and Sect. 3! It transforms and writes your test dataframe into a submission.csv file.
    
    Parameters
    ----------
    df: pd.DataFrame - filled dataframe that needs to be converted
    
    Returns
    -------
    submission_df: pd.DataFrame - df in submission format.
    """
    df_dict = {"Id": [], "Predicted": []}
    for idx, _ in df.iterrows():
        df_dict["Id"].append(f"{idx}_classification")
        df_dict["Predicted"].append(_rle_encode(np.array(df.loc[idx, labels])))
        df_dict["Id"].append(f"{idx}_segmentation")
        df_dict["Predicted"].append(_rle_encode(np.array([df.loc[idx, "seg"] == j + 1 for j in range(len(labels))])))
    
    submission_df = pd.DataFrame(data=df_dict, dtype=str).set_index("Id")
    submission_df.to_csv(fileName)
    return submission_df

# 2. Image classification
The goal here is simple: implement a classification CNN and train it to recognise all 20 classes (and/or background) using the training set and compete on the test set (by filling in the classification columns in the test dataframe).

In [None]:
class RandomClassificationModel:
    """
    Random classification model: 
        - generates random labels for the inputs based on the class distribution observed during training
        - assumes an input can have multiple labels
    """
    def fit(self, X, y):
        """
        Adjusts the class ratio variable to the one observed in y. 

        Parameters
        ----------
        X: list of arrays - n x (height x width x 3)
        y: list of arrays - n x (nb_classes)

        Returns
        -------
        self
        """
        self.distribution = np.mean(y, axis=0)
        print("Setting class distribution to:\n{}".format("\n".join(f"{label}: {p}" for label, p in zip(labels, self.distribution))))
        return self
        
    def predict(self, X):
        """
        Predicts for each input a label.
        
        Parameters
        ----------
        X: list of arrays - n x (height x width x 3)
            
        Returns
        -------
        y_pred: list of arrays - n x (nb_classes)
        """
        np.random.seed(0)
        return [np.array([int(np.random.rand() < p) for p in self.distribution]) for _ in X]
    
    def __call__(self, X):
        return self.predict(X)
    
model = RandomClassificationModel()
model.fit(train_df["img"], train_df[labels])
test_df.loc[:, labels] = model.predict(test_df["img"])
test_df.head(1)

Setting class distribution to:
aeroplane: 0.06275033377837116
bicycle: 0.0520694259012016
bird: 0.07343124165554073
boat: 0.06408544726301736
bottle: 0.056074766355140186
bus: 0.050734312416555405
car: 0.08411214953271028
cat: 0.06008010680907877
chair: 0.09212283044058744
cow: 0.04005340453938585
diningtable: 0.06408544726301736
dog: 0.05740987983978638
horse: 0.056074766355140186
motorbike: 0.06275033377837116
person: 0.27636849132176233
pottedplant: 0.05740987983978638
sheep: 0.036048064085447265
sofa: 0.05874499332443257
train: 0.0534045393858478
tvmonitor: 0.06809078771695594


Before anything else, let's plot the distribution to get a better view of it.

In [None]:
plt.figure(figsize=(18, 6))
plt.bar(labels, model.distribution, color='skyblue', edgecolor='black', width=1)

plt.title('Label distribution')

Here we are dealing with a multi-label classification problem, which means we need to train a model that can recognize multiple objects/creatures in an input image, sourced from a number of categories (20 in this case).
The labeled part of the dataset is only 749 images long. More than 27% of the images contain a person but otherwise the dataset is fairly balanced with the rest of the categories appearing roughly 5% of the time. This means that for the majority of the categories we only have around 37 images to use in training which is a very low number of images.

## 2.1 Classification from scratch
### Data augmentation
Given the observations we made above, it is reasonable to do some data augmentation before we move forward. Data augmentation here aims to increase the number of images available and reduce overfitting. To achieve this we create a function which transforms the images in various ways and feeds them back to the training dataset along with their labels.

In [None]:
IMAGE_SIZE = 128 
def augment(image, label, image_size):

    image = tf.cast(image, tf.float32)
    image = tf.image.random_saturation(image, lower=1, upper=3)
    image = tf.image.random_flip_left_right(image)
    image = tf.image.random_flip_up_down(image)
    image = tf.image.random_crop(image, size=[image_size, image_size, 3])
    image = tf.image.random_brightness(image, max_delta=0.5)
    return image, label



def augment_matrix(images, labels, image_size):
    assert (len(images) == len(labels))

    augmented_images = []

    augmented_labels = []

    for i in range(len(images)):

        im, l = augment(np.copy(images[i]), np.copy(labels[i]), image_size)
        augmented_images.append(im)
        augmented_labels.append(l)

    return np.array(augmented_images), np.array(augmented_labels)

### Data preparation
Here we make sure that all the images have the same resolution. In terms of resolution, smaller images will reduce the computational load later on but the smaller we make them the more details are lost making the training process less fruitful. The average image resolution in the dataset is 471 x 383, so resizing them to 256 x 256 or 128 x 128 doesn't result in a great loss in level of detail. After testing we decided to use 128 x 128 to save on computational resources.

In [None]:
from skimage.transform import resize

num_classes = 20
input_shape = (IMAGE_SIZE, IMAGE_SIZE, 3)


def preprocess_image(img, target_size=(IMAGE_SIZE, IMAGE_SIZE)):
    resized_img = resize(img, target_size, anti_aliasing=True)
    return np.array(resized_img)


train_imgs = np.array(train_df['img'])
train_imgs = np.array([preprocess_image(img) for img in train_imgs])
augmented_imgs, augmented_labels = augment_matrix(train_imgs[0:150], np.array(train_df[labels])[0:150], IMAGE_SIZE)


train_imgs = np.concatenate((train_imgs, augmented_imgs),axis=0)
train_labels = np.concatenate((np.array(train_df[labels]), np.array(augmented_labels)),axis=0)

test_imgs = test_df['img']
test_imgs = np.array([preprocess_image(img) for img in test_imgs])

Let's visualize some of the augmented images to make sure our alterations are working as intended.

In [None]:
fig, axs = plt.subplots(1, 10, figsize=(10 * 10, 10 * 1))
for i in range(10):
    axs[i].imshow(augmented_imgs[i], vmin=0, vmax=255)
    
plt.show()

### Building the model
This is the part where we build the model to be trained. Through experimentation we have decided on the following fairly small convolutional neural network to match the size of the dataset. A brief explaination of the functionality of each layer is given below:

- **Rescaling layer**. Normalizes pixel values from 0-255 to 0-1. 
- **Convolutional layer**. Responsible for the convolutional functionality at the core of each CNN. Here we have 3 such layers each followed by a pooling and a dropout layer.
- **Pooling layer**. Applies max pooling to each 2 by 2 block of pixels, reducing the size of the input but retaining the most important features.
- **Dropout layer**. These layers randomly set part of the input to 0 as a regularization technique that aims to reduce overfitting.
- **Dense layer**. As is common practise with CNNs, the previous layer culminate to a series of dense layers (just two here) that map the recognized features to the labels we want to attribute to the input.


In [None]:
from tensorflow import keras

def create_cnn_model(input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3), num_classes=20):
    model = keras.models.Sequential()
    model.add(tf.keras.Input(shape=input_shape))
    model.add(tf.keras.layers.Rescaling(1.0/255))
    model.add(tf.keras.layers.Conv2D(32, (3, 3), activation='relu'))   
    model.add(tf.keras.layers.MaxPooling2D((2, 2)))

    model.add(tf.keras.layers.Conv2D(64, (3, 3), activation='relu'))
    model.add(tf.keras.layers.MaxPooling2D((2, 2)))

    model.add(tf.keras.layers.Conv2D(128, (3, 3), activation='relu'))
    model.add(tf.keras.layers.MaxPooling2D((2, 2)))

    model.add(tf.keras.layers.Flatten())

    model.add(tf.keras.layers.Dense(512, activation='relu'))
    model.add(tf.keras.layers.Dropout(0.25))
    # Output layer
    model.add(tf.keras.layers.Dense(num_classes, activation='sigmoid'))

    return model


model = create_cnn_model()
model.summary()

### Training the model
For training the model we used *stratified* cross-validation to ensure correct data distribution and checkpoints to the epoch where the best f1 score is achieved. For loss we use binary crossentropy which penalizes the model based on how well its predicted probabilities align with the true labels. Binary crossentropy is calculated for each category seperately and averaged out, essentially treating the problem as multiple classification problems.
As for metrics, accuracy is not a great metric for in this case because of the imbalances in the dataset and the overall low frequency of each class. A model that always predicts the majority class would have great accuracy but be completely useless. Thus, we will also add recall, precision and the aforementioned f1-score. 

Recall is the fraction of true positive predictions among all actual positives, precision is the percentage of true positive predictions among all positive predictions and the f1-score is a harmonic mean of both previous metrics. The f1 score is just the harmonic mean of these values.

To improve the performance of the network we also used class weights to artificially increase the impact of the under-represented labels (so every label but "person"), in order to balance the scales. For the same goal, we also used moving thresholds and found the per-label threshold for which the model performs best. Cross-validation is important here to ameliorate the bias introduced by this technique.

In [None]:
x_train, y_train = train_imgs, train_labels

In [None]:
from keras import metrics
from keras import callbacks as cb
from sklearn.utils import class_weight
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
from keras import metrics
from sklearn.metrics import f1_score


def find_best_thresholds(y_true, y_pred_probs, thresholds):
    best_thresholds = np.zeros(y_true.shape[1])
    for i in range(y_true.shape[1]):
        best_f1 = 0
        for threshold in thresholds:
            y_pred = (y_pred_probs[:, i] >= threshold).astype(int)
            f1 = f1_score(y_true[:, i], y_pred)
            if f1 > best_f1:
                best_f1 = f1
                best_thresholds[i] = threshold
    return best_thresholds

total_iterations = 1
skf = MultilabelStratifiedKFold(n_splits=10, shuffle=True, random_state=42)
metrics_f1=metrics.F1Score(average='micro', threshold=0.5)
thresholds = np.linspace(0.1, 0.9, 80)
best_thresholds = np.zeros((10, 20))
histories = []
end_results = []
avg_results = []
for iteration in range(total_iterations):
    for j, (train_index, test_index) in enumerate(skf.split(x_train, y_train)):
        X_train, X_val = x_train[train_index], x_train[test_index]
        Y_train, Y_val = y_train[train_index], y_train[test_index]
        
        checkpoint = cb.ModelCheckpoint(
            'best_model.weights.h5',
            monitor='val_f1_score',
            save_weights_only=True,
            save_best_only=True,
            mode='max',
            verbose=0
        )

        model = create_cnn_model()
        model.compile(optimizer='adam', loss='binary_crossentropy',
                    metrics=['accuracy', 'recall', 'precision', metrics_f1],)
        val_pred_probs = model.predict(X_val)
        y = np.array(train_labels)
        class_weights = np.zeros(20)
        for i in range(20):
            class_weights[i] = class_weight.compute_class_weight(
                class_weight='balanced',
                classes=np.array([0, 1]),
                y=y[:, i]
            )[1]

        # Create a sample weight arrays
        sample_weights = np.ones(y.shape[0])
        for i in range(y.shape[0]):
            sample_weights[i] = np.dot(y[i], class_weights)


        # Normalize sample weights
        sample_weights /= np.mean(sample_weights)
        history = model.fit(X_train, Y_train, validation_data=(X_val, Y_val),
                            epochs=90, batch_size=64, callbacks=[checkpoint], verbose=0)
        histories.append(history)

        model.load_weights('best_model.weights.h5')
        val_pred_probs = model.predict(X_val)
        
        best_thresholds[j] = find_best_thresholds(Y_val, val_pred_probs, thresholds)


Now to see the cross-validation results before the moving thresholds:

In [None]:
avg_f1 = 0
avg_precision = 0
avg_recall = 0
for iteration in range(10):
    avg_f1 += histories[iteration].history['val_f1_score'][-1]
    avg_precision += histories[iteration].history['val_precision'][-1]
    avg_recall += histories[iteration].history['val_recall'][-1]
print("f1 score: ",avg_f1/10)
print("precision: ",avg_precision/10)
print("recall: ",avg_recall/10)

In [None]:
from sklearn.metrics import precision_score, recall_score

val_pred_probs = model.predict(X_val)
average_best_thresholds = np.mean(best_thresholds, axis=0)
val_pred_labels = np.array([(val_pred_probs[:, i] >= average_best_thresholds[i]).astype(
    int) for i in range(Y_val.shape[1])]).T
precision_scores = [precision_score(
    Y_val[:, i], val_pred_labels[:, i], zero_division=0.0) for i in range(num_classes)]
recall_scores = [recall_score(Y_val[:, i], val_pred_labels[:, i])
                 for i in range(num_classes)]
f1_scores = [f1_score(Y_val[:, i], val_pred_labels[:, i])
             for i in range(num_classes)]
precision = np.mean(precision_scores)
recall = np.mean(recall_scores)
# micro f1 - Used for comparison
print((2 * recall * precision) / (recall + precision))
print(np.mean(f1_scores))  # macro f1

Upon running one of the better training instances had the following validation results:

val_f1_score: 0.23
val_precision: 0.32
val_recall: 0.18

From this we can conclude that even in this better instance the results are not good, especially compared to the results achieved later on via transfer learning. As initially suspected the dataset is too small to train a model from scratch effectively.

In [None]:
def plot_metrics(history):
    fig, axes = plt.subplots(2, 2, figsize=(12, 8))

    axes[0, 0].plot(history.history['val_precision'],label='Validation precision')
    axes[0, 0].plot(history.history['precision'], label='Precision')
    axes[0, 0].set_xlabel('Epochs')
    axes[0, 0].set_ylabel('Precision')
    axes[0, 0].legend()

    axes[0, 1].plot(history.history['val_recall'], label='Validation recall')
    axes[0, 1].plot(history.history['recall'], label='Recall')
    axes[0, 1].set_xlabel('Epochs')
    axes[0, 1].set_ylabel('Recall')
    axes[0, 1].legend()

    axes[1, 0].plot(history.history['val_loss'], label='Validation Loss')
    axes[1, 0].plot(history.history['loss'], label='Loss')
    axes[1, 0].set_xlabel('Epochs')
    axes[1, 0].set_ylabel('Loss')
    axes[1, 0].legend()

    axes[1, 1].plot((history.history['val_f1_score']),
                    label='Validation f1 score')
    axes[1, 1].plot((history.history['f1_score']), label='f1 score')
    axes[1, 1].set_xlabel('Epochs')
    axes[1, 1].set_ylabel('f1 score')
    axes[1, 1].legend()

    plt.tight_layout()
    plt.show()


plot_metrics(history)

## 2.2 Training a pre-trained Classifier

## 2.2.1 ResNet50V2

### Data Augmentation and Model Modification

To prevent overfitting, data augmentation techniques were employed. The top layer of each pre-trained model was replaced with a custom architecture that included:

- A global average pooling layer
- A dropout layer with a rate of 0.5
- A dense output layer with a softmax activation function
This custom architecture aimed to improve the generalization of the models.

### Transfer Learning

We performed transfer learning over 30 epochs with a batch size of 32, using the ADAM optimizer. This phase involved training the new top layers while keeping the pre-trained layers frozen. Major improvements in model performance were observed during this phase.

### Fine-tuning

Fine-tuning was conducted for an additional 20 epochs. During this phase, most of the pre-trained layers were kept frozen, and only the top layers were unfrozen and retrained. This approach was intended to refine the model's performance further while avoiding overfitting. However, only minor improvements were noticeable in this stage.

Overall, the combination of data augmentation, a custom top layer, and strategic transfer learning and fine-tuning allowed us to achieve significant performance improvements with the pre-trained models.

In [None]:
def resize_images(images, img_size):
    list_images = list()
    for img in images:
        img_resized = resize(img, img_size)
        list_images.append(img_resized)
    return np.array(list_images)

def get_y_train(): 
    y_train = list()
    for i in range(train_df['img'].shape[0]):
        img = train_df['img'][i]
        lista = list()
        for i in train_df.loc[i][:20]:
            lista.append(i)
        y_train.append(np.array(lista))     
    y_train = np.array(y_train)
    return y_train

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import keras
import time
import os
from sklearn.model_selection import train_test_split
from tensorflow.keras.applications import ResNet50V2
from tensorflow.keras import layers
import tensorflow as tf
from skimage.transform import resize

### PARAMETERS

NUM_CLASSES = 20
IMAGE_SIZE = (224, 224, 3)

datagen = ImageDataGenerator(
    rotation_range=90,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True)

epochs_pretraining = 30
epochs_fine_tuning = 20 
tuning_index = 185
BATCH_SIZE = 32

### LOAD DATA

X_train = resize_images(train_df["img"], IMAGE_SIZE)
X_test = resize_images(test_df["img"], IMAGE_SIZE)
y_train = get_y_train()

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.2, random_state = 4)

### LOAD MODEL AND EDIT TOP CLF LAYER
    
resnet_model = ResNet50V2(weights='imagenet', include_top=False, pooling='avg', input_shape=IMAGE_SIZE)
resnet_model.trainable = False
        
clf_model = tf.keras.Sequential() 
clf_model.add(resnet_model)
clf_model.add(layers.Dropout(0.5))
clf_model.add(layers.Dense(NUM_CLASSES, activation='softmax')) #output layer with NUM_CLASSES
    
### PRETRAINING AND TRANSFER LEARNING
    
clf_model.compile(optimizer='Adam', 
                  loss='binary_crossentropy', 
                  metrics=[tf.keras.metrics.BinaryAccuracy()])
    
history_pre_train = clf_model.fit(datagen.flow(X_train, y_train, batch_size=BATCH_SIZE), 
                                  validation_data=(X_val, y_val), 
                                  epochs=epochs_pretraining)
    
    
### FINE TUNING

for layer in resnet_model.layers[:tuning_index]:
    layer.trainable = False
for layer in resnet_model.layers[tuning_index:]:
    layer.trainable = True


opt = tf.keras.optimizers.Adam(learning_rate=0.0001)
clf_model.compile(optimizer=opt,
                  loss='binary_crossentropy', 
                  metrics=[tf.keras.metrics.BinaryAccuracy()]) 


history_fine_tune = clf_model.fit(datagen.flow(X_train, y_train, batch_size=BATCH_SIZE),
                                  validation_data= (X_val, y_val),
                                  epochs=epochs_fine_tuning)

## Plotting Training Curves for Transfer Learning and Fine-Tuning

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))

# Plot training accuracy on the first subplot (ax1)
ax1.plot(history_pre_train.history['binary_accuracy'], 'r')
ax1.plot(history_pre_train.history['val_binary_accuracy'], 'b')
ax1.set_xlabel("Iterations")
ax1.set_ylabel("Accuracy")
ax1.set_title('Training Accuracy')
ax1.set_xticks(np.arange(0, 30, 1.0)) 

# Plot validation accuracy on the second subplot (ax2)
ax2.plot(history_fine_tune.history['binary_accuracy'],'r')
ax2.plot(history_fine_tune.history['val_binary_accuracy'], 'b')
ax2.set_xlabel("Iterations")
ax2.set_ylabel("Accuracy")
ax2.set_title('Validation Accuracy')
ax2.set_xticks(np.arange(0, 20, 1.0)) 

 #Add legends
ax1.legend(['Train', 'Validation'])
ax2.legend(['Train', 'Validation'])

plt.show()

## Evaluation of Precision, Recall and F-measure on Validation Set

In [None]:
### EVALUATE PRECISION, RECALL AND FMEASURE ON VALIDATION SET

def calculate_f1_score(precision, recall):
    if precision + recall == 0:
        return 0
    else:
        f1_score = 2 * (precision * recall) / (precision + recall)
        return f1_score

threshold = 0.18

pred_y_val = clf_model.predict(X_val)
pred_y_val = np.where(pred_y_val > threshold, 1, 0)
    
TP = np.sum(np.logical_and(y_val == 1, pred_y_val == 1))
FP = np.sum(np.logical_and(y_val == 0, pred_y_val == 1))
FN = np.sum(np.logical_and(y_val == 1, pred_y_val == 0))


precision = TP / (TP + FP) if (TP + FP) > 0 else 0
recall = TP / (TP + FN) if (TP + FN) > 0 else 0 
f1_score = calculate_f1_score(precision, recall)

result_str = "F-measure:\t"+str(round(f1_score,3)) + "\nPrecision:\t" + str(round(precision,3)) + "\nRecall:\t\t" + str(round(recall,3))
print(result_str)

### Results on Validation Data Set

Threshold: 0.18

- F-Measure: 0.74
- Precision: 0.79
- Recall:    0.70

## 2.2.2 InceptionV3


We utilize the pre-trained InceptionV3 as the base model. The fully connected layers responsible for classification are removed from the pre-trained model. An image generator is created to perform data augmentation techniques, such as rotation, shifting, and flipping, on the training images. This helps to increase the diversity of the training data and improve the model's generalization ability.

In the initial training phase, the pre-trained layers are frozen, meaning their weights are not updated during this stage. Only the newly added dense layers are trained using the specified optimizer and binary cross-entropy loss function. This step allows the model to learn the task-specific features while preserving the general-purpose features learned by the pre-trained model.

After the initial training phase, a specified number of layers  from the pre-trained InceptionV3 model are unfrozen, allowing their weights to be updated during the subsequent training phase. This fine-tuning process enables the model to adapt the pre-trained features to the specific task at hand. To facilitate this fine-tuning, a lower learning rate is employed, using the Stochastic Gradient Descent (SGD) optimizer with a momentum term.

To prevent overfitting and ensure optimal model performance, an early stopping mechanism is implemented. The training process is monitored based on the validation loss, and if the loss does not improve for a specified number of epochs (patience), the training is halted. The model weights corresponding to the best validation performance are restored and saved.

If early stopping does not occur during the initial training phase, the number of epochs is increased up to a maximum value of 150. This iterative process continues until either early stopping occurs or the maximum number of epochs is reached, whichever comes first.

This model reached the score of 0.37210 when submitted to evaluate classification only. 

In [None]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import InceptionV3
from tensorflow.keras import layers
import random
from skimage.transform import resize
import time

NUM_CLASSES = 20
IMAGE_SIZE = (224,224,3)

train_images = train_df["img"]
train_labels = train_df.drop(columns=["img", "seg"]).values
test_images = resize_images(test_df["img"], IMAGE_SIZE)
y_train = get_y_train()

# Print shapes of training data
train_labels_combined = train_labels.reshape(train_labels.shape[0], -1)
print("Shape of train_images:", train_images.shape)
print("Shape of train_labels_combined:", train_labels_combined.shape)

# Stratify data
stratify_column = train_labels_combined[:, 0]
X_train, X_val, y_train, y_val = train_test_split(train_images, train_labels_combined, test_size=0.2, random_state=42, stratify=stratify_column)

# Resize images
X_train_resized = resize_images(X_train, IMAGE_SIZE)
X_val_resized = resize_images(X_val, IMAGE_SIZE)

# Data augmentation
data_gen = ImageDataGenerator(rotation_range=90, width_shift_range=0.2, height_shift_range=0.2, horizontal_flip=True)

# Load pre-trained model
base_model = InceptionV3(weights='imagenet', include_top=False, input_shape=IMAGE_SIZE)

# Define and compile the model
tf_model = tf.keras.Sequential([
    base_model,
    layers.Flatten(),
    layers.Dense(1024, activation='relu'),
    layers.Dense(512, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(NUM_CLASSES, activation='sigmoid')
])

for layer in base_model.layers:
    layer.trainable = False

optimizer = 'rmsprop'
tf_model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=[tf.keras.metrics.BinaryAccuracy()])

# Define early stopping callback
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Train the model
initial_epochs = 50
max_epochs = 150

while initial_epochs <= max_epochs:
    print(f"Training for {initial_epochs} epochs with optimizer {optimizer}")

    tf_model.fit(data_gen.flow(X_train_resized, y_train, batch_size=32),
                 validation_data=(X_val_resized, y_val),
                 epochs=initial_epochs,
                 callbacks=[early_stopping])

    for layer in base_model.layers[:200]:
        layer.trainable = False
    for layer in base_model.layers[200:]:
        layer.trainable = True

    tf_model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.9),
                     loss='binary_crossentropy',
                     metrics=[tf.keras.metrics.BinaryAccuracy()])

    start_time = time.time()
    tf_model.fit(data_gen.flow(X_train_resized, y_train, batch_size=32),
                 validation_data=(X_val_resized, y_val),
                 epochs=initial_epochs,
                 callbacks=[early_stopping])
    elapsed_time = time.time() - start_time
    print("Elapsed time: %0.2f seconds." % elapsed_time)


    y_test_pred_tf = tf_model.predict(test_images)
    test_df.loc[:, labels] = np.where(y_test_pred_tf > 0.18, 1, 0)

    generate_submission(test_df, "classification.csv")

    if early_stopping.stopped_epoch != 0:
        break
    else:
        initial_epochs *= 2

In [None]:
del base_model
del tf_model
del clf_model
del model
del data_gen
del augmented_imgs 
del augmented_labels 

# 3. Semantic segmentation
The goal here is to implement a segmentation CNN that labels every pixel in the image as belonging to one of the 20 classes (and/or background). Use the training set to train your CNN and compete on the test set (by filling in the segmentation column in the test dataframe).

Our initial dataset doesn't have enough observations for a complicated task like segmentation,
that's why we enriched our initial dataset with extra augmentations of the original data.
Some augmentations requires changes also in the masks(e.g. rotations) while others don't (adding noise to the images).

In [None]:
# GLOBAL VARIABLES FOR THE SEGMENTATION MODEL
# OOM ISSUES WHEN BIGGER SIZES TESTED
IMG_TRAIN_SIZE = (224, 224)
IMG_TRAIN_SHAPE= [224, 224, 3]
NORMALIZE_CONST = 255

In [None]:
# Shearing is basically streching an image along an axis
# or a reference point
def shear_image(img, mask, shear_factor=0.2):
    # transform img back to [0,255]
    img_pil = Image.fromarray((img *NORMALIZE_CONST).astype(np.uint8))
    mask_pil = Image.fromarray(mask)
    # shear transformation matrix
    # the 1 is for the x coordinate which is left unchanged
    # the shear factor is for the shear in the y coordinate based on the x coordinate
    transformation_matrix = [
        1, shear_factor, 0,
        0, 1, 0]
    # implement shears
    # Bicubic interpolation for smooth result
    # Nearest for preserving mask labels
    sheared_img = img_pil.transform(img_pil.size, Image.AFFINE, transformation_matrix, Image.BICUBIC)
    sheared_mask = mask_pil.transform(mask_pil.size, Image.AFFINE, transformation_matrix, Image.NEAREST)
    # transform imgs back to [0,1]
    sheared_img = np.array(sheared_img) / NORMALIZE_CONST
    sheared_mask = np.array(sheared_mask)
    return sheared_img, sheared_mask

# rotate img 135 degrees to create irregular shapes
def rotate_image_135(img, mask):
    # Rotate the image and mask by 135 degrees
    rotated_img = scipy.ndimage.rotate(img, 135, reshape=False, mode='nearest')
    rotated_mask = scipy.ndimage.rotate(mask, 135, reshape=False, mode='nearest')
    # trnasform to PIL Image
    rotated_img_pil = Image.fromarray((rotated_img * NORMALIZE_CONST).astype(np.uint8))
    rotated_mask_pil = Image.fromarray(rotated_mask)
    # Return the numpy arrays normalized
    rotated_img_resized = np.array(rotated_img_pil.resize(IMG_TRAIN_SIZE, Image.NEAREST)) / NORMALIZE_CONST
    rotated_mask_resized = np.array(rotated_mask_pil.resize(IMG_TRAIN_SIZE, Image.NEAREST))
    return rotated_img_resized, rotated_mask_resized

# Add delta to [0,1] images
def brightness(img, mask, delta=0.1):
    # clip to return correct images
    img = np.clip(img + delta, 0, 1)
    return img, mask

# Add random noise of mean 0 and var 0.01
def random_noise(img, mask, mean=0, var=0.01, smooth=False, sigma_smooth=1):
    sigma = var ** 0.5
    gaussian = np.random.normal(mean, sigma, img.shape)
    noisy_img = img + gaussian
    # Clip to maintain [0,1] range
    noisy_img = np.clip(noisy_img, 0, 1)
    return noisy_img, mask

# Do gamma adjustment for augmenation
def gamma(img, mask, gamma=0.5):
    # Correct gamma adjustment for normalized images
    img = np.clip(img ** gamma, 0, 1)  # Ensure gamma output is within 0 and 1
    return img, mask

# Change hue
def hue(img, mask, hue_shift=-0.1):
    # convert to [0,255]
    img_pil = Image.fromarray((img * NORMALIZE_CONST).astype(np.uint8))
    # Convert to hsv to change it
    img_hsv = img_pil.convert('HSV')
    np_img = np.array(img_hsv)
    # Perform the hue shift
    np_img[..., 0] = (np_img[..., 0].astype(int) + int(hue_shift * NORMALIZE_CONST)) % 256
    # Convert back to rgb, normalize and retun
    img_pil = Image.fromarray(np_img, 'HSV').convert('RGB')
    img = np.array(img_pil) / NORMALIZE_CONST
    return img, mask

# zoom
def crop(img, mask, crop_fraction=0.7):
    # Start/end coords of crop
    start_crop = int((1 - crop_fraction) * img.shape[0] / 2)
    end_crop = int(start_crop + crop_fraction * img.shape[0])
    # Crop
    img_cropped = img[start_crop:end_crop, start_crop:end_crop]
    mask_cropped = mask[start_crop:end_crop, start_crop:end_crop]
    # Resize back to correct dims
    img_resized = np.array(Image.fromarray((img_cropped * NORMALIZE_CONST).astype(np.uint8)).resize(IMG_TRAIN_SIZE, Image.LANCZOS)) / 255.0
    mask_resized = np.array(Image.fromarray(mask_cropped).resize(IMG_TRAIN_SIZE, Image.NEAREST))
    return img_resized, mask_resized

# Flip horizontaly
def flip_hori(img, mask):
    return np.fliplr(img), np.fliplr(mask)

# Flip vertically
def flip_vert(img, mask):
    # Flipping does not alter the range or data type
    return np.flipud(img), np.flipud(mask)

# Rotate 90 degrees
# k is times to rotate
def rotate(img, mask, k=1):
    return np.rot90(img, k), np.rot90(mask, k)

In [None]:
ORIGINAL_DATASET_SIZE = train_df.shape[0]
NUM_OF_AUGMENTATIONS = 9
print(f"Original dataset size: {ORIGINAL_DATASET_SIZE}")

In [None]:
# Functions to perform the above augmentations
def create_aug_data(df, shape, train_run = False):
    # Inialize train numpys to store data
    train_imgs = np.zeros((ORIGINAL_DATASET_SIZE*(NUM_OF_AUGMENTATIONS+1), shape[0], shape[0], 3), dtype=np.float32)
    train_masks = np.zeros((ORIGINAL_DATASET_SIZE*(NUM_OF_AUGMENTATIONS+1), shape[0], shape[0]), dtype=np.uint8)
    # tqdm to show progress, zip to enumerate
    for i, (img, mask) in enumerate(tqdm(zip(df["img"], df["seg"]), total=len(df))):
        # Resize imgs/masks
        img = cv2.resize(img, shape)
        mask = cv2.resize(mask, shape, interpolation=cv2.INTER_NEAREST)
        # Normalize image to range [0, 1]
        img = img.astype(np.float32) / NORMALIZE_CONST
        train_imgs[i] = img
        train_masks[i] = mask
        # To separate train and test
        if train_run:
        # Apply each augmentation function
            for j, func in enumerate([gamma, brightness, hue, crop, flip_hori, flip_vert, random_noise, shear_image, rotate]):
                aug_img, aug_mask = func(img, mask)
                # Set it to the correct place in the array
                train_imgs[i + ORIGINAL_DATASET_SIZE * (j + 1)] = aug_img
                train_masks[i + ORIGINAL_DATASET_SIZE * (j + 1)] = aug_mask
    return train_imgs, train_masks

In [None]:
# Create imgs, masks
train_imgs, train_masks = create_aug_data(train_df, shape=IMG_TRAIN_SIZE, train_run=True)
# Keep only originals
test_imgs, _ = create_aug_data(test_df, shape=IMG_TRAIN_SIZE, train_run = False)[0:750]
test_imgs = test_imgs[0:750]
# Check that it works
assert train_imgs.shape[0] == ORIGINAL_DATASET_SIZE*(NUM_OF_AUGMENTATIONS +1)

In [None]:
# CHECK MANUALLY THAT THE AUGMENTATIONS WORK
for i in range(1, 10):
    ind = i * ORIGINAL_DATASET_SIZE
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
    # Show first image
    axes[0].imshow(train_imgs[ind])
    axes[0].set_title("Augmentation")
    # Show second image
    axes[1].imshow(train_masks[ind])
    axes[1].set_title("Mask")
  
    plt.tight_layout()
    plt.show()

In [None]:
# Verify that the images are in [0,1] after augmentations
assert(np.max(train_imgs)) == 1
assert(np.min(train_imgs)) == 0 
assert(np.max(test_imgs)) == 1 
assert(np.min(test_imgs)) == 0

In [None]:
# Define Dice Coefficient
def dice_coefficient(y_true, y_pred, smooth=1e-6):
    # make masks one hot
    y_true = tf.one_hot(tf.cast(y_true, tf.int32), depth=tf.shape(y_pred)[-1])
    y_true = tf.keras.backend.flatten(y_true)
    y_pred = tf.keras.backend.flatten(y_pred)
    intersection = tf.keras.backend.sum(y_true * y_pred)
    return (2. * intersection + smooth) / (tf.keras.backend.sum(y_true) + tf.keras.backend.sum(y_pred) + smooth)

# Define Dice Loss as 1 - coeff
def dice_loss(y_true, y_pred):
    return 1 - dice_coefficient(y_true, y_pred)

# Update keras so it is serializable
tf.keras.utils.get_custom_objects().update({
    'dice_coefficient': dice_coefficient,
    'dice_loss': dice_loss})

In [None]:
# Function to calculate the dominant class, excluding background
def dominant_class_excluding_background(mask, background_class=0):
    # Filter out the background class
    filtered_mask = mask[mask != background_class]
    # If only background exists, return it
    if filtered_mask.size == 0:
        return background_class
    # else return the biggest mask class
    return np.argmax(np.bincount(filtered_mask))

# This will be needed to stratify training and validation
strat_labels = np.array([dominant_class_excluding_background(mask, 0) for mask in tqdm(train_df["seg"])])

In [None]:
# Calculate pixel weights for each img
def calculate_pixel_weights(masks, num_classes=21):
    n_samples, height, width = masks.shape
    pixel_weights = np.zeros_like(masks, dtype=float)
    for i in tqdm(range(n_samples)):
        mask = masks[i]
        class_weights = np.zeros(num_classes)
        # Calculate class frequencies
        class_counts = np.bincount(mask.flatten(), minlength=num_classes)
        # Set weight to 0 for classes not present in the mask
        # np.errstate is used to ignore division by zero and
        # invalid operations (when class_counts is zero).
        with np.errstate(divide='ignore', invalid='ignore'):
            class_weights = 1.0 / class_counts
            class_weights[class_counts == 0] = 0
        # Normalize weights to sum to 1
        total_weight = np.sum(class_weights[class_counts > 0])
        if total_weight > 0:
            class_weights[class_counts > 0] /= total_weight
        pixel_weights[i] = class_weights[mask]
    return pixel_weights

weights = calculate_pixel_weights(train_masks, num_classes=21)

As a backbone we used the model from https://www.tensorflow.org/tutorials/images/segmentation with MobileNetV2 with dropout
We trained this heavy model also with the efficientNetB3 architecture as a backbone but the results were not optimal probably due to it being too heavy as a result we tried different architectures

In [None]:
from tensorflow_examples.models.pix2pix import pix2pix
from tensorflow.keras.regularizers import l2
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_TRAIN_SHAPE, include_top=False)

# We used the activations of these layers
layer_names = [
    'block_1_expand_relu',   # 64x64
    'block_3_expand_relu',   # 32x32
    'block_6_expand_relu',   # 16x16
    'block_13_expand_relu',  # 8x8
    'block_16_project',      # 4x4
]
base_model_outputs = [base_model.get_layer(name).output for name in layer_names]

# Create the feature extraction model
down_stack = tf.keras.Model(inputs=base_model.input, outputs=base_model_outputs)
# Set it to not train but use the default imgnet weights
down_stack.trainable = False

# upsamping list
up_stack = [
    pix2pix.upsample(512, 3),  # 4x4 -> 8x8
    pix2pix.upsample(256, 3),  # 8x8 -> 16x16
    pix2pix.upsample(128, 3),  # 16x16 -> 32x32
    pix2pix.upsample(64, 3),   # 32x32 -> 64x64
]

# Define model
def unet_model_tf(input_shape, num_classes:int, dropout_rate=0.1):
  inputs = tf.keras.layers.Input(shape=input_shape)
  # Downsampling through the model
  skips = down_stack(inputs)
  x = skips[-1]
  skips = reversed(skips[:-1])
  # Upsampling and establishing the skip connections
  for up, skip in zip(up_stack, skips):
    if dropout_rate > 0:
        x = tf.keras.layers.Dropout(dropout_rate)(x)  # A
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.Activation('relu')(x)
    x = up(x)
    concat = tf.keras.layers.Concatenate()
    x = concat([x, skip])
  # This is the last layer of the model
  # We dont use softmax to use logits
  last = tf.keras.layers.Conv2DTranspose(
      filters=num_classes, kernel_size=3, strides=2,
      padding='same')  #64x64 -> 128x128
  x = last(x)
  return tf.keras.Model(inputs=inputs, outputs=x)

In [None]:
# Callback we created to display a train and a val img and mask every 7th epoch
class DisplayCallback(tf.keras.callbacks.Callback):
    def __init__(self, train_data, train_masks, val_data, val_masks, num_examples=2):
        super().__init__()
        # Get data
        self.train_data = train_data
        self.train_masks = train_masks
        self.val_data = val_data
        self.val_masks = val_masks
        # Number to show
        self.num_examples = num_examples

    # trigger
    def on_epoch_end(self, epoch, logs=None):
        # Display images
        self.display_images(epoch)

    # function to call n display
    def display_images(self, epoch):
        # Every 7th epoch
        if epoch % 7 == 0 :
            # Prefict maks
            train_predictions = self.model.predict(self.train_data[:self.num_examples])
            val_predictions = self.model.predict(self.val_data[:self.num_examples])
            # Create plots
            fig, axs = plt.subplots(self.num_examples, 6, figsize=(10, 5 * self.num_examples))  # Adjust subplot size
            for i in range(self.num_examples):
                # Training image
                axs[i, 0].imshow(self.train_data[i])
                axs[i, 0].title.set_text(f'Train Image {i+1}')
                axs[i, 0].axis('off')
                # Train pred mask
                axs[i, 1].imshow(np.argmax(train_predictions[i], axis=-1), cmap='gray')
                axs[i, 1].title.set_text(f'Predicted Train Mask {i+1}')
                axs[i, 1].axis('off')
                # Train mask
                axs[i, 2].imshow(self.train_masks[i], cmap='gray')
                axs[i, 2].title.set_text(f'Actual Train Mask {i+1}')
                axs[i, 2].axis('off')

                # Same for val
                axs[i, 3].imshow(self.val_data[i])
                axs[i, 3].title.set_text(f'Val Image {i+1}')
                axs[i, 3].axis('off')

                axs[i, 4].imshow(np.argmax(val_predictions[i], axis=-1), cmap='gray')
                axs[i, 4].title.set_text(f'Predicted Val Mask {i+1}')
                axs[i, 4].axis('off')

                axs[i, 5].imshow(self.val_masks[i], cmap='gray')
                axs[i, 5].title.set_text(f'Actual Val Mask {i+1}')
                axs[i, 5].axis('off')
            # Set tight layout to conserve space
            plt.tight_layout()
            plt.show()

In [None]:
# Vanilla not pretrained Unet
# Got 77% training accuracy so the model underfitted
def build_unet(input_shape, num_classes=21):
    inputs = Input(input_shape)

    # Encoder
    conv1 = Conv2D(64, 3, activation='relu', padding='same')(inputs)
    conv1 = Conv2D(64, 3, activation='relu', padding='same')(conv1)
    pool1 = MaxPooling2D(pool_size=(2, 2))(conv1)

    conv2 = Conv2D(128, 3, activation='relu', padding='same')(pool1)
    conv2 = Conv2D(128, 3, activation='relu', padding='same')(conv2)
    pool2 = MaxPooling2D(pool_size=(2, 2))(conv2)

    # Decoder
    up6 = Conv2D(64, 2, activation='relu', padding='same')(UpSampling2D(size=(2, 2))(pool2))
    merge6 = concatenate([conv2, up6], axis=3)
    conv6 = Conv2D(64, 3, activation='relu', padding='same')(merge6)
    conv6 = Conv2D(64, 3, activation='relu', padding='same')(conv6)

    up7 = Conv2D(32, 2, activation='relu', padding='same')(UpSampling2D(size=(2, 2))(conv6))
    merge7 = concatenate([conv1, up7], axis=3)
    conv7 = Conv2D(32, 3, activation='relu', padding='same')(merge7)
    conv7 = Conv2D(32, 3, activation='relu', padding='same')(conv7)

    outputs = Conv2D(num_classes, 1, activation=None)(conv7)
    model = tf.keras.Model(inputs=inputs, outputs=outputs)
    return model

In [None]:
del train_df

In [None]:
# Initialize cross-validation
skf = StratifiedKFold(n_splits=5)
# Loop over each fold
for fold_idx, (train_idx, val_idx) in enumerate(skf.split(train_imgs[:749], strat_labels)):
    # set to use only first fold to prevent OOM
    if fold_idx == 0:
        # Save indices to extend from
        original_train_idx = train_idx.copy()
        original_val_idx = val_idx.copy()
        # Calculate new indices for augmented data and add them to the original indices
        for i in range(1, NUM_OF_AUGMENTATIONS + 1):
            augmented_train_idx = original_train_idx + i * ORIGINAL_DATASET_SIZE
            augmented_val_idx = original_val_idx + i * ORIGINAL_DATASET_SIZE
            train_idx = np.concatenate((train_idx, augmented_train_idx))
            val_idx = np.concatenate((val_idx, augmented_val_idx))
        # Start training
        print(f"Training fold {fold_idx + 1}")
        # Split into training and validation sets
        X_train, X_val = train_imgs[train_idx], train_imgs[val_idx]
        y_train, y_val = train_masks[train_idx], train_masks[val_idx]
        # Create and compile the model
        model = unet_model_tf(input_shape=IMG_TRAIN_SHAPE, num_classes=21)
        model.compile(
                  optimizer=Adam(learning_rate=1e-4),
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])
        # Define callbacks
        # Display one seen above
        display_callback = DisplayCallback(X_train[:2], y_train[:2], X_val[:2], y_val[:2], num_examples=2)
        # Model save checkpoint
        checkpoint_cb_dice = ModelCheckpoint(
            f"best_unet_acc_fold_{fold_idx + 1}.keras",
            save_best_only=True,
            monitor='val_accuracy',
            mode='max')  # Specifically saves the model with the highest Dice coefficient
        # Reduce LR on plateau
        reduce_lr_cb = ReduceLROnPlateau(
            monitor='val_accuracy',
            factor=1/np.sqrt(10),
            patience=5,
            mode='max',
            verbose=1)
        # Early stopping callback
        # we chose 5, 13 for the respective callbacks to enable some time
        # for the model to train, to care for noise and local minima oscillations
        early_stopping_cb = EarlyStopping(
            monitor='val_accuracy',
            patience=13,
            mode='max',
            restore_best_weights=True,
            verbose=1)

        # Train the model
        model.fit(
            X_train, y_train,
            validation_data=(X_val, y_val),
            epochs=100,
            sample_weight = weights[train_idx],
            callbacks=[checkpoint_cb_dice, reduce_lr_cb, early_stopping_cb, display_callback],
            batch_size=8)

In [None]:
# If we don't have run the model before and want to load the model from memory
if model is None:
    model = unet_model_tf(input_shape = (224,224,3), num_classes=21)
    model.load_weights("/kaggle/working/best_unet_acc_fold_1.keras")
# Predict test masks
predicted_masks = model.predict(test_imgs, batch_size = 5)

In [None]:
# Create new df to get to submission
# Takes original df and the masks
def create_new_df(df, masks):
    # Resize imgs to their original
    for i, mask in enumerate(tqdm(masks)):
        shape1 = df["img"][i].shape[0]
        shape2 = df["img"][i].shape[1]
        temp_mask = cv2.resize(mask, (shape2, shape1), cv2.INTER_NEAREST)
        # set it to the df making masks with 1 channel
        df.at[i,"seg"] = np.argmax(temp_mask, -1)
    return df

In [None]:
# Carefull because here we will export only segmentation results
test_sub = create_new_df(test_df, predicted_masks)
sub_df = generate_submission(test_sub, "seg.csv")

In [None]:
# This code block allows to do classification
# based on the predicted masks. We are keeping classes
# of the mask that appears at least n times where n >= threshold
def find_values_above_threshold(masks, threshold):
    # Find unique pixel values and their counts
    unique_values, value_counts = np.unique(mask, return_counts=True)
    # Filter values above the threshold
    values_above_threshold = unique_values[value_counts > threshold]
    return values_above_threshold[1:]


# update dataframe with classes from predicted masks
# from seg model
def update_with_classes(df, class_lists):
    for i, listt in enumerate(class_lists):
        # set all class cols to 0
        df.iloc[i,:20] = 0
        for element in listt:
            # if class in list set the respective column to 1
            df.iloc[i, element-1] = 1
    return df

In [None]:
# Run this block of code if you want to use the segm model as class predictor
'''
class_lists = []
for i, mask in enumerate(tqdm(test_try["seg"])):
  class_lists.append(find_values_above_threshold(mask, 1000))
  test_try = update_with_classes(test_try, class_lists)
  funny_sub_df = generate_submission(test_try)
'''

## Submit to competition
You don't need to edit this section. Just use it at the right position in the notebook. See the definition of this function in Sect. 1.3 for more details.

In [None]:
# function to export the .csv from the two csvs, one from each subteam
def merge_and_export(clasPath='/kaggle/working/classification.csv', segPath='/kaggle/working/seg.csv', exportedName="submission.csv"):
    df1 = pd.read_csv(clasPath, index_col=0)
    df2 = pd.read_csv(segPath,  index_col=0)
    merged_df = pd.merge(df1, df2, on='Id', how='inner')
    merged_df['Predicted_x'] = merged_df['Predicted_x'].fillna(merged_df['Predicted_y'])
    merged_df = merged_df.rename(columns={"Predicted_x": "Predicted"})
    merged_df = merged_df.drop(columns=['Predicted_y'])
    merged_df.to_csv(exportedName)

In [None]:
merge_and_export()

# 4. Adversarial attack
For this part, your goal is to fool your classification and/or segmentation CNN, using an *adversarial attack*. More specifically, the goal is build a CNN to perturb test images in a way that (i) they look unperturbed to humans; but (ii) the CNN classifies/segments these images in line with the perturbations.

In [None]:
# get all bike imgs indexes
class_bike_images_indices = [i for i in range(len(train_masks[:749])) if np.isin(train_masks[i], 14).any()]
# The number to be added incrementally
increment = 749
# number of augs
times = 9
# list to put augmented indices
extended_bike_images_indices = class_bike_images_indices.copy()
# Generate new lists and append them
for i in tqdm(range(1, times + 1)):
    # Create a new list by adding the incremental number
    new_list = [x + i * increment for x in class_bike_images_indices]
    # Append the new list to the extended list
    extended_bike_images_indices += new_list

# Get filtered imgs and masks
filtered_images = train_imgs[extended_bike_images_indices]
filtered_masks = train_masks[extended_bike_images_indices]
# set every pixel where its not a bike to 0
filtered_masks = np.where(filtered_masks == 14, filtered_masks, 0)
# set every bike pixel to bicycle
filtered_masks[filtered_masks==14] = 2

In [None]:
NUM_OF_AUGMENTATIONS = 9
total_images = 47
# Similar code to before to stratify the imgs for training the adversarial model
skf = StratifiedKFold(n_splits=5)
for fold_idx, (train_idx, val_idx) in enumerate(skf.split(filtered_images[:total_images], strat_labels[:total_images])):
    if fold_idx == 0:
        original_train_idx = train_idx.copy()
        original_val_idx = val_idx.copy()
        for i in range(1, NUM_OF_AUGMENTATIONS + 1):
            augmented_train_idx = original_train_idx + i * total_images
            augmented_val_idx = original_val_idx + i * total_images
            train_idx = np.concatenate((train_idx, augmented_train_idx))
            val_idx = np.concatenate((val_idx, augmented_val_idx))
        print(f"Training fold {fold_idx + 1}")
        # Split into training and validation sets
        X_train, X_val = filtered_images[train_idx], filtered_images[val_idx]
        y_train, y_val = filtered_masks[train_idx], filtered_masks[val_idx]

In [None]:
# Our adversarial model
def build_adversarial_model(input_shape):
    inputs = Input(shape=input_shape)
    # Encoder path
    x = Conv2D(8, (3, 3), activation='relu', padding='same')(inputs)
    x = Conv2D(16, (3, 3), activation='relu', padding='same')(x)
    # Decoder path
    x = Conv2DTranspose(32, (3, 3), activation='relu', padding='same')(x)
    x = Conv2DTranspose(64, (3, 3), activation='relu', padding='same')(x)
    # Output perturbation as the same size as input
    perturbations = Conv2D(3, (3, 3), padding='same')(x)
    model = Model(inputs=inputs, outputs=perturbations)
    return model

# Adversarial loss
def adversarial_loss(labels, perturbations, images, scale=0.1):
    # Apply perturbations to the original images
    perturbed_images = images + perturbations
    # Get predictions from the target model on the perturbed images
    predictions = target_model(perturbed_images, training=False)
    # Convert predictions from logits to class indices
    predicted_classes = tf.argmax(predictions, axis=-1)
    # Cast labels to match the predicted_classes dtype
    labels = tf.cast(labels, dtype=tf.int32)
    predicted_classes = tf.cast(predicted_classes, dtype=tf.int32)
    # Calculate  corrrect preds and acc
    correct_predictions = tf.equal(labels, predicted_classes)
    accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32))
    # euclidean regularization to keep perturbations small
    loss_regularization = tf.norm(perturbations, ord=2)
    # Combine losses, adapted as a form of loss where lower accuracy increases the "loss"
    # to make the model learn the masks
    # also tried with reverse dice
    # and reverse sparse
    # but they all failed
    total_loss = (1 - accuracy) + scale * loss_regularization
    return total_loss

# get batches to train and val
# set it to 10 since it's the number of augs
# and will always be divisible
def get_batches(X, y, batch_size=10):
    num_samples = X.shape[0]
    for start_idx in range(0, num_samples, batch_size):
        end_idx = min(start_idx + batch_size, num_samples)
        # yield to be able to continue
        yield X[start_idx:end_idx], y[start_idx:end_idx]

# Instantiate models
adv_model = build_adversarial_model(input_shape=(224, 224, 3))
target_model = unet_model_tf(input_shape=(224, 224, 3), num_classes=21)
target_model.load_weights("/kaggle/working/best_unet_acc_fold_1.keras")
target_model.trainable = False
adv_model.compile(optimizer=Adam(learning_rate=1e-4))

# Set hyperparams
epochs = 10
batch_size = 2
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-4)

# training step
@tf.function
def train_step(images, labels, model, optimizer):
    # initiate gradient
    with tf.GradientTape() as tape:
        # get perturbations and add em and calc loss
        perturbations = model(images, training=True)
        perturbed_images = images + perturbations
        loss = adversarial_loss(labels, perturbations, images)
    # calc gradients
    gradients = tape.gradient(loss, model.trainable_variables)
    # update weights
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    return loss


# Training loop
for epoch in range(epochs):
    print(f"Epoch {epoch+1}/{epochs}")
    train_loss = 0
    train_steps = 0
    for images, labels in tqdm(get_batches(X_train, y_train, batch_size), desc="Training"):
        # Ensure labels are of type int32
        # Else we got errors
        labels = tf.cast(labels, tf.int32)
        loss = train_step(images, labels, adv_model, optimizer)
        train_loss += loss.numpy()
        train_steps += 1

        # Displaying one training image per epoch
        if train_steps == 1:
            perturbations = adv_model(images, training=False)
            perturbed_images = images + perturbations
            perturbed_images = tf.clip_by_value(perturbed_images, 0, 1)
            predictions = target_model(perturbed_images)
            predicted_classes = tf.argmax(predictions, axis=-1)

            plt.figure(figsize=(15, 10))
            plt.subplot(1, 4, 1)
            plt.imshow(images[0])
            plt.title(f'Original Image predicting class {np.max(labels[0])}')
            plt.axis('off')

            plt.subplot(1, 4, 2)
            plt.imshow(perturbations[0]*255, cmap='viridis')
            plt.title('Perturbation')
            plt.axis('off')

            plt.subplot(1, 4, 3)
            plt.imshow(perturbed_images[0])
            plt.title('Perturbed Image')
            plt.axis('off')

            plt.subplot(1, 4, 4)
            plt.imshow(predicted_classes[0], cmap='viridis')
            plt.title(f'Predicted Mask predicting class {np.max(predicted_classes[0][predicted_classes[0] != 0])}')
            plt.axis('off')

            plt.show()
    train_loss /= train_steps
    print(f"Training loss: {train_loss}")

    # Validation loop with tqdm progress bar
    val_loss = 0
    val_steps = 0
    for images, labels in tqdm(get_batches(X_val, y_val, batch_size), desc="Validation"):
        # No training, just forward pass
        labels = tf.cast(labels, tf.int32)
        # Displaying one val image per epoch
        if train_steps == 1:
            perturbations = adv_model(images, training=False)
            perturbed_images = images + perturbations
            # Clip values to [0, 1] range
            perturbed_images = tf.clip_by_value(perturbed_images, 0, 1)
            predictions = target_model(perturbed_images)
            predicted_classes = tf.argmax(predictions, axis=-1)

            plt.figure(figsize=(15, 10))
            plt.subplot(1, 4, 1)
            plt.imshow(images[0])
            plt.title(f'Original Image predicting class {np.max(labels[0])}')
            plt.axis('off')

            plt.subplot(1, 4, 2)
            plt.imshow(perturbations[0]*255, cmap='viridis')
            plt.title('Perturbation')
            plt.axis('off')

            plt.subplot(1, 4, 3)
            plt.imshow(perturbed_images[0])
            plt.title('Perturbed Image')
            plt.axis('off')

            plt.subplot(1, 4, 4)
            plt.imshow(predicted_classes[0], cmap='viridis')
            plt.title(f'Predicted Mask predicting class {np.max(predicted_classes[0][predicted_classes[0] != 0])}')
            plt.axis('off')
            plt.show()

        loss = adversarial_loss(labels, perturbations, images)
        val_loss += loss.numpy()
        val_steps += 1
    val_loss /= val_steps
    print(f"Validation loss: {val_loss}")

# 5. Discussion
Finally, take some time to reflect on what you have learned during this assignment. Reflect and produce an overall discussion with links to the lectures and "real world" computer vision.

### Segmentation
On segmentation we tried only a vanilla unet for the untrained one and one with a pretrained backbone. After seeing the advanced results for the pretrained one, we focused on developing several ones with available models as efficientnetV2B3, efficientNetB3 and several versions of mobileNet, however the V2 showed the best results. What we found quite surprising at first was the relative accuracy that the model could achieve in classification as well as segmentation by tinkering with the threshold parameter. Given mode time, we would create an accuracy curve for the different thershold we could implement and coordinate with the classification team to compare results

## Adversarial segmentation

No matter what classes we tried to juggle, very similar ones such as bike and bicycle or ones that were similar and appeared often alone such as cat and dog, the results we got were subpar. The model underfitted at all times, and some extra loss functions we tried, scaling factors and slightly smaller or bigger models yielded similar results.