KUL H02A5a Computer Vision: Group Assignment 2
---------------------------------------------------------------
Student numbers: <span style="color:red">r0810913, r0792692, r0795760, r0795985, r0799260</span>.

# 1. Overview
This assignment consists of *three main parts*:
* Image classification (Sect. 2)
* Semantic segmentation (Sect. 3)
* Adversarial attacks (Sect. 4)

In the first part, an end-to-end neural network is trained for image classification. In the second part, the same is done for semantic segmentation. In the third part, we try to find and exploit the weaknesses of our classification and/or segmentation network. Finally, a reflection and overall discussion with links to the lectures and "real world" computer vision is provided in section (Sect. 5).

## 1.1 Deep learning resources

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from matplotlib import pyplot as plt
import os
import cv2
from urllib import request
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras import Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.losses import categorical_crossentropy
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D, Dropout, Input, Add, ActivityRegularization
from tensorflow.keras import models, layers
from tensorflow.keras.optimizers import Adadelta
from tensorflow.keras.metrics import Precision, Recall
from keras import regularizers
from keras.layers import BatchNormalization
from sklearn.metrics import f1_score
from keras.callbacks import ModelCheckpoint, EarlyStopping
from skimage.transform import resize
from tensorflow.keras.optimizers import Adam
from IPython.display import clear_output

## 1.2 PASCAL VOC 2009
For this project we will be using the [PASCAL VOC 2009](http://host.robots.ox.ac.uk/pascal/VOC/voc2009/index.html) dataset. This dataset consists of colour images of various scenes with different object classes (e.g. animal: *bird, cat, ...*; vehicle: *aeroplane, bicycle, ...*), totalling 20 classes.

In [None]:
# Loading the training data
train_df = pd.read_csv('//kaggle/input/kul-h02a5a-computer-vision-ga2-2024/train/train_set.csv', index_col="Id")
labels = train_df.columns
train_df["img"] = [np.load('/kaggle/input/kul-h02a5a-computer-vision-ga2-2024/train/img/train_{}.npy'.format(idx)) for idx, _ in train_df.iterrows()]
train_df["seg"] = [np.load('/kaggle/input/kul-h02a5a-computer-vision-ga2-2024/train/seg/train_{}.npy'.format(idx)) for idx, _ in train_df.iterrows()]
print("The training set contains {} examples.".format(len(train_df)))

# Show some examples
fig, axs = plt.subplots(2, 20, figsize=(10 * 20, 10 * 2))
for i, label in enumerate(labels):
    df = train_df.loc[train_df[label] == 1]
    axs[0, i].imshow(df.iloc[0]["img"], vmin=0, vmax=255)
    axs[0, i].set_title("\n".join(label for label in labels if df.iloc[0][label] == 1), fontsize=40)
    axs[0, i].axis("off")
    axs[1, i].imshow(df.iloc[0]["seg"], vmin=0, vmax=20)  # with the absolute color scale it will be clear that the arrays in the "seg" column are label maps (labels in [0, 20])
    axs[1, i].axis("off")
    
plt.show()

# The training dataframe contains for each image 20 columns with the ground truth classification labels and 20 column with the ground truth segmentation maps for each class
train_df.head(1)

# Loading the test data
test_df = pd.read_csv('/kaggle/input/kul-h02a5a-computer-vision-ga2-2024/test/test_set.csv', index_col="Id")
test_df["img"] = [np.load('/kaggle/input/kul-h02a5a-computer-vision-ga2-2024/test/img/test_{}.npy'.format(idx)) for idx, _ in test_df.iterrows()]
test_df["seg"] = [-1 * np.ones(img.shape[:2], dtype=np.int8) for img in test_df["img"]]
print("The test set contains {} examples.".format(len(test_df)))

# The test dataframe is similar to the training dataframe, but here the values are -1 --> your task is to fill in these as good as possible in Sect. 2 and Sect. 3; in Sect. 6 this dataframe is automatically transformed in the submission CSV!
test_df.head(1)

## 1.3 Kaggle submission
The filled test dataframe (during Sect. 2 and Sect. 3) must be converted to a submission.csv with two rows per example (one for classification and one for segmentation) and with only a single prediction column (the multi-class/label predictions running length encoded). The following functions are used to generate this submission file.

In [None]:
def _rle_encode(img):
    """
    Kaggle requires RLE encoded predictions for computation of the Dice score (https://www.kaggle.com/lifa08/run-length-encode-and-decode)

    Parameters
    ----------
    img: np.ndarray - binary img array
    
    Returns
    -------
    rle: String - running length encoded version of img
    """
    pixels = img.flatten()
    pixels = np.concatenate([[0], pixels, [0]])
    runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
    runs[1::2] -= runs[::2]
    rle = ' '.join(str(x) for x in runs)
    return rle

def generate_submission(df):
    """
    Make sure to call this function once after you completed Sect. 2 and Sect. 3! It transforms and writes your test dataframe into a submission.csv file.
    
    Parameters
    ----------
    df: pd.DataFrame - filled dataframe that needs to be converted
    
    Returns
    -------
    submission_df: pd.DataFrame - df in submission format.
    """
    df_dict = {"Id": [], "Predicted": []}
    for idx, _ in df.iterrows():
        df_dict["Id"].append(f"{idx}_classification")
        df_dict["Predicted"].append(_rle_encode(np.array(df.loc[idx, labels])))
        df_dict["Id"].append(f"{idx}_segmentation")
        df_dict["Predicted"].append(_rle_encode(np.array([df.loc[idx, "seg"] == j + 1 for j in range(len(labels))])))
    
    submission_df = pd.DataFrame(data=df_dict, dtype=str).set_index("Id")
    submission_df.to_csv("submission.csv")
    return submission_df

## 1.4 Preprocessing/General functions

This section contains general purpose functions that are used for example to preprocess 
the data sets before neural networks can be trained with them.

The training data undergoes the following preprocessing steps:
* Resizing of the images to 224 x 224.
* Data augmentation is applied to increase the number of data samples. New samples are generated by rotating, flipping, and adjusting the brightness of the existing training data. These augmented samples are then added to the dataset, enhancing the number of training samples and improving generalization.

The test data is only resized.


In [None]:
def resize_image(img, height, width):
    img_size = (height,width,3)
#     normal_image = img.astype(np.float32)/ 255.0
    resized = cv2.resize(img, (width, height), interpolation=cv2.INTER_CUBIC)
    return resized


def preprocess_data(X, y, shape, augment=True):
    """Resize and optionally augment the dataset
    In order to deal with class imbalance we only 
    augment the images that do not contain people.
    """
    height = shape[0]
    width = shape[1]
    X_res = []
    for i, img in enumerate(X):
        X_res.append(resize_image(img, height, width))

    X_res = np.array(X_res)

    if augment:
        to_augment = []
        to_augment_y = []
        for i, img in enumerate(X_res):
            to_augment.append(img)
            to_augment_y.append(y[i])


        to_augment = np.array(to_augment)
        to_augment_y = np.array(to_augment_y)
    
        datagen = ImageDataGenerator(
                    rotation_range=20,
                    brightness_range=[0.3,1.0],
                    horizontal_flip=True)

        d = datagen.flow(to_augment, to_augment_y, batch_size=50, shuffle=False, seed=None)
        it = 0

        new = []
        new_y = []
        for batch in d:
            it += 1
            for i, img in enumerate(batch[0]):
                new.append(img)
                new_y.append(batch[1][i])
            if it >= 50:
                break

        new = np.array(new)
        new_y = np.array(new_y)
        
        data = np.concatenate((new, to_augment))
        data_y = np.concatenate((new_y, to_augment_y))
    else:
        data = X_res
        data_y = y

        
    return data, data_y



The following functions will be used later to display the images with their predicted or actual labels and to show the training history of the model.

In [None]:
def show_image_with_label(data, labels):

    CLASSES = {'aeroplane': 0, 'bicycle': 1, 'bird': 2, 'boat': 3, 'bottle': 4, 'bus': 5,
            'car': 6, 'cat': 7, 'chair': 8, 'cow': 9, 'diningtable': 10, 'dog': 11,
            'horse': 12, 'motorbike':13, 'person': 14, 'pottedplant': 15, 'sheep': 16,
            'sofa': 17, 'train': 18, 'tvmonitor': 19}
    
    # Randomly select 10 indices
    indices = np.random.choice(len(labels), 10, replace=False)

    # Create a figure and axis objects
    fig, axes = plt.subplots(1, 10, figsize=(20, 2))  # Adjust figsize as needed

    # Show the selected 10 random images with their corresponding class names
    for i, idx in enumerate(indices):
        axes[i].imshow(data[idx])
        
        # Get the class names for the labels with value 1
        class_names = [class_name for class_name, label in CLASSES.items() if labels[idx][label] == 1]
        
        # Set title as class names separated by commas
        axes[i].set_title(', '.join(class_names))
        
        axes[i].axis('off')

    plt.show()
    return 

def plot_history(history):
    fig, axs = plt.subplots(1, 3, figsize=(15, 5))  # Adjusting the layout to fit 3 plots

    if "Precision" in history.history:
        # Summarize history for precision
        axs[0].plot(history.history['Precision'])  # Make sure the key matches your history dictionary
        axs[0].plot(history.history['val_Precision'])
        axs[0].set_title('Model Precision')
        axs[0].set_ylabel('Precision')
        axs[0].set_xlabel('Epoch')
        axs[0].legend(['Train', 'Validation'], loc='upper left')

    if "loss" in history.history:
        # Summarize history for loss
        axs[1].plot(history.history['loss'])
        axs[1].plot(history.history['val_loss'])
        axs[1].set_title('Model Loss')
        axs[1].set_ylabel('Loss')
        axs[1].set_xlabel('Epoch')
        axs[1].legend(['Train', 'Validation'], loc='upper left')

    if "Accuracy" in history.history:
        # Summarize history for accuracy
        axs[2].plot(history.history['Accuracy'])  # Again, make sure the key matches your history dictionary
        axs[2].plot(history.history['val_Accuracy'])
        axs[2].set_title('Model Accuracy')
        axs[2].set_ylabel('Accuracy')
        axs[2].set_xlabel('Epoch')
        axs[2].legend(['Train', 'Validation'], loc='upper left')

    plt.tight_layout()  # This helps to fit everything into the figure cleanly
    plt.show()
    return

# 2. Image classification
The goal here is simple: implement a classification CNN and train it to recognise all 20 classes (and/or background) using the training set and compete on the test set (by filling in the classification columns in the test dataframe). For the multilabel classification task, where each image can have multiple labels, multiple CNN architectures were tried. In what follows, we summarise the main findings and attempts.

We first preprocess our training set. This ensures all images are equally sized, normalized and also augments the data set with rotated images or images with reduced brightness. The neural networks used during the transfer learning later have been trained on images of size 224x224, so we will restrict all images to this size.

In [None]:
train_images = train_df["img"].to_numpy()
train_labels = train_df[labels].to_numpy()

IMG_HEIGHT = 224
IMG_WIDTH = 224
IMG_SIZE = (IMG_HEIGHT,IMG_WIDTH)

X, y_train = preprocess_data(train_images, train_labels, IMG_SIZE, augment=True)

X_train = X.astype(np.float32)

X, y_test = preprocess_data(test_df["img"].to_numpy(), test_df[labels].to_numpy(), IMG_SIZE, augment=False)

X_test = X.astype(np.float32)

print("Training set:", X_train.shape, y_train.shape)
print("Test set:", X_test.shape, y_test.shape)

In [None]:
show_image_with_label(X_train,y_train)


Some images contain multiple objects. Let's see how many images contain more than 1 object.

In [None]:
def count_images_with_multiple_objects(labels, label_columns):
    """
    Function to count the number of images containing more than one object.

    Args:
    labels (pd.DataFrame): DataFrame containing the labels for each image.
    label_columns (list): List of columns to consider for counting objects.

    Returns:
    int: The number of images containing more than one object.
    """
    # Sum the number of objects (values of 1) in each row for the specified columns
    object_counts = labels[label_columns].sum(axis=1)

    # Count how many rows have more than one object
    images_with_multiple_objects = (object_counts > 1).sum()

    return images_with_multiple_objects


# Specify the label columns (excluding 'img' and 'seg')
label_columns = [
    'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat',
    'chair', 'cow', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep',
    'sofa', 'train', 'tvmonitor'
]

# Print the number of images in the train data
total_images = len(train_df)
print(f"Total number of images in the training data: {total_images}")

# Count the number of images containing more than one object
num_images_with_multiple_objects = count_images_with_multiple_objects(train_df, label_columns)
print(f"Number of images containing more than one object: {num_images_with_multiple_objects}")


Let's also take a look at the most prevalent classes in the training set.

In [None]:
# Define the classes dictionary
CLASSES = ['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus',
           'car', 'cat', 'chair', 'cow', 'diningtable', 'dog',
           'horse', 'motorbike', 'person', 'pottedplant', 'sheep',
           'sofa', 'train', 'tvmonitor']

# Initialize the class counts dictionary
class_counts = {class_name: 0 for class_name in CLASSES}

# Count the occurrences of each class label
for idx, row in train_df.iterrows():
    for class_name in CLASSES:
        if row[class_name] == 1:
            class_counts[class_name] += 1

# Calculate relative class counts
total_images = len(train_df)
rel_class_counts_list = [class_counts[class_name] / total_images for class_name in CLASSES]

# Plot a histogram of class frequencies
plt.figure(figsize=(12, 8))
plt.bar(range(len(CLASSES)), rel_class_counts_list, tick_label=CLASSES)
plt.xticks(rotation=90)
plt.ylabel('Proportion of Images')
plt.title('Class Frequencies in the Training Data')

We see that "person" is most present in the images (>25% of images). Other objects appear in very few of the images.

## 2.1 Basic CNN Model

Below is an explanation of our architectural decisions for constructing a CNN model:

**Convolutional Blocks (4)**:
- Each block contains a convolutional layer with increasing filter counts to capture increasingly complex features.

**ReLU Activation**:
- Applied after each Conv2D layer to capture non-linear features and prevent the vanishing gradient problem.

**MaxPooling2D**:
- Downsamples feature maps, retaining the most salient features by selecting the maximum value in each window.

**Dropout**:
- Prevents overfitting by randomly dropping neurons during training to reduce reliance on specific neurons.

**Fully Connected Layers**:
- Flatten layer transforms the feature maps into 1D arrays.
- Dense layers learn non-linear combinations of these features.

**Sigmoid Activation**:
- Used in the output layer to produce probabilities for each class in a multi-label classification task.

**Binary Cross-Entropy Loss**:
- Suitable for multi-label classification, treating each class probability independently.

Our basic model consist of four convolutional layers, each followed by the relu activation function and a maxpooling layer. These layer extract features from the images. The end of the network is a fully connected or dense neural network. 
This network performs classification based on the extracted features.

The binary crossentropy loss function is used in combination with the sigmoid activation function in the output layer for multilabel classification.

In [None]:
def get_basic_model(input_shape, nb_classes):

    input_shape = (input_shape[0], input_shape[1], 3)

    model = Sequential(name="OurModel")

    # Convolutional layers
    model.add(Conv2D(16, (3, 3), activation='relu', input_shape=input_shape))
    model.add(MaxPooling2D((2, 2)))

    model.add(Conv2D(32, (3, 3), activation='relu'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Dropout(0.25))

    model.add(Conv2D(64, (3, 3), activation='relu'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Dropout(0.25))

    model.add(Conv2D(128, (3, 3), activation='relu'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Dropout(0.25))

    # Flatten layer
    model.add(Flatten())

    # Fully connected layers
    model.add(Dense(512, activation='relu'))
    model.add(Dropout(0.5))

    # Output layer
    model.add(Dense(nb_classes, activation='sigmoid'))

    return model

def compile_our_model():
    input_shape = IMG_SIZE
    nb_classes = 20

    model = get_basic_model(input_shape, nb_classes)
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['Precision', 'Recall', 'Accuracy'])
    summary = model.summary()
    print(summary)
    return model

# Compile the model
our_model = compile_our_model()

In [None]:
def get_callbacks(patience, monitor_metric='val_loss', checkpoint_path='best_model.keras'):
    """
    Create ModelCheckpoint and EarlyStopping callbacks.

    Args:
    patience (int): Number of epochs with no improvement after which training will be stopped.
    monitor_metric (str): Metric to monitor for EarlyStopping and ModelCheckpoint.
    checkpoint_path (str): Path to save the best model.

    Returns:
    list: List of callbacks.
    """
    checkpoint_callback = ModelCheckpoint(
        checkpoint_path,              # Path where the model will be saved
        monitor=monitor_metric,       # Metric to monitor
        save_best_only=True,          # Save only the best model
        mode='max' if 'Accuracy' in monitor_metric else 'min',  # Mode 'max' if monitoring accuracy, else 'min'
        verbose=1                     # Logs out when models are being saved
    )

    early_stopping_callback = EarlyStopping(
        monitor=monitor_metric,       # Metric to monitor
        min_delta=0.005,              # Minimum change to qualify as an improvement
        patience=patience,            # Number of epochs with no improvement after which training will be stopped
        mode='min' if 'loss' in monitor_metric else 'max',  # Mode 'min' if monitoring loss, else 'max'
        verbose=1                     # Logs out when training is being stopped
    )

    return [checkpoint_callback, early_stopping_callback]

In [None]:
# Assuming X_train and y_train are defined and preprocessed
# Define number of epochs
epochs = 50
# Get callbacks with patience based on the number of epochs
callbacks = get_callbacks(patience=epochs//5, monitor_metric='val_Accuracy')

history_basic_model = our_model.fit(
    X_train,
    y_train,
    validation_split=0.2,
    epochs=epochs,
    batch_size=64,
    callbacks=callbacks  # Include the callbacks in the training process
)

In [None]:
plot_history(history_basic_model)

Achieving only an accuracy of around 65% on the validation set, we can see that the obatained accuracy of the model is not very high. Moreover, the validation accuracy quickly levels off after a few training epochs. The validation loss even starts to increase significantly, indicating overfitting.

A second architecture was also tried, which added batch normalization layers and multiple convolutional layers per convolutional block. The results were similar to the previously chosen model, except that convergence was more stable and quicker.

## 2.2 Known Model Architectures (Alexnet)

As a first step to a more performant model, this section tries the well-known Alexnet architecture to classify images.

The architecture consists of the following layers:
* 5 convolutional layers for feature detection
* 3 maxpooling layers for reducing the dimension of feature maps
* 3 Fully connected layers to learn from the extracted features of the convolutional layers
* 1 flatten layer to transition from the convolutional layers to the fully connected layers
* The binary crossentropy loss function is used in combination with the sigmoid activation function in the output layer for multilabel classification.
Reference: Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60, 84 - 90.

Since the full Alexnet architecture contains more than 50 million parameters, this is too extensive to train from scratch and would overfit on the limited training data available. For this reason, a reduced version of the Alexnet architecture was trained from scratch. The reduced Alexnet model has the same structure, but with fewer filters for each convolutional layer, and fewer units for each dense layer. This reduces the number of trainable parameters to 8.6 million.

In addition, batch normalisation was added after each convolutional and fully connected layer. Thanks to the normalization of the outputs of the previous layer, the learning is sped up, the noise in the model is reduced and a higher learning rate can be used.

The model was trained for 100 epochs, using the Adam optimizer.  The resulting net attains a higher accuracy for both the training and validation sets. The validation accuracy has increased to ~70%.

reference: Saxena, S. (2024, 20 februari). Introduction to Batch Normalization. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/03/introduction-to-batch-normalization/

In [None]:
from keras import regularizers
from keras.layers import BatchNormalization
from keras.callbacks import ModelCheckpoint


def get_alexnet_model_reduced(input_shape, nb_classes):
    input_shape = (input_shape[0], input_shape[1],3)
    model = models.Sequential(name="Alexnet")
    
    # Convolutional layers
    model.add(layers.Conv2D(filters=16, kernel_size=(11,11), input_shape=input_shape, strides=4, padding="valid", kernel_initializer="he_normal", use_bias=False))
    model.add(BatchNormalization())
    model.add(layers.Activation("relu"))
    model.add(layers.MaxPool2D(pool_size=(3,3), strides=(2,2), padding="valid"))
    
    model.add(layers.Conv2D(filters=32, kernel_size=(5,5), strides=1, padding="same", kernel_initializer="he_normal", use_bias=False))
    model.add(BatchNormalization())
    model.add(layers.Activation("relu"))
    model.add(layers.MaxPool2D(pool_size=(3,3), strides=(2,2), padding="valid"))
    
    model.add(layers.Conv2D(filters=64, kernel_size=(3,3), strides=1, padding="same", kernel_initializer="he_normal", use_bias=False))
    model.add(BatchNormalization())
    model.add(layers.Activation("relu"))
    
    model.add(layers.Conv2D(filters=64, kernel_size=(3,3), strides=1, padding="same", kernel_initializer="he_normal", use_bias=False))
    model.add(BatchNormalization())
    model.add(layers.Activation("relu"))
    
    model.add(layers.Conv2D(filters=64, kernel_size=(3,3), strides=1, padding="same", kernel_initializer="he_normal", use_bias=False))
    model.add(BatchNormalization())
    model.add(layers.Activation("relu"))
    model.add(layers.MaxPool2D(pool_size=(3,3), strides=(2,2), padding="valid"))
    
    # Flatten layer
    model.add(layers.Flatten())
    
    # Fully connected layers
    model.add(layers.Dense(units=2048))
    model.add(BatchNormalization())
    model.add(layers.Activation("relu"))
    
    model.add(layers.Dense(units=2048))
    model.add(BatchNormalization())
    model.add(layers.Activation("relu"))
    
    model.add(layers.Dense(units=500))
    model.add(BatchNormalization())
    model.add(layers.Activation("relu"))
    
    # Output layer
    model.add(layers.Dense(units=nb_classes, activation="sigmoid"))
    
    return model

def compile_alexnet_model_reduced():
    alexnet_reduced=get_alexnet_model_reduced(IMG_SIZE,20)
    filepath = "best_model.keras"

    # Define the checkpoint to monitor validation accuracy and save the best model
    checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')

    alexnet_reduced.compile(optimizer='adam', loss='binary_crossentropy', metrics=['Precision', 'Recall', 'Accuracy'])
    summary=alexnet_reduced.summary()
    print(summary)
    return alexnet_reduced

alexnet_reduced=compile_alexnet_model_reduced()

In [None]:
checkpoint_callback = ModelCheckpoint(
    'alexnet_reduced_best.keras', # Path where the model will be saved
    monitor='val_Accuracy',       # Metric to monitor
    save_best_only=True,          # Save only the best model
    mode='max',                   # Mode 'max' because we aim to maximize the validation accuracy
    verbose=1                     # Logs out when models are being saved
)

history_alexnet_reduced=alexnet_reduced.fit(X_train, y_train, epochs=100, batch_size=16, validation_split=0.2,callbacks=[checkpoint_callback])

In [None]:
plot_history(history_alexnet_reduced)

In [None]:
y_pred_prob  = alexnet_reduced.predict(X_test)

threshold = 0.3

y_pred = np.array(y_pred_prob>threshold, int)

In [None]:
show_image_with_label(X_test,y_pred)

By plotting some images of the test set, along with the label that the reduced Alexnet predicts,we can see that the model still is far from performing optimally. We continue to look for improvements, this time by applying transfer learning.

In [None]:
test_images = test_df["img"].to_numpy()
test_resized=[]
for im in test_images:
    resized = resize_image(im, IMG_HEIGHT, IMG_WIDTH)
    test_resized.append(resized)
test_resized=np.array(test_resized)
y_test_pred_tf = alexnet_reduced.predict(test_resized)
y_test_pred=np.where(y_test_pred_tf > 0.4, 1, 0)
test_df.loc[:, labels] = y_test_pred

show_image_with_label(test_resized,y_test_pred)

## 2.3 Transfer Learning
### 2.3.1. Resnet50

In a first transfer learning model, the ResNet50 model is used as an encoder. The classification is done by training two fully connected layers and an output layer.

In [None]:
# Define the directory to save the model weights ResNet50
weights_folder = './weights/'
if not os.path.exists(weights_folder):
    os.makedirs(weights_folder)
model_weights_path = os.path.join(weights_folder, "RESNET_2.weights.h5")

### 2.3.2 Fine-Tuned Models
Freezing Layers: Why It’s Useful

- **Preserve Learned Features**: Early ResNet50 layers, trained on ImageNet, extract generic features useful for various tasks. Freezing these layers keeps these features intact during training on a new dataset.
- **Reduce Training Time**: Training fewer layers decreases computational requirements, speeding up the process since fewer parameters are updated.
- **Avoid Overfitting**: It helps prevent overfitting, especially with smaller datasets. Early layers retain general features, while later layers fine-tune to the specific task.

Unfreezing Layers: Potential Benefits

- **Fine-Tune Specific Features**: By unfreezing some layers, the model can adapt more precisely to the new dataset.
- **Balance Preservation and Learning**: In ResNet50, unfreezing layers beyond the 143rd allows the network to fine-tune the last 32 layers. This balances maintaining useful low- and mid-level features while learning high-level features unique to the new data.

We’ll unfreeze and train a few layers to try to improve performance.

We'll also add dropout and batchnormalization


In [None]:
def create_resnet_model_2(input_shape, nb_classes):
    """
    Build a ResNet50 model with additional fully connected layers.

    Args:
    input_shape (tuple): Shape of the input images.
    nb_classes (int): Number of classes.

    Returns:
    model: Keras model.
    """
    resnet_model_2 = Sequential(name="ResNetModel_2")

    # Start by importing the pre-defined ResNet50 model
    imported_model = tf.keras.applications.ResNet50(include_top=False,
                                                    input_shape=input_shape,
                                                    pooling='avg',
                                                    weights='imagenet')

    # Freeze the first 143 layers of the imported model
    for layer in imported_model.layers[:143]:
        layer.trainable = False

    # Unfreeze the remaining layers
    for layer in imported_model.layers[143:]:
        layer.trainable = True

    # Add preprocessing layer
    resnet_model_2.add(tf.keras.layers.Lambda(tf.keras.applications.resnet50.preprocess_input, input_shape=input_shape))
    resnet_model_2.add(imported_model)

    # Add additional layers
    
    # Complex:

    resnet_model_2.add(Flatten())
    resnet_model_2.add(Dropout(0.5))
    resnet_model_2.add(BatchNormalization())

    resnet_model_2.add(Dense(4096, activation='relu'))
    resnet_model_2.add(Dropout(0.5))
    resnet_model_2.add(BatchNormalization())
    resnet_model_2.add(Dense(4096, activation='relu'))
    resnet_model_2.add(Dropout(0.5))
    resnet_model_2.add(BatchNormalization())


    # Keep Simple:

    # resnet_model.add(Dense(4096, activation='relu'))
    # resnet_model.add(Dense(4096, activation='relu'))
    # resnet_model_2.add(BatchNormalization())
    # resnet_model_2.add(Dropout(0.5))


    resnet_model_2.add(Dense(nb_classes, activation='sigmoid'))

    return resnet_model_2

def compile_resnet_model_2():
    """
    Compile the ResNet model.

    Returns:
    model: Compiled Keras model.
    """
    input_shape = (224, 224, 3)
    nb_classes = 20

    model = create_resnet_model_2(input_shape, nb_classes)
    model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['Precision', 'Recall', 'Accuracy']) # Try changing the learning rate 
    model.summary()
    return model

# Compile the ResNet model
resnet_model_2 = compile_resnet_model_2()

We use early stopping during training to prevent overfitting.

In [None]:
# Define number of epochs
epochs = 100

# Get callbacks
early_stopping = EarlyStopping(monitor='val_Accuracy', min_delta=0.01, patience=epochs//5, mode='min')

# Assuming X_train and y_train are defined and preprocessed
history_resnet_2 = resnet_model_2.fit(
    X_train,
    y_train,
    validation_split=0.2,
    epochs=epochs,
    batch_size=16,
    callbacks=[early_stopping]  # Include the callbacks in the training process
)

In [None]:
plot_history(history_resnet_2)

In [None]:
# Save the model weights
resnet_model_2.save_weights(model_weights_path)

In [None]:
# Load the model and weights for further use
# Compile the VGG model
input_shape = (224, 224, 3)
nb_classes = 20
resnet_model_2 = compile_resnet_model_2()
resnet_model_2.load_weights(model_weights_path)

In [None]:
from sklearn.preprocessing import LabelEncoder

# Assuming you have a list of class names
class_names = ['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 
               'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 
               'dog', 'horse', 'motorbike', 'person', 'pottedplant', 
               'sheep', 'sofa', 'train', 'tvmonitor']

# Create and fit the LabelEncoder
encoder = LabelEncoder()
encoder.fit(class_names)

In [None]:
from sklearn.metrics import classification_report, average_precision_score, accuracy_score
# Assuming X_train, y_train, X_val, y_val, X_test, y_test are defined and preprocessed

# Function to get classification metrics
def print_classification_metrics(model, X_test, y_test, encoder):
    preds = np.round(model.predict(X_test), 0)
    classification_metrics = classification_report(y_test, preds, target_names=encoder.classes_)
    print(classification_metrics)
    print("mAP: {:.2f}%".format(average_precision_score(y_test, preds) * 100))

# Display the classification metrics for the validation set
print_classification_metrics(resnet_model_2, X_train, y_train, encoder)

In [None]:
y_pred_prob  = resnet_model_2.predict(X_test)

# Experiment with the following threshold to get
# the best f1 score. Should be around 0.3-0.5. This is based on previous experiments using test set (not used anymore)
threshold = 0.3

y_pred = np.array(y_pred_prob>threshold, int)
test_df.loc[:, labels] = y_pred
test_df.head(1)

VGG16 was also utilized in an attempt to potentially yield better results.

**Key differences:**

*  **Depth**: ResNet50 is deeper with 50 layers, whereas VGG16 consists of 16 layers.
*  **Architecture**: ResNet50 employs residual connections to facilitate the training of deeper networks, while VGG16 follows a straightforward, sequential architecture without shortcuts.

Despite these efforts, no improvement in results was observed. We also attempted unfreezing some layers, but this did not lead to any enhancements either.



# 3. Semantic segmentation
The goal here is to implement a segmentation CNN that labels every pixel in the image as belonging to one of the 20 classes (or background). We first define some functions to show predictions and the final submission classification/segmentation.

In [None]:
def show_predictions(X, Y_pred, indices, ground_truth=None):
    m = len(indices)
    f = 2 if ground_truth is None else 3
    for i in range(m):
        plt.subplot(m, f, i*f+1)
        plt.imshow(X[indices[i]])
        plt.subplot(m, f, i*f+2)
        plt.imshow(Y_pred[indices[i]],vmin=0, vmax=20)
        if ground_truth is not None:
            plt.subplot(m, f, i*f+3)
            plt.imshow(ground_truth[indices[i]],vmin=0, vmax=20)
    plt.axis('off')
    plt.show()

In [None]:
def show_submission(df,N):
    CLASSES = {'aeroplane': 0, 'bicycle': 1, 'bird': 2, 'boat': 3, 'bottle': 4, 'bus': 5,
            'car': 6, 'cat': 7, 'chair': 8, 'cow': 9, 'diningtable': 10, 'dog': 11,
            'horse': 12, 'motorbike':13, 'person': 14, 'pottedplant': 15, 'sheep': 16,
            'sofa': 17, 'train': 18, 'tvmonitor': 19}

    for i in range(N):
        plt.subplot(1,2,1)
        plt.imshow(df["img"][i])
        plt.subplot(1, 2, 2)
        mask = df["seg"][i]
        print(f"number of different segmented classes in img {i} = {np.unique(mask)}")
        plt.imshow(df["seg"][i],vmin=0, vmax=20)
        class_names = [class_name for class_name, label in CLASSES.items() if df.loc[i][label] == 1]
        plt.title(",".join(class_names))
        plt.show()
        
    

We can already see what the classification has classified the testdata with.

In [None]:
show_submission(test_df,10)

We shall resize and normalize the images while only resizing the masks. We must make sure the labels in the segmentation masks are still correct labels and that the resizing happens in a logical way by taking a value from neighbouring pixels instead of an average.

In [None]:
def resize_images(images,img_size):
    list_images = list()
    for img in images:
        normal_image = tf.cast(img, tf.float32) / 255.0
        img_resized = resize(normal_image,img_size, anti_aliasing=True)
        list_images.append(img_resized)
    return np.array(list_images) 

def resize_masks(masks,mask_size):
    list_masks = list()
    for mask in masks:
        mask_resized = tf.image.resize(mask[None,...,None],mask_size, method  =tf.image.ResizeMethod.NEAREST_NEIGHBOR )
        mask_resized = np.clip(mask_resized,0,len(labels))
        list_masks.append(mask_resized[0,:,:,0])
    return np.array(list_masks) 


We now resize the training set and split off 20% for validation purposes.

In [None]:
NUM_LABELS = len(labels) + 1

X_train_seg = resize_images(train_df["img"],(IMG_HEIGHT,IMG_WIDTH,3))
Y_train_seg = resize_masks(train_df["seg"],(IMG_HEIGHT,IMG_WIDTH))

X_train_seg, X_val_seg, Y_train_seg, Y_val_seg = train_test_split(X_train_seg,Y_train_seg,test_size=0.2,random_state=42)

We also augment the data set, we found that only flipping images yields better results for the segmentation than also changing the brightness like in the classification case.

In [None]:
def augment(X, Y):
  flip = np.random.random(X.shape[0])
  X_new = np.empty((X.shape[0] + np.sum(flip > 0.5), X.shape[1],X.shape[2],X.shape[3]), dtype=X.dtype)
  Y_new = np.empty((Y.shape[0] + np.sum(flip > 0.5), Y.shape[1],Y.shape[2]),dtype=Y.dtype)
  ind = 0
  for i in range(X.shape[0]):
    X_new[ind] = X[i]
    Y_new[ind] = Y[i]
    ind += 1
    if flip[i]>0.5:
      X_new[ind] = np.fliplr(X[i])
      Y_new[ind] = np.fliplr(Y[i])
      ind += 1
  
  return X_new, Y_new
  
X_aug_seg, Y_aug_seg = augment(X_train_seg,Y_train_seg)
np.unique(Y_aug_seg)

TRAIN_LENGTH = X_train_seg.shape[0]
BATCH_SIZE = 32
BUFFER_SIZE = 1500
STEPS_PER_EPOCH = TRAIN_LENGTH // BATCH_SIZE

training_set = tf.data.Dataset.from_tensor_slices((X_aug_seg, Y_aug_seg))
val_set = tf.data.Dataset.from_tensor_slices((X_val_seg,Y_val_seg))
train_batches = (
    training_set
    .cache()
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE)
    .repeat()
    .prefetch(buffer_size=tf.data.AUTOTUNE))

val_batches = val_set.batch(BATCH_SIZE)
for images, masks in train_batches.take(4):
    sample_image, sample_mask = images[0], masks[0]
    display((sample_image,sample_mask))

We shall set up a display callback to see the changes in segmentation during the training.

In [None]:
EPOCHS = 30
VAL_SUBSPLITS = 5
VALIDATION_STEPS = X_val_seg.shape[0]//BATCH_SIZE//VAL_SUBSPLITS


class DisplayCallback(tf.keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs=None):
        clear_output(wait=True)
        indices = [6]
        for i in range(len(indices)):
            pred = model.predict(X_val_seg[indices[i]:indices[i]+1])
            pred = tf.math.argmax(pred, axis=-1)
            plt.subplot(1, 3, 1)
            plt.imshow(X_val_seg[indices[i]])
            plt.subplot(1, 3, 2)
            plt.imshow(Y_val_seg[indices[i]],vmin=0, vmax=20)
            plt.subplot(1, 3, 3)
            plt.imshow(pred[0],vmin=0, vmax=20)
            plt.axis('off')
            plt.show()
        print ('\nSample Prediction after epoch {}\n'.format(epoch+1))

We shall use the mean dice metric to see how well we can expect our segmentation to perform on the test set, since the competition also uses a dice metric.

In [None]:
def mean_dice_coefficient(y_true, y_pred, smooth=1.0):
    
    mean_dice = 0
    for i in range(len(labels)):
        mask_true = y_true == i
        mask_pred = tf.math.argmax(y_pred, axis=-1) == i
        intersection = tf.cast(tf.reduce_sum(tf.cast(tf.logical_and(mask_true, mask_pred), tf.int32)), tf.float32)
        mean_dice += (2.*intersection+smooth)/(1.+tf.cast(tf.reduce_sum(tf.cast(mask_true, tf.int32)) + tf.reduce_sum(tf.cast(mask_pred, tf.int32)), tf.float32))
    mean_dice /=len(labels)
    return mean_dice

We shall use a U-net for our segmentation.
A U-net consists of an encoder followed by a bottle neck and a decoder. The encoder and decoder are connected using skip-connections.
### contraction
The image is first "contracted" using an encoder. The encoder detects features on multiple levels by iteratively convoluting the image with a small convolution matrix at each layer. This way the image is halved each time while the feature channel doubles each time. It consists of multiple layers each containing two 3x3 convolutional layers followed by a ReLU activation and a downsampling.
### bottleneck
Here the image is in its most contracted state and features are learned on the most abstract level.
### expansion
After the bottleneck the original image is recreated by "decoding" the features learned at the coarsest level. It is a symmetric reversal of the contraction path and must recreate the full features at the larger image scales.
### skip connections
Skip connections connect the output of the encoder layers to the input of the corresponding decoder layers. They help reconstruct the features.


We shall use transfer learning for our best model. For this we use a pretrained model for the encoding step and build a decoder on top of this. We will use the pretrained weights on the ImageNet dataset for the MobilNetV2 model. We also, initialy, freeze the weights of the pretrained model when we train our U-net.

In [None]:
base_model = tf.keras.applications.MobileNetV2(input_shape=[IMG_HEIGHT, IMG_WIDTH, 3],weights='imagenet', include_top=False)
layer_names = [
    'block_1_expand_relu', 
    'block_3_expand_relu',  
    'block_6_expand_relu', 
    'block_13_expand_relu', 
    'block_16_project',      
]
base_model_outputs = [base_model.get_layer(name).output for name in layer_names]

# Create the feature extraction model
down_stack = tf.keras.Model(inputs=base_model.input, outputs=base_model_outputs)


down_stack.trainable = False

def upsample(filters, size, apply_dropout=False):
  initializer = tf.random_normal_initializer(0., 0.02)

  result = tf.keras.Sequential()
  result.add(
    tf.keras.layers.Conv2DTranspose(filters, size, strides=2,
                                    padding='same',
                                    kernel_initializer=initializer,
                                    use_bias=False))

  result.add(tf.keras.layers.BatchNormalization())

  if apply_dropout:
      result.add(tf.keras.layers.Dropout(0.5))

  result.add(tf.keras.layers.ReLU())

  return result

up_stack = [
    upsample(512, 3),  # 4x4 -> 8x8
    upsample(256, 3),  # 8x8 -> 16x16
    upsample(128, 3),  # 16x16 -> 32x32
    upsample(64, 3),   # 32x32 -> 64x64
]

def unet_model(output_channels:int):
    inputs = tf.keras.layers.Input(shape=[IMG_HEIGHT, IMG_WIDTH, 3])

    # Downsampling through the model
    skips = down_stack(inputs)
    x = skips[-1]
    skips = reversed(skips[:-1])

    # Upsampling and establishing the skip connections
    for up, skip in zip(up_stack, skips):
        x = up(x)
        concat = tf.keras.layers.Concatenate()
        x = concat([x, skip])

    # This is the last layer of the model
    last = tf.keras.layers.Conv2DTranspose(
        filters=output_channels, kernel_size=3, strides=2,
        padding='same')  #64x64 -> 128x128

    x = last(x)

    return tf.keras.Model(inputs=inputs, outputs=x)
OUTPUT_CLASSES = len(labels)+1

model = unet_model(output_channels=OUTPUT_CLASSES)
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy', mean_dice_coefficient])

model.summary()

There is a large imbalance in the trainingset. Most pixels in the training set are background (about 77%). If left unattended, this can lead to the model only learning to segment background. Since images exist mostly out of background, we will inversely weight pixels representing background in the training set, this way while each background pixel occurs more frequently they are also made less important.

In [None]:
distribution = np.mean([[np.sum(Y_ == i) / Y_.size for i in range(len(labels) + 1)] for Y_ in Y_train_seg], axis=0)
def add_sample_weights(image, label):
  # The weights for each class, with the constraint that:
  #     sum(class_weights) == 1.0
  class_weights = tf.constant(1/distribution)
#   temp = 20
#   class_weights = np.array([1,temp, temp,temp, temp,temp, temp,temp, temp,temp, temp,temp, temp,temp, temp,temp, temp,temp, temp,temp, temp])
  class_weights = class_weights/tf.reduce_sum(class_weights)

  # Create an image of `sample_weights` by using the label at each pixel as an
  # index into the `class weights` .
  sample_weights = tf.gather(class_weights, indices=tf.cast(label, tf.int32))

  return image, label, sample_weights

In [None]:
model_weights_path = os.path.join(weights_folder, "mobilenetv2.weights.h5")

In [None]:
model_history = model.fit(train_batches.map(add_sample_weights), epochs=EPOCHS,
                          steps_per_epoch=STEPS_PER_EPOCH,
                          validation_steps=VALIDATION_STEPS,
                          validation_data=val_batches,
                          callbacks=[DisplayCallback()])

In [None]:
loss = model_history.history['loss']
val_loss = model_history.history['val_loss']

plt.figure()
plt.plot(model_history.epoch, loss, 'r', label='Training loss')
plt.plot(model_history.epoch, val_loss, 'bo', label='Validation loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss Value')
plt.ylim([0, 1])
plt.legend()
plt.show()

While the sample weighting helps in preventing the background from taking over during the training, training some more on a non-weighted training set. This will help get rid of blobs outside of the actual images (by not reweighting the background we give it a higher importance, thus leading to more background)

In [None]:
model_history = model.fit(train_batches, epochs=20,
                          steps_per_epoch=STEPS_PER_EPOCH,
                          validation_steps=VALIDATION_STEPS,
                          validation_data=val_batches,
                          callbacks=[DisplayCallback()])

Finally we finetune the model by making the pretrained layers trainable and training some more with a very low training rate. This adapts the pretrained encoder model for better meshing with our decoder. 

In [None]:
down_stack.trainable = True
model.compile(optimizer=tf.keras.optimizers.Adam(1e-5),  # Very low learning rate
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
model_history = model.fit(train_batches, epochs=20,
                          steps_per_epoch=STEPS_PER_EPOCH,
                          validation_steps=VALIDATION_STEPS,
                          validation_data=val_batches,
                          callbacks=[DisplayCallback()])

In [None]:
model.save_weights(model_weights_path)

In [None]:
model.load_weights(model_weights_path)

We now segment the test set and view the results.

In [None]:
X_test_seg = resize_images(test_df["img"],(IMG_HEIGHT,IMG_WIDTH, 3))
res = model.predict(X_test_seg)
masks = np.argmax(res,axis=-1).astype(np.uint8)
test_df["seg"] = [tf.reshape(tf.image.resize(masks[i][None,...,None],(img.shape[0],img.shape[1]), method=tf.image.ResizeMethod.NEAREST_NEIGHBOR),(img.shape[0],img.shape[1])) for i,img in enumerate(test_df["img"])]

In [None]:
show_submission(test_df,10)

There are some nicely segmented objects, although a couple things can be noticed:
    - Some images have no classification
    - Images can have a lot of segmented parts of only a couple pixels
    - Sometimes the classification detects multiple items that are not really present in the segmentation

We will attempt to fix these issues by postprocessing the segmentation and classification:
    - If there is no classification we will pick the classes that are well represented in the segmentation (>20% of non-background)
    - If a segmentation class is not in the classification and the number of pixels is low (<5%) we will set is to background, unless there is only one class in the classification in this case all small non-background segmentations are set to the classification class
    - If a classification class is not well represented in the segmentation (<5%) then remove the classification

In [None]:
def postprocess(df):
    CLASSES = {'aeroplane': 0, 'bicycle': 1, 'bird': 2, 'boat': 3, 'bottle': 4, 'bus': 5,
            'car': 6, 'cat': 7, 'chair': 8, 'cow': 9, 'diningtable': 10, 'dog': 11,
            'horse': 12, 'motorbike':13, 'person': 14, 'pottedplant': 15, 'sheep': 16,
            'sofa': 17, 'train': 18, 'tvmonitor': 19}
    res = []
    for i in range(len(df)):
        class_names = [class_name for class_name, label in CLASSES.items() if df.loc[i,class_name] == 1]
        mask = df.loc[i, "seg"]
        tol = 0.05
        new_mask = tf.identity(mask)
        # remove small segmentations
        for j in range(len(CLASSES)):
            if (tf.reduce_sum(tf.cast(mask == (j + 1), tf.float32)) < tol * tf.reduce_sum(tf.cast(mask !=0, tf.float32))):
    #                     print(f"image only has {tf.reduce_sum(tf.cast(mask == (j + 1), tf.float32))} pixels of {list(CLASSES.keys())[j]} ({float(len(np.where(mask == (j+1)[0]))/tf.reduce_sum(tf.cast(mask !=0, tf.float32))}%)")
                new_mask = tf.where(mask == (j + 1), 0, new_mask)
        
        if (len(class_names) > 1):
            # if you detected multiple classes, but the segmentation did not, remove the classification 
            tol = 0.05
            j = 1        
            for class_name, label in CLASSES.items():
                if (df.loc[i,class_name] == 1) and( tf.reduce_sum(tf.cast(mask == (j), tf.float32)) < tol*tf.reduce_sum(tf.cast(mask !=0, tf.float32))):
#                     print(f"image of {tf.size(mask).numpy()} pixels, only has {len(np.where(mask == j)[0])} pixels of {class_name} detected ({float(len(np.where(mask == j)[0]))/tf.reduce_sum(tf.cast(mask !=0, tf.float32))}%)")
                    df.loc[i,class_name] = 0
                j +=1 
        # if you only detected one class, and there are small sections of others segmented, change them into the colour
        elif (len(class_names) == 1):
            tol = 0.15
            for j in range(len(CLASSES)):
                if (CLASSES[class_names[0]] != j) and (tf.reduce_sum(tf.cast(mask == (j + 1), tf.float32)) < tol * tf.reduce_sum(tf.cast(mask !=0, tf.float32))):
                    new_mask = tf.where(mask == (j + 1), CLASSES[class_names[0]] + 1, new_mask)
        else:
            # if you detected multiple classes, but the segmentation did not, remove the classification 
            tol = 0.05
            # remove small segmentations
            for j in range(len(CLASSES)):
                # no classification but obvious segmentation
                if (len(class_names) == 0) and (tf.reduce_sum(tf.cast(mask == (j + 1), tf.float32)) > 0.2 * tf.reduce_sum(tf.cast(mask !=0, tf.float32))):
#                     print(f"image {i} has nothing classified, but segmentation of {tf.reduce_sum(tf.cast(mask == (j + 1), tf.float32))/tf.reduce_sum(tf.cast(mask !=0, tf.float32))} for {list(CLASSES.keys())[j]}")
                    df.loc[i,list(CLASSES.keys())[j]] = 1
        res.append(new_mask)
    df["seg"] = res
    return df

In [None]:
test_df = postprocess(test_df)

We see that these issues are mostly resolved.

In [None]:
show_submission(test_df,10)

# Submission

In [None]:
generate_submission(test_df)

# 4. Adversarial attack
For this part, your goal is to fool your classification and/or segmentation CNN, using an *adversarial attack*. More specifically, the goal is build a CNN to perturb test images in a way that (i) they look unperturbed to humans; but (ii) the CNN classifies/segments these images in line with the perturbations.

This part demonstrates how to generate targeted adversarial examples using our classification model resnet 2. The goal is to perturb input images such that the model misclassifies them as a specified target class.

## 4.1.1. Define Functions

### targeted_adversarial_pattern

This function generates the adversarial perturbation for a given input image and target label. It uses the gradient of the loss with respect to the input image to create the perturbation.

In [None]:
def targeted_adversarial_pattern(input_image, target_label):
    input_image = tf.convert_to_tensor(input_image, dtype=tf.float32)
    target_label = tf.convert_to_tensor(target_label, dtype=tf.float32)
    target_label = tf.reshape(target_label, (1, -1))  # Ensure the target label shape matches the prediction
    with tf.GradientTape() as tape:
        tape.watch(input_image)
        prediction = resnet_model_2(input_image)
        loss = tf.keras.losses.binary_crossentropy(target_label, prediction)
    gradient = tape.gradient(loss, input_image)
    signed_grad = tf.sign(gradient)  # Get the sign of the gradient to create the perturbation
    return signed_grad.numpy()

### scale_perturbations

This function scales the perturbations for better visualization.

In [None]:
def scale_perturbations(perturbations):
    perturbations = perturbations - perturbations.min()
    if perturbations.max() != 0:
        perturbations = perturbations / perturbations.max()
    return perturbations

## 4.1.2. Set Parameters

Set the parameters for the adversarial attack.

In [None]:
c = 0
target_class_index = 18  # The index of the target class to be set to 1
num_samples = 2  # Number of samples to attack
epsilon = 0.45  # Perturbation step size
num_steps = 10  # Number of iterative steps

## 4.1.3. Iterate Through Samples

For each sample, we begin by getting the original prediction from the model. We then create a one-hot encoded target label vector representing the desired misclassification. We start with the original image and we iteratively apply perturbations: in each step, we compute the gradient of the loss (between the model’s prediction and the target label) with respect to the image, we generate perturbations, and adjust the image by subtracting the perturbations multiplied with a factor epsilon. This is repeated for a specified number of steps. After applying all perturbations, we obtain the final adversarial prediction. We visualize the original image, the perturbations, and the adversarial image by plotting them side-by-side.

In [None]:
# Iterate through the samples
for i in range(num_samples):
    num = i
    labels = list(train_df.columns[:-2])
    image = X_train[num]
    image_label = y_train[num]

    # Get the original prediction
    scores_ori = resnet_model_2.predict(image.reshape((1, 224, 224, 3)))
    pred_ori = [labels[idx] for idx, val in enumerate(scores_ori[0]) if val > 0.5]
    
    # Create the desired label vector based on the original label but with high confidence in the target class
    des_label = image_label.copy()
    des_label[target_class_index] = 1  # Set high confidence in the target class

    # Ensure the desired label is reshaped correctly
    des_label = des_label.reshape((1, -1))  # Reshape to match the output shape

    # Start with the original image
    adversarial = image.copy()

    # Iteratively apply perturbations
    for step in range(num_steps):
        perturbations = targeted_adversarial_pattern(adversarial.reshape((1, 224, 224, 3)), des_label)
        adversarial = adversarial - epsilon * perturbations  # Subtraction to move towards the target class
        adversarial = np.clip(adversarial, 0, 255)  # Ensure pixel values are in the valid range

    # Get the final adversarial prediction
    scores_adv = resnet_model_2.predict(adversarial.reshape((1, 224, 224, 3)))
    pred_adv = [labels[idx] for idx, val in enumerate(scores_adv[0]) if val > 0.5]

    # Remove the batch dimension for visualization
    adversarial_img = adversarial.reshape((224, 224, 3))
    perturbation_image = (adversarial - image).reshape((224, 224, 3))
    
    # Scale perturbations for better visualization
    scaled_perturbations = scale_perturbations(perturbation_image)

    # Display original image, adversarial image, and perturbations
    fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))

    # Original image
    ax[0].imshow(image / 255.0)
    ax[0].set_title(f'Original Image\nPrediction: {pred_ori}')
    ax[0].axis('off')

    # Perturbations
    ax[1].imshow(scaled_perturbations)
    ax[1].set_title('Perturbations')
    ax[1].axis('off')
    
    # Adversarial image
    ax[2].imshow(adversarial_img / 255.0)
    ax[2].set_title(f'Adversarial Image\nPrediction: {pred_adv}')
    ax[2].axis('off')

    plt.show()

    # Check if the adversarial attack was successful
    if pred_ori != pred_adv and scores_adv[0][target_class_index] > 0.5:
        c += 1
        print(f'Successful attack! Original Prediction: {pred_ori}, Adversarial Prediction: {pred_adv}')
    else:
        print(f'Unsuccessful attack. Original Prediction: {pred_ori}, Adversarial Prediction: {pred_adv}')

print(f'Number of successful adversarial attacks: {c}')


## 4.1.4 Discussion results

We demonstrated how to generate targeted adversarial examples using our classification model resnet 2. We iteratively applied perturbations to the input images to misclassify them into a specified target class. By visualizing the original images, perturbations, and adversarial images, it is possible to see the impact of the adversarial attack. Even though the changes introduced by the perturbations are not visible to the human eye, they are sufficient to fool the classification model into misclassifying the images into the target class. 

## 4.2 Fooling the Classifier
### 4.2.1. Load Classification Model & Freeze weigths 
Freeze weights because we want to fool this specific model, not some variant

In [None]:
model_weights_path = os.path.join(weights_folder, "RESNET.weights.h5")

# Load the model and weights for further use
resnet_model = compile_resnet_model()
resnet_model.load_weights(model_weights_path)


In [None]:
# Freeze the weights
for layer in resnet_model.layers:
    layer.trainable = False
classification_model = resnet_model

### 4.2.2. Define Your Task

The classes to be fooled will not be chosen randomly. We need to select classes that already exhibit good predictive performance by the classifier. The rationale behind this is that if the classifier does not initially perform well on these classes, it becomes challenging to discern whether we are truly fooling the classifier or if it is merely making incorrect predictions as it has done previously.

Thus, we will look at classification metrics for all classes.

In [None]:
# Assuming X_train and y_train are already defined

# Calculate the split index for validation set
val_split_index = int(0.8 * len(X_train))

# Create the validation set from the training data
X_val = X_train[val_split_index:]
y_val = y_train[val_split_index:]

print("Validation set:", X_val.shape, y_val.shape)

In [None]:
from sklearn.preprocessing import LabelEncoder

# Assuming you have a list of class names
class_names = ['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 
               'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 
               'dog', 'horse', 'motorbike', 'person', 'pottedplant', 
               'sheep', 'sofa', 'train', 'tvmonitor']

# Create and fit the LabelEncoder
encoder = LabelEncoder()
encoder.fit(class_names)

In [None]:
from sklearn.metrics import classification_report, average_precision_score, accuracy_score
# Assuming X_train, y_train, X_val, y_val, X_test, y_test are defined and preprocessed

# Function to get classification metrics
def print_classification_metrics(model, X_test, y_test, encoder):
    preds = np.round(model.predict(X_test), 0)
    classification_metrics = classification_report(y_test, preds, target_names=encoder.classes_)
    print(classification_metrics)
    print("mAP: {:.2f}%".format(average_precision_score(y_test, preds) * 100))

# Display the classification metrics for the validation set
print_classification_metrics(resnet_model, X_val, y_val, encoder)

Let's add noise to make our classifier classify **aeroplane** as **cats**

### 4.2.3. Create the Adversarial Model

Now, we define the adversarial model using the UNet architecture.

In [None]:
import segmentation_models as sm
# Set the framework for segmentation_models to use tensorflow.keras
sm.set_framework('tf.keras')
sm.framework()

# Define and compile the adversarial model
img_size = 224  # Assuming the image size is 224x224
adversarial_model = sm.Unet('mobilenetv2', classes=3, encoder_weights='imagenet', activation='tanh', input_shape=(img_size, img_size, 3))
adversarial_model.summary()

### 4.2.4 Combine the Adversarial Model and Classifier

In [None]:
inputs = Input(shape=(img_size, img_size, 3))
delta = adversarial_model(inputs)
delta = ActivityRegularization(l2=1e-3)(delta)
input_plus_delta = Add()([inputs, delta])
outputs = classification_model(input_plus_delta)
combined_model = tf.keras.Model(inputs=inputs, outputs=outputs)

# Compile the combined model
combined_model.compile(loss='binary_crossentropy', optimizer=Adam(), 
                       metrics=['accuracy', tf.keras.metrics.Precision(), tf.keras.metrics.Recall()])
combined_model.summary()

### 4.2.5. Train the Combined Model

In [None]:
# Function to get index of a class object
def get_index_of_object(class_name):
    return class_names.index(class_name)

In [None]:
# Define indices for the target and deceptive classes
target_class = "aeroplane"
deceptive_class = "cat"
target_class_index = get_index_of_object(target_class)
deceptive_class_index = get_index_of_object(deceptive_class)

# Extract images of the target class from the training set
X_train_target = np.array([X_train[i] for i in range(len(y_train)) if y_train[i][target_class_index] == 1])
y_train_target = np.array([y_train[i] for i in range(len(y_train)) if y_train[i][target_class_index] == 1])

# Modify the labels to deceive the classifier
y_train_target_deceptive = y_train_target.copy()
for i in range(len(y_train_target_deceptive)):
    y_train_target_deceptive[i][target_class_index] = 0
    y_train_target_deceptive[i][deceptive_class_index] = 1

# Train the combined model
batch_size = 32
epochs = 1000

combined_model.fit(X_train_target, y_train_target_deceptive, batch_size=batch_size, epochs=epochs,
                   callbacks=[EarlyStopping(monitor='loss', patience=epochs//10)], verbose=1)

In [None]:
# Save the model weights
model_weights_path = os.path.join(weights_folder, "adversarial_mobilenetv2.h5")
adversarial_model.save_weights(model_weights_path)

model_weights_path = os.path.join(weights_folder, "combined_model.h5")
combined_model.save_weights(model_weights_path)

In [None]:
# Load the model and weights for further use
#TO DO

### 4.2.6. Evaluate the Adversarial Model

Evaluate the model's performance and visualize the adversarial examples.

In [None]:
# Assuming combined_model and adversarial_model are defined and compiled


# Function to calculate per-class accuracy
def per_class_accuracy(y_true, y_pred, classes):
    accuracies = {}
    for i, class_name in enumerate(classes):
        class_true = y_true[:, i]
        class_pred = y_pred[:, i]
        class_acc = accuracy_score(class_true, class_pred)
        accuracies[class_name] = class_acc
    return accuracies

# Function to add adversarial noise
def add_adversarial_noise(adversarial_model, x):
    delta = adversarial_model.predict(np.array([x]))[0]
    x_plus_delta = x + delta
    return x_plus_delta, delta

# Evaluate the impact of adversarial noise on the validation set
def evaluate_adversarial_noise(classification_model, adversarial_model, X_val, y_val, encoder):
    preds_clean = np.round(classification_model.predict(X_val), 0)
    preds_noisy = []

    X_val_noisy = []
    for x in X_val:
        x_plus_delta, _ = add_adversarial_noise(adversarial_model, x)
        X_val_noisy.append(x_plus_delta)
    X_val_noisy = np.array(X_val_noisy)
    preds_noisy = np.round(classification_model.predict(X_val_noisy), 0)

    # Convert numerical labels to class names for clean and noisy predictions
    y_val_labels = encoder.inverse_transform(np.argmax(y_val, axis=1))
    preds_clean_labels = encoder.inverse_transform(np.argmax(preds_clean, axis=1))
    preds_noisy_labels = encoder.inverse_transform(np.argmax(preds_noisy, axis=1))

    # Compute classification metrics for clean and noisy data
    classification_metrics_clean = classification_report(y_val_labels, preds_clean_labels, target_names=encoder.classes_)
    classification_metrics_noisy = classification_report(y_val_labels, preds_noisy_labels, target_names=encoder.classes_)

    # Compute mean Average Precision (mAP) for clean and noisy data
    mAP_clean = average_precision_score(y_val, preds_clean)
    mAP_noisy = average_precision_score(y_val, preds_noisy)

    # Compute per-class accuracy for clean and noisy data
    accuracies_clean = per_class_accuracy(y_val, preds_clean, encoder.classes_)
    accuracies_noisy = per_class_accuracy(y_val, preds_noisy, encoder.classes_)

    print("Classification metrics for clean validation set:")
    print(classification_metrics_clean)
    print("mAP (clean): {:.2f}%".format(mAP_clean * 100))
    print("\nPer-Class Accuracy (clean):")
    for class_name, acc in accuracies_clean.items():
        print(f"{class_name}: {acc * 100:.2f}%")

    print("\nClassification metrics for noisy validation set:")
    print(classification_metrics_noisy)
    print("mAP (noisy): {:.2f}%".format(mAP_noisy * 100))
    print("\nPer-Class Accuracy (noisy):")
    for class_name, acc in accuracies_noisy.items():
        print(f"{class_name}: {acc * 100:.2f}%")

# Display the classification metrics for the validation set with and without adversarial noise
evaluate_adversarial_noise(classification_model, adversarial_model, X_val, y_val, encoder)

# Visualize a few examples of the adversarial noise on the validation set
fig = plt.figure(figsize=(10, 10))
for i in range(5):
    x = X_val[i]
    x_plus_delta, delta_val = add_adversarial_noise(adversarial_model, x)

    fig.add_subplot(5, 3, i*3+1)
    plt.imshow(x.astype(np.uint8))
    plt.title("Input image")
    plt.axis('off')

    fig.add_subplot(5, 3, i*3+2)
    plt.imshow(delta_val)
    plt.title("Adversarial noise")
    plt.axis('off')

    fig.add_subplot(5, 3, i*3+3)
    plt.imshow(x_plus_delta.astype(np.uint8))
    plt.title("Input for the classifier")
    plt.axis('off')

    y_pred_clean = classification_model.predict(np.array([x]))
    y_pred_noise = classification_model.predict(np.array([x_plus_delta]))
    print(f"Prediction without noise: {y_pred_clean}, with noise: {y_pred_noise}")

# 5. Discussion
## Classification
Multiple models were developed for the multilabel classification task. The best model was selected based on the validation accuracy. Early stopping was used to prevent overfitting of the models during training. Each model used the binary cross entropy loss function in combination with the Adam optimizer. The number of layers and type of layers and thus number of parameters was therefore different for each classification model.

In terms of preprocessing, each image was rescaled to 224 x 224 and data augmentation was applied to enlarge the training dataset by rotating, mirroring and changing brightness of existing images and adding them to the training set.

The classification model built from scratch had low validation accuracy and started overfitting quite quickly. Also, the model based on the Alexnet architecture did not perform well enough on the training data. However, we did notice that batch normalisation accelerated training and increased accuracy.

Eventually, transfer learning was attempted. As expected, transfer learning using the weights gave the RESNET-50 model the best result in terms of validation accuracy. However, some more layers were added to the model. By tuning the cut-off value, a good classification performance was eventually achieved. It can thus be concluded that transfer learning works best, but still needs to be adapted to the specific context of the task.

Possible improvements could include training the model on a more extensive dataset, or adding combinations of additional layers on top of the Resnet architecture. Futhermore, it could be interesting to test other model architectures for transfer learning such as GoogLeNet or MobileNetV2. Other possible improvements could be found by carefull hyperparameter tuning or by developing an ensemble model that combines the predictions of multiple models. 

Given its high validation accuracy of around 90%, the model could be used for real-life applications requiring multilabel classification, such as autonomous driving or classification of medical images. Despite achieving high validation accuracy during training, the score in the Kaggle competition remained rather low.

## Semantic segmentation

## Adversarial attacks


# Discussion

## Preprocessing
**Resize Images to Uniform Shape**: To ensure that all images in the dataset are of the same dimensions, several strategies were considered:
- Idea 1: Cropping
    - **Description**: Crop predefined shapes from the four corners and the center of the image, resulting in a 5x bigger dataset.
    - **Pros**: This method introduces data augmentation by generating multiple images from a single image, potentially increasing the dataset's size and variability.
    - **Cons**: Cropping can result in images that do not contain all the objects of the original image, leading to inaccuracies in labeling.

- Idea 2: Resizing
    - **Description**: Simply resize the images to a predefined shape, such as (224, 224).
    - **Pros**: Ensures uniformity across the dataset, avoiding mislabeling associated with cropping.
    - **Cons**: Direct resizing might distort the aspect ratio of the images.

- Idea 3: Resizing with Padding
    - **Description**: Resize images while adding padding for vertical images (~20% of the training set) to maintain the aspect ratio.
    - **Pros**: Preserves aspect ratio and ensures uniform dimensions.
    - **Cons**: Padding may introduce unwanted artifacts.

**Chosen Approach**: We opted for Idea 2: resizing the images to (224, 224). This approach ensures consistency and simplicity in the preprocessing step.

## Image Classification
### Imbalanced Dataset

**Observation**: The dataset is imbalanced, with 27% of samples belonging to the 'person' class.

**Handling Imbalanced Datasets**
- Idea 1: Data Augmentation
    - **Description**: Augment the training dataset to balance the class distribution.
    - **Outcome**: This method was implemented by augmenting images, particularly focusing on the 'person' class.

- Idea 2: Evaluation Metrics
    - **Description**: Use metrics like the F1 score, considering both precision and recall.
    - **Suggested Metric**: F1 score was chosen for its balanced measure of performance across classes.


**Chosen Approach**: [TO ADD]

### Model Development and Improvement

**Classification Approaches**:
    - Building models from scratch: Very prone to overfitting and low validation accuracy.
    - Utilizing existing architectures: Bad performance. Managed to improve the performance (and accelerate training) with batch normalization but still lacked in performance.
    - Transfer learning with RESNET-50: Achieved, as excpected, the highest validation accuracy. Layers were added to the pretrained model to adapt the model to the specific context and the cut off value was tuned to improve performance. 

**Possible Improvements**: 
- Explore other architectures like GoogLeNet or MobileNetV2 (alread tried: VGG, Inception) 
- perform more extensive hyperparameter tuning (for example using grid search)
- develop an ensemble model 

Despite achieving high validation accuracy (~90%) with the RESNET model, its performance in the Kaggle competition remained subpar.

## Image Segmentation

[TO ADD]

## Adversarial Attack

### Simple Attack: Misclassification of All Images
Employing the Fast Gradient Sign Method (FGSM) on all images to induce misclassification.

### Advanced Attack: Targeting Specific Classifications
- **Attempt 1**: Utilizing XXX - unsuccessful. 

[TO ADD: Architecture description]
[TO ADD: Failure analysis]

- **Attempt 2**: Employing FGSM - successful. 

[TO ADD: Explanation of method]

[TO ADD: Results]

## Real-World Application and Reflection

### Classification: 
Given the model's high validation accuracy of around 90%, the model could be used for real-life applications requiring multilabel classification, such as autonomous driving or classification of medical images. Despite achieving high validation accuracy during training, the score in the Kaggle competition remained rather low.
