## Setting up environment and importing libraries

In this segment, we install the libraries required and set up the environment to train the models. Please choose a GPU runtime in the Google Colab setting. It is also recommended to mount your Google Drive to the notebook so that the static files just need to be downloaded once and can be reused should you need to restart your runtime.

In [None]:
# install required libraries
!pip install transformers timm

In [26]:
# Generic packages
import json
import os
from tqdm.notebook import tqdm, trange

# Deep-Learning packages
import torch
import torch.nn as nn
import torchvision.transforms as transforms
from transformers import AutoModel, AutoTokenizer, get_scheduler
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from torch.optim import AdamW

# Evaluation Packages
from sklearn.metrics import accuracy_score, classification_report
from time import perf_counter

# Data Packages
from PIL import Image
import pandas as pd

Common configurations to be used throughout the notebook

In [3]:
# use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [4]:
# set random seeds for reprodubility
import numpy as np
import random

def set_seed(seed_val):
    random.seed(seed_val)
    np.random.seed(seed_val)
    torch.manual_seed(seed_val)
    torch.cuda.manual_seed_all(seed_val)

In [5]:
seed_val = 0
set_seed(seed_val)

## Data loading and training parameters

This segment downloads the data which we are going to use for the tutorial and defines the paths to read data from, as well as training parameters which we are going to use for all three models.

In [27]:
# HOME_FOLDER = '/content/drive/MyDrive/KDD/' # if mounted
HOME_FOLDER = '/content/KDD/' # if not mounted
WEBVISION_DATA_FOLDER = HOME_FOLDER + 'webvision_data/'
IMAGE_FOLDER = WEBVISION_DATA_FOLDER + 'images/'
RESULTS_FOLDER = HOME_FOLDER + 'results/'
TRAINED_MODELS_FOLDER = HOME_FOLDER + 'trained_models/'
os.makedirs(RESULTS_FOLDER, exist_ok=True)

In [None]:
!mkdir -p $WEBVISION_DATA_FOLDER
!wget "https://drive.google.com/uc?id=1r4aTTbLuYgGrgpZLOgUH9sQ33DBsbOFm&export=download" -O $WEBVISION_DATA_FOLDER/data.zip
!unzip $WEBVISION_DATA_FOLDER/data.zip -d $WEBVISION_DATA_FOLDER

In [8]:
df_train = pd.read_csv(WEBVISION_DATA_FOLDER + 'train.csv')
df_test = pd.read_csv(WEBVISION_DATA_FOLDER + 'test.csv')

Exceute the cells below to see a random label, text, image triplet from the train dataset

In [10]:
import matplotlib.pyplot as plt

def show_sample(row_num):
    """Displays an image at position `row_num` in the WebVision dataset."""
    sample_row = df_train.iloc[row_num]
    print('Index:', row_num)
    print('Label:', sample_row['label'])
    print('Text:', sample_row['text'])
    image_path = IMAGE_FOLDER + sample_row['img_path']
    im = Image.open(image_path)
    plt.imshow(im)

In [None]:
from random import randint
show_sample(randint(0, len(df_train)))

We create the mapping table to map the string labels to integers to be used for the class labels and vice versa.

In [12]:
label_to_id = {lab:i for i, lab in enumerate(df_train['label'].sort_values().unique())}
id_to_label = {v:k for k,v in label_to_id.items()}

In [None]:
label_to_id

In [14]:
num_out_labels = len(label_to_id)

Training parameters that we will use all along this course.

In [15]:
## training parameters to be used for all models ##

# Number of passes through the whole training data
num_train_epochs = 20

# Number of (image, description) samples to be processed at once by the model
batch_size = 16

# Learning rate used for training all models
learning_rate = 1.0e-5

# We will also use a learning rate scheduler: it will lowers the learning rate after each model step
warmup_steps = 0

# Weight decay i.e l2-regularization used for optimizing parameters
weight_decay = 0.01

# Maximum length in token when encoding the image descriptions
max_seq_length = 64

## BERT
The first model which we are going to train is a BERT model which only uses the text from the data.

### Dataset
Since we are training a text only model, the dataset which we fit into the model only requires two attributes: **text** and **label**.

**Exercise:** Fill the \_\_getitem__() function such that it returns the description of an image and its label

In [None]:
class TextDataset(Dataset):
    """PyTorch dataset used to train a BERT model on the descriptions of the WebVision dataset"""
    def __init__(self, df, label_to_id, text_field="text", label_field="label"):
        self.df = df.reset_index(drop=True)
        self.label_to_id = label_to_id
        self.text_field = text_field
        self.label_field = label_field

    def __getitem__(self, index):
        """This default function needs to be defined such that it returns the image description and its label at position index"""
        text = ...
        # Do not forget to use the label_to_ids here
        label = ...

        return text, label

    def __len__(self):
        """This default function should return the length of the dataset"""

        return self.df.shape[0]

### Model
The model uses BERT to encode the text, and feeds the encodings (a 768 dimension vector) into a fully connected linear layer with 10 outputs (one for each class label).

![](https://drive.google.com/uc?export=view&id=1nlBu9P8saotjNg_nv_tfdnTxpxaFAhqq)

In [None]:
## We load the default BERT tokenizer using the transformers library
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

Let's take a look at what the tokenizer returns: it converts a sentence in a sequence of tokens along with their attention_mask (i.e which tokens can be attended to in self-attention).

Note that they are special tokens inserted in each sequence:
- **101** is the default mapping for the **[CLS]** token, that is often used to represent full sentences.
- **102** is the default mapping for the **[SEP]** token, indicating end of sentence

In [31]:
tokenized_example = bert_tokenizer("This is an example_text", truncation=True, max_length=max_seq_length,
            return_tensors="pt", padding=True
        )
print("Input IDS: ", tokenized_example.input_ids)
print("Attention Mask: ", tokenized_example.attention_mask)

print("Special Token Map: ", {i:j for i,j in zip(bert_tokenizer.all_special_ids, bert_tokenizer.all_special_tokens)})

Input IDS:  tensor([[ 101, 2023, 2003, 2019, 2742, 1035, 3793,  102]])
Attention Mask:  tensor([[1, 1, 1, 1, 1, 1, 1, 1]])
Special Token Map:  {100: '[UNK]', 102: '[SEP]', 0: '[PAD]', 101: '[CLS]', 103: '[MASK]'}


In [32]:
class VLBertModel(nn.Module):
    """PyTorch model that will be used to train the BERT model on the WebVision dataset."""
    def __init__(self, num_labels, text_pretrained='bert-base-uncased'):
        super().__init__()

        self.num_labels = num_labels
        ## The text_encoder attribute is the default pretrained BERT model
        self.text_encoder = AutoModel.from_pretrained(text_pretrained)
        ## BERT maps a sequence of text to embeddings. We need to add a projection layer to map the embeddings to the number of classes

        # Exercise: Define the classifier attribute to be a linear layer that maps embeddings of BERT size to the number of labels.
        #   linear layer: nn.Linear
        #   BERT Embeddings size: self.text_encoder.config.hidden_size
        self.classifier = ...


    def forward(self, text):
        """The forward function takes data input and should return label likelihood, named `logits`."""
        # Excerise: fill the blanks with the right inputs
        output = self.text_encoder(input_ids=..., attention_mask=..., return_dict=True)
        # Here, we take the last hidden state of the output, which is defined by the embeddings returned by the last BERT layer
        logits = self.classifier(output.last_hidden_state[:, ..., :]) # CLS embedding
        return logits

In [33]:
# create the model
bert_model = VLBertModel(num_labels=num_out_labels, text_pretrained='bert-base-uncased')
# Map to GPU
bert_model = bert_model.to(device)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

### Training
Load the data using the text dataset, feed it into a data loader for random sampling, and train the model

In [None]:
set_seed(seed_val)

# Define the dataset and the DataLoader
train_dataset = TextDataset(df=df_train, label_to_id=label_to_id, text_field='text', label_field='label')
# The DataLoader will take care of separating the dataset in batches of size `batch_size`
# The RandomSampler will take care of randomly splitting the data
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(dataset=train_dataset,
                    batch_size=batch_size,
                    sampler=train_sampler)

# Total number of steps
t_total = len(train_dataloader) * num_train_epochs

# We use the AdamW optimizer here
optimizer = AdamW(bert_model.parameters(), lr=learning_rate, weight_decay=weight_decay)
# We also use a cosine scheduler: it will take care of lower the learning rate after each optimizer step.
# Learning rate scheduling is often used to reduce overfitting on data
scheduler = get_scheduler(name="cosine", optimizer=optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total)

# Here, we will use a cross entropy loss because it's a multiclass problem
criterion = nn.CrossEntropyLoss()

# Put the model in train mode: (enables special features such as dropout, batch normalization...)
bert_model.train()


start = perf_counter()

# Run over all epochs
for epoch_num in trange(num_train_epochs, desc='Epochs'):
    epoch_total_loss = 0
    # For each epoch, run over all batches of data
    for step, batch in tqdm(enumerate(train_dataloader), total=len(train_dataloader), desc='Batch'):
        b_text, b_labels = batch
        # Tokenize the image description
        b_inputs = bert_tokenizer(
            list(b_text), truncation=True, max_length=max_seq_length,
            return_tensors="pt", padding=True
        )
        # Put input data to GPU
        b_labels = b_labels.to(device)
        b_inputs = b_inputs.to(device)

        # PYTORCH TRAINING LOOP

        # Exercise: fill the pytorch training loop

        ## Step 1: gradient should be set to 0
        ...
        ## Step 2: pass the inputs to the model and get logits
        b_logits = ...
        ## Step 3: Calculate loss given the logits and the labels
        loss = ...

        ## accumulates all losses
        epoch_total_loss += loss.item()

        ## Perform a backward pass to calculate the gradients
        ...

        ## Perform the optimizer step: backpropagation on model parameters
        ...
        ## Perform the scheduelr step: lowers the learning rate according to the schedule
        ...

    avg_loss = epoch_total_loss/len(train_dataloader)


    print('epoch =', epoch_num)
    print('    epoch_loss =', epoch_total_loss)
    print('    avg_epoch_loss =', avg_loss)
    print('    learning rate =', optimizer.param_groups[0]["lr"])

end = perf_counter()
bert_training_time = end- start
print('Training completed in ', bert_training_time, 'seconds')

### Testing
Now that we trained the model, we can predict unseen examples on the test set

In [None]:
bert_prediction_results = []

# Define the test dataset similarly as above
test_dataset = TextDataset(df=df_test, label_to_id=label_to_id, text_field='text', label_field='label')

# Since we just perform prediction, we don't need to randomly sample the dataset
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(dataset=test_dataset,
                            batch_size=batch_size,
                            sampler=test_sampler)


for batch in tqdm(test_dataloader):
  # Put model in eval model: disable special features
  bert_model.eval()

  b_text, b_labels = batch
  # Tokenizer description
  b_inputs = bert_tokenizer(list(b_text), truncation=True, max_length=max_seq_length, return_tensors="pt", padding=True)

  # Put to GPU
  b_labels = b_labels.to(device)
  b_inputs = b_inputs.to(device)

  # tells pytorch to not calculate gradients since we won't be performing optimizer step
  with torch.no_grad():
      # Excerise: pass the inputs to the model and get logits
      b_logits = ...
      # Put logits to cpu (need to call detach before)
      b_logits = b_logits.detach().cpu()

  # Excerise: calculate the most likely predicted class given the output logits
  # Tips: you can use torch.argmax()
  batch_prediction = ...
  bert_prediction_results.extend(batch_prediction.tolist())

bert_prediction_labels = [id_to_label[p] for p in bert_prediction_results]

  0%|          | 0/13 [00:00<?, ?it/s]

Generate the classification report by comparing the predictions from the model with the true labels

In [None]:
bert_class_report = classification_report(df_test['label'], bert_prediction_labels, output_dict=True)
bert_class_report['training_time (seconds)'] = bert_training_time

with open(RESULTS_FOLDER + 'bert_class_report.json', 'w') as f:
  json.dump(bert_class_report, f)

print(bert_class_report['accuracy'])

0.87


In [None]:
# while True:pass

## BERT + ResNet-50
The next model that we are training uses a combination of BERT and ResNet-50 to encode the text and images, respectively.

### Dataset
Unlike the previous Dataset used for BERT, we include images in this dataset by reading the image files and applying a series of transformations to them so that they can fit into the ResNet model.

In [None]:
class ImageDataset(Dataset):
    def __init__(self, df, label_to_id, train=False, text_field="text", label_field="label", image_path_field="img_path", img_size=224):
        self.df = df.reset_index(drop=True)
        self.label_to_id = label_to_id
        self.train = train
        self.text_field = text_field
        self.label_field = label_field
        self.image_path_field = image_path_field

        # Default ResNet-50 settings. Do not change this
        self.img_size = img_size # Pixel size
        self.mean, self.std = (
            0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711) # Average mean/std of pixels in the dataset

        # We will perform some classic data augmentation for images during training
        self.train_transform_func = transforms.Compose(
                [transforms.RandomResizedCrop(self.img_size, scale=(0.5, 1.0)),  # Crops an image and rescale the crop images to the given size
                    transforms.RandomHorizontalFlip(), # Randomly flip the image horizontally
                    transforms.ToTensor(), # Put the image in tensor class
                    transforms.Normalize(self.mean, self.std) # Normalize the pixel values
                    ])

        # During evaluation, we do not use data augmentation, but we make sure that the images have the same size as training images
        self.eval_transform_func = transforms.Compose(
                [transforms.Resize(256), # Resize the image to be of size (256,256) pixels
                    transforms.CenterCrop(self.img_size), # Puts the image to the same size as training size
                    transforms.ToTensor(), # # Put the image in tensor class
                    transforms.Normalize(self.mean, self.std) # Normalize the pixel values
                    ])


    def __getitem__(self, index):
        ## Exercise: fill the blank such that we get the right inputs
        text = ...
        # Do not forget to use the label_to_ids here
        label = ...
        # This should be the path to the desired image
        img_path = IMAGE_FOLDER + ...

        # Opens the desired image
        image = Image.open(img_path)

        if self.train:
          # Apply the training transformations
          img = ...
        else:
          # Apply the evaluation transformations
          img = ...

        return text, label, img

    def __len__(self):
        return self.df.shape[0]

### Model
The original ResNet model consists of a fully connected layer with 1000 classes at the end, to show the scores of each image belonging to that class. However, our output classes are different and we want to use the image features before the fully connected layer instead of the 1000-class output probabilities. Therefore, we "extract" this model out of the original ResNet model architecture by leaving out the fully connected layer.


After that, we pair the extracted ResNet model with a BERT model and add a 10-class linear layer on top of them, like we did for the previous BERT classifier.

![](https://drive.google.com/uc?export=view&id=1vFL3V1LdRlamLjkoI7ieoimxbwGnR7mU)


The ResNet-50 model is trained on imagenet data to classify images into 1000 classes, therefore the last layer is a fully connected layer with 1000 output nodes. This output is not useful to us since our output classes are different. Therefore, we need to strip off this fully connected layer and use the features after the last average pooling layer. This can be done by copying the layers and weights to another network and leave out the last layer.

![](https://drive.google.com/uc?export=view&id=1ivYlubrhvY00P7b2SYLfpRSF3XxJUbfh)

In [None]:
# Let's take a look at the ResNet 50 model
from torchvision.models.resnet import resnet50

pretrained_resnet = resnet50(pretrained=True)

In [None]:
  pretrained_resnet = resnet50(pretrained=True)
  children_list = []
  for n,c in pretrained_resnet.named_children():
      print(n)

In [None]:
# extract layers of resnet-50 to build a new model

import torch.nn as nn
from torchvision.models.resnet import resnet50

class ResNetFeatureModel(nn.Module):
    """A PyTorch ResNet wrapper that outputs the features derived from the `output_layer` layer."""
    def __init__(self, output_layer):
        """Parameters:
        `output_layer: string` layer name of ResNet50 from which we want to derive features."""
        super().__init__()
        self.output_layer = output_layer
        pretrained_resnet = resnet50(pretrained=True)

        self.children_list = []

        # Exercise: build a loop that appends every layer up to `output_layer` (included) in self.children_list
        ...

        # The final network is just a sequential step of all layers in self.children_list
        self.net = nn.Sequential(*self.children_list)


    def forward(self,x):
        """Takes as input an image x, and outputs the image embeddings"""
        # x shape: [batch_size, num_channels, img_size, img_size]
        x = self.net(x)
        # Exercise: what is the shape of x before and after flatten?
        x = torch.flatten(x, 1)
        return x

In [None]:
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

In [18]:
# last output layer name for resnet is named 'layer4', dim 2048*7*7
# last layer name before fc is named 'avgpool', dim 2048*1*1 -> needs to be flattened

class BertResNetModel(nn.Module):
    """PyTorch model that takes as input an image and its descriptions. Both are processed with ResNet50 and BERT respectively, and then concatenated."""
    def __init__(self, num_labels, text_pretrained='bert-base-uncased'):
        super().__init__()
        self.text_encoder = AutoModel.from_pretrained(text_pretrained)
        # Fill the blank with the right output layer
        self.visual_encoder = ResNetFeatureModel(output_layer=...)
        self.image_hidden_size = 2048
        # Fil the blank with the right embedding size
        self.classifier = nn.Linear(..., num_labels)

    def forward(self, text, image):
        """Given an image and its description, process each of them independently, concatenate their embedding and returns class likelihood."""

        # Text tower
        text_output = self.text_encoder(input_ids=..., attention_mask=...)
        text_feature = text_output.last_hidden_state[:, ..., :]

        # Image tower
        img_feature = self.visual_encoder(image)

        # Final embeddings
        # Excerise: concatenate the text and image features on the embedding axis
        features = ...

        # Likelihood with the linear layer
        logits = self.classifier(features)

        return logits

In [None]:
resnet_model = BertResNetModel(num_labels=num_out_labels, text_pretrained='bert-base-uncased')
resnet_model = resnet_model.to(device)



### Training
Similar to BERT training, but we take in images as an additional input

In [None]:
## training loop
set_seed(seed_val)

# Define the dataset and the DataLoader
train_dataset = ImageDataset(df=df_train, label_to_id=label_to_id, train=True, text_field='text', label_field='label', image_path_field='img_path', img_size=224)
# The DataLoader will take care of separating the dataset in batches of size `batch_size`
# The RandomSampler will take care of randomly splitting the data
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(dataset=train_dataset,
                    batch_size=batch_size,
                    sampler=train_sampler)


# Total number of steps
t_total = len(train_dataloader) * num_train_epochs

# We use the AdamW optimizer here
optimizer = AdamW(bert_model.parameters(), lr=learning_rate, weight_decay=weight_decay)
# We also use a cosine scheduler: it will take care of lower the learning rate after each optimizer step.
# Learning rate scheduling is often used to reduce overfitting on data
scheduler = get_scheduler(name="cosine", optimizer=optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total)


# Here, we will use a cross entropy loss because it's a multiclass problem
criterion = nn.CrossEntropyLoss()

# Put the model in train mode: (enables special features such as dropout, batch normalization...)
resnet_model.train()

start = perf_counter()
# Run over all epochs
for epoch_num in trange(num_train_epochs, desc='Epochs'):
    epoch_total_loss = 0
    # For each epoch, run over all batches of data
    for step, batch in tqdm(enumerate(train_dataloader), total=len(train_dataloader), desc='Batch'):
        b_text, b_labels, b_imgs = batch
        # Tokenize the image description
        b_inputs = bert_tokenizer(
            list(b_text), truncation=True, max_length=max_seq_length,
            return_tensors="pt", padding=True
        )
        # Put input data to GPU
        b_labels = b_labels.to(device)
        b_imgs = b_imgs.to(device)
        b_inputs = b_inputs.to(device)

        # Exercise: fill the pytorch training loop

        ## Step 1: gradient should be set to 0
        ...
        ## Step 2: pass the inputs to the model and get logits
        b_logits = ...
        ## Step 3: Calculate loss given the logits and the labels
        loss = ...

        ## accumulates all losses
        epoch_total_loss += loss.item()

        ## Perform a backward pass to calculate the gradients
        ...

        ## Perform the optimizer step: backpropagation on model parameters
        ...
        ## Perform the scheduelr step: lowers the learning rate according to the schedule
        ...

    avg_loss = epoch_total_loss/len(train_dataloader)


    print('epoch =', epoch_num)
    print('    epoch_loss =', epoch_total_loss)
    print('    avg_epoch_loss =', avg_loss)
    print('    learning rate =', optimizer.param_groups[0]["lr"])
end = perf_counter()
resnet_training_time = end- start
print('Training completed in ', resnet_training_time, 'seconds')

### Testing

In [None]:
# testing loop

resnet_prediction_results = []

# Define the test dataset similarly as above
test_dataset = ImageDataset(df=df_test, label_to_id=label_to_id, train=False, text_field='text', label_field='label', image_path_field='img_path', img_size=224)
# Since we just perform prediction, we don't need to randomly sample the dataset
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(dataset=test_dataset,
                            batch_size=batch_size,
                            sampler=test_sampler)

# Put model in eval model: disable special features
resnet_model.eval()

for batch in tqdm(test_dataloader):

  b_text, b_labels, b_imgs = batch
  # Tokenize description
  b_inputs = bert_tokenizer(list(b_text), truncation=True, max_length=max_seq_length, return_tensors="pt", padding=True)
  # Put to GPU
  b_labels = b_labels.to(device)
  b_imgs = b_imgs.to(device)
  b_inputs = b_inputs.to(device)

  # tells pytorch to not calculate gradients since we won't be performing optimizer step
  with torch.no_grad():
      # Excerise: pass the inputs to the model and get logits

      b_logits = ...
      b_logits = b_logits.detach().cpu()

  # Excerise: calculate the most likely predicted class given the output logits
  # Tips: you can use torch.argmax()
  batch_prediction = ...
  resnet_prediction_results += batch_prediction.tolist()

resnet_prediction_labels = [id_to_label[p] for p in resnet_prediction_results]

Generate the classification report

In [None]:
resnet_class_report = classification_report(df_test['label'], resnet_prediction_labels, output_dict=True)
resnet_class_report['training_time (seconds)'] = resnet_training_time

with open(RESULTS_FOLDER + 'resnet_class_report.json', 'w') as f:
  json.dump(resnet_class_report, f)

print(resnet_class_report['accuracy'])

In [None]:
# while True:pass

## ALBEF
The last model that we are training is the ALBEF joint-encoder model which aligns the text and image features.

### ALBEF-specific setup
This section creates the folder structure and download the necessary files required to train an ALBEF model.

In [28]:
ALBEF_FOLDER = HOME_FOLDER + 'ALBEF/'
os.makedirs(ALBEF_FOLDER, exist_ok=True)

In [None]:
# download pre-trained ALBEF model and required ALBEF files from ALBEF's official repo (only need to do this once to save it in your gdrive)
!wget https://raw.githubusercontent.com/salesforce/ALBEF/main/models/vit.py -O $ALBEF_FOLDER/vit.py
!wget https://raw.githubusercontent.com/salesforce/ALBEF/main/models/tokenization_bert.py -O $ALBEF_FOLDER/tokenization_bert.py
!wget https://raw.githubusercontent.com/salesforce/ALBEF/main/models/xbert.py -O $ALBEF_FOLDER/xbert.py


In [32]:
# replace all occurrences of tokenizer_class with processor_class in xbert.py to make it compatible with newer transformers version
# if you don't do this step, you will need to install transformers==4.8.1 as specified by the requirements in the ALBEF repo

!sed -i 's/tokenizer_class/processor_class/g' $ALBEF_FOLDER/xbert.py

In [None]:
# add path to downloaded ALBEF files
import sys
sys.path.append(ALBEF_FOLDER)

#import libraries required for ALBEF
from vit import VisionTransformer
from xbert import BertConfig as AlbefBertConfig, BertModel as AlbefBertModel
from functools import partial

### Dataset
Same as the BERT-ResNet Dataset which contains **text**, **images** and **labels**. The only difference here is the image size (ResNet - 224, ALBEF - 256).

### Model
ALBEF also uses different encoders:
- **Text Encoder**: BERT
- **Image Encoder**: VisionTransformer

We use the joint text-image encoder to encode both the text and images, and as with the previous two models, add a linear fully connected layer to it.

![](https://drive.google.com/uc?export=view&id=1zcBBx08_7ujlH2RS2WZrmTZ--Icsk4NN)

In [None]:
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

In [None]:
class AlbefModel(nn.Module):

    def __init__(self, bert_config, num_labels, text_pretrained='bert-base-uncased'):
        super().__init__()

        self.num_labels = num_labels
        # Loads the pretrained ALBEF bert model
        self.text_encoder = AlbefBertModel.from_pretrained(
            text_pretrained, config=bert_config, add_pooling_layer=False)
        # Loads Vision transformer with some default parameters.
        # You can play with the following parameters to see how the differences: `embed_dim`, `depth`, `num_heads`, `mlp_ratio`.
        # You can check here for a description of hyperparameters: https://github.com/lucidrains/vit-pytorch?tab=readme-ov-file#vision-transformer---pytorch
        ### Exercise: what should be the embed_dim?
        ### Tip: look at the ALBER images above
        self.visual_encoder = VisionTransformer(
            img_size=256, patch_size=16, embed_dim=..., depth=12, num_heads=12,
            mlp_ratio=4, qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6))

        # Exericse: Fill the blank with the right embedding size
        self.classifier = nn.Linear(
            ..., num_labels)


    def forward(self, text, image):
        # We start by processing the images with the Vision Transformer
        # Exercise: what shape will be the output?
        # answer: [batch_size, img_size + 1, 768]
        image_embeds = self.visual_encoder(image)

        # Builds the attention masks of image for cross-attention with text embeddings
        image_atts = torch.ones(image_embeds.size()[:-1], dtype=torch.long).to(image_embeds.device)

        # ALBEF uses cross-attention to combine image embeddings and text embeddings
        # It's quite easy to implement it: just pass the image embeddings/masks in the encoder_ fields.
        output = self.text_encoder(text.input_ids, attention_mask=text.attention_mask,
                                   encoder_hidden_states=image_embeds, encoder_attention_mask=image_atts, return_dict=True
                                   )

        # Similarly as BERT, use the representation of the [CLS] token to do classification
        logits = self.classifier(output.last_hidden_state[:, ..., :])
        return logits

Because ALBEF aligns the BERT and VisionTransformers features, it has its own BERT configuration. We download both this configuration and the pretrained model from Salesforce's GitHub and web pages in the function below which loads a pretrained model.

In [37]:
from urllib.request import urlretrieve

def load_albef_pretrained(num_out_labels):
    """Loads pretrained ALBEF by downloading the right configuration and pretrained weights."""
    tmp_directory = './tmp/albef'
    os.makedirs(tmp_directory, exist_ok=True)

    albef_bert_config_fp = os.path.join(tmp_directory, 'config_bert.json')
    albef_model_fp = os.path.join(tmp_directory, 'ALBEF.pth')

    if not os.path.exists(albef_bert_config_fp):
        urlretrieve("https://raw.githubusercontent.com/salesforce/ALBEF/main/configs/config_bert.json", albef_bert_config_fp)

    if not os.path.exists(albef_model_fp):
        urlretrieve("https://storage.googleapis.com/sfr-pcl-data-research/ALBEF/ALBEF_4M.pth", albef_model_fp)

    albef_bert_config = AlbefBertConfig.from_json_file(albef_bert_config_fp)
    albef_model = AlbefModel(bert_config=albef_bert_config, num_labels=num_out_labels)

    albef_checkpoint = torch.load(albef_model_fp, map_location='cpu')
    albef_state_dict = albef_checkpoint['model']

    for key in list(albef_state_dict.keys()):
        if 'bert' in key:
            encoder_key = key.replace('bert.', '')
            albef_state_dict[encoder_key] = albef_state_dict[key]
            del albef_state_dict[key]

    msg = albef_model.load_state_dict(albef_state_dict, strict=False)
    print("ALBEF checkpoint loaded from ", albef_model_fp)
    print(msg)
    return albef_model

In [None]:
albef_model = load_albef_pretrained(num_out_labels=10)
albef_model = albef_model.to(device)

### Training

In [None]:
## training loop
set_seed(seed_val)

train_dataset = ImageDataset(df=df_train, label_to_id=label_to_id, train=True, text_field='text', label_field='label', image_path_field='img_path', image_size=256)
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(dataset=train_dataset,
                    batch_size=batch_size,
                    sampler=train_sampler)


t_total = len(train_dataloader) * num_train_epochs


optimizer = AdamW(albef_model.parameters(), lr=learning_rate, weight_decay=weight_decay)
scheduler = get_scheduler(name="cosine", optimizer=optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total)

criterion = nn.CrossEntropyLoss()

albef_model.train()

start = perf_counter()
for epoch_num in trange(num_train_epochs, desc='Epochs'):
    epoch_total_loss = 0

    for step, batch in tqdm(enumerate(train_dataloader), total=len(train_dataloader), desc='Batch'):
        b_text, b_labels, b_imgs = batch
        b_inputs = bert_tokenizer(
            list(b_text), truncation=True, max_length=max_seq_length,
            return_tensors="pt", padding=True
        )

        b_labels = b_labels.to(device)
        b_imgs = b_imgs.to(device)
        b_inputs = b_inputs.to(device)

        # Exercise: fill the pytorch training loop

        ## Step 1: gradient should be set to 0
        ...
        ## Step 2: pass the inputs to the model and get logits
        b_logits = ...
        ## Step 3: Calculate loss given the logits and the labels
        loss = ...

        ## accumulates all losses
        epoch_total_loss += loss.item()

        ## Perform a backward pass to calculate the gradients
        ...

        ## Perform the optimizer step: backpropagation on model parameters
        ...
        ## Perform the scheduelr step: lowers the learning rate according to the schedule
        ...

    avg_loss = epoch_total_loss/len(train_dataloader)


    print('epoch =', epoch_num)
    print('    epoch_loss =', epoch_total_loss)
    print('    avg_epoch_loss =', avg_loss)
    print('    learning rate =', optimizer.param_groups[0]["lr"])
end = perf_counter()
albef_training_time = end- start
print('Training completed in ', albef_training_time, 'seconds')

### Testing

In [None]:
# testing loop

albef_prediction_results = []

test_dataset = ImageDataset(df=df_test, label_to_id=label_to_id, train=False, text_field='text', label_field='label', image_path_field='img_path')
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(dataset=test_dataset,
                            batch_size=batch_size,
                            sampler=test_sampler)


for batch in tqdm(test_dataloader):
  albef_model.eval()

  b_text, b_labels, b_imgs = batch

  b_inputs = bert_tokenizer(list(b_text), truncation=True, max_length=max_seq_length, return_tensors="pt", padding=True)

  b_labels = b_labels.to(device)
  b_imgs = b_imgs.to(device)
  b_inputs = b_inputs.to(device)

  with torch.no_grad():
      # Excerise: pass the inputs to the model and get logits

      b_logits = ...
      b_logits = b_logits.detach().cpu()

  # Excerise: calculate the most likely predicted class given the output logits
  # Tips: you can use torch.argmax()
  batch_prediction = ...
  albef_prediction_results += batch_prediction.tolist()

albef_prediction_labels = [id_to_label[p] for p in albef_prediction_results]


Generate the classification report

In [None]:
albef_class_report = classification_report(df_test['label'], albef_prediction_labels, output_dict=True)
albef_class_report['training_time (seconds)'] = albef_training_time

with open(RESULTS_FOLDER + 'albef_class_report.json', 'w') as f:
  json.dump(albef_class_report, f)

print(albef_class_report['accuracy'])


## Predict on models trained with 20 epochs
In the previous segments, we trained each model for only 5 epochs due to the tutorial's time constraint. Thus, we cannot see a significant contrast between the accuracies of the models. Training for more epochs will improve the models' accuracies. Therefore, we have trained the models for 20 epochs each and saved them. In this segment, we will load the models and make predictions on the test set to compare their accuracies.

The code in the previous segments have to be reused to load the models. Before executing this step, the following cells must have been executed:
- Setup and common config cells
- ALBEF-specific cells
- ALBEF-loading cells
- Cells containing model code for BERT, BERT-ResNet and ALBEF


In [None]:
# Download trained_models.zip file to trained_models folder
!gdown 'https://drive.google.com/uc?id=1r-tHlbxIeajopWCNKMtnQVvvXdyuM8WC' -O $HOME_FOLDER/trained_models.zip
!unzip $HOME_FOLDER/trained_models.zip -d $HOME_FOLDER

This function loads the pretrained model for each of the three model architectures.

In [None]:
def load_trained_models(load_directory, image_model_type):
    """Loads the pretrained model for each of the three model architectures.""""
    label_map_filepath = os.path.join(load_directory, "label_map.json")
    with open(label_map_filepath, 'r') as f:
        label_to_id = json.load(f)

    id_to_label = {v:k for k,v in label_to_id.items()}

    num_labels = len(label_to_id)


    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

    model_sd_filepath = os.path.join(load_directory, "state_dict.pt")
    model_sd = torch.load(model_sd_filepath, map_location='cpu')

    if image_model_type is None:
        model = VLBertModel(num_labels=num_labels)
    elif image_model_type.lower() == 'resnet':
        model = BertResNetModel(num_labels=num_labels)
    elif image_model_type.lower() == 'albef':
        model = load_albef_pretrained(num_out_labels=num_labels)

    model.to('cpu') # load all models in cpu first
    model.load_state_dict(model_sd, strict=True)
    model.to(device)

    return model, tokenizer, label_to_id, id_to_label


We streamline the three different datasets presented previously into one common VLDataset class which has **text**, **images** and **labels**.

In [None]:
class VLDataset(Dataset):
    """Unified PyTorch Dataset that works for both text only and text + images"""
    def __init__(self, df, label_to_id, train=False, text_field="text", label_field="label", image_path_field=None, image_model_type=None):
        self.df = df.reset_index(drop=True)
        self.label_to_id = label_to_id
        self.train = train
        self.text_field = text_field
        self.label_field = label_field
        self.image_path_field = image_path_field
        self.image_model_type = image_model_type

        # text only dataset
        if image_model_type is not None:

            # ResNet-50 and ALBEF use different image sizes: fill the blanks
            if image_model_type.lower() == "resnet":   # ResNet-50 settings
                self.img_size = ...
            elif image_model_type.lower() == "albef":   # ALBEF settings
                self.img_size = ...

            self.mean, self.std = (
                0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)


            self.train_transform_func = transforms.Compose(
                    [transforms.RandomResizedCrop(self.img_size, scale=(0.5, 1.0)),
                        transforms.RandomHorizontalFlip(),
                        transforms.ToTensor(),
                        transforms.Normalize(self.mean, self.std)
                        ])

            self.eval_transform_func = transforms.Compose(
                    [transforms.Resize(256),
                        transforms.CenterCrop(self.img_size),
                        transforms.ToTensor(),
                        transforms.Normalize(self.mean, self.std)
                        ])



    def __getitem__(self, index):
        ## Exercise: fill this function such that it returns the right output depending on if its text only or text+image
        ...

    def __len__(self):
        return self.df.shape[0]

We also streamline the predict function to do prediction on the test set with the loaded models of any of the three model architectures.

In [None]:
## testing loop
def predict(df_test, model, tokenizer, label_to_id, id_to_label, image_model_type):
    prediction_results = []

    test_dataset = VLDataset(df=df_test, label_to_id=label_to_id, train=False, text_field='text', label_field='label', image_path_field='img_path', image_model_type=image_model_type)
    test_sampler = SequentialSampler(test_dataset)
    test_dataloader = DataLoader(dataset=test_dataset,
                                batch_size=batch_size,
                                sampler=test_sampler)


    for batch in tqdm(test_dataloader):
        model.eval()

        if image_model_type is None:
          b_text, b_labels = batch
          b_imgs = None
        else:
          b_text, b_labels, b_imgs = batch

        b_inputs = tokenizer(list(b_text), truncation=True, max_length=max_seq_length, return_tensors="pt", padding=True)

        b_labels = b_labels.to(device)
        b_inputs = b_inputs.to(device)

        if b_imgs is not None:
          b_imgs = b_imgs.to(device)

        with torch.no_grad():
            # Excerise: pass the inputs to the model and get logits
            if b_imgs is not None:
              b_logits = ...
            else:
              b_logits = ...

            b_logits = b_logits.detach().cpu()

        # Excerise: calculate the most likely predicted class given the output logits
        # Tips: you can use torch.argmax()
        prediction_results = ...
        prediction_results += prediction_results.tolist()

    prediction_labels = [id_to_label[p] for p in prediction_results]

    print(accuracy_score(df_test['label'], prediction_labels))

    return prediction_labels

### Predict with loaded BERT model

In [None]:
bert_load_directory = TRAINED_MODELS_FOLDER + 'BERT'
bert_model, bert_tokenizer, label_to_id, id_to_label = load_trained_models(bert_load_directory, image_model_type=None)
bert_predictions = predict(df_test.copy(), bert_model, bert_tokenizer, label_to_id, id_to_label, image_model_type=None)

  model_sd = torch.load(model_sd_filepath, map_location='cpu')


  0%|          | 0/13 [00:00<?, ?it/s]

0.87


### Predict with loaded BERT-ResNet model

In [None]:
bert_resnet_load_directory = TRAINED_MODELS_FOLDER + 'BERT_ResNet'
bert_resnet_model, bert_resnet_tokenizer, label_to_id, id_to_label = load_trained_models(bert_resnet_load_directory, image_model_type='resnet')
bert_resnet_predictions = predict(df_test.copy(), bert_resnet_model, bert_resnet_tokenizer, label_to_id, id_to_label, image_model_type='resnet')

  model_sd = torch.load(model_sd_filepath, map_location='cpu')


  0%|          | 0/13 [00:00<?, ?it/s]

0.905


### Predict with loaded ALBEF model

In [None]:
albef_load_directory = TRAINED_MODELS_FOLDER + 'ALBEF'
albef_model, albef_tokenizer, label_to_id, id_to_label = load_trained_models(albef_load_directory, image_model_type='albef')
albef_predictions = predict(df_test.copy(), albef_model, albef_tokenizer, label_to_id, id_to_label, image_model_type='albef')

  model_sd = torch.load(model_sd_filepath, map_location='cpu')
  albef_checkpoint = torch.load(albef_model_fp, map_location='cpu')


ALBEF checkpoint loaded from  ./tmp/albef/ALBEF.pth
_IncompatibleKeys(missing_keys=['classifier.weight', 'classifier.bias'], unexpected_keys=['temp', 'image_queue', 'text_queue', 'queue_ptr', 'vision_proj.weight', 'vision_proj.bias', 'text_proj.weight', 'text_proj.bias', 'itm_head.weight', 'itm_head.bias', 'visual_encoder_m.cls_token', 'visual_encoder_m.pos_embed', 'visual_encoder_m.patch_embed.proj.weight', 'visual_encoder_m.patch_embed.proj.bias', 'visual_encoder_m.blocks.0.norm1.weight', 'visual_encoder_m.blocks.0.norm1.bias', 'visual_encoder_m.blocks.0.attn.qkv.weight', 'visual_encoder_m.blocks.0.attn.qkv.bias', 'visual_encoder_m.blocks.0.attn.proj.weight', 'visual_encoder_m.blocks.0.attn.proj.bias', 'visual_encoder_m.blocks.0.norm2.weight', 'visual_encoder_m.blocks.0.norm2.bias', 'visual_encoder_m.blocks.0.mlp.fc1.weight', 'visual_encoder_m.blocks.0.mlp.fc1.bias', 'visual_encoder_m.blocks.0.mlp.fc2.weight', 'visual_encoder_m.blocks.0.mlp.fc2.bias', 'visual_encoder_m.blocks.1.norm1

  0%|          | 0/13 [00:00<?, ?it/s]

0.855


### Save predictions

In [None]:
df_out = df_test.copy()
df_out['bert_predictions'] = bert_predictions
df_out['bert_resnet_predictions'] = bert_resnet_predictions
df_out['albef_predictions'] = albef_predictions
df_out.to_csv(RESULTS_FOLDER + 'predictions_with_pretrained_models.csv', index=False)
