# <a id='toc1_'></a>[Visual Question Answering](#toc0_)

Visual Question Answering (VQA) is the task of answering open-ended questions based on an image. VQA has many applications: Medical VQA, Education purposes, for surveillance and numerous other applications. In this project we will use [VizWiz](https://vizwiz.org/tasks-and-datasets/vqa/) dataset for Visual Question Answering, this dataset was constructed to train models to help visually impaired people.  In the words of creators of VizWiz: “we introduce the visual question answering (VQA) dataset coming from this population, which we call VizWiz-VQA.  It originates from a natural visual question answering setting where blind people each took an image and recorded a spoken question about it, together with 10 crowdsourced answers per visual question.”

<!-- Center the following image: -->
<p align="center">
  <img src="Images/vizwiz_example.png" alt="vizwiz_example" width="500"/>
</p>

- **Note:** visit the [GitHub Repo](https://github.com/yousefkotp/Visual-Question-Answering/tree/main)

- **Note:** This repository is an implementation for [Less is More: Linear Layers on CLIP Features as Powerful VizWiz Model](https://arxiv.org/abs/2206.05281) paper.
- It is really advised to read OpenAI's [CLIP](https://openai.com/blog/clip/) paper before reading this repository if you have enough time.

**Table of contents**<a id='toc0_'></a>    
- [Visual Question Answering](#toc1_)    
  - [Installing Required Libraries](#toc1_1_)    
  - [Importing Libraries](#toc1_2_)    
  - [Configuring the Notebook](#toc1_3_)    
  - [Processing Data](#toc1_4_)    
  - [Creating Dataframes & Splitting](#toc1_5_)    
  - [Exploratory Data Analysis](#toc1_6_)    
    - [Training Dataframe](#toc1_6_1_)    
    - [Validation Dataframe](#toc1_6_2_)    
    - [Testing Dataframe](#toc1_6_3_)    
  - [Processing Images & Questions using CLIP model](#toc1_7_)    
  - [Creating Dataset Class](#toc1_8_)    
  - [Building Model's Architecture](#toc1_9_)    
  - [Loading Preprocessed Embeddings](#toc1_10_)    
  - [Preparing Data Loaders](#toc1_11_)    
  - [Training the Model](#toc1_12_)    
  - [Remarks](#toc1_13_)    
  - [Test your own image !](#toc1_14_)    
  - [Building Test Answers](#toc1_15_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Installing Required Libraries](#toc0_)

Before we start, we need to make sure to install the required libraries. We will use [PyTorch](https://pytorch.org/) for building our model. We will also [Open AI's CLIP](https://openai.com/research/clip) pretrained model for image and text embedding which is open sourced on [GitHub](https://github.com/openai/CLIP). We will use [LaTeX](https://www.latex-project.org/) for writing our research [report](https://github.com/yousefkotp/Visual-Question-Answering/blob/9c27560e9c19a0981343fd5fce25861236ab939f/LaTeX_Paper/Visual_Question_Answering_Report.pdf).

%pip install ftfy regex tqdm --user
%pip install pandas --user
%pip install wordcloud --user
%pip install sklearn --user
%pip install scikit-learn --user
%pip install Levenshtein --user
%pip install git+https://github.com/openai/CLIP.git --user

## <a id='toc1_2_'></a>[Importing Libraries](#toc0_)

In [None]:
# Importing os, numpy and pandas for data manipulation
import os
import numpy as np
import pandas as pd

# For data visualization, we will use matplotlib, wordcloud
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# For data preprocessing, we will use Counter, train_test_split, Levenshtein distance, Python Image Library and OneHotEncoder
from collections import Counter
import Levenshtein as lev
from PIL import Image
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

# For saving and loading the preprocessed data, we will use pickle
import pickle

# For Building the model, we will use PyTorch and its functions
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import clip
from torch.utils.data import Dataset, DataLoader

# For taking the image from the URL, we will use requests
import requests

# For evaluation, we will need sklearn.metrics.average_precision_score
from sklearn.metrics import average_precision_score

# Importing json for results formatting which will be uploaded for evaluation
import json

## <a id='toc1_3_'></a>[Configuring the Notebook](#toc0_)

In [None]:
# Configuring the paths for the dataset
TRAIN_PATH = 'train'
VALIDATION_PATH = 'val'
ANNOTATIONS_TRAIN_PATH = 'train.json'
ANNOTATIONS_VAL_PATH = 'val.json'
OUTPUT_PATH = 'saida_clip/'
ANSWER_SPACE = 0 # Will be configured later when we build the vocab using the methodology described in the paper
MODEL_NAME = "ViT-L/14@336px" # This is the backbone of the CLIP model

# Using accelerated computing if available
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device: ", DEVICE)

## <a id='toc1_4_'></a>[Processing Data](#toc0_)

The following cell defines functions for exploratory data analysis and preprocessing of data
- `read_dataframe` function reads a JSON file and returns a dataframe with required columns
- `split_train_test` function splits the dataframe into train and test sets
- `plot_histogram`, `plot_pie`, and `plot_wordcloud` functions plot the histogram, pie chart, and wordcloud of the given column, respectively
- `explore_dataframe` function explores the dataframe by utilizing the previous functions
- `get_number_of_distinct_answers` function returns the number of distinct answers in the dataframe
- `process_images` function processes the images in the dataframe and returns the image features using Open AI's CLIP model
- `process_questions` function processes the questions in the dataframe and returns the question features using Open AI's CLIP model

In [None]:
def read_dataframe(path):
    """
    Reads the JSON file and returns a dataframe with the required columns (image, question, answers, answer_type, answerable)

    Parameters:
        path (str): Path to the JSON file

    Returns:
        df (pandas.DataFrame): Dataframe with the required columns
    """
    df = pd.read_json(path)
    df = df[['image', 'question', 'answers', 'answer_type', 'answerable']]
    return df

def split_train_test(dataframe, test_size = 0.05):
    """
    Splits the dataframe into train and test sets

    Parameters:
        dataframe (pandas.DataFrame): Dataframe to be split

    Returns:
        train (pandas.DataFrame): Train set
        test (pandas.DataFrame): Test set
    """
    train, test = train_test_split(dataframe, test_size=test_size, random_state=42, stratify=dataframe[['answer_type', 'answerable']])
    return train, test

def process_images(dataframe, image_path, clip_model, preprocessor, device):
    """
    Processes the images in the dataframe and returns the image features

    Parameters:
        dataframe (pandas.DataFrame): Dataframe containing the images
        image_path (str): Path to the input images
        clip_model (clip.model.CLIP): CLIP model
        preprocessor (clip.model.Preprocess): Preprocessor for the CLIP model
        device (torch.device): Device to be used for processing

    Returns:
        images (list): List of image features
    """
    images = []
    for _, row in dataframe.iterrows():
        full_path = image_path + "/" + row['image']
        image = Image.open(full_path)
        image = preprocessor(image).unsqueeze(0).to(device)
        image_features = clip_model.encode_image(image)
        image_features = torch.flatten(image_features, start_dim=1)
        images.append(image_features)
    return images

def process_questions(dataframe, clip_model,device):
    """
    Processes the questions in the dataframe and returns the question features

    Parameters:
        dataframe (pandas.DataFrame): Dataframe containing the questions
        clip_model (clip.model.CLIP): CLIP model
        device (torch.device): Device to be used for processing

    Returns:
        questions (list): List of question features
    """
    questions = []
    for _, row in dataframe.iterrows():
        question = row['question']
        question =  clip.tokenize(question).to(device)
        text_features = clip_model.encode_text(question).float()
        text_features = torch.flatten(text_features, start_dim=1)
        questions.append(text_features)
    return questions

## <a id='toc1_5_'></a>[Creating Dataframes & Splitting](#toc0_)

Now, let's use previously defined functions to create dataframes and split them into train and test sets.

In [None]:
train_df = read_dataframe(ANNOTATIONS_TRAIN_PATH)
#train_df = train_df.groupby('answer_type').apply(lambda x: x.sample(25)).reset_index(drop = True)
validation_df = read_dataframe(ANNOTATIONS_VAL_PATH)
#validation_df = validation_df.groupby('answer_type').apply(lambda x: x.sample(25)).reset_index(drop = True)
train_df, test_df = split_train_test(train_df, test_size=0.05)
train_df, test_df = train_df.reset_index(drop = True), test_df.reset_index(drop = True)
train_df.head()

### Seleciona o espaço de resposta e altera os dataframes para ficar compatível

In [None]:
def get_most_common_answers(train_df, neurons_final_layer):

    df_answers = pd.DataFrame()
    for index in range(train_df.shape[0]):
        df_intermed = pd.DataFrame(train_df["answers"][index])
        df_answers = pd.concat([df_answers, df_intermed])

    df_answers = df_answers[df_answers.answer_confidence.isin(["yes", "maybe"])].answer.value_counts().reset_index().rename(columns = {"answer": "freq", "index": "answer"})
    df_answers = df_answers.head(neurons_final_layer)

    return df_answers.answer.tolist()

In [None]:
answer_space = get_most_common_answers(train_df, 3000)
answer_space[:5]

In [None]:
# Define o codificador e decodificador das classes a ser usado na etapa de treinamento/validação
encoder_label = {w: i for i,w in enumerate(answer_space)}
decoder_label = {w: i for w,i in enumerate(answer_space)}

with open(OUTPUT_PATH+'encoder_label.pkl', 'wb') as handle:
    pickle.dump(encoder_label, handle)

with open(OUTPUT_PATH+'decoder_label.pkl', 'wb') as handle:
    pickle.dump(decoder_label, handle)

In [None]:
def check_answers(answers, answer_space):

    df_answers = pd.DataFrame(answers)
    common = list(set(answer_space)&set(df_answers.answer))
    if len(common) == 0:
        return 0
    else:
        return 1

In [None]:
train_df["proceed"] = train_df["answers"].apply(check_answers, answer_space = answer_space)
train_df = train_df[train_df["proceed"] == 1].reset_index(drop = True).drop(["proceed"], axis = 1)

validation_df["proceed"] = validation_df["answers"].apply(check_answers, answer_space = answer_space)
validation_df = validation_df[validation_df["proceed"] == 1].reset_index(drop = True).drop(["proceed"], axis = 1)

test_df["proceed"] = test_df["answers"].apply(check_answers, answer_space = answer_space)
test_df = test_df[test_df["proceed"] == 1].reset_index(drop = True).drop(["proceed"], axis = 1)

## <a id='toc1_7_'></a>[Processing Images & Questions using CLIP model](#toc0_)

Instead of lazy processing of images and questions embeddings and recomputing them over and over during forward passes in the model, we can preprocess them and save them in a file using Pickle. This will save us a lot of time when we want to train our model and decrease the time taken by one epoch drastically.

In [None]:
clip_model, preprocessor = clip.load(MODEL_NAME, device = DEVICE)
clip_model.eval().requires_grad_(False)

training_images = process_images(train_df, TRAIN_PATH, clip_model, preprocessor, DEVICE)
training_questions = process_questions(train_df, clip_model, DEVICE)
with open(OUTPUT_PATH + 'training_images.pkl', 'wb') as f:
    pickle.dump(training_images, f)

with open(OUTPUT_PATH + 'training_questions.pkl', 'wb') as f:
    pickle.dump(training_questions, f)

validation_images = process_images(validation_df, VALIDATION_PATH, clip_model, preprocessor, DEVICE)
validation_questions = process_questions(validation_df, clip_model, DEVICE)
with open(OUTPUT_PATH + 'validation_images.pkl', 'wb') as f:
    pickle.dump(validation_images, f)
with open(OUTPUT_PATH + 'validation_questions.pkl', 'wb') as f:
    pickle.dump(validation_questions, f)

test_images = process_images(test_df, TRAIN_PATH, clip_model, preprocessor, DEVICE)
test_questions = process_questions(test_df, clip_model, DEVICE)
with open(OUTPUT_PATH + 'test_images.pkl', 'wb') as f:
    pickle.dump(test_images, f)
with open(OUTPUT_PATH + 'test_questions.pkl', 'wb') as f:
    pickle.dump(test_questions, f)

## <a id='toc1_8_'></a>[Creating Dataset Class](#toc0_)

Using PyTorch requires using Dataset class. We will create a class that will be used to load the data and process it during training. We will also use this class to load the preprocessed images and questions embeddings.

In [None]:
class VizWizDataset(Dataset):
    def __init__(self, dataframe, encoder_label, images_features = torch.tensor([]), questions_features = torch.tensor([])):
        super(VizWizDataset, self).__init__()

        # Saving image & question embeddings
        self.images_features = images_features
        self.questions_features = questions_features
        self.answerable = dataframe['answerable'].to_numpy()
        self.encoder_answer = encoder_label
        self.encoder_answer_type = {"number": 0, "other": 1, "unanswerable": 2, "yes/no": 3}

        # Saving the dataframe
        self.dataframe = dataframe
        self.dataframe["label"] = self.dataframe["answers"].apply(self.transform_labels)
        self.dataframe["one_hot_encoding_answer"] = self.dataframe["label"].apply(self.one_hot_encoding_answer)
        self.dataframe["one_hot_encoding_answer_type"] = self.dataframe["answer_type"].apply(self.one_hot_encoding_answer_type)
        self.dataframe["label"] = self.dataframe["label"].apply(self.correcting_labels)

    def transform_labels(self, answers):

        answers = pd.DataFrame(answers)
        answers = answers.answer.tolist()
        answers = [self.encoder_answer[w] for w in answers if w in self.encoder_answer.keys()]
        return answers

    def one_hot_encoding_answer(self, label):
        size = len(self.encoder_answer)
        encoding_answer = np.zeros(size)
        encoding_answer[label] = 1
        return encoding_answer

    def one_hot_encoding_answer_type(self, answer_type):
        size = len(self.encoder_answer_type)
        encoding_answer = np.zeros(size)
        encoding_answer[self.encoder_answer_type[answer_type]] = 1
        return encoding_answer

    def correcting_labels(self, label):

        len_label = len(label)

        if len_label == 10:
            return label

        else:
            preenche = [-1]*(10-len_label)
            label = label + preenche
            return label

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, index):
        answer = torch.tensor(self.dataframe["one_hot_encoding_answer"][index], dtype=torch.float32)
        answer_type = torch.tensor(self.dataframe["one_hot_encoding_answer_type"][index], dtype=torch.float32)
        answer_counter = torch.tensor(self.dataframe["label"][index], dtype=torch.long)
        answerable = torch.tensor(self.answerable[index], dtype=torch.float32)
        return self.images_features[index], self.questions_features[index], answer, answer_type, answer_counter, answerable


## <a id='toc1_9_'></a>[Building Model's Architecture](#toc0_)

Now, let's build our model's architecture according to the paper. We will use PyTorch to build our model as we said before.

In [None]:
class VQAModel(nn.Module):

    def __init__(self, num_classes, hidden_size, model_name = "ViT-L/14@336px", device = torch.device("cpu")):
        super(VQAModel, self).__init__()

        self.training_losses = []
        self.validation_losses = []

        self.training_accuracies = []
        self.validation_accuracies = []

        self.vizwiz_training_accuracies = []
        self.vizwiz_validation_accuracies = []

        self.device = device
        self.model_name = model_name

        # Loading the CLIP model
        self.clip_model, self.preprocess = clip.load(model_name, device = device)

        # Freezing the CLIP model
        for param in self.clip_model.parameters():
            param.requires_grad = False

        # First linear layer
        self.linear_layer1 = nn.Sequential(
            nn.LayerNorm(self.clip_model.visual.output_dim + self.clip_model.text_projection.shape[1]),
            nn.Dropout(p=0.5),
            nn.Linear(self.clip_model.visual.output_dim + self.clip_model.text_projection.shape[1], hidden_size)
        )

        # Second linear layer
        self.linear_layer2 = nn.Sequential(
            nn.LayerNorm(hidden_size),
            nn.Dropout(p=0.5),
            nn.Linear(hidden_size, num_classes)
        )

        self.answer_type_layer = nn.Linear(hidden_size, 4)
        self.answer_mask_layer = nn.Linear(4, num_classes)

        self.sigmoid = nn.Sigmoid()

    def forward(self, image, question):

        # Flattening and concatenating the image and question features
        image = torch.flatten(image, start_dim=1)
        question = torch.flatten(question, start_dim=1)
        features = torch.cat((image, question), dim=1)

        # Passing the features through the first linear layer
        features = self.linear_layer1(features)

        # Passing the features to get 4 answer types
        answer_type = self.answer_type_layer(features)

        # Expanding answer make to the same size as the number of classes (vocab size)
        answer_mask = self.answer_mask_layer(answer_type)

        # Applying sigmoid to get the answer mask
        answer_mask = self.sigmoid(answer_mask)

        # Passing the features through the second linear layer
        output = self.linear_layer2(features)

        # Applying the answer mask to the output
        output = output * answer_mask

        #return output, answer_type, answerability_score
        return output, answer_type

    def train_model(self, training_dataloader, validation_dataloader, test_dataloader, criterion, optimizer, epochs = 10, save_path = None, save_every = 1):
        for epoch in range(1,epochs+1):
            #training_loss, training_accuracy, training_vizwiz_accuracy, train_answerability_score = self.training_step(training_dataloader, criterion, optimizer, self.device)
            training_loss, training_vizwiz_accuracy = self.training_step(training_dataloader, criterion, optimizer, self.device)

            #validation_loss, validation_accuracy, validation_vizwiz_accuracy, validation_answerability_score = self.validation_step(validation_dataloader, criterion, self.device)
            validation_loss, validation_vizwiz_accuracy = self.validation_step(validation_dataloader, criterion, self.device)

            #test_accuracy, test_vizwiz_accuracy, test_answerability_score = self.test_step(test_dataloader)
            test_vizwiz_accuracy = self.test_step(test_dataloader)

            self.training_losses.append(training_loss)
            self.validation_losses.append(validation_loss)

            self.vizwiz_training_accuracies.append(training_vizwiz_accuracy)
            self.vizwiz_validation_accuracies.append(validation_vizwiz_accuracy)

            print("Epoch: {} | Training Loss: {:.3f} | Validation Loss: {:.3f}".format(epoch, training_loss, validation_loss))
            #print("Epoch: {} | Training Accuracy: {:.3f} | Validation Accuracy: {:.3f} | Test Accuracy: {:.3f}".format(epoch, training_accuracy, validation_accuracy, test_accuracy))
            print("Epoch: {} | Training VizWiz Accuracy: {:.3f} | Validation VizWiz Accuracy: {:.3f} | Test VizWiz Accuracy: {:.3f}".format(epoch, training_vizwiz_accuracy, validation_vizwiz_accuracy, test_vizwiz_accuracy))
            #print("Epoch: {} | Training Answerability Score: {:.3f} | Validation Answerability Score: {:.3f} | Test Answerability Score: {:.3f}\n".format(epoch, train_answerability_score, validation_answerability_score, test_answerability_score))

            logs = open(save_path+'logs.txt', 'a')
            log_epoch = f'Epoch {epoch} \t Training Loss: {training_loss} \t Training Acc: {training_vizwiz_accuracy} \t Validation Loss: {validation_loss} \t Validation Acc: {validation_vizwiz_accuracy} \t Test Acc: {test_vizwiz_accuracy}'+"\n"
            logs.write(log_epoch)
            logs.close()

            if save_path != None and epoch % save_every == 0:
                self.save_model(save_path + "epoch_{}.pth".format(epoch))
        return

    def training_step(self, dataloader, criterion, optimizer, device):
        #training_loss, training_accuracy, vizwiz_accuracy, total_sum = 0.0, 0.0, 0.0, 0
        training_loss, vizwiz_accuracy, total_sum = 0.0, 0.0, 0

        self.train()
        for _, batch in enumerate(dataloader):
            image, question, answer, answer_type, answers_for_questions, answerable = batch
            #image, question, answer, answer_type, answers_for_questions, answerable = image.to(device), question.to(device), answer.to(device), answer_type.to(device), answers_for_questions.to(device), answerable.to(device)
            image, question, answer, answer_type, answers_for_questions = image.to(device), question.to(device), answer.to(device), answer_type.to(device), answers_for_questions.to(device)

            optimizer.zero_grad()
            #output, answer_type_predicted, answerable_predict = self.forward(image, question)
            output, answer_type_predicted = self.forward(image, question)
            #answerable = 1 - answerable
            #answerable_predict = 1.0 - answerable_predict
            #loss = criterion(output, answer) + criterion(answer_type_predicted, answer_type) + self.answerability_loss_fn(answerable_predict, answerable)
            loss = criterion(output, answer) + criterion(answer_type_predicted, answer_type)

            loss.backward()
            optimizer.step()
            training_loss += loss.item()
            predicted_answer = torch.argmax(output, dim = 1)

            ### ATENÇÃO: PEGA QUALQUER POSIÇÃO QUE CONTENHA UM
            actual_answer = torch.argmax(answer, dim = 1)

            for i in range(len(answer)):
               
                total_sum +=1
                vizwiz_accuracy += min(1, torch.sum(torch.eq(predicted_answer[i], answers_for_questions[i])).item()/3)
                
        training_loss /= len(dataloader)
        #training_accuracy /= total_sum
        vizwiz_accuracy /= total_sum

        #return training_loss, training_accuracy, vizwiz_accuracy, average_precision_score(answerable_true, answerable_predicted, average = 'weighted')
        return training_loss, vizwiz_accuracy


    def validation_step(self, dataloader, criterion, device):
        #validation_loss, validation_accuracy, vizwiz_accuracy, total_sum = 0.0, 0.0, 0.0, 0
        validation_loss, vizwiz_accuracy, total_sum = 0.0, 0.0, 0

        #answerable_true = []
        #answerable_predicted = []
        self.eval()
        with torch.no_grad():
            for _, batch in enumerate(dataloader):
                image, question, answer, answer_type, answers_for_questions, answerable = batch
                #image, question, answer, answer_type, answers_for_questions, answerable = image.to(device), question.to(device), answer.to(device), answer_type.to(device), answers_for_questions.to(device), answerable.to(device)
                image, question, answer, answer_type, answers_for_questions = image.to(device), question.to(device), answer.to(device), answer_type.to(device), answers_for_questions.to(device)

                #output, answer_type_predicted, answerable_predict = self.forward(image, question)
                output, answer_type_predicted = self.forward(image, question)

                # Answerablity is the confidence that quesion is not answerable, so we have to subtract from 1
                #answerable = 1 - answerable
                #answerable_predict = 1.0 - answerable_predict
                #loss = criterion(output, answer) + criterion(answer_type_predicted, answer_type) + self.answerability_loss_fn(answerable_predict, answerable)
                loss = criterion(output, answer) + criterion(answer_type_predicted, answer_type)

                validation_loss += loss.item()
                predicted_answer = torch.argmax(output, dim = 1)
                actual_answer = torch.argmax(answer, dim = 1)
                for i in range(len(answer)):
                   
                    total_sum +=1
                    vizwiz_accuracy += min(1, torch.sum(torch.eq(predicted_answer[i], answers_for_questions[i])).item()/3)
                    #answerable_true.append(answerable[i].item())
                    #answerable_predicted.append(answerable_predict[i].item())

        #answerable_true = np.array(answerable_true)
        #answerable_predicted = np.array(answerable_predicted)

        validation_loss /= len(dataloader)
        #validation_accuracy /= total_sum
        vizwiz_accuracy /= total_sum

        # We will use weighted average since that there is imbalance in answerability in the dataset as displayed in EDA section
        #return validation_loss, validation_accuracy, vizwiz_accuracy, average_precision_score(answerable_true, answerable_predicted, average = 'weighted')
        return validation_loss, vizwiz_accuracy

    def test_step(self, dataloader):
        self.eval()
        #accuracy, total_sum, vizwiz_accuracy = 0.0, 0, 0.0
        total_sum, vizwiz_accuracy = 0, 0.0
        #answerable_true = []
        #answerable_predicted = []
        with torch.no_grad():
            for _, batch in enumerate(dataloader):
                image, question, answer, answer_type, answers_for_questions, answerable = batch
                #image, question, answer, answer_type, answers_for_questions, answerable = image.to(self.device), question.to(self.device), answer.to(self.device), answer_type.to(self.device), answers_for_questions.to(self.device), answerable.to(self.device)
                image, question, answer, answer_type, answers_for_questions = image.to(self.device), question.to(self.device), answer.to(self.device), answer_type.to(self.device), answers_for_questions.to(self.device)

                #output, _, answerable_predict = self.forward(image, question)
                output, _ = self.forward(image, question)
                #answerable = 1 - answerable
                #answerable_predict = 1.0 - answerable_predict
                predicted_answer = torch.argmax(output, dim = 1)
                actual_answer = torch.argmax(answer, dim = 1)
                for i in range(len(answer)):
                   
                    vizwiz_accuracy += min(1, torch.sum(torch.eq(predicted_answer[i], answers_for_questions[i])).item()/3)
                    total_sum +=1
                    #answerable_true.append(answerable[i].item())
                    #answerable_predicted.append(answerable_predict[i].item())

        #answerable_true = np.array(answerable_true)
        #answerable_predicted = np.array(answerable_predicted)

        #accuracy /= total_sum
        vizwiz_accuracy /= total_sum
        #return accuracy, vizwiz_accuracy, average_precision_score(answerable_true, answerable_predicted, average = 'weighted')
        return vizwiz_accuracy

    def save_model(self, path):
        """
        Saves the model state dictionary to the given path.

        Args:
        - self: the model object
        - path (str): the path to save the model state dictionary

        Returns:
        - None
        """
        torch.save(self.state_dict(), path)

    def load_model(self, path):
        """
        Loads the model state dictionary from the given path.

        Args:
        - self: the model object
        - path (str): the path to load the model state dictionary

        Returns:
        - self: the loaded model object
        """
        self.load_state_dict(torch.load(path))
        self.eval()
        return self

    def predict(self, image, question):
        """
        Predicts the output and answer type for the given image and question.

        Args:
        - self: the model object
        - image (tensor): the image tensor
        - question (tensor): the question tensor

        Returns:
        - output (tensor): the predicted output tensor
        - answer_type (str): the predicted answer type
        """
        #output, answer_type, answerability = self.forward(image, question)
        output, answer_type = self.forward(image, question)
        #answerability = 1.0 - answerability
        #return output, answer_type, answerability
        return output, answer_type

    def plot_loss(self):
        """
        Plots the training and validation losses.

        Args:
        - self: the model object

        Returns:
        - None
        """
        plt.plot(self.training_losses, label = "Training Loss")
        plt.plot(self.validation_losses, label = "Validation Loss")
        plt.legend()
        plt.show()

    def plot_vizwiz_accuracy(self):
        """
        Plots the VizWiz training and validation accuracies.

        Args:
        - self: the model object

        Returns:
        - None
        """
        plt.plot(self.vizwiz_training_accuracies, label = "VizWiz Training Accuracy")
        plt.plot(self.vizwiz_validation_accuracies, label = "VizWiz Validation Accuracy")
        plt.legend()
        plt.show()

    def plot_answerability(self):
        """
        Plots the training and validation answerabilities.

        Args:
        - self: the model object

        Returns:
        - None
        """
        plt.plot(self.training_answerability, label = "Training Answerability")
        plt.plot(self.validation_answerability, label = "Validation Answerability")
        plt.legend()
        plt.show()

    def test_model(self, image_path, question):
        """
        Tests the model by predicting the answer and answer type for the given image and question.

        Args:
        - self: the model object
        - image_path (str): the path to the image file or URL
        - question (str): the question to be asked

        Returns:
        - predicted_answer (tensor): the predicted answer tensor
        - predicted_answer_type (str): the predicted answer type
        """
        self.eval()
        if image_path.startswith("http"):
            image = Image.open(requests.get(image_path, stream = True).raw)
        else:
            image = Image.open(image_path)

        image = self.preprocess(image).unsqueeze(0).to(self.device)
        image_features = self.clip_model.encode_image(image)
        image_features = torch.flatten(image_features, start_dim=1)

        question =  clip.tokenize(question).to(self.device)
        text_features = self.clip_model.encode_text(question).float()
        text_features = torch.flatten(text_features, start_dim=1)

        #predicted_answer, predicted_answer_type, answerability = self.predict(image_features, text_features)
        predicted_answer, predicted_answer_type = self.predict(image_features, text_features)

        #return predicted_answer, predicted_answer_type, answerability
        return predicted_answer, predicted_answer_type

    def print_CLIP_model(self):
        """
        Prints the details of the selected CLIP model.

        Args:
        - self: the model object

        Returns:
        - None
        """
        input_resolution = self.clip_model.visual.input_resolution
        context_length = self.clip_model.context_length
        vocab_size = self.clip_model.vocab_size

        print("Selected model:", self.model_name)
        print("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in self.clip_model.parameters()]):,}")
        print("Input resolution:", input_resolution)
        print("Context length:", context_length)
        print("Vocab size:", vocab_size)
        print("")

## <a id='toc1_10_'></a>[Loading Preprocessed Embeddings](#toc0_)

In [None]:
with open(OUTPUT_PATH + 'training_images.pkl', 'rb') as f:
    training_images = pickle.load(f)
with open(OUTPUT_PATH + 'training_questions.pkl', 'rb') as f:
    training_questions = pickle.load(f)

with open(OUTPUT_PATH + 'validation_images.pkl', 'rb') as f:
    validation_images = pickle.load(f)
with open(OUTPUT_PATH + 'validation_questions.pkl', 'rb') as f:
    validation_questions = pickle.load(f)

with open(OUTPUT_PATH + 'test_images.pkl', 'rb') as f:
    test_images = pickle.load(f)
with open(OUTPUT_PATH + 'test_questions.pkl', 'rb') as f:
    test_questions = pickle.load(f)

## <a id='toc1_11_'></a>[Preparing Data Loaders](#toc0_)

In [None]:
# Constructing the training dataset
training_dataset = VizWizDataset(train_df, encoder_label, training_images, training_questions)

# Constructing the validation dataset
validation_dataset = VizWizDataset(validation_df, encoder_label, validation_images, validation_questions)

# Constructing the test dataset
test_dataset = VizWizDataset(test_df, encoder_label, test_images, test_questions)

# Configuring the data loaders
BATCH_SIZE = 32 # 64 is good too but 32 is better (variance wise)

# Constructing the training, validation and test data loaders
training_dataloader = DataLoader(training_dataset, batch_size=BATCH_SIZE, shuffle=True)
validation_dataloader = DataLoader(validation_dataset, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE)

## <a id='toc1_12_'></a>[Training the Model](#toc0_)

In [None]:
# Configuring training's hyperparameters
NUM_EPOCHS = 50
LR = 5e-4
WEIGHT_DECAY = 0
NUM_CLASSES = len(encoder_label)
SAVE_PATH = OUTPUT_PATH
SAVE_EVERY = 5

# Initializing the model
model = VQAModel(num_classes=NUM_CLASSES, device= DEVICE, hidden_size=512, model_name=MODEL_NAME).to(DEVICE)
model.print_CLIP_model()

# Initializing the loss function and optimizer
loss_function = nn.CrossEntropyLoss().to(DEVICE)
optimizer = optim.Adam(model.parameters(), lr=LR, weight_decay = WEIGHT_DECAY)

# Training the model and plotting the loss and accuracy
model.train_model(training_dataloader, validation_dataloader, test_dataloader, loss_function, optimizer, epochs=NUM_EPOCHS, save_path=SAVE_PATH, save_every=SAVE_EVERY)
model.plot_loss()
#model.plot_accuracy()
model.plot_vizwiz_accuracy()
#model.plot_answerability()

- The model at **epoch number 45** outperforms the same model at any other epochs, so let's pick this model as our ultimate and model

## <a id='toc1_13_'></a>[Remarks](#toc0_)

- As you can see, the model is very light weight and fast, it takes ~ 1 minute to run an epoch. In addition to this, the model converges very fast, it only takes a maximum of 30 epoch to fully converges. This is due to the fact that we are using CLIP model which is pretrained on a huge dataset.
- We can further improve the model by training more models with same architecture but different backbone for CLIP model. We can also use different pretrained models for image and text embeddings and ensemble them together.

## <a id='toc1_14_'></a>[Test your own image !](#toc0_)

The following part of code allows the user to test his own image using the trained model. You just have to configure `IMAGE_PATH` and `QUESTION` variables and run the cell.

In [None]:
# Taking a sample image and question from the user
QUESTION = "What kind of food is this?"
IMAGE_PATH = "train/VizWiz_train_00000008.jpg"

# Loading the model from the disk
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
MODEL_NAME = "ViT-L/14@336px"
NUM_CLASSES = len(encoder_label)
MODEL_PATH = OUTPUT_PATH+"/epoch_50.pth" # OUTPUT_PATH + 'model.pt'
model = VQAModel(num_classes=NUM_CLASSES, device= DEVICE, hidden_size=512, model_name=MODEL_NAME).to(DEVICE)
model.load_model(MODEL_PATH)

# Predicting the answer and answer type
#predicted_answer, predicted_answer_type, answerability = model.test_model(image_path = IMAGE_PATH, question = QUESTION)
predicted_answer, predicted_answer_type = model.test_model(image_path = IMAGE_PATH, question = QUESTION)

answer = decoder_label[predicted_answer.cpu().detach().numpy()[0].argmax()]
#answer_type = ANSWER_TYPE_ONEHOTENCODER.inverse_transform(predicted_answer_type.cpu().detach().numpy())

# Printing the answer and answer type
#print("The Answer is: " + answer[0][0])
print("The Answer is: " + answer)
#print("The Answer Type is: " + answer_type[0][0])
#print("The confidence for being unanswerable: " + str(answerability.item()))

## <a id='toc1_15_'></a>[Building Test Answers](#toc0_)

In [None]:
df = pd.read_json("test.json")
df = df[['image', 'question']]

# let's create two json objects to store the output of the model then write them to a file
model_answers = []
#model_answerability = []

for i in range(len(df)):
    image_url = df['image'][i]
    question = df['question'][i]
    image_path = "test/" + image_url
    #predicted_answer, predicted_answer_type, answerability = model.test_model(image_path = image_path, question = question)
    predicted_answer, predicted_answer_type = model.test_model(image_path = image_path, question = question)

    #answer = ANSWER_ONEHOTENCODER.inverse_transform(predicted_answer.cpu().detach().numpy())
    answer = decoder_label[predicted_answer.cpu().detach().numpy()[0].argmax()]
    #answer_type = ANSWER_TYPE_ONEHOTENCODER.inverse_transform(predicted_answer_type.cpu().detach().numpy())
    #answer_result = {'image': image_url, 'answer': answer[0][0]}
    answer_result = {'image': image_url, 'answer': answer}
    #answerability_result = {'image': image_url, 'answerability': answerability.item()}
    model_answers.append(answer_result)
    #model_answerability.append(answerability_result)

# Writing them using pickle
with open('answers_results.json', 'w') as file:
    json.dump(model_answers, file)
#with open('answerability_results.json', 'w') as file:
#    json.dump(model_answerability, file)