# Overview

This notebook is used to give you a demonstration of how to run each experiment using FROMAGe. Specifically, it aims to prove how in-context learning is applied for several tasks different types of prompting strategies (e.g. text or visual augmentations) can increase the model's performance for each downstream task.


# 1. Image Captioning

### Important note

The data used for inference are in the repo so when you clone it, you don't have to manually download anything to run the demo.

### Some useful functions
- The cos_sim computes function the cosine similarity. We need it to compare the text embeddings of the outputs of the model.
- The mean_pooling function is used as an processing tool for the embeddings.
- The compare_embeddings function uses the Mini-LM-L6 model to generate the text embeddings and then calls the mean_pooling function and the cos_sim function to provide a final score. The following image describes the aforementioned procedure.

### Instructions :

If you execute image_captioning_demo.py through jobs/image_captioning_demo.job you will get the the results in a few minutes and what essentialy happens in the main function is :
1. Load the model
2. Load some samples from the dataset
3. Run the inference loop

There are comments in the image_captioning_demo.py file for each section.

&nbsp;

<p align="center">
  <img src="./images_report/Visual_augmentation_of_prompt.png" width="920" height="280" />
</p>

&nbsp;

<p align="center">
  <img src="./images_report/embeds_cos_sim.png" width="700" height="200" />
</p>

# 2. Image Retrieval from Text (using the Flickr dataset)

The data used for inference are in the repo so when you clone it, you don't have to manually download anything to run the demo.

What happens in the image_retrieval_demo.py file is pretty much the same as above, but in this case we evaluate whether text augmentation can help the model perform better.

&nbsp;

<p align="center">
  <img src="./images_report/Text_augmentation_of_prompt.png" width="720" height="300" />
</p>

The comparison of the two generated image to the target image is being done using the cosine similarity of the visual embeddings extracted from CLIP from the FROMAGe for the images

# 3. Guided VQA

### Important note

The data used for inference are in the repo so when you clone it, you don't have to manually download anything to run the demo.

What happens in the guided_vqa_demo.py file is that we check whether visual augmentation (image segmentation) of the prompt helps the model perform better or not.

&nbsp;
<p align="center">
  <img src="./images_report/gvqa.png" width="1400" height="400" />
</p>



# 4. GIF Captioning

This task aims to explore the zero-shot learning capabilities of FROMAGe for captioning a number of frames that come from the same GIF, not individually but as a sequence.

The `gif_captioning_demo.py` script contains a few GIF urls to choose from and run inference using FROMAGe to observe the output.

&nbsp;

<p align="center">
  <table style="text-align: center">
    <tr>
      <th> Original GIF and Caption </td>
      <td rowspan="3"> &rarr; </td>
      <th> Prompt </td>
      <td rowspan="3"> &rarr; </td>
      <th> Predected Caption </td>
    </tr>
    <tr>
      <td rowspan=> <img src="images_report/skating.gif" height=150/> </td>
      <td> <img src="images_report/skating-5.png" height=150/> </td>
      <td rowspan="2"> skateboarder in the skateboarder jumps over a rail and lands on </td>
    </tr>
    <tr>
      <td> a skate boarder is doing trick on his skate board. </td> 
      <td> + "Give caption as video." </td> 
    </tr>
  </table>
</p>

# 5. Image Classification

Write here

# 6. Visual Dialog 

With this task, we aim to prove how compressing complex text input (in this case in the form of a caption and a dialog) into a more clear and compact manner leads to improved capability of retrieving an image that suits the context. The image below describes the procedure in a more comprehensible way.

### Important note

Make sure you download the data as stated in the github readme.md and place them in the main directory of the project (when you clone the repo).

### Some useful functions
- The load_dialogs function loads the data.
- The gpt_prompt function generates the new captions using text-davinci-003.
- The get_image function shows an image by giving the image id.
- The get_prompt_list function creates prompts from dataframe


&nbsp;

<img src="images_report/visualdialog_scheme.png" alt="Image" />