In [None]:
from IPython.display import Image

# Lab 1: Classic language and vision tasks
*This notebook is based on Niels Rogge's tutorial:*
https://github.com/NielsRogge/Transformers-Tutorials/blob/master/GIT/

If you have a GPU, set ``runtime'' to GPU. With CPU, the process will be very slow.

## GIT (GenerativeImage2Text)
GIT is a transformer decoder that is based on CLIP (https://openai.com/index/clip/). We will learn more about CLIP and GIT in the next class. GIT gets an image (or several images to handle videos) and text as input, and generates text as the output. The model uses CLIP to encode the image as patch tokens, and the text as text tokens. Conditioned on both image and text tokens, the model predicts iteratively the next text tokens, given the image tokens and the previous(ly generated) text tokens.

<img src="./git.jpeg" alt="drawing" width="500"/>

### Setup
*Install transformers if you don't have them yet.*

In [None]:
!pip install -q git+https://github.com/huggingface/transformers.git

*If torch is not installed yet, check how to install it on your device: (https://pytorch.org/get-started/locally)*
For example, you may use <br/>
`!pip3 install torch torchvision torchaudio` <br/>
or <br/>
`!conda install pytorch torchvision -c pytorch`

In [None]:
#!pip3 install torch torchvision torchaudio
import torch

In [None]:
from PIL import Image # to load images

In [None]:
import os # python built-in package

### Warm-up
#### Loading an image
Let's load an image and display it. We'll use images from the [IRFL dataset](https://irfl-dataset.github.io/), which is a collection of images used to study figurative language (we'll hear more about it later in the course). 

**To download the IRFL dataset, you can clone it into the lab folder. Create a `data` folder (for the data used in the labs), go to the data folder and type the following (the assumption is that you have git):**
Make sure you have git-lfs installed (to handle large files, e.g., images), if not, see https://git-lfs.com (e.g., `brew install git-lfs` for mac)
Then, in the terminal, type 
`git lfs install` and 
`git clone https://huggingface.co/datasets/lampent/IRFL`
**You then only need to unzip the images and you should be good to go.**

Assuming the images are stored at `data/IRFL/images`, we'll load the first image:

In [None]:
imgdir = "data/IRFL/images/"

In [None]:
imgname = os.listdir(imgdir)[0] # list all the files in imgdir, get the first file
print(imgname)

In [None]:
filepath = os.path.join(imgdir, imgname)
image = Image.open(filepath).convert("RGB") # convert the image to the red-green-blue colour scheme
image

In [None]:
# and another image at index 17
imgname = os.listdir(imgdir)[17]
filepath = os.path.join(imgdir, imgname)
image = Image.open(filepath).convert("RGB")
image

The image has 3 colour channels -- red, green and blue. We can also just display one component, e.g., green:

In [None]:
display(image.getchannel('G'))

#### Preparing the image for GIT
We use the `GitProcessor`, which includes an image processor (for the visual modality) and a tokeniser (for the linguistic modality). Feeding GIT an image will make it use the image processor.

In [None]:
from transformers import AutoProcessor

# the Auto API automatically loads a GitProcessor
processor = AutoProcessor.from_pretrained("microsoft/git-base-textcaps", clean_up_tokenization_spaces=True)

pixel_values = processor(images=image, return_tensors="pt").pixel_values
pixel_values.shape

*The size of the loaded image (in a tensor) is 3x224x224 -- 3 colour channels (red, green, blue), and 224x224 pixels per colour channel.*

#### Loading a GIT model
We will load the GIT based-sized model from the [huggingface hub](https://huggingface.co/docs/hub/index), which was fine-tuned on the image captioning [TextCaps dataset](https://huggingface.co/datasets/lmms-lab/TextCaps). Loading it may take a while ...

**Remark:** To see all the GIT models that are available on the hub, see here: https://huggingface.co/models?search=microsoft/git

In [None]:
# load a model with a causal language modeling "head", such that we can generate text with the model
from transformers import AutoModelForCausalLM 

model = AutoModelForCausalLM.from_pretrained("microsoft/git-base-textcaps")

#### Generate a caption
We call the `generate` method to generate a caption for the image. 
By default, *greedy decoding* is used, which, in order to generate a token *t*, chooses the token with the highest probability. Token *t+1* is then generated again by choosing the token with the highest probability given the image an the tokens *1, ..., t*.

In [None]:
# run on the GPU if you have one
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
pixel_values = pixel_values.to(device)

generated_ids = model.generate(pixel_values=pixel_values, max_length=20)
print("Generated caption:", processor.batch_decode(generated_ids, skip_special_tokens=True))

#### Functionalities
Let's write a function to load an image to be input to GIT.

In [None]:
import random 

def load_image(imgdir="data/IRFL/images/", imgname=None):
    """ Loads an image and returns it in RGB format.
        If no image directory is specified, images from IRFL will be used. 
        If imgname is None, a random image will be sampled and returned.
    """
    if imgname == None:
        imgname = random.sample(os.listdir(imgdir), 1)[0]
    image = Image.open(os.path.join(imgdir, imgname)).convert("RGB")
    return image, imgname

image, _ = load_image()
image

As an alternative, you can also used images from the web: 

In [None]:
import requests

def download_image(url, save_as):
    response = requests.get(url)
    if response.status_code == 200:
        with open(save_as, 'wb') as file:
            file.write(response.content)

image_url = 'https://upload.wikimedia.org/wikipedia/commons/4/47/Jackfruit.jpeg'
save_as = 'data/jackfruit.jpg'

download_image(image_url, save_as)

In [None]:
image,_ = load_image("data", "jackfruit.jpg")
image

### Image captioning

The code below is the same as above, just condensed in one block.

In [None]:
from transformers import AutoProcessor, AutoModelForCausalLM
import torch

processor = AutoProcessor.from_pretrained("microsoft/git-base-textcaps", clean_up_tokenization_spaces=True)
model = AutoModelForCausalLM.from_pretrained("microsoft/git-base-textcaps")

# run on the GPU if you have one
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

In [None]:
image, imgname = load_image()

inputs = processor(images=image, return_tensors="pt")
pixel_values = inputs.pixel_values.to(device)

generated_ids = model.generate(pixel_values=pixel_values, max_length=50)

print("Generated caption:", processor.batch_decode(generated_ids, skip_special_tokens=True))
print("Image name: ", imgname)
display(image)

In [None]:
image, imgname = load_image("data", "jackfruit.jpg")

inputs = processor(images=image, return_tensors="pt")
pixel_values = inputs.pixel_values.to(device)

generated_ids = model.generate(pixel_values=pixel_values, max_length=50)

print("Generated caption:", processor.batch_decode(generated_ids, skip_special_tokens=True))
print("Image name: ", imgname)
display(image)

In [None]:
num_images = 20
img_names = []
for idx in range(num_images):
    img, imgname = load_image()

#### Exercise
* Using the code above, sample 20 images and generate a caption for each of them with GIT.
Inspect the generated captions and evaluate them according to the following criteria:
   1. Faithfulness/Fidelity: Is the caption *consistent* with the content of the image? Is the sentence truthfully related to what is shown in the image? Does it have *hallucinations* or *non-factual* information?
   2. Adequacy: How much image gist does the caption contain? (That is, how exhaustively does it describe the image content?)
   4. Fluency: Is the sentence fluent, grammatical and coherent (irrespective of the image)?
   5. Informativeness: Is the content of the caption redundant or meaningless?

* For the errors GIT made in terms of the generated captions, try to group them and find reasons for each of the error classes. One class could be, e.g., ''incorrect object name'' (i.e., object not identified), or ''wrong action mentioned'' (if at all).

**Remark:** *You can use the function `sample_imgnames()`below to first sample `num_images` image names, and then load the using load_image(). This way, you can keep track of the images you inspected, and use them also as input for another model (see below, VATEX).*

In [None]:
def sample_imgnames(imgdir="data/IRFL/images/", num_imgs=10):
    return random.sample(os.listdir(imgdir), num_imgs)
    
img_names = sample_imgnames(num_imgs=3) # sample 3 image names
img, _ = load_image(imgname=img_names[0]) # load the first of the 3 image names (index 0)
img

You can also try another model: The code below loads the model that was trained on VATEX, a video captioning dataset. 
* How do the two models compare in terms of the evaluation criteria? Is one model, e.g., more faithful than the other?

In [None]:
processor = AutoProcessor.from_pretrained("microsoft/git-base-vatex")
model = AutoModelForCausalLM.from_pretrained("microsoft/git-base-vatex")

# run on the GPU if you have one
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

In [None]:
inputs = processor(images=image, return_tensors="pt")
pixel_values = inputs.pixel_values.to(device)

generated_ids = model.generate(pixel_values=pixel_values, max_length=50)

print("Generated caption:", processor.batch_decode(generated_ids, skip_special_tokens=True))
print("Image name: ", imgname)
display(image)

### Visual Question Answering (VQA)

To load the model fine-tuned on [TextVQA](https://huggingface.co/datasets/facebook/textvqa), run the first block below.

In [None]:
processor = AutoProcessor.from_pretrained("microsoft/git-base-textvqa")
model = AutoModelForCausalLM.from_pretrained("microsoft/git-base-textvqa")  

To load the model fine-tuned on [VQAv2](https://huggingface.co/datasets/HuggingFaceM4/VQAv2) instead, run the block below.

In [None]:
processor = AutoProcessor.from_pretrained("microsoft/git-base-vqav2")
model = AutoModelForCausalLM.from_pretrained("microsoft/git-base-vqav2")  

In [None]:
image, imgname = load_image("data", "jackfruit.jpg")
#image, imgname = load_image() # uncomment to load random images from IRFL
image

In [None]:
pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(device)

In [None]:
#question = "what colour is the mirror?" # is the bed white / black / pink
question = "what is shown in the image?" # is the mirror white?
input_ids = processor(text=question, add_special_tokens=False).input_ids
input_ids = [processor.tokenizer.cls_token_id] + input_ids
input_ids = torch.tensor(input_ids).unsqueeze(0)

generated_ids = model.generate(pixel_values=pixel_values, input_ids=input_ids, max_length=50)

print("Image name: ", imgname)
print("Generated answer:", processor.batch_decode(generated_ids[:, input_ids.shape[1]:], skip_special_tokens=True))

### Exercise
1. Sample ten images, and for each image, ask the model three questions. You could ask for *attributes* (e.g., colour, shape, activity), *objects*, or relations (e.g., actions, spatial relations between objects).
2. Inspect the answers, and analyse then according to the following criteria: 
  * **Accuracy:** Is the answer correct?
    * Q: What is the proportion of correct answers overall? Calculate that by dividing the correct answers by the total number of questions, i.e., 30.
  * **Consistency:**
    Does the model respond consistently across different questions? <br/>
    * You can test this, e.g., by paraphrasing a question, asking a reconfirming question, or negating the proposition. <br/>
      For example: Q: *what is the colour of the apple?* A: `green`
      * Reconfirming question: *is the apple green?* A: `yes` is consistent, A: `no` is inconsistent
      * Paraphrase: *what is the colour of the fruit?* (provided there is only one fruit shown in the image)
      * Negation of *is the apple green?* should yield the opposite answer to *is the apple not green?* (e.g., yes -> no).
    * Test for consistency with each of the three questions per image.
    * Q: For how many questions is the model inconsistent?
  * **Validity:**
    Is the answer in the scope of the question? <br/>
    For example, a number when asking a counting question, a colour when asking for a colour.
  * **Plausibility:**
    Is the answer reasonable or does it make sense given the question?<br/>
    For example, `red` is a plausible answer to a question about the colour of an apple, but it is implausible for the colour or a lion.

2. Inspect the errors the model made and try to find error classes. Possible error classes could be: mistake in visual *grounding*, *relation* between objects / entities not understood, image seems to having been disregarded (*hallucinated* answer)

### Further information
* [Measuring Faithful and Plausible Visual Grounding in VQA](https://aclanthology.org/2023.findings-emnlp.206.pdf)