## Assignment 5

This assignment is designed to help you understand, implement, and get hold of two important models in Multi-Modal Learning: CLIP and BLIP.

Before starting to code, read the instructions and comments carefully. Make sure you understand the requirements and the expected output.

There are two parts of this assignment.
- Content based image retrieval (CBIR) based on CLIP
- Visual Question Answering (VQA) based on BLIP

## Part 1: Content Based Image Retrieval (CBIR)

Ever found yourself looking for a specific image in a load of photos on your phone?
Imagine you have a system that can retrieve images based on their content. In this assignment, you will implement exactly that.

### Task
1. I have provided a list of images in form of unsplash links. Feel free to explore and add more images.
2. You will have to implement CBIR system that will retrive images based on the query you provide.
3. You have to use CLIP model to extract image and text embeddings. Since CLIP projects both image and text into the same embedding space, similar images and their corresponding text descriptions will be close to each other in the embedding space.
4. You have to use a pretrained CLIP model from HuggingFace. (Because it is not feasible to train CLIP from scratch)
5. You have to use [OpenAI's CLIP](https://huggingface.co/openai/clip-vit-base-patch32) model from HuggingFace. Refer to the page itself and anyother resources you find useful, on how to use it.
6. You will have to take an input query from user, and display the best image matching the query. (You can use cosine similarity to find similarity between the query and the images, then display the image with highest similarity score)

##### Load all the libraries

In [2]:
## Load libraries
!pip install transformers
!pip install torchvision
!pip install pillow

from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image
import requests
from torchvision import transforms


IMAGE_URLS = [
    "https://images.unsplash.com/photo-1750126833705-ba98013f16f3",
    "https://images.unsplash.com/photo-1749627995669-4d4dda3a9c1d",
    "https://images.unsplash.com/photo-1744294724362-3f5c404c771a",
    "https://images.unsplash.com/photo-1750075750236-3f8924fc0e35"
]
# Feel free to edit the above list with your own image URLs



##### Load the CLIP model

In [3]:
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


##### Loop through images, convert then to tensors, and store them in a list

In [4]:
transform = transforms.Compose([
    transforms.Resize((224, 224)),   
    transforms.ToTensor(),           
    transforms.Normalize(            
        mean=[0.48145466, 0.4578275, 0.40821073],
        std=[0.26862954, 0.26130258, 0.27577711]
    )
])
image_tensors = []
images = []
for url in IMAGE_URLS:
    response = requests.get(url, stream=True)
    img = Image.open(response.raw).convert("RGB")
    images.append(img)                            
    tensor = transform(img)                       
    image_tensors.append(tensor)  

##### Get the embeddings for the images and store them in a list
Use the CLIP model linked above, the page also provides details on how to use it. If you have any issues with it, feel free to use anyother resources you find useful.

In [5]:
model.eval()
image_embeddings = []
with torch.no_grad():
    for img_tensor in image_tensors:
        img_tensor = img_tensor.unsqueeze(0)  
        inputs = {"pixel_values": img_tensor}
        outputs = model.get_image_features(**inputs)
        outputs = outputs / outputs.norm(p=2, dim=-1, keepdim=True)  
        image_embeddings.append(outputs)

##### Get the query from user

In [6]:
query = input("Enter which image are you looking for: ")

##### Get the embeddings for the user query

In [7]:
inputs = processor(text=[query], return_tensors="pt")
with torch.no_grad():
    text_embedding = model.get_text_features(**inputs)
    text_embedding = text_embedding / text_embedding.norm(p=2, dim=-1, keepdim=True)  

##### Find the similarity between the query and all the images

In [8]:
import torch.nn.functional as F
image_embeddings_tensor = torch.cat(image_embeddings, dim=0)  
similarities = F.cosine_similarity(text_embedding, image_embeddings_tensor)
best_match_index = similarities.argmax().item()

##### Display the image with highest similarity score

In [9]:
print(f"\nMost similar image index: {best_match_index}")
print(f"Similarity score: {similarities[best_match_index]:.4f}")
images[best_match_index].show()


Most similar image index: 0
Similarity score: 0.2944


Congrats! You have implemented a simple Content Based Image Retrieval (CBIR) system using CLIP model.

Moving on to the next part of the assignment.

## Part 2: Visual Question Answering (VQA)

This part is based on Visual Question Answering (VQA), where you will be given an image and a question about that image, and you will need to provide an answer.

You are provided with a dataset of 10 images, and a json file containing questions and answers for those images.

### Format of JSON file
```json
[
    "what is the blue shape?",
    "rectangle",
    0
]
```
This means that the question is "what is the blue shape?", the answer is "rectangle", and the image index is 0.
Images are named from 0 to 9, so the image for index 0 is `0.png`.

### Task
1. Load the JSON file and parse the questions and answers.
2. Just for understanding, print a few questions and their corresponding answers, also display the corresponding images. (You may do it for 1 or 2 questions)
3. Implement a function that takes an image and a question, and returns the answer.
4. Run the function on all questions and store the answers in a file. (You can store it in a new JSON or CSV file) (May use pandas for this)
5. Now since you have the answers, you can also evaluate your model. For this, you can compare your answers with the ground truth answers provided in the JSON file. (refer to note below)

Note: But it is not that easy to compare directly, as the answers may not match exactly due to variations in phrasing ("2" and "two", "blue" and "The color is blue". Phrases like this don't match, but are correct). So you should check your answers manually and see if they are correct or not. Do it for first 30 answers, and calculate the accuracy.
There are certain ways to automate this, but they will make this assignment too complicated, so we will not do that here.

### Additional Points
- Refer to BLIP model that was taught in last class. It can be used to answer questions based on images.
- You will have to use pre-trained BLIP model for this task. (Because training it from scratch is not feasible in this assignment)
- You have to use [Salesforce BLIP model](https://huggingface.co/Salesforce/blip-vqa-base) for this task. Refer to the Usage section of page for details on how to use it. If you feel stuck, you can refer to blogs available online, or use AI tools to get yourself familiar with the model.

1. Load libraries, and the JSON file.

In [4]:
pip install transformers torchvision torchaudio





In [6]:
from transformers import BlipProcessor, BlipForQuestionAnswering
from PIL import Image
import torch
import json
with open('filtered_questions.json', 'r') as f:
    data = json.load(f)
processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


preprocessor_config.json:   0%|          | 0.00/445 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.54G [00:00<?, ?B/s]

2. Display a few questions and answers along with images.

In [5]:
for i, item in enumerate(data):
    question, answer, image_index = item
    image_filename = f"{image_index}.png"


    img = Image.open(image_filename)
    img.show()
    print(f"Example {i+1}")
    print(f"Image: {image_filename}")
    print(f"Q: {question}")
    print(f"A: {answer}")
    print('-' * 30)
    if i == 1:  
        break 
    

Example 1
Image: 0.png
Q: what is the blue shape?
A: rectangle
------------------------------
Example 2
Image: 0.png
Q: what color is the shape?
A: blue
------------------------------


3. Implement the function to answer questions based on images.

In [7]:
def vqa(question, image):
    # write code here
    inputs = processor(image, question, return_tensors="pt")
    output = model.generate(**inputs)
    answer = processor.decode(output[0], skip_special_tokens=True)
    return answer

4. Run the function on all questions and store the answers in a file.

In [11]:
import pandas as pd
results = []
for item in data:
    question, ground_truth, image_index = item
    image_path = f"{image_index}.png"

    try:
        image = Image.open(image_path).convert("RGB")
        predicted_answer = vqa(question, image)

        results.append({
            "image": image_path,
            "question": question,
            "ground_truth": ground_truth,
            "predicted_answer": predicted_answer
        })

    except Exception as e:
        print(f"Error with image {image_path}: {e}")

df = pd.DataFrame(results)
df.to_csv("vqa_predictions.csv", index=False)


5. Evaluate the model by comparing answers with ground truth.
Do the comparison manually for the first 30 answers and print the accuracy.

In [13]:
print("First 30 Predictions:\n")

for i, r in enumerate(results[:30]):
    print(f"{i+1}. Q: {r['question']}")
    print(f"   Ground Truth:      {r['ground_truth']}")
    print(f"   Predicted Answer:  {r['predicted_answer']}")
    print("-" * 50)


First 30 Predictions:

1. Q: what is the blue shape?
   Ground Truth:      rectangle
   Predicted Answer:  rectangle
--------------------------------------------------
2. Q: what color is the shape?
   Ground Truth:      blue
   Predicted Answer:  black
--------------------------------------------------
3. Q: does the image contain a rectangle?
   Ground Truth:      yes
   Predicted Answer:  yes
--------------------------------------------------
4. Q: is there a triangle in the image?
   Ground Truth:      no
   Predicted Answer:  no
--------------------------------------------------
5. Q: is there a black shape?
   Ground Truth:      no
   Predicted Answer:  yes
--------------------------------------------------
6. Q: does the image not contain a gray shape?
   Ground Truth:      yes
   Predicted Answer:  yes
--------------------------------------------------
7. Q: is there a red shape in the image?
   Ground Truth:      no
   Predicted Answer:  no
------------------------------------

ACCURACY= (correct x 100)/30=22x100/30= 73.33%


Bonus: This images were too simple. But guess what? Salesforce has pretrained it's model on wide variety of images, so you can use it on any image of your choice.

Try to load any real world image of any natural scene, or anything you like, and ask questions about it. You will be surprised by the results.

In [14]:
def ask_question_about_image(image_path, question):
    image = Image.open(image_path).convert("RGB")
    inputs = processor(image, question, return_tensors="pt")
    output = model.generate(**inputs)
    answer = processor.decode(output[0], skip_special_tokens=True)
    return answer
image_path = "my_image.jpg"  
question ="how many mountains are there?"  

answer = ask_question_about_image(image_path, question)
print(f"Q: {question}")
print(f"A: {answer}")


Q: how many mountains are there?
A: 1


The End 🫡.

If you have any questions, feel free to use following in the given order:
- Google it!!
- ChatGPT
- Ask on the group