# 0. Define evaluation metrics

Some evaluation metrics for information retrieval systems like this semantic search engine are:
* __Mean Reciprocal Rank (MRR)__ is a ranking quality metric. It considers the position of the first relevant item in the ranked list.
You can calculate MRR as the mean of Reciprocal Ranks across all users or queries. 
A Reciprocal Rank is the inverse of the position of the first relevant item. If the first relevant item is in position 2, the reciprocal rank is 1/2. 
* __Normalized Discounted Cumulative Gain (NDCG)__ is a ranking quality metric. It compares rankings to an ideal order where all relevant items are at the top of the list.
NDCG at K is determined by dividing the Discounted Cumulative Gain (DCG) by the ideal DCG representing a perfect ranking. 
DCG measures the total item relevance in a list with a discount that helps address the diminishing value of items further down the list.
* __Recall at K__ measures the proportion of correctly identified relevant items in the top K recommendations out of the total number of relevant items in the dataset. In simpler terms, it indicates how many of the relevant items you could successfully find.
* __Precision at K__ is the ratio of correctly identified relevant items within the total recommended items inside the K-long list. Simply put, it shows how many recommended or retrieved items are genuinely relevant.

As you can understand, __Recall at K__ metrics require the informantion of the total number of relevant items. In the dataset used in this repository, there is no such information, so we will avoid applying this metric on evaluating the system on the set of test queries.
For the others metrics, we can easily label by hand the results and then compute the metrics.

# 1. Implement the code for communicating with the API

In [8]:
import json
import requests
from sklearn.metrics import precision_score, ndcg_score

In [12]:
API_ENDPOINT = "http://0.0.0.0:5000/query"

In [17]:
def retrieve(api_endpoint: str,
             vector_to_search: str,
             k: int,
             text_query: str) -> list[str]:
    """
    retrieve the most relevant images based on a text query
    :param api_endpoint: the API endpoint to retreive the relevant images.
    :param vector_to_search: if "text" it will retrieve the most relavant images based on the caption embeddings. If it is "image"
    it will retrieve the most relevant images based on the image embedding
    :param k: the number of the top-k most relevant retrieved images
    :param text_query: the user's query
    :return a list of the captions of the images. Based on them, I will decide if it is correctly retrieved(label 1) or not(label 0)
    """
    response = requests.post(api_endpoint,
                            data=json.dumps({
                                "text": text_query,
                                "k": k,
                                "vector_to_search": vector_to_search
                            })).json()
    return [metadata["captions"][0] for metadata in response]

To make my life easier in labelling and evaluating this system, I will retrieve on the first caption/answer of each image. You can have a look at the following cell, how the function works.

In [18]:
retrieve(api_endpoint=API_ENDPOINT,
         k=5,
         text_query="a cat playing alone",
         vector_to_search="image")

['A cat in between two cars in a parking lot.',
 'A cute kitten is sitting in a dish on a table.',
 'A black cat is inside a white toilet.',
 'a man sleeping with his cat next to him',
 'A cat eating a bird it has caught.']

# 2. Evaluation process

## 2.1 Define the test set

In [20]:
test_set = ["a cat playing alone",
            "photo of a car",
            "photo of a dog",
            "photo of a human with an animal"]

## 2.2 Retrieve the most relevant images

In [21]:
for query in test_set:
    print(retrieve(api_endpoint=API_ENDPOINT,
         k=5,
         text_query=query,
         vector_to_search="image"))

['A cat in between two cars in a parking lot.', 'A cute kitten is sitting in a dish on a table.', 'A black cat is inside a white toilet.', 'a man sleeping with his cat next to him', 'A cat eating a bird it has caught.']
['A small car is parked in front of a scooter', 'A car is stopped at a red light', 'Fog is in the air at an intersection with several traffic lights.', 'An old-fashioned green station wagon is parked on a shady driveway.', 'A cat in between two cars in a parking lot.']
['A door with a sticker of a cat door on it', "Two husky's hanging out of the car windows.", 'A black cat is inside a white toilet.', "A trio of dogs sitting in their owner's lap in a red convertible.", 'A fireplace with a fire built in it.']
['a man sleeping with his cat next to him', 'A man sits with a traditionally decorated cow', 'A shot of an elderly man inside a kitchen.', 'A black and white photo of an older man skiing.', 'A person holding a skateboard overlooks a dead field of crops.']


There is no need to store the retrieved captions in a variable. However, we need to store the labels of the retrieved documents in variables. It will facilitate the computation of the metrics

## 2.3 Labelling retrieved results

In [22]:
true_labels = [
    [1, 1, 1, 1, 1], # each image has a cat, is labelled as 1.
    [1, 1, 0, 1, 1], # each image displays a car, is labelled as 1.
    [0, 1, 0, 1, 0], # each image shows a dog, is labelled as 1
    [1, 1, 0, 0, 0] # each image shows an animal with a person is labelled as 1.
] # the true labels of the predictions

In [25]:
preds = [[1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1]
        ] # the labels the retrieval process returns

## 2.4 Evaluating 

In [41]:
def mrr(y_preds:list[list[int]]) -> float:
    """
    compute the mrr score
    :param y_preds: the prediction labels
    :return the mrr score
    """
    mrr = 0
    for y_pred in y_preds:
        for index, pred in enumerate(y_pred):
            if pred == 1:
                mrr += 1/(index + 1)
                break
    mrr /= len(y_preds)
    return mrr

In [42]:
micro_precision_at_5 = precision_score(y_true=true_labels,
                                       y_pred=preds,
                                       average="micro")
ndcg_at_5 = ndcg_score(y_true=true_labels, y_score=preds)
mrr_at_5 = mrr(y_preds=true_labels)

In [45]:
print("Precision@5 is:", micro_precision_at_5)
print("NDCG@5 is:", ndcg_at_5)
print("MRR@5 is:", mrr_at_5)

Precision@5 is: 0.65
NDCG@5 is: 0.8417718099904842
MRR@5 is: 0.875


## 3. Error Analysis

An error I noticed is that the clip model is not able to understant the word "without". I will display some examples in the following cells

In [46]:
error_test_set = ["Image of a bird without a cat",
                 "Image of a people without an animal"]

## 3.1 Retrieving the images

In [48]:
for query in error_test_set:
    print(retrieve(api_endpoint=API_ENDPOINT,
         k=5,
         text_query=query,
         vector_to_search="image"))

['A cat in between two cars in a parking lot.', 'A cat eating a bird it has caught.', 'A door with a sticker of a cat door on it', 'A black cat is inside a white toilet.', 'A cute kitten is sitting in a dish on a table.']
['A man sits with a traditionally decorated cow', 'A man is sitting on a bench next to a bike.', 'A brown and black horse in the middle of the city eating grass.', 'A shot of an elderly man inside a kitchen.', 'A door with a sticker of a cat door on it']


## 3.2 Define labels

In [49]:
true_labels = [
    [0, 0, 0, 0, 0], # each image has a bird without displaying a cat, is labelled as 1.
    [0, 1, 0, 1, 0], # each image displays a human without displaying an animal, is labelled as 1.
] # the true labels of the predictions

In [50]:
preds = [[1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1]
        ] # the labels the retrieval process returns

## 3.3 Compute metrics

In [51]:
micro_precision_at_5 = precision_score(y_true=true_labels,
                                       y_pred=preds,
                                       average="micro")
ndcg_at_5 = ndcg_score(y_true=true_labels, y_score=preds)
mrr_at_5 = mrr(y_preds=true_labels)

In [52]:
print("Precision@5 is:", micro_precision_at_5)
print("NDCG@5 is:", ndcg_at_5)
print("MRR@5 is:", mrr_at_5)

Precision@5 is: 0.2
NDCG@5 is: 0.36156788634492326
MRR@5 is: 0.25


a way to fix this error is by fine-tuning the pretrained clip model 