# Evaluation

In this notebook, we will walk through the whole pipeline of evaluating the performance of an embedding model.

## Step 0: Setup

Install the dependencies in the environment.

In [None]:
%pip install -U FlagEmbedding datasets faiss-cpu scikit-learn

## Step 1: Load Dataset

First, download MS Marco from Huggingface Dataset

In [4]:
from datasets import load_dataset

queries = load_dataset("namespace-Pt/msmarco", split="dev")
corpus = load_dataset("namespace-Pt/msmarco-corpus", split="train")

## Step 2: Text Embedding

In [None]:
from FlagEmbedding import FlagModel

# get the BGE embedding model
model = FlagModel('BAAI/bge-base-en-v1.5',
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=True)

# get the embedding of the corpus
corpus_embeddings = model.encode(corpus['content'])

print("shape of the corpus embeddings:", corpus_embeddings.shape)
print("data type of the embeddings: ", corpus_embeddings.dtype)

## Step 3: Indexing

In [None]:
import faiss

index = faiss.index_factory(corpus_embeddings.shape[-1], index_factory, faiss.METRIC_INNER_PRODUCT)

if model.device == torch.device("cuda"):
        co = faiss.GpuMultipleClonerOptions()
        co.useFloat16 = True
        index = faiss.index_cpu_to_all_gpus(index, co)

corpus_embeddings = corpus_embeddings.astype(np.float32)

index.train(corpus_embeddings)
index.add(corpus_embeddings)

## Step 4: Retrieval

In [None]:
query_embeddings = model.encode_queries(queries["query"])
ground_truths = [q["positive"] for q in queries]

In [None]:
from tqdm import tqdm
import numpy as np

res_scores, res_ids, res_text = [], [], []
query_size = len(query_embeddings)

for i in tqdm(range(0, query_size, batch_size), desc="Searching"):
    q_embedding = query_embeddings[i: min(i+batch_size, query_size)].astype(np.float32)
    socre, idx = index.search(q_embedding, k=5)
    res_scores.append(score)
    res_ids.append(idx)
    res_text.append(corpus[idx]["content"])

## Step 5: Evaluate

### 5.1 Recall

Recall represents the model's capability of correctly predicting positive instances from all the actual positive samples in the dataset.

$$\textbf{Recall}=\frac{\text{True Positives}}{\text{True Positives}+\text{False Negatives}}$$

Recall is useful when the cost of false negatives is high. In other words, we are trying to find all objects of the positive class, even if this results in some false positives. This attribute makes recall a useful metric for text retrieval tasks.

In [None]:
cut_offs = [1, 10, 100]

In [None]:
def calc_recall(preds, truth, cutoffs):
    recalls = np.zeros(len(cut_offs))
    for text, truth in res_text, ground_truths:
        for i, c in enumerate(cut_offs):
            recall = np.intersect1d(truth, text[:c])
            recalls[i] += len(recall) / min(len(recall), len(truth))
    recalls /= len(res_text)
    return recalls

recalls = calc_recall(preds, truth, cutoffs)
for i, c in enumerate(cut_offs):
    print(f"recall@{c}: {recalls[i]}")

### 5.2 MRR

Mean Reciprocal Rank ([MRR](https://en.wikipedia.org/wiki/Mean_reciprocal_rank)) is a widely used metric in information retrieval to evaluate the effectiveness of a system. It measures the rank position of the first relevant result in a list of search results.

$$MRR=\frac{1}{|Q|}\sum_{i=1}^{|Q|}\frac{1}{rank_i}$$

where 
- $|Q|$ is the total number of queries.
- $rank_i$ is the rank position of the first relevant document of the i-th query.

In [None]:
def MRR(preds, truth, cutoffs):
    mrr = [0 for _ in range(len(cutoffs))]
    for pred, t in zip(preds, truth):
        for i, c in enumerate(cutoffs):
            for j, p in enumerate(pred):
                if j < c and p in t:
                    mrr[i] += 1/(j+1)
                    break
    mrr = [k/len(preds) for k in mrr]
    return mrr

In [None]:
mrr = MRR(res_text, ground_truths)
for i, c in enumerate(cut_offs):
    print(f"MRR@{c}: {mrr[i]}")

### 5.3 nDCG

Normalized Discounted cumulative gain (nDCG) measures the quality of a ranked list of search results by considering both the position of the relevant documents and their graded relevance scores. The calculation of nDCG involves two main steps:

1. Discounted cumulative gain (DCG) measures the ranking quality in retrieval tasks.

$$DCG_p=\sum_{i=1}^p\frac{2^{rel_i}-1}{\log_2(i+1)}$$

2. Normalized by ideal DCG to make it comparable across queries.
$$nDCG_p=\frac{DCG_p}{IDCG_p}$$
where $IDCG$ is the maximum possible DCG for a given set of documents, assuming they are perfectly ranked in order of relevance.

In [None]:
pred_hard_encodings = []
for pred, label in zip(res_text, laground_truthsels):
    pred_hard_encoding = list(np.isin(pred, label).astype(int))
    pred_hard_encodings.append(pred_hard_encoding)

In [7]:
from sklearn.metrics import ndcg_score

for i, c in enumerate(cutoffs):
    nDCG = ndcg_score(pred_hard_encodings, res_scores, k=c)
    print(f"nDCG@{c}: {nDCG}")