# Quick Start

In this tutorial, we will show how to use BGE models on a text retrieval task in 5 minutes.

## Step 0: Preparation

First, install FlagEmbedding in the environment.

In [72]:
# !pip install -U FlagEmbedding

Below is a super tiny courpus with only 10 sentences, which will be the dataset we use.

Each sentence is a discription of a famous people in specific domain.

In [73]:
corpus = [
    "Michael Jackson was a legendary pop icon known for his record-breaking music and dance innovations.",
    "Fei-Fei Li is a professor in Stanford University, revolutionized computer vision with the ImageNet project.",
    "Brad Pitt is a versatile actor and producer known for his roles in films like 'Fight Club' and 'Once Upon a Time in Hollywood.'",
    "Geoffrey Hinton, as a foundational figure in AI, received Turing Award for his contribution in deep learning.",
    "Eminem is a renowned rapper and one of the best-selling music artists of all time.",
    "Taylor Swift is a Grammy-winning singer-songwriter known for her narrative-driven music.",
    "Sam Altman leads OpenAI as its CEO, with astonishing works of GPT series and pursuing safe and beneficial AI.",
    "Morgan Freeman is an acclaimed actor famous for his distinctive voice and diverse roles.",
    "Andrew Ng spread AI knowledge globally via public courses on Coursera and Stanford University.",
    "Robert Downey Jr. is an iconic actor best known for playing Iron Man in the Marvel Cinematic Universe.",
]

We want to know which one of these people could be an expert of neural network and who he/she is. 

Thus we generate the following query:

In [74]:
query = "Who could be an expert of neural network?"

## Step 1: Text -> Embedding

First, let's use a BGE embedding model to create sentence embedding for the corpus.

In [75]:
from FlagEmbedding import FlagModel

# get the BGE embedding model
model = FlagModel('BAAI/bge-base-en-v1.5',
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=True)

# get the embedding of the query and corpus
corpus_embeddings = model.encode(corpus)
query_embedding = model.encode(query)

The embedding of each sentence is a vector with length 768. 

Run the following print line to check it and take a look at the first 10 elements of the query embedding vector.

In [76]:
print("shape of the query embedding:  ", query_embedding.shape)
print("shape of the corpus embeddings:", corpus_embeddings.shape)
print(query_embedding[:10])

shape of the query embedding:   (768,)
shape of the corpus embeddings: (10, 768)
[-0.00790005 -0.00683443 -0.00806659  0.00756918  0.04374858  0.02838556
  0.02357143 -0.02270943 -0.03611493 -0.03038301]


## Step 2: Calculate Similarity

Now, we have the embeddings of the query and the corpus. The next step is to calculate the similarity between the query and each sentence in the corpus.

In [77]:
sim_scores = query_embedding @ corpus_embeddings.T
print(sim_scores)

[0.39290053 0.6031525  0.32672375 0.6082418  0.39446455 0.35350388
 0.4626108  0.40196604 0.5284606  0.36792332]


## Step 3: Ranking

After we have the similarity score of the query to each sentence in the corpus, we can rank them from large to small.

In [78]:
# get the indices in sorted order
sorted_indices = sorted(range(len(sim_scores)), key=lambda k: sim_scores[k], reverse=True)
print(sorted_indices)

[3, 1, 8, 6, 7, 4, 0, 9, 5, 2]


Now from the ranking, the sentence with index 3 is the best answer to our query "Who could be an expert of neural network?"

And that person is Geoffrey Hinton!

In [79]:
print(corpus[3])

Geoffrey Hinton, as a foundational figure in AI, received Turing Award for his contribution in deep learning.


According to the order of indecies, we can print out the ranking of people that our little retriever got.

In [80]:
# iteratively print the score and corresponding sentences in descending order

for i in sorted_indices:
    print(f"Score of {sim_scores[i]:.3f}: \"{corpus[i]}\"")

Score of 0.608: "Geoffrey Hinton, as a foundational figure in AI, received Turing Award for his contribution in deep learning."
Score of 0.603: "Fei-Fei Li is a professor in Stanford University, revolutionized computer vision with the ImageNet project."
Score of 0.528: "Andrew Ng spread AI knowledge globally via public courses on Coursera and Stanford University."
Score of 0.463: "Sam Altman leads OpenAI as its CEO, with astonishing works of GPT series and pursuing safe and beneficial AI."
Score of 0.402: "Morgan Freeman is an acclaimed actor famous for his distinctive voice and diverse roles."
Score of 0.394: "Eminem is a renowned rapper and one of the best-selling music artists of all time."
Score of 0.393: "Michael Jackson was a legendary pop icon known for his record-breaking music and dance innovations."
Score of 0.368: "Robert Downey Jr. is an iconic actor best known for playing Iron Man in the Marvel Cinematic Universe."
Score of 0.354: "Taylor Swift is a Grammy-winning singer-s

From the ranking, not surprisingly, the similarity scores of the discriptions of Geoffrey Hinton and Fei-Fei Li is way higher than others, following by those of Andrew Ng and Sam Altman. 

While the key phrase "neural network" in the query does not appear in any of those discriptions, this implies that the BGE embedding model gets the semantic meaning of query and corpus well.

## Step 4: Evaluate

We've seen the embedding model performed pretty well on the "neural network" query. What about the overall quality?

In [81]:
queries = [
    "Who could be an expert of neural network?",
    "Who had won Grammy 15 times?",
    "Who might had won Academy Awards?",
    "One of the most famous female singers.",
    "Inventor of AlexNet",
]

In [82]:
ground_truth = [
    [1, 3],
    [4],
    [2, 7, 9],
    [5],
    [3],
]

Here we will use the Mean Reciprocal Rank (MRR) to evaluate the performance.

In [83]:
def MMR(preds, labels, cutoffs):
    mmr = [0 for _ in range(len(cutoffs))]
    for pred, label in zip(preds, labels):
        for i, c in enumerate(cutoffs):
            for j, index in enumerate(pred):
                if j < c and index in label:
                    mmr[i] += 1/(j+1)
                    break
    mmr = [k/len(preds) for k in mmr]
    return mmr

In [84]:
queries_embedding = model.encode(queries)
scores = queries_embedding @ corpus_embeddings.T
rankings = [sorted(range(len(sim_scores)), key=lambda k: sim_scores[k], reverse=True) for sim_scores in scores]
rankings

[[3, 1, 8, 6, 7, 4, 0, 9, 5, 2],
 [5, 0, 4, 3, 1, 2, 7, 9, 8, 6],
 [3, 2, 5, 9, 0, 7, 1, 4, 6, 8],
 [5, 0, 4, 7, 1, 9, 2, 3, 6, 8],
 [3, 1, 8, 6, 0, 7, 5, 9, 4, 2]]

In [85]:
cutoffs = [1, 5]
mmrs = MMR(rankings, ground_truth, cutoffs)
for i, c in enumerate(cutoffs):
    print(f"MMR@{c}: {mmrs[i]}")

MMR@1: 0.6
MMR@5: 0.7666666666666666
