# Embedding

For text retrieval, pattern matching is the most intuitive way. People would use certain characters, words, phrases, or sentence patterns. However, not only for human, it is also extremely inefficient for computer to do pattern matching between a query and a collection of text files to find the possible results. 

For images and acoustic waves, there are rgb pixels and digital signals. Similarly, in order to accomplish more sophisticated tasks of natural language such as retrieval, classification, clustering, or semantic search, we need a way to represent text data. That's how text embedding comes in front of the stage.

## 1. Intro

Traditional text embedding methods like one-hot encoding and bag-of-words (BoW) represent words and sentences as sparse vectors based on their statistical features, such as word appearance and frequency within a document. More advanced methods like TF-IDF and BM25 improve on these by considering a word's importance across an entire corpus, while n-gram techniques capture word order in small groups. However, these approaches suffer from the "curse of dimensionality" and fail to capture semantic similarity like "cat" and "kitty", difference like "play the watch" and "watch the play".

In [1]:
# example of bag-of-words
sentence1 = "I love basketball"
sentence2 = "I have a basketball match"

words = ['I', 'love', 'basketball', 'have', 'a', 'match']
sen1_vec = [1, 1, 1, 0, 0, 0]
sen2_vec = [1, 0, 1, 1, 1, 1]

To overcome these limitations, dense word embeddings were developed, mapping words to vectors in a low-dimensional space that captures semantic and relational information. Early models like Word2Vec demonstrated the power of dense embeddings using neural networks. Subsequent advancements with neural network architectures like RNNs, LSTMs, and Transformers have enabled more sophisticated models such as BERT, RoBERTa, and GPT to excel in capturing complex word relationships and contexts. **BAAI General Embedding (BGE)** provide a series of open-source models that could satisfy all kinds of demands.

## 2. BAAI General Embedding

In this Part, we will walk through the BGE series and introduce how to use those embedding models.

First, install the FlagEmbedding in your environment.

In [None]:
%pip install -U FlagEmbedding

### 2.1 BGE

The very first version of BGE has 6 models, with 'large', 'base', and 'small' for English and Chinese. 

| Model  | Language |   Parameters   |   Model Size   |    Description    |   Base Model     |
|:-------|:--------:|:--------------:|:--------------:|:-----------------:|:----------------:|
| [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en)   | English |    500M    |    1.34 GB   |              Embedding Model which map text into vector                            |  BERT  |
| [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en)     | English |    109M    |    438 MB    |          a base-scale model but with similar ability to `bge-large-en`  |  BERT  |
| [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en)   | English |    33.4M   |    133 MB    |          a small-scale model but with competitive performance                    |  BERT  |
| [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh)   | Chinese |    326M    |    1.3 GB    |              Embedding Model which map text into vector                            |  BERT  |
| [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh)     | Chinese |    102M    |    409 MB    |           a base-scale model but with similar ability to `bge-large-zh`           |  BERT  |
| [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh)   | Chinese |    24M     |    95.8 MB   |           a small-scale model but with competitive performance                    |  BERT  |

For inference, import FlagModel from FlagEmbedding and initialize the model.

In [10]:
from FlagEmbedding import FlagModel

# Load BGE model
model = FlagModel('BAAI/bge-base-en',
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=True)

queries = ["query 1", "query 2"]
corpus = ["passage 1", "passage 2"]

# encode the queries and corpus
q_embeddings = model.encode(queries)
p_embeddings = model.encode(corpus)

# compute the similarity scores
scores = q_embeddings @ p_embeddings.T
print(scores)

[[0.8888277  0.82843924]
 [0.80761224 0.8892383 ]]


To use `FlagModel`:
```
FlagModel.encode(sentences, batch_size=256, max_length=512, convert_to_numpy=True)
```
The *encode()* function directly encode the input sentences to embedding vectors.
```
FlagModel.encode_queries(sentences, batch_size=256, max_length=512, convert_to_numpy=True)
```
The *encode_queries()* function concatenate the `query_instruction_for_retrieval` with each of the input query, and then call `encode()`.

### 2.2 BGE 1.5

BGE 1.5 alleviate the issue of the similarity distribution, and enhance retrieval ability without instruction.

| Model  | Language |   Parameters   |   Model Size   |    Description    |   Base Model     |
|:-------|:--------:|:--------------:|:--------------:|:-----------------:|:----------------:|
| [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5)   | English |    335M    |    1.34 GB   |     version 1.5 with more reasonable similarity distribution      |   BERT   |
| [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5)     | English |    109M    |    438 MB    |     version 1.5 with more reasonable similarity distribution      |   BERT   |
| [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)   | English |    33.4M   |    133 MB    |     version 1.5 with more reasonable similarity distribution      |   BERT   |
| [BAAI/bge-large-zh-v1.5](https://huggingface.co/BAAI/bge-large-zh-v1.5)   | Chinese |    326M    |    1.3 GB    |     version 1.5 with more reasonable similarity distribution      |   BERT   |
| [BAAI/bge-base-zh-v1.5](https://huggingface.co/BAAI/bge-base-zh-v1.5)     | Chinese |    102M    |    409 MB    |     version 1.5 with more reasonable similarity distribution      |   BERT   |
| [BAAI/bge-small-zh-v1.5](https://huggingface.co/BAAI/bge-small-zh-v1.5)   | Chinese |    24M     |    95.8 MB   |     version 1.5 with more reasonable similarity distribution      |   BERT   |

BGE 1.5 models shares the same API of `FlagModel` with BGE models.

In [11]:
model = FlagModel('BAAI/bge-base-en-v1.5',
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=True)

queries = ["query 1", "query 2"]
corpus = ["passage 1", "passage 2"]

# encode the queries and corpus
q_embeddings = model.encode(queries)
p_embeddings = model.encode(corpus)

# compute the similarity scores
scores = q_embeddings @ p_embeddings.T
print(scores)

[[0.736794  0.5989914]
 [0.5684842 0.7461165]]


### 2.3 LLM-Embedder

LLM-Embedder is a unified embedding model supporting diverse retrieval augmentation needs for LLMs. It is fine-tuned over 6 tasks:
- Question Answering (qa)
- Conversational Search (convsearch)
- Long Conversation (chat)
- Long-Rnage Language Modeling (lrlm)
- In-Context Learning (icl)
- Tool Learning (tool)

| Model  | Language |   Parameters   |   Model Size   |    Description    |   Base Model     |
|:-------|:--------:|:--------------:|:--------------:|:-----------------:|:----------------:|
| [BAAI/llm-embedder](https://huggingface.co/BAAI/llm-embedder)             |   English | 109M |  438 MB  |      a unified embedding model to support diverse retrieval augmentation needs for LLMs       | BERT |

To use `LLMEmbedder`:
```
LLMEmbedder.encode_queries(queries, batch_size=256, max_length=256, task='qa')
```
The *encode_queries()* will call the *_encode()* functions (similar to the *encode()* in `FlagModel`) and add the corresponding query instruction of the given *task* in front of each of the input *queries*.
```
LLMEmbedder.encode_keys(keys, batch_size=256, max_length=512, task='qa')
```
Similarly, *encode_keys()* also calls *_encode()* and automatically add instructions according to given task.

In [12]:
from FlagEmbedding import LLMEmbedder

# load the LLMEmbedder model
model = LLMEmbedder('BAAI/llm-embedder', use_fp16=False)

# Define queries and keys
queries = ["test query 1", "test query 2"]
keys = ["test key 1", "test key 2"]

# Encode for a specific task (qa, icl, chat, lrlm, tool, convsearch)
task = "qa"
query_embeddings = model.encode_queries(queries, task=task)
key_embeddings = model.encode_keys(keys, task=task)

# compute the similarity scores
similarity = query_embeddings @ key_embeddings.T
print(similarity)

[[0.89705944 0.85341793]
 [0.8462474  0.90914035]]


### 2.4 BGE M3

BGE-M3 is the new version of BGE models that is distinguished for its versatility in:
- Multi-Functionality: Simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
- Multi-Linguality: Supports more than 100 working languages.
- Multi-Granularity: Can proces inputs with different granularityies, spanning from short sentences to long documents of up to 8192 tokens.

| Model  | Language |   Parameters   |   Model Size   |    Description    |   Base Model     |
|:-------|:--------:|:--------------:|:--------------:|:-----------------:|:----------------:|
| [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)                   |    Multilingual     |   568M   |  2.27 GB  |  Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) | XLM-RoBERTa |

In [13]:
from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)

sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

Fetching 30 files: 100%|██████████| 30/30 [00:00<00:00, 216946.76it/s]


```
BGEM3FlagModel.encode(
    sentences, 
    batch_size=12, 
    max_length=8192, 
    return_dense=True, 
    return_sparse=False, 
    return_colbert_vecs=False
)
```
It returns a dictionary like:
```
{
    'dense_vecs': array of dense embeddings if return_dense=Ture, otherwise None,
    'lexical_weights': array of dictionaries with keys and values are ids of tokens and their corresponding weights if return_sparse=True, otherwise None,
    'colbert_vecs': 
}
```

#### 2.4.1 Dense Retrieval

It's almost the same to BGE or BGE 1.5 models if using BGE M3 for dense embedding. 

Note that the dense embedding vector has length of 1024 different form 768 before.

In [14]:
# If you don't need such a long length of 8192 input tokens, you can set max_length to a smaller value to speed up encoding.
embeddings_1 = model.encode(sentences_1, max_length=10)['dense_vecs']
embeddings_2 = model.encode(sentences_2, max_length=100)['dense_vecs']

# compute the similarity scores
similarity = embeddings_1 @ embeddings_2.T
print(similarity)

[[0.6259035  0.34749585]
 [0.349868   0.6782462 ]]


#### 2.4.2 Sparse Retrieval

Set `return_sparse` to true to make the model return sparse vector.  If a term token appears multiple times in the sentence, we only retain its max weight.

In [15]:
output_1 = model.encode(sentences_1, return_sparse=True)
output_2 = model.encode(sentences_2, return_sparse=True)

# you can see the weight for each token:
print(model.convert_id_to_token(output_1['lexical_weights']))

[{'What': 0.08362077, 'is': 0.081469566, 'B': 0.12964639, 'GE': 0.25186998, 'M': 0.17001738, '3': 0.26957875, '?': 0.040755156}, {'De': 0.050144322, 'fin': 0.13689369, 'ation': 0.045134712, 'of': 0.06342201, 'BM': 0.25167602, '25': 0.33353207}]


Based on the tokens' weights of query and passage, the relevance score between them is computed by the joint importance of the co-existed terms within the query and passage:

$$s_{lex} = \sum_{t\in q\cap p}(w_{qt} * w_{pt})$$

In [16]:
# compute the scores via lexical mathcing
score_1 = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])
print(score_1)

score_2 = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_1['lexical_weights'][1])
print(score_2)

0.19554448500275612
0


#### 3.4.2 Multi-Vector

The multi-vector method utilizes the entire output embeddings for the representation of query and passage.

In [17]:
output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True)
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=True)

print(f"({len(output_1['colbert_vecs'][0])}, {len(output_1['colbert_vecs'][0][0])})")

(8, 1024)


Following ColBert, we use late-interaction to compute the fine-grained relevance score:
$$s_{mul}=\frac{1}{N}\sum_{i=1}^N\max_{j=1}^M E_q[i]\cdot E_p^T[j]$$

In [18]:
print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0]).item())
print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1]).item())

0.7796662449836731
0.4621177911758423
