# Embedding

For text retrieval, pattern matching is the most intuitive way. People would use certain characters, words, phrases, or sentence patterns. however, not only for human, it is also extremely inefficient for computer to do pattern matching between a query and a collection of text files to find the possible result. For images and acoustic waves, there are rgb pixels and digital signals. Similarly, we need a way to represent text data. That's how text embedding comes in front of the stage.

## 1. Background

Based on the statistical features of words in sentenses/documents, techniques like *one-hot encoding* and *bag-of-words (BoW)* represent words and sentences as sparse vector. While they already provides some These classical approaches only consider the appearance and frequency of words in a single document. 

In [None]:
# example of one-hot encoding
words = ['code', 'phone', 'cup', 'apple']
code = [1, 0, 0, 0]
phone = [0, 1, 0, 0]
cup = [0, 0, 1, 0]
apple = [0, 0, 0, 1]

In [None]:
# example of bag-of-words
sentence1 = "I love basketball"
sentence2 = "I have a basketball match"

words = ['I', 'love', 'basketball', 'have', 'a', 'match']
sen1_vec = [1, 1, 1, 0, 0, 0]
sen2_vec = [1, 0, 1, 1, 1, 1]


*TF-IDF* considers how important a word in a document with respect to the whole corpus, which helps out the applications of text classification and filtering. *BM25* is one of the well known ranking algorithm based on TF-IDF.
*N-gram* method captures the order of words in a window with size $n$, making a step forward from individual words to a group of words.

$$\text{tf-idf}(t,d,D)=\text{tf}(t,d)\cdot\text{idf}(t,D)$$
Where:
$$\text{tf}(t,d)=\frac{f_{t,d}}{\sum_{t'\in{d}} f_{t',d}}$$
$$\text{idf}(t,D)=\log{\frac{N}{|\{d:d\in{D}\text{ and }t\in{d}\}|}}$$

However, there's still a long journey ahead for bridging computer to human's natural language. The shortcomings cannot be ignored. First, these methods are facing the "curse of dimensionality". It's hard to scale up with the growing size of datasets and limitation of computing power.
Besides that, what about the words like "cat", "kitty", and "feline" that are sharing similar semantic sense but with totally different lexical formulation?

To work beyond the limitations, researchers came up with dense word embedding. The key idea is mapping each word to a vector in a low-dimensional space, which could somehow capture the semantic and relational information of the words. The rising of neural network provides a perfect way to build up the model. People can tune the network structure and number of parameters to fit their affordable computing power and training time.

<center>
    <img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*OEmWDt4eztOcm5pr2QbxfA.png" width = 750>
    <figcaption>Fig.1 - Word2Vec (<a href="https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8">source</a>)</figcaption>
<center>

The work of *Word2Vec* shows the magic of dense embedding. It use a relatively small neural network with only one hidden layer to find the embedding of words containing the hidden semantic information, with the famous example: 
$$king - man + woman \approx queen$$

Follow by that, with the development of new neural network structures (RNN, CNN, LSTM, etc) that performs well in NLP tasks, more works on dense embedding show good progress. Models like *CoVe*, *ULMFiT*, and *ELMo* based on variance of LSTM all show great results.

**Transformer**, the key architecture of most of the current LLMs, better improves LSTM's advantage of memorizing the context and words relationships. It further supports parallel training on GPU with tensor operations. In this case, we are able to train **larger** models with **larger** datasets. Pre-trained language models like *BERT*, *RoBERTa*, *T5*, and *GPT* were used for encoding tasks.
Based on that, models and paradigms like *Sentence-BERT*, *Condenser*, and *RetroMAE* builds bi-encoder and cross-encoder with SOTA performance on on semantic textual similarity benchmarks.

Now into the era of large language models, we've seen the potential of scaling up the model size and training data size to accomplish more sophisticated tasks. So do embedding models. Based on different base models with reasonable model sizes, we can train embedding models to achieve multi functionality, multilingual, multi granularity, in-context learning ability, etc.

## 2. BGE Embedding Models

BGE stands for **B**AAI **G**eneral **E**mbedding.

Embedding Models:

| Model  | Language |   Parameters   |   Model Size   |    Description    |   Base Model     |
|:-------|:--------:|:--------------:|:--------------:|:-----------------:|:----------------:|
| [BAAI/bge-en-icl](https://huggingface.co/BAAI/bge-en-icl) | English | 7.11B |  28.5 GB  | A LLM-based embedding model with in-context learning capabilities, which can fully leverage the model's potential based on a few shot examples | Mistral-7B |
| [BAAI/bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2) |  Multilingual   |  9.24B  |  37 GB  | A LLM-based multilingual embedding model, trained on a diverse range of languages and tasks. |    Gemma2-9B    |
| [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)                   |    Multilingual     |   568M   |  2.27 GB  |  Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) | XLM-RoBERTa |
| [BAAI/llm-embedder](https://huggingface.co/BAAI/llm-embedder)             |   English | 109M |  438 MB  |      a unified embedding model to support diverse retrieval augmentation needs for LLMs       | BERT |



BGE v1.5
| Model  | Language |   Parameters   |   Model Size   |    Description    |   Base Model     |
|:-------|:--------:|:--------------:|:--------------:|:-----------------:|:----------------:|
| [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5)   | English |    335M    |    1.34 GB   |     version 1.5 with more reasonable similarity distribution      |   BERT   |
| [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5)     | English |    109M    |    438 MB    |     version 1.5 with more reasonable similarity distribution      |   BERT   |
| [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)   | English |    33.4M   |    133 MB    |     version 1.5 with more reasonable similarity distribution      |   BERT   |
| [BAAI/bge-large-zh-v1.5](https://huggingface.co/BAAI/bge-large-zh-v1.5)   | Chinese |    326M    |    1.3 GB    |     version 1.5 with more reasonable similarity distribution      |   BERT   |
| [BAAI/bge-base-zh-v1.5](https://huggingface.co/BAAI/bge-base-zh-v1.5)     | Chinese |    102M    |    409 MB    |     version 1.5 with more reasonable similarity distribution      |   BERT   |
| [BAAI/bge-small-zh-v1.5](https://huggingface.co/BAAI/bge-small-zh-v1.5)   | Chinese |    24M     |    95.8 MB   |     version 1.5 with more reasonable similarity distribution      |   BERT   |

BGE v1.0
| Model  | Language |   Parameters   |   Model Size   |    Description    |   Base Model     |
|:-------|:--------:|:--------------:|:--------------:|:-----------------:|:----------------:|
| [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en)   | English |    500M    |    1.34 GB   |              Embedding Model which map text into vector                            |  BERT  |
| [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en)     | English |    109M    |    438 MB    |          a base-scale model but with similar ability to `bge-large-en`  |  BERT  |
| [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en)   | English |    33.4M   |    133 MB    |          a small-scale model but with competitive performance                    |  BERT  |
| [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh)   | Chinese |    326M    |    1.3 GB    |              Embedding Model which map text into vector                            |  BERT  |
| [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh)     | Chinese |    102M    |    409 MB    |           a base-scale model but with similar ability to `bge-large-zh`           |  BERT  |
| [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh)   | Chinese |    24M     |    95.8 MB   |           a small-scale model but with competitive performance                    |  BERT  |


Now, install the FlagEmbedding package

In [None]:
!pip install -U FlagEmbedding

In [None]:
# If you are unable to connect to Hugging Face, uncomment the following line to use the mirror

# import os
# os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

The following blocks give very simple examples to show how BGE embedding model works. Note that it might take a while to download the model if you are the first time using it. 

bge-base-en-v1.5 is about 438 MB. Feel free to play with other models mentioned above.

In [5]:
from FlagEmbedding import FlagModel

model = FlagModel('BAAI/bge-base-en-v1.5',
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=True)
# Setting use_fp16 to True speeds up computation with a slight performance degradation

In [10]:
sentences_1 = ["A cat is playing with yarn", "Today's lunch lunch is pizza"]
sentences_2 = ["She has three kittens", "He orders Dominos takeout everyday"]

In [11]:
embeddings_1 = model.encode(sentences_1)
embeddings_2 = model.encode(sentences_2)
print("shape of the embedding: ", embeddings_1.shape)

shape of the embedding:  (2, 768)


Here we directly use dot product to compute similarity. More 

In [12]:
similarity = embeddings_1 @ embeddings_2.T
print("similarity: \n", similarity)

similarity: 
 [[0.58180463 0.34935987]
 [0.4080751  0.62310696]]


For s2p(short query to long passage) retrieval task, suggest to *use encode_queries()* which will automatically add the instruction to each query.

Corpus in retrieval task can still use *encode()* or *encode_corpus()*, since they don't need instruction.

In [4]:
queries = ['What is panda?', 'What is tiger?']
passages = [
    "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.", 
    "The tiger (Panthera tigris) is a member of the genus Panthera and the largest living cat species native to Asia."
    ]
q_embeddings = model.encode_queries(queries)
p_embeddings = model.encode(passages)
print("shape of queries embedding:   ", q_embeddings.shape)
print("shape of passages emebedding: ", p_embeddings.shape)

shape of queries embedding:    (2, 768)
shape of passages emebedding:  (2, 768)


In [15]:
scores = q_embeddings @ p_embeddings.T
print(scores)

[[0.73488915 0.45358998]
 [0.4899818  0.69096196]]


## 3. Applications

We've shown in the code blocks above that getting the similarity of two embedding vector by simply calculating their dot product. There are more applications that embedding are widely used in:

### 3.1 Search Engines

This is a very important task that information retrieval plays an important role. Embeddings can help match user queries with relevant documents/images/videos by comparing the similarity between the query and the documents/images/videos in the datasets.

### 3.2 Classification

- **Topic classification**. Embedding can help identify the topic or category of a given piece of text, such as news articles, academic papers, or social media posts.
- **Sentiment analysis**. Embeddings are used to classify sentences or texts as positive, negative, or neutral.

### 3.3 Translation and Summarization

Sentence embeddings can assist in aligning sentences across languages by capturing their underlying semantic meanings and contexts. With that, it can be used to translate to other languages, or do summarization/paraphrase.

### 3.4 Clustering

By the characteristic of encoding sentenses to high dimension space, embeddings can be used to group similar sentences or documents together based on their semantic similarity.

### 3.5 Recommendation System

Embeddings can be used to recommend similar content based on the semantic similarity of the user’s past interactions or preferences. It could also help tailor search results based on the user’s history and preferences.

## References

- [Retrieve Anything To Augment Large Language Models](https://arxiv.org/abs/2310.07554)
- [M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity
Text Embeddings Through Self-Knowledge Distillation](https://arxiv.org/pdf/2402.03216)
- [A Survey of Text Representation and Embedding Techniques in NLP](https://ieeexplore.ieee.org/abstract/document/10098736)
- [Dense Text Retrieval Based on Pretrained Language Models: A Survey](https://dl.acm.org/doi/full/10.1145/3637870?casa_token=3L7XtUgnci8AAAAA%3A2FcXrFQukPQrJEz6czKR-GAfEH4_aE9yoQWdGicIkFUQ2_SYbKDx_iQCn9_afoLgabJNk41BLpLz)

In [6]:
import numpy as np

In [13]:
a = np.zeros((8,5))
a

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [14]:
a.reshape(-1, 4)

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])