# Text Embedding


## Goal

In this notebook, we will learn how to generate text embeddings using `fastembed` library.

In [7]:
%load_ext jupyter_ai
%config AiMagics.default_language_model = "ollama:llama3.2"

The jupyter_ai extension is already loaded. To reload it, use:
  %reload_ext jupyter_ai


In [13]:
%%ai

"what is text embedding? keep it short and simple"

**What is Text Embedding?**
==========================

Text embedding is a technique to represent words or phrases as numerical vectors, called embeddings, that capture their semantic meaning.

**Key Features:**

*   **Numerical Representation**: Words are converted into dense, high-dimensional vectors.
*   **Similarity Measures**: Distance between word vectors can be used to measure semantic similarity.
*   **Contextual Understanding**: Embeddings can capture relationships between words in a sentence or document.

**Example:**
Word "dog" might be represented as an embedding that is closer to the word "cat" than "house", even though they are not semantically similar.

In [12]:
# "Create a wordcloud for the topic text embedding"
# "Create a simple syllabus for learning the following topic text embedding"

## Dense text embedding

In [8]:
%%ai

"Explain dense text embedding, keep it simple. Demonstrate some example. Give some real-world example"

## What are Dense Text Embeddings?

Dense text embeddings are a way to convert text into numerical representations that capture its meaning and context.

### How Do They Work?

1. **Vectorize Words:** Turn words into vectors (numbers) that represent their meaning.
2. **Learned Through Data:** The model learns from a massive amount of text data and generates these vectors based on the relationships between words.

### Example:

*   **“Dog”** -> [0.8, 0.2, 0.9]
*   **“Cat”** -> [0.7, 0.3, 0.8]
*   **“Car”** -> [0.1, 0.9, 0.2]
*   **“Apple”** -> [0.3, 0.5, 0.7]

Notice how similar words are closer together.

### Real-World Examples:

1.  **Search Engines:** Convert search queries into embeddings to find relevant documents.
2.  **Recommendation Systems:** Use embeddings to suggest movies or songs based on user preferences.
3.  **Sentiment Analysis:** Analyze text to determine its sentiment (positive, negative, neutral).
4.  **Chatbots & Conversational AI:** Embeddings help chatbots understand user input and respond accordingly.

### Key Benefits:

*   Capture complex relationships between words
*   Efficiently represent large amounts of text data
*   Enable applications like search, recommendation, and sentiment analysis

In [18]:
from fastembed import TextEmbedding

# Example list of documents
documents: list[str] = [
    "This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.",
    "fastembed is supported by and maintained by Qdrant.",
]

# This will trigger the model download and initialization
embedding_model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")
print(f"The model {embedding_model.model_name} is ready to use.")

embeddings_generator = embedding_model.embed(documents)  # reminder this is a generator
embeddings_list = list(embedding_model.embed(documents))
# you can also convert the generator to a list, and that to a numpy array
len(embeddings_list[0])  # Vector of 384 dimensions

The model BAAI/bge-small-en-v1.5 is ready to use.


384

In [5]:
list(filter(lambda x: not x.startswith("__"), dir(embedding_model)))

['EMBEDDINGS_REGISTRY',
 'METADATA_FILE',
 '_get_model_description',
 '_list_supported_models',
 '_local_files_only',
 'add_custom_model',
 'cache_dir',
 'decompress_to_cache',
 'download_file_from_gcs',
 'download_files_from_huggingface',
 'download_model',
 'embed',
 'list_supported_models',
 'model',
 'model_name',
 'passage_embed',
 'query_embed',
 'retrieve_model_gcs',
 'threads']

In [6]:
embedding_model.model_name

'BAAI/bge-small-en-v1.5'

## Sparse text embeddings

In [25]:
%%ai

"Explain sparse text embedding, keep it simple. Demonstrate some example. Give some real-world example"

**Sparse Text Embeddings**
========================

### What is a Sparse Vector?

A sparse vector is a numerical representation of a piece of text where most values are 0.

### How Does it Work?

Imagine you're looking at a word and thinking of related words, like "apple" and "fruit". A sparse vector would have a few non-zero values representing the relationships between these words, but most values would be 0 (representing unrelated words).

### Example
--------

Let's say we have a piece of text: "The quick brown fox jumps over the lazy dog."

We can represent this as two sparse vectors:

* Vector A: `[0.1, 0, 0, ...]` (related to the word "quick")
* Vector B: `[0, 0.4, 0, ...]` (related to the word "brown")

Our sparse vector would have only non-zero values representing the relationships between these two words.

### Real-World Examples
--------------------

1. **Text Classification**: Sparse embeddings can be used as input features for text classification models, such as sentiment analysis or spam detection.
2. **Topic Modeling**: Sparse embeddings can help identify topics in large collections of text by capturing the underlying relationships between words.
3. **Word Embeddings**: Sparse embeddings can be used to represent individual words, like Word2Vec or GloVe.

### Code Example
--------------

```python
# Import necessary libraries
import numpy as np

# Define a function to create a simple sparse vector representation
def create_sparse_vector(text):
    # Split the text into words
    words = text.split()

    # Create a dictionary to store word vectors
    word_vectors = {}

    # Add each word to the dictionary
    for i, word in enumerate(words):
        # Use a random numerical value as the vector for simplicity
        word_vector = np.random.rand(768)
        word_vector[i] = 1.0  # Set the correct value
        word_vectors[word] = word_vector

    # Combine all word vectors into a single sparse vector
    sparse_vector = np.array([word_vectors[word][i] for i, word in enumerate(words)])

    return sparse_vector

# Test the function
text = "The quick brown fox jumps over the lazy dog."
sparse_vector = create_sparse_vector(text)
print(sparse_vector.shape)  # Output: (1, 768)
```

In this example, we use a simple dictionary to store word vectors and then combine them into a single sparse vector. The `i`-th value in each word vector is set to 1.0 to represent the correct relationship between words.

In [8]:
from fastembed import SparseTextEmbedding

model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")
embeddings = list(model.embed(documents))
embeddings

[SparseEmbedding(values=array([0.46793732, 0.34634435, 0.82014424, 0.45307532, 0.98732066,
        0.80176616, 0.2087955 , 0.07078066, 0.15851103, 0.07413071,
        0.34253079, 0.88557774, 0.13234277, 0.23698376, 0.07734038,
        0.20083414, 1.3942709 , 0.57856292, 0.75639009, 0.12872015,
        0.12940496, 1.21411681, 0.3960413 , 0.38100156, 0.85480541,
        0.23132324, 0.61133695, 0.34899744, 0.15025412, 0.1130122 ,
        0.15241024, 0.36152679, 0.13700481, 0.7303589 , 1.39194822,
        0.04954698, 0.49473077, 0.30635571, 0.06034151, 1.13118982,
        0.01341425, 0.02633621, 0.10710741, 1.03937888, 0.05903498,
        0.33036089, 0.0278459 , 0.04743589, 1.68689609, 0.62101287,
        1.86998868, 0.71478194, 0.08071101, 1.26968515, 0.05093801,
        0.09553559, 1.57417607, 0.18500556, 0.0425379 , 0.24046306,
        1.08656394, 0.72864759, 0.1876028 , 0.85070795, 0.16575399,
        0.23869337, 0.52304912, 0.90775394, 0.02330356, 0.12363458,
        0.37557927, 1.934

## Late interaction models (aka ColBERT)

In [26]:
%%ai ollama:llama3.2

"Explain late interaction models (aka ColBERT), keep it simple. Demonstrate some example. Give some real-world example"

**Late Interaction Models (ColBERT)**
=====================================

### What is Late Interaction?

Late interaction models are a type of search model that improves the performance of early interaction methods, such as BM25.

### How Does it Work?

Imagine you're searching for documents related to a query. Early interaction methods like BM25 use features extracted from the query and the document to rank them. Late interaction models refine these rankings by using additional information, such as the entire document or other relevant documents.

### Example
--------

Let's say we have a search model that uses BM25 to rank documents for a query "computer science". The top 5 ranked documents are:

1. **A paper on computer vision**
2. **A blog post on machine learning**
3. **A Wikipedia page on artificial intelligence**
4. **A news article on data science**
5. **A book review on software engineering**

Late interaction models like ColBERT can improve these rankings by considering additional features, such as:

* The entire document text
* Other relevant documents in the corpus

### Real-World Examples
--------------------

1. **E-commerce Search**: Late interaction models can improve search results for e-commerce platforms by incorporating product descriptions, reviews, and other relevant information.
2. **Knowledge Graph Search**: Late interaction models can be used to search knowledge graphs, which are databases of entities and their relationships.
3. **Recommendation Systems**: Late interaction models can be used to recommend items based on user behavior and preferences.

### Code Example
--------------

```python
# Import necessary libraries
import numpy as np

# Define a function to create a ColBERT model
def create_colbert_model(query, documents):
    # Extract features from the query and documents using BM25
    bm25_features = bm25_query_and_document(query, documents)

    # Add additional features using late interaction
    colbert_features = add_late_interaction(bm25_features, query, documents)

    return colbert_features

# Define a function to calculate BM25 features
def bm25_query_and_document(query, documents):
    # Implement BM25 calculation here
    pass

# Define a function to add late interaction features
def add_late_interaction(features, query, documents):
    # Implement late interaction calculation here
    pass
```

In this example, we define a simple ColBERT model that uses BM25 as an early interaction method and adds additional features using late interaction. The `bm25_query_and_document` function calculates BM25 features, and the `add_late_interaction` function calculates late interaction features.

In [10]:
from fastembed import LateInteractionTextEmbedding

model = LateInteractionTextEmbedding(model_name="colbert-ir/colbertv2.0")
embeddings = list(model.embed(documents))
embeddings

[array([[-0.1351824 ,  0.12230334,  0.1269857 , ...,  0.17307524,
          0.11274203,  0.02880633],
        [-0.17495233,  0.08767531,  0.11352374, ...,  0.12433604,
          0.15752925,  0.08118125],
        [-0.10130584,  0.09613474,  0.13923067, ...,  0.12898032,
          0.16839182,  0.09858395],
        ...,
        [-0.10270972,  0.01041561,  0.04440113, ...,  0.0550529 ,
          0.08930317,  0.09720251],
        [-0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [-0.15476122,  0.06961455,  0.10665789, ...,  0.15388842,
          0.09050205,  0.00516431]], shape=(29, 128), dtype=float32),
 array([[ 0.12170535,  0.07871944,  0.12508287, ...,  0.08450251,
          0.01834184, -0.01686618],
        [-0.02659732, -0.12131035,  0.14012505, ..., -0.01885814,
          0.01064609, -0.05982119],
        [-0.03633325, -0.14667122,  0.14062028, ..., -0.052545  ,
          0.00967532, -0.08844125],
        ...,
        [-0.        , 

## Image embeddings

In [35]:
import numpy as np
from fastembed import ImageEmbedding

images = ["images/cat1.jpeg", "images/cat2.jpeg", "images/dog1.webp"]

model = ImageEmbedding(model_name="Qdrant/clip-ViT-B-32-vision")
embeddings = list(model.embed(images))

In [36]:
from qdrant_client.local.distances import cosine_similarity

In [37]:
cosine_similarity(np.array(embeddings), np.array(embeddings))

array([[1.0000001 , 0.83723867, 0.61121917],
       [0.83723867, 1.0000002 , 0.7055948 ],
       [0.61121917, 0.7055948 , 1.0000001 ]], dtype=float32)

## Late interaction multimodal models (ColPali)

In [27]:
%%ai

"Explain late interaction multimodal models (aka ColPali), keep it simple. Demonstrate some example. Give some real-world example"

**Late Interaction Multimodal Models (ColPali)**
=============================================

### What is Late Interaction?

Late interaction models are a type of search model that improves the performance of early interaction methods, such as BM25.

### How Does it Work?

Imagine you're searching for documents related to a query. Early interaction methods like BM25 use features extracted from the query and the document to rank them. Late interaction models refine these rankings by using additional information, such as the entire document or other relevant documents.

### Multimodal Models

Multimodal models extend late interaction models to incorporate multiple input modalities, such as text, images, audio, and video.

### Example
--------

Let's say we have a search model that uses ColPali to rank documents for a query "computer science". The input modalities are:

* **Text**: A passage of text related to the query.
* **Image**: An image related to the query (e.g. a diagram of a computer chip).
* **Audio**: An audio file related to the query (e.g. a lecture on machine learning).

The ColPali model extracts features from each modality and combines them using late interaction. The final ranking is based on the weighted sum of these features.

### Real-World Examples
--------------------

1. **Image Search with Text Description**: Late interaction multimodal models can be used to search images based on a text description.
2. **Speech Recognition with Visual Feedback**: ColPali can be used in speech recognition systems that incorporate visual feedback, such as displaying the recognized text in real-time.
3. **Multimodal Question Answering**: Late interaction multimodal models can be used to answer questions that require information from multiple input modalities (e.g. text, images, audio).

### Code Example
--------------

```python
# Import necessary libraries
import numpy as np

# Define a function to create a ColPali model
def create_colpali_model(query, text, image, audio):
    # Extract features from each modality using late interaction
    text_features = extract_features(text, query)
    image_features = extract_features(image, query)
    audio_features = extract_features(audio, query)

    # Combine features using weighted sum
    combined_features = combine_features(text_features, image_features, audio_features)

    return combined_features

# Define a function to extract features from text using late interaction
def extract_features(text, query):
    # Implement late interaction calculation here
    pass

# Define a function to combine features using weighted sum
def combine_features(text_features, image_features, audio_features):
    # Implement weighted sum calculation here
    pass
```

In this example, we define a simple ColPali model that uses late interaction to extract features from each modality and combines them using a weighted sum. The `extract_features` function calculates features for each modality, and the `combine_features` function calculates the final combined features.

In [None]:
from fastembed import LateInteractionMultimodalEmbedding

doc_images = [
    "images/wiki_computer_science.png",
    "images/wiki_technology.png",
    "images/wiki_space.png",
]

query = "what is tech"

model = LateInteractionMultimodalEmbedding(model_name="Qdrant/colpali-v1.3-fp16")
doc_images_embeddings = list(model.embed_image(doc_images))
query_embedding = model.embed_text(query)

In [None]:
from qdrant_client import models
from qdrant_client.local.multi_distances import calculate_multi_distance

# How to calculate distance?
# from qdrant_client.local.sparse_distances import calculate_distance_sparse
# calculate_multi_distance(
#     query_embedding,
#     doc_images_embeddings,
#     # distance_type=models.MultiVectorComparator.MAX_SIM,
#     distance_type=models.Distance.DOT,
# )

## Rerankers

In [None]:
from fastembed.rerank.cross_encoder import TextCrossEncoder

query = "Who is maintaining Qdrant?"
documents: list[str] = [
    "This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.",
    "fastembed is supported by and maintained by Qdrant.",
]
encoder = TextCrossEncoder(model_name="Xenova/ms-marco-MiniLM-L-6-v2")
scores = list(encoder.rerank(query, documents))
scores