# Notebook Information

This notebook demonstrates concept of similarity search and the difference between dense and sparse vectors.


---

This notebook is maintained by:

**Name:** Ekaterina Antonova
**Email:** [ekaterina_antonova@epam.com](ekaterina_antonova@epam.com)

**Name:** Adam Krzysiek 
**Email:** [adam_krzysiek@epam.com](adam_krzysiek@epam.com)

---

In [None]:
%pip install --upgrade pip > /dev/null
%pip install langchain altair peft pandas umap-learn transformers numba langchain-huggingface FlagEmbedding > /dev/null

In [None]:
import torch
import warnings
import altair as alt
import pandas as pd
import numpy as np
import umap.umap_ as umap
from IPython.display import display
from FlagEmbedding import BGEM3FlagModel
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from numba.core.errors import (
    NumbaDeprecationWarning,
    NumbaPendingDeprecationWarning,
)
from transformers import AutoModelForMaskedLM, AutoTokenizer

## Let's start with some simple examples. First wee need to extract text and split it into chunks. 

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # The maximum number of characters in a chunk: we selected this value arbitrarily
    chunk_overlap=100,  # The number of characters to overlap between chunks
    add_start_index=True,  # If `True`, includes chunk's start index in metadata
    strip_whitespace=True,  # If `True`, strips whitespace from the start and end of every document
    separators=["\n"]
)

In [None]:
text = ''
with open('data/example_article.txt', 'r', encoding='utf-8') as file:
    text += file.read()

with open('data/example_article_2.txt', 'r', encoding='utf-8') as file:
    text += file.read()

In [None]:
chunks = text_splitter.split_text(text)

In [None]:
for i, chunk in enumerate(chunks[:5]):
    print(f"Chunk {i + 1}:")
    print(chunk)
    print("-" * 50)

### Dense Vectors
Dense vectors are numerical representations (embeddings) generated by neural networks, specifically by text embedding models. These vectors usually have most or all of their elements as non-zero values.

The main goal of dense vectors is to capture the semantic meaning of a piece of text. In simpler terms, they represent the essence of the text in a specific numerical format. The length of these vectors, called their "dimension," varies depending on the model used (examples include dimensions like 256, 768, 1024, etc.).

A single dimension in a dense vector embedding does not mean anything, as it is too abstract to determine its meaning. 
However, when we take all the dimensions together, they provide the semantic meaning of the input text.

If you are interested to dive deeper into __representation of words__ and etc . Have a look at this wonderful __[presentation](https://nlp.cs.princeton.edu/cos484-sp21/lectures/lec5.pdf)__

In [None]:
warnings.simplefilter("ignore", category=NumbaDeprecationWarning)
warnings.simplefilter("ignore", category=NumbaPendingDeprecationWarning)
warnings.filterwarnings("ignore")

In [None]:
model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embedding_function = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [None]:
embedded_chunks = embedding_function.embed_documents(chunks)

#### Visualization
Now for the purpose of the visualization(we need to be able to build 2D plot), we need to reduce the dimentionality in our vectors.

In [None]:
embedded_chunks_np = np.array(embedded_chunks)

# Reduce dimensionality with UMAP
umap_model = umap.UMAP(n_components=2, random_state=42)
umap_embeddings = umap_model.fit_transform(embedded_chunks_np)

# Create a DataFrame for visualization
df = pd.DataFrame({
    'x': umap_embeddings[:, 0],
    'y': umap_embeddings[:, 1],
    'text': chunks
})

You can see that on the plot we have visualized like two different cluster one related to pets and domestic animals, 
the other for wild animals. It shows us that our embeddings capture this key differences and construct its vector space accordingly. 

You can see that on the plot we have visualized like two different cluster one related to pets and domestic animals, 
the other for wild animals. It shows us that our embeddings capture this key differences and construct its vector space accordigly. 

Also if you are interested in __UMAP__ as a technique, Higly reccomend this __[article](https://pair-code.github.io/understanding-umap/)__.
It is worth to have a look at least for cool visualizations.  

In [None]:
# Create an interactive scatter plot with Altair
chart = alt.Chart(df).mark_circle().encode(
    x='x',
    y='y',
    tooltip=['text']
).interactive()

display(chart)

### Sparse vectors
Sparse Vectors are a representation where each dimension corresponds to a word or subword, greatly aiding in interpreting document rankings. This clarity is why sparse vectors are essential in modern search and recommendation systems, complimenting the meaning-rich embedding or dense vectors.

Sparse Vectors shine in domains and scenarios where many rare keywords or specialized terms are present. For example, in the medical domain, many rare terms are not present in the general vocabulary, so general-purpose dense vectors cannot capture the nuances of the domain.

Where do sparse vectors fail though? They’re not great at capturing nuanced relationships between words. For example, they can’t capture the relationship between “king” and “queen” as well as dense vectors.

Sparse embeddings are generated from algorithms like [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) or with models like [BGE-M3](https://huggingface.co/BAAI/bge-m3) and  [SPLADE](https://arxiv.org/abs/2107.05720). 

In most databases and search engines, such as Pinecone, Elasticsearch, and Milvus, there is a built-in implementation of BM25. Therefore, there is usually no need to implement it yourself; you can simply use the predefined capabilities they offer.

Let's create sparse embeddings using model __BGE-M3__, sometimes called lexical weights. This model actually is able to return sparse vectors and dense vectors

In [None]:
text_example = chunks[0]

In [None]:
model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)
sentences = [text_example]

In [None]:
output = model.encode(sentences, return_dense=False, return_sparse=True)

In [None]:
sparse_weight_id = output['lexical_weights']
sparse_weight_id

In [None]:
model.convert_id_to_token(sparse_weight_id)

In [None]:
# Let's create sparse vectors using SPLADE
model_id = 'naver/splade-cocondenser-ensembledistil'

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

In [None]:
def compute_vector(text, tokenizer, model):
    """
    Computes a vector from logits and attention mask using ReLU, log, and max operations.

    Args:
    logits (torch.Tensor): The logits output from a model.
    attention_mask (torch.Tensor): The attention mask corresponding to the input tokens.

    Returns:
    torch.Tensor: Computed vector.
    """
    tokens = tokenizer(text, return_tensors="pt")
    output = model(**tokens)
    logits, attention_mask = output.logits, tokens.attention_mask
    relu_log = torch.log(1 + torch.relu(logits))
    weighted_log = relu_log * attention_mask.unsqueeze(-1)
    max_val, _ = torch.max(weighted_log, dim=1)
    vec = max_val.squeeze()

    return vec, tokens

In [None]:
def extract_and_map_sparse_vector(vector, tokenizer):
    """
    Extracts non-zero elements from a given vector and maps these elements to their human-readable tokens using a tokenizer. The function creates and returns a sorted dictionary where keys are the tokens corresponding to non-zero elements in the vector, and values are the weights of these elements, sorted in descending order of weights.

    This function is useful in NLP tasks where you need to understand the significance of different tokens based on a model's output vector. It first identifies non-zero values in the vector, maps them to tokens, and sorts them by weight for better interpretability.

    Args:
    vector (torch.Tensor): A PyTorch tensor from which to extract non-zero elements.
    tokenizer: The tokenizer used for tokenization in the model, providing the mapping from tokens to indices.

    Returns:
    dict: A sorted dictionary mapping human-readable tokens to their corresponding non-zero weights.
    """

    # Extract indices and values of non-zero elements in the vector
    cols = vector.nonzero().squeeze().cpu().tolist()
    weights = vector[cols].cpu().tolist()

    # Map indices to tokens and create a dictionary
    idx2token = {idx: token for token, idx in tokenizer.get_vocab().items()}
    token_weight_dict = {idx2token[idx]: round(weight, 2) for idx, weight in zip(cols, weights)}

    # Sort the dictionary by weights in descending order
    sorted_token_weight_dict = {k: v for k, v in sorted(token_weight_dict.items(), key=lambda item: item[1], reverse=True)}

    return sorted_token_weight_dict

In [None]:
vec, tokens = compute_vector(text_example, tokenizer=tokenizer, model=model)

# Usage example
sorted_tokens = extract_and_map_sparse_vector(vec, tokenizer)
len(sorted_tokens), sorted_tokens

There will be 60 sorted tokens in total. This has expanded to include tokens that weren’t in the original text. 
Consider a query “solar energy advantages”. SPLADE might expand this to include terms like “renewable,” “sustainable,” and “photovoltaic,” which are contextually relevant but not explicitly mentioned. This process is called term expansion, and it’s a key component of SPLADE.

SPLADE learns the query/document expansion to include other relevant terms. This is a crucial advantage over other sparse methods which include the exact word, but completely miss the contextually relevant ones.

| Feature                 | Sparse Vectors                          | Dense Vectors                        |
|-------------------------|------------------------------------------|--------------------------------------|
| Data Representation     | Majority of elements are zero           | All elements are non-zero            |
| Computational Efficiency| Generally higher, especially in operations involving zero elements | Lower, as operations are performed on all elements |
| Information Density     | Less dense, focuses on key features     | Highly dense, capturing nuanced relationships |
| Example Applications    | Text search, Hybrid search              | RAG, many general machine learning tasks |


Additional materials: 
For those who want to dive deeper:

* [Understanding the differences between sparse and dense semantic vectors](https://www.adelean.com/en/blog/20240131_vectors_sparse_and_dense/)
* [Problem Motivation: Sparse Overcomplete Word Vector Representations](https://ar5iv.org/abs/1506.02004?utm_source=qdrant&utm_medium=website&utm_campaign=sparse-vectors&utm_content=article&utm_term=sparse-vectors)
* [SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval](https://ar5iv.org/abs/2109.10086?utm_source=qdrant&utm_medium=website&utm_campaign=sparse-vectors&utm_content=article&utm_term=sparse-vectors)
* [SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking](https://ar5iv.org/abs/2107.05720?utm_source=qdrant&utm_medium=website&utm_campaign=sparse-vectors&utm_content=article&utm_term=sparse-vectors)
* [Late Interaction - ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction](https://ar5iv.org/abs/2112.01488?utm_source=qdrant&utm_medium=website&utm_campaign=sparse-vectors&utm_content=article&utm_term=sparse-vectors)
* [SparseEmbed: Learning Sparse Lexical Representations with Contextual Embeddings for Retrieval](https://research.google/pubs/pub52289/?utm_source=qdrant&utm_medium=website&utm_campaign=sparse-vectors&utm_content=article&utm_term=sparse-vectors)