# Document embeddings

In the context of NLP, document embedding refers to the process of converting textual documents into numerical vectors. These vectors capture the semantic meaning of the documents, enabling machines to understand and process human language.  
  
  
### **Steps to embed documents:**  
1) Preparation of data:
    Clean and preprocess the data, eg., removing special characters, normalizing text and tokenization.
    Organize your documents into a format that is compatible with model's input requirements, typically as a list of strings or a data set.  You cannot imput an entire large document as is because embedding models have maximum input token limits, you must split the documents into chunks first.  

2) Load the pretrained embedding model, which is optimized for generating document embeddings.

3) Embedding process:

    Pass the prepared documents through the embedding model.
    The model will convert each document into a fixed-size numerical vector. These vectors are dense and capture the semantic meaning of the documents.

4) Postprocessing:

    After obtaining the embeddings, consider normalizing the vectors if necessary.
    Store the embeddings in a suitable format, such as a database, for further use in downstream tasks.



### Applications of document embeddings
- Document clustering:  
Use the embeddings to group similar documents together. This is particularly useful in organizing large document collections or creating topic-based clusters.

- Semantic search:  
Implement a semantic search engine where queries are matched with documents based on their semantic similarity rather than just keyword matching.

- Text classification:  
Utilize the embeddings as input features for classification models to categorize documents into predefined labels.


## Which embedding model to use?
Good question. Here's a leaderboard that might help  
https://huggingface.co/spaces/mteb/leaderboard


## Hugging Face embedding model

We will use the all-mpnet-base-v2 embedding model.  
It is a sentence-transformers model. It maps sentences and paragraphs to a 768-dimensional dense vector space and can be used for tasks like clustering or semantic search. It used the pre-trained Microsoft/money-base model and fine-tuned it on a 1B sentence pairs dataset.  
https://huggingface.co/sentence-transformers/all-mpnet-base-v2

In [None]:
# import necessary libraries
from langchain_huggingface import HuggingFaceEmbeddings
import urllib.request
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


In [10]:
# lets get some data and split it into chunks
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/Ec5f3KYU1CpbKRp1whFLZw/new-Policies.txt"
filename = 'data/new-Policies.txt'
urllib.request.urlretrieve(url, filename)

loader = TextLoader("data/new-Policies.txt")
data = loader.load()

# split the text into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
)

chunks = text_splitter.split_text(data[0].page_content)
print(len(chunks))

92


In [None]:
# defining the embedding model
model_name = "sentence-transformers/all-mpnet-base-v2"
huggingface_embedding = HuggingFaceEmbeddings(model_name=model_name)


  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


In [12]:
# lets create a query and embed it

query = "How are you?"
query_result = huggingface_embedding.embed_query(query)
query_result[:5]

[0.0271061509847641,
 0.011331789195537567,
 -0.0019524155650287867,
 -0.03695131093263626,
 0.01776493526995182]

In [15]:
len(query_result)

768

In [13]:
# now lets embed the document
doc_result = huggingface_embedding.embed_documents(chunks)
doc_result[0][:5]

[0.05780600383877754,
 0.04059649258852005,
 0.013996032066643238,
 0.009279176592826843,
 -0.03389701619744301]

In [14]:
len(doc_result[0])

768