In [None]:
from typing import List, Dict

# Document Embeddings and Vector Databases

A fundamental part of Retrieval Augmented Generation (RAG) based systems is the ability to search for documents that are relevant to a given query.  By retrieving these relevant documents, we can inject our prompts with further contextual information, allowing for the LLMs to provide better, grounded answers, often without the need for the LLM to be finetuned off of the data.

In this notebook, we will demonstrate the concepts of how to build the semantic search system which powers the search and retrieval for RAG.  Broadly, the steps will be:

1. Vector Embeddings Conceptually
    - Turning Documents into Vector Embeddings via Neural Networks
    - Retrieving Relevant Documents via Vector Similarity
    - Simple Vector Based Retrieval System
2. More Robust Solution using LangChain and PGVector
    - Creating a Vector Embedding Database using LangChain and PGVector
    - Query the Embedding Database using LangChain and PGVector

## 1. Vector Embeddings Conceptually

Document retrieval and search systems can be built off of simple heuristic concepts such as the occurence of common words, word counts, dictionaries of synonyms, etc.  Depending on the problem, this can often work fine, however for natural language, this is generally insufficient because of the semantic meaning of words, where the literal words do not carry the meaning or the concept.  As an example, the words "pencil" and "eraser" are conceptually similar to each other, both relevant to writing, however in a naive literal search system, there is almost nothing linking the two without additional context.  This problem has lead to the development of semantic search [https://en.wikipedia.org/wiki/Semantic_search], which aims to build search systems for natural language capturing "semantic" meaning of words.


One of the best solutions to pop up for capturing semantics is **Vector Embeddings**, which is the idea that any type of data can be mapped into a vector space, via some embedding function, where in the new space, similar data points are "close" to each other.  The problem then becomes, how do you learn this embedding function?  Without diving into too much detail, it turns out that Neural Networks and Deep Learning are able to learn pretty good embedding functions.

## Turning Documents into Vector Embeddings via Neural Networks

With the concept in mind that **Vector Embeddings** are the process of mapping text to a vector space, let's see how this looks like in practice.  For this, our embedding function will be a Transformer like model called `sBAAI/bge-large-en-v1.5` which has been trained for this purpose of capturing semantic meaning.  More information on how many different open source embeddings models perform can be found at https://huggingface.co/spaces/mteb/leaderboard.

Furthermore, in this vector space, similar concepts should be mapped close to each other as measured by cosine-similarity (used for this specific model).

In [1]:
from typing import Dict, List

import numpy as np
import pprint
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

In [5]:
# Our sample documents to turn into Vector Embeddings
sample_docs = [
    "Pigs are stout-bodied, short-legged, omnivorous mammals, with thick skin usually sparsely coated with short bristles",
    "Cows are four-footed and have a large body. It has two horns, two eyes plus two ears and one nose and a mouth. Cows are herbivorous animals.",
    "Chickens are average-sized fowls, characterized by smaller heads, short beaks and wings, and a round body perched on featherless legs.",
    "NumPy (Numerical Python) is an open source Python library that's used in almost every field of science and engineering. It's the universal standard for working with numerical data in Python, and it's at the core of the scientific Python and PyData ecosystems."
]

# The Embedding function, which is a Neural Network taking in text and outputting vectors
embedding_function = SentenceTransformerEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    model_kwargs = {'device': 'cuda'},
    encode_kwargs = {'normalize_embeddings': True}
)

OutOfMemoryError: CUDA out of memory. Tried to allocate 120.00 MiB (GPU 0; 22.04 GiB total capacity; 1.25 GiB already allocated; 11.19 MiB free; 1.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [3]:
embeddings = np.array(embedding_function.embed_documents(texts = sample_docs))

print(embeddings)


RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

In [None]:
for i, embedding in enumerate(embeddings[:1]):
    print(f"First 5 dimensions for embedding of {sample_docs[i]}:")
    print(f"\t {embeddings[i,:5]}") # Only printing the first 5 to shorten it
    print(f"Embedding Dimension: {embeddings[i].shape}")
    print("-" * 80)

## Retrieving Relevant Documents via Vector Similarity

A key property of these embeddings is that in the vector space, two semantically similar vectors should be close, while non-similar concepts should be far or 0.  The metric used for measuring often depends on how the model is trained.  For this model, we will use cosine similarity.

With the embeddings we can compute how similar any two documents are by computing the cosine similarity between their vector embeddings.

In [None]:
norms = np.linalg.norm(embeddings, axis = 1)
cosine_similarities = (embeddings @ embeddings.T) / (norms.T * norms)
for i in range(len(sample_docs)):
    for j in range(i):
        print(f"Similarity between {sample_docs[j][:20]}... and {sample_docs[i][:20]}...: {cosine_similarities[j][i]}")

As a simple eye test, we can see here that the first three documents have high cosine similarities, while each of their relationship with the description of numpy is lower.

## Simplest Vector Search System

With only these tools, we have enough to technically build a semantic search system.

In [None]:
embedding_function = SentenceTransformerEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    model_kwargs = {'device': 'cuda'},
    encode_kwargs = {'normalize_embeddings': True}
)
sample_docs = [
    "Pigs are stout-bodied, short-legged, omnivorous mammals, with thick skin usually sparsely coated with short bristles",
    "Cows are four-footed and have a large body. It has two horns, two eyes plus two ears and one nose and a mouth. Cows are herbivorous animals.",
    "Chickens are average-sized fowls, characterized by smaller heads, short beaks and wings, and a round body perched on featherless legs.",
    "NumPy (Numerical Python) is an open source Python library that's used in almost every field of science and engineering. It's the universal standard for working with numerical data in Python, and it's at the core of the scientific Python and PyData ecosystems."
]

def embed_documents(docs: List[str]) -> np.ndarray:
    """embed all of our documents, only done once"""
    return np.array(embedding_function.embed_documents(docs))

def embed_query(query: str) -> np.ndarray:
    """embed the query, done on demand"""
    return np.array(embedding_function.embed_documents([query]))[0,:]

def retrieve_relevant_documents(doc_embeddings : np.ndarray, query_embedding : np.ndarray, k : int = 1) -> List[Dict[str, float]]:
    """compute cosine similarity between query and documents, return top k and their scores"""
    cosine_similarities = (doc_embeddings @ query_embedding) / (np.linalg.norm(doc_embeddings, axis = 1).T * np.linalg.norm(query_embedding))
    sim_scores = np.argsort(cosine_similarities)
    return [{'document' : sample_docs[i], 'score' : cosine_similarities[i]} for i in sim_scores[::-1][:k]]
                    

First we embed our documents, typically done offline

In [None]:
doc_embeddings = embed_documents(sample_docs)

Then for every query:
1. embed the query
2. compute the similarity score between the query and the documents
3. return the top k most similar documents

In [None]:
query_embedding = embed_query("What is a hog?")

relevant_docs = retrieve_relevant_documents(doc_embeddings, query_embedding, k = 2)
pprint.pprint(relevant_docs)

## 2. More Robust Solution using LangChain and PGVector

We built our very simple retrieval system, but in practice, there are better solutions for building production ready, scalable solutions.  Primarily, when we computed our vectors, we stored them as a simple numpy array and kept it in memory.  When computing distances, we calculated the cosine similarity against every document.  However, if we had millions of documents, this solution would no longer be sufficient, whether for memory or latency constraints.  

In this section, we will demonstrate the use of PGVector and LangChain to improve this, noting that this is one of the solutions available but not fully production grade.  

`PGVector` is an extension for `postgresql` which allows for the storage of vector embeddings.  `LangChain` is a tool for working with LLM models, including building embeddings for storage withing a PGVector database.

### Creating a Vector Embedding Database using LangChain and PGVector

First, we will connect LangChain to PGVector so that we can have a consistent pipeline that computes embeddings for text and stores the embeddings into the PGVector Vector Database.  This will enable faster query times and persistence of our embeddings.

In [None]:
from langchain_core.documents import Document

import glob
from langchain.document_loaders import TextLoader, PyPDFLoader
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.pgvector import PGVector

import sqlalchemy

We create a connection to the database and use langchain to initialize the database from a set of documents

In [None]:
# The connection to the database
CONNECTION_STRING = PGVector.connection_string_from_db_params(
    driver= "psycopg2",
    host = "localhost",
    port = "5432",
    database = "postgres",
    user= "username",
    password="password"
)

# The embedding function that will be used to store into the database
embedding_function = SentenceTransformerEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    model_kwargs = {'device': 'cuda'},
    encode_kwargs = {'normalize_embeddings': True}
)

For larger documents, a good practice is to chunk the document into smaller segments to allow for the search to be more precise.  LangChain contains multiple methods to chunk documents into smaller parts.  Below we implement a method to load a text document from a path and chunk it into blocks for embedding.

In [None]:
def chunk_document(doc_path: str) -> List[Document]:
    """Chunk a document into smaller langchain Documents for embedding.

    :param doc_path: path to document
    :type doc_path: str
    :return: List of Document chunks
    :rtype: List[Document]
    """
    loader = PyPDFLoader(doc_path)
    documents = loader.load()

    # split document based on the `\n\n` character, quite unintuitive
    # https://stackoverflow.com/questions/76633836/what-does-langchain-charactertextsplitters-chunk-size-param-even-do
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    
    return text_splitter.split_documents(documents)

# load the document and split it into chunks
doc_chunks = []
for doc in glob.glob("docs/*.pdf"):
    doc_chunks += chunk_document(doc)

Now we can initilize and populate our vector database with these chunks.  Behind the scenes, each text is embedded using the `embedding_function` above, which is the `sentence-transformers/all-MiniLM-L6-v2` neural network model, and then a sql command is run to store the embedding into `postgresql`.  See the Appendix below to look at the precise schema that is created and managed by LangChain.

In [None]:
db = PGVector.from_documents(
    doc_chunks[:5],
    connection_string = CONNECTION_STRING,
    collection_name = "embeddings",
    embedding = embedding_function,
    pre_delete_collection = True, # uncomment this to delete existing database first
)

If we want to add new document embeddings, we can do so as below:

In [None]:
new_doc_ids = db.add_documents(doc_chunks[5:])

## Query the Embedding Database using LangChain and PGVector

With the Vector DB built out and populated, we can now query it using LangChain and PGVector.  We replicate some code below to demonstrate that this process can exist without the above

In [None]:
from langchain.vectorstores.pgvector import PGVector

from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

In [None]:
# The connection to the database
CONNECTION_STRING = PGVector.connection_string_from_db_params(
    driver= "psycopg2",
    host = "localhost",
    port = "5432",
    database = "postgres",
    user= "username",
    password="password"
)

# The embedding function that will be used to store into the database
embedding_function = SentenceTransformerEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    model_kwargs = {'device': 'cuda'},
    encode_kwargs = {'normalize_embeddings': True}
)

# Creates the database connection to our existing DB
db = PGVector(
    connection_string = CONNECTION_STRING,
    collection_name = "embeddings",
    embedding_function = embedding_function
)

In [None]:
# query it, note that the score here is a distance metric (lower is more related)
query = "What's the efficacy of NeuroGlyde?"
docs_with_scores = db.similarity_search_with_score(query, k = 3)

# print results
for doc, score in docs_with_scores:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

# Appendix

## LangChain Documentation for PGVector

https://python.langchain.com/docs/integrations/vectorstores/pgvector

https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.pgvector.PGVector.html

## LangChain DB Schema for Embeddings

### Table Schema of Collection

```
Table "public.langchain_pg_collection"
  Column   |       Type        | Collation | Nullable | Default 
-----------+-------------------+-----------+----------+---------
 name      | character varying |           |          | 
 cmetadata | json              |           |          | 
 uuid      | uuid              |           | not null | 
Indexes:
    "langchain_pg_collection_pkey" PRIMARY KEY, btree (uuid)
Referenced by:
    TABLE "langchain_pg_embedding" CONSTRAINT "langchain_pg_embedding_collection_id_fkey" FOREIGN KEY (collection_id) REFERENCES langchain_pg_collection(uuid) ON DELETE CASCADE
```

### Table Schema of Embeddings

```
              Table "public.langchain_pg_embedding"
    Column     |       Type        | Collation | Nullable | Default 
---------------+-------------------+-----------+----------+---------
 collection_id | uuid              |           |          | 
 embedding     | vector            |           |          | 
 document      | character varying |           |          | 
 cmetadata     | json              |           |          | 
 custom_id     | character varying |           |          | 
 uuid          | uuid              |           | not null | 
Indexes:
    "langchain_pg_embedding_pkey" PRIMARY KEY, btree (uuid)
Foreign-key constraints:
    "langchain_pg_embedding_collection_id_fkey" FOREIGN KEY (collection_id) REFERENCES langchain_pg_collection(uuid) ON DELETE CASCADE
```