# Introduction
This notebook is used as an example of building and interacting with a chromadb embeddings database using a locally stored and downloaded embeddings model.

Specifically, we will be using the [instructor-xl embedding model](https://huggingface.co/hkunlp/instructor-xl) which looks create an embedding that allows matching between prompts and relevant documents. This is a relatively flexible embedding model and is more closely aligned to our end-goal of a simple proof-of-concept of a Retrieval-Augmented Generation (RAG) approach to using Large Language Models (LLMs).

# Method
## Approach for embedding
To embed the files, we will be turning them into reasonably sized text chunks and storing them in a static, on-disk chromadb vector database which can be loaded and used for searching.

This notebook will show some simple walkthroughs of this and the associated calculated_document_embeddings.py script turns this into more consistent functions for re-use and with better documentation.

The use-case we are considering here is largely an asymmetric embedding problem where the query text may be a very different size to the stored text for comparison.

For similarity measures in the embedding space, we can consider both the cosine similarity and raw dot-product but noting they are each likely to favour different potential answer lengths.

## Text chunking strategy
We will aim for context-aware text chunks. This can be file / format specific (i.e Markdown, HTML, code, etc)
and will allow for some overlap between these text cunks.

## Environment setup

Note, I'd reccommend you pick a python kernel built to match the requirements specified in pyproject.toml and to use Poetry to manage the dependecies and virtual environment. I have added the pip magic command to install relevant packages in this notebook but it would be unnecessary if you set up an appropriate virtual environment.

## Example data
I've downloaded various books that are out of copyright. These are:
* Moby Dick converted to five formats:
  * txt
  * docx
  * pdf (but with text encoded)
  * markdown
  * html
* Pride and Prejudice as a PDF
* Romeo and Juliet as a PDF

In [None]:
%pip install chromadb langchain PyYAML sentence-transformers pypdf ipykernel unstructured markdown docx2txt tiktoken InstructorEmbedding

We first set up some constants with paths to our data.

Please note, these are hardcoded and would need to be adjusted for where you setup your embedding model and your document data.

In [67]:
import chromadb  # For creating, managing and interacting with the local vector store
import sentence_transformers  # Nominal for default sentence_transformers but instructor model used instead
import os  # For file-path operators
import uuid  # Used for generating unique IDs for document chunks

core_working_directory = r"C:\Users\Alex\Google Drive\projects\llama2_retrieval_augmented_generation"
document_data_dir_path = os.path.join(core_working_directory, "data", "documents")

# Note that the name is in the convention for huggingface.co and this model is apache 2.0 licensed.
reference_models_directory = r"F:\reference_models"
embedding_model_name_used = "hkunlp/instructor-xl"
embedding_model_path = os.path.join(reference_models_directory, "embedding_models", "instructor-xl")

# Somewhat arbitrary but shouldn't be too large to avoid hitting issues with context window size on various LLMs
# without using context window extension methods, llama2 should have a 2048ish token context window
# Therefore want to keep chunks small enough that a few chunks of context can be added to a prompt.
n_size_in_doc_chunk = 512
n_size_in_chunk_overlap = 32

From here, the intent it to load the files, turn them into chunks of reasonably sized text, add some meta-data to each chunk and then save it to the vector database.

In [34]:
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader
from langchain.document_loaders import UnstructuredHTMLLoader
from langchain.document_loaders import UnstructuredMarkdownLoader
from langchain.document_loaders import PyMuPDFLoader
from langchain.document_loaders import Docx2txtLoader
# from langchain.document_loaders import JSONLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Note, below is pieced together from langchain docs here:
# https://python.langchain.com/docs/modules/data_connection/document_transformers/
# from langchain.document_loaders import DocxLoader
file_endings_considered = {"markdown": "md",
                           "html": "html",
                           "text": "txt",
                           "pdf": "pdf",
                           "word_doc": "docx"}

file_globbers = {file_type: os.path.join("**", f"*.{file_ending}")
                 for file_type, file_ending in file_endings_considered.items()}

# Build a markdown file loader
markdown_file_loader = DirectoryLoader(document_data_dir_path,
                                       glob = file_globbers['markdown'],
                                       show_progress = True,
                                       loader_cls = UnstructuredMarkdownLoader)

# Build a text file loader handling different text encodings
text_loader_kwargs= {'autodetect_encoding': True}
text_file_loader = DirectoryLoader(document_data_dir_path,
                                   glob = file_globbers['text'],
                                   show_progress = True,
                                   loader_cls = TextLoader,
                                   loader_kwargs=text_loader_kwargs)

# Build a html file loader handling different text encodings
html_file_loader = DirectoryLoader(document_data_dir_path,
                                   glob = file_globbers['html'],
                                   show_progress = True,
                                   loader_cls = UnstructuredHTMLLoader)

# Build a pdf file loader
pdf_file_loader = DirectoryLoader(document_data_dir_path,
                                  glob = file_globbers['pdf'],
                                  show_progress = True,
                                  loader_cls = PyMuPDFLoader)

# Build docx file loader
docx_file_loader = DirectoryLoader(document_data_dir_path,
                                  glob = file_globbers['word_doc'],
                                  show_progress = True,
                                  loader_cls = Docx2txtLoader)

# We also defined our text splitters here for each use-case
default_text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = n_size_in_doc_chunk,
    chunk_overlap = n_size_in_chunk_overlap,
    length_function = len,
    is_separator_regex = False,
)

# Note we'd use rapidocr-onnxruntime too if we degrade python version to suit it as it needs <3.12

In [26]:
loaded_markdown_files = markdown_file_loader.load()
loaded_text_files = text_file_loader.load()
loaded_html_files = html_file_loader.load()
loaded_pdf_files = pdf_file_loader.load()
loaded_docx_files = docx_file_loader.load()



100%|██████████| 1/1 [13:37<00:00, 817.74s/it]
100%|██████████| 1/1 [06:45<00:00, 405.55s/it]


100%|██████████| 1/1 [00:13<00:00, 13.36s/it]
100%|██████████| 1/1 [00:00<00:00, 111.01it/s]
100%|██████████| 1/1 [00:11<00:00, 11.79s/it]
100%|██████████| 3/3 [00:01<00:00,  1.65it/s]
100%|██████████| 1/1 [00:01<00:00,  1.43s/it]


After loading the documents we go through text splitting, chunks and metadata extraction.

In [35]:
# Text chunking is built together from langchain documentation here:
# https://python.langchain.com/docs/modules/data_connection/document_transformers/
# Note that it does propagate the metadata
markdown_file_chunks = default_text_splitter.split_documents(loaded_markdown_files)
text_file_chunks = default_text_splitter.split_documents(loaded_text_files)
html_file_chunks = default_text_splitter.split_documents(loaded_html_files)
pdf_file_chunks = default_text_splitter.split_documents(loaded_pdf_files)
docx_file_chunks = default_text_splitter.split_documents(loaded_docx_files)

We load up a local embedding model and then get ready to embed the text and save it to a chromadb.

Some example interfaces considered from this [tutorial](https://realpython.com/chromadb-vector-database/#get-started-with-chromadb-an-open-source-vector-database).

In [50]:
persistent_chromadb_location = os.path.join(core_working_directory, 'doc_db')
default_embedding_model = "all-MiniLM-L6-v2"
document_collection_name = "simple_test_of_chunks"

local_vector_db_client = chromadb.PersistentClient(path = persistent_chromadb_location)

Old example aiming to use hugging face before realising instruct model needs modified package, not default sentence_transformers package.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings

# Create a dictionary with model configuration options, specifying to use the CPU for computations
embedding_model_kwargs = {'device': 'cpu'}

# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
embedding_encode_kwargs = {'normalize_embeddings': False}

# Initialize an instance of HuggingFaceEmbeddings with the specified parameters
instructor_xl_embedding = HuggingFaceEmbeddings(
    model_name = embedding_model_path,     # Provide the pre-trained model's path
    model_kwargs = embedding_model_kwargs, # Pass the model configuration options
    encode_kwargs = embedding_encode_kwargs # Pass the encoding options
)

In [56]:
generic_instructor_document_task = "Represent the general document chunk for retrieval:"
generic_instructor_query_task = "Represent the general question for retrieving supporting documents:"

In [83]:
from chromadb.utils import embedding_functions
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings

# Create a dictionary with model configuration options, specifying to use the CPU for computations
embedding_model_kwargs = {'device': 'cpu'}

# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
embedding_encode_kwargs = {'normalize_embeddings': False,
                           'batch_size': 1}

# Initialize an instance of HuggingFaceEmbeddings with the specified parameters
instructor_xl_embedding = HuggingFaceInstructEmbeddings(
    model_name = embedding_model_path,
    query_instruction = generic_instructor_document_task,
    model_kwargs = embedding_model_kwargs
)
# chromadb_instruct_xl_embedding = embedding_functions.InstructorEmbeddingFunction(
#     model_name = embedding_model_path,
#     device = "cpu"
# )

load INSTRUCTOR_Transformer
max_seq_length  512


We now define a document collection built off the local ChromaDB client and go through and embed all of our document chunks inside it.

In [84]:
doc_collection_for_rag = local_vector_db_client.get_or_create_collection(
    name = document_collection_name,
    metadata = {'purpose': 'embedding documents in a way that supports retrieval augmented generation.',
                'embedding_model': 'https://huggingface.co/hkunlp/instructor-xl'},
                embedding_function = instructor_xl_embedding
)

ValueError: Expected EmbeddingFunction.__call__ to have the following signature: odict_keys(['self', 'input']), got odict_keys(['args', 'kwargs'])
Please see https://docs.trychroma.com/embeddings for details of the EmbeddingFunction interface.
Please note the recent change to the EmbeddingFunction interface: https://docs.trychroma.com/migration#migration-to-0416---november-7-2023 


To embed the documents appropriately, we have to need to extract it from the document chunks we made and also give each entry a unique ID.

In [80]:
pdf_file_chunk = pdf_file_chunks[0]
chunk_id = str(uuid.uuid1())
chunk_metadata = pdf_file_chunk.metadata
chunk_content = pdf_file_chunk.page_content
chunk_content_with_instruction = [generic_instructor_document_task, chunk_content]
chunk_context_embedding = instructor_xl_embedding.encode(chunk_content_with_instruction)

In [81]:
chunk_context_embedding

array([[ 0.00265378, -0.01419186,  0.0533826 , ..., -0.03320131,
        -0.01322962,  0.09307595],
       [ 0.01149968,  0.01181622,  0.01291376, ..., -0.05892983,
         0.01926928,  0.02393644]], dtype=float32)

In [87]:
from langchain.vectorstores import Chroma

langchain_chroma_db = Chroma(
    client = local_vector_db_client,
    collection_name = document_collection_name,
    embedding_function = instructor_xl_embedding
)

In [88]:
langchain_chroma_db.add_documents(
    pdf_file_chunks
)

In [82]:
doc_collection_for_rag.add(
    ids = chunk_id,
    embeddings = chunk_context_embedding,
    metadatas = chunk_metadata,
    documents = chunk_content_with_instruction
)

ValueError: Expected embeddings to be a list, got [[ 0.00265378 -0.01419186  0.0533826  ... -0.03320131 -0.01322962
   0.09307595]
 [ 0.01149968  0.01181622  0.01291376 ... -0.05892983  0.01926928
   0.02393644]]

In [66]:
for pdf_file_chunk in pdf_file_chunks:
    chunk_id = str(uuid.uuid1())
    chunk_metadata = pdf_file_chunk.metadata
    chunk_content = pdf_file_chunk.page_content 
    chunk_instructor_embedding = 

    
    collection.add(
        ids=[str(uuid.uuid1())], metadatas=doc.metadata, documents=doc.page_content
    )

[Document(page_content='The Project Gutenberg eBook of Moby Dick; Or, The Whale \n     \nThis ebook is for the use of anyone anywhere in the United States and \nmost other parts of the world at no cost and with almost no restrictions \nwhatsoever. You may copy it, give it away or re-use it under the terms \nof the Project Gutenberg License included with this ebook or online \nat www.gutenberg.org. If you are not located in the United States, \nyou will have to check the laws of the country where you are located \nbefore using this eBook.', metadata={'source': 'C:\\Users\\Alex\\Google Drive\\projects\\llama2_retrieval_augmented_generation\\data\\documents\\moby_dick_3.pdf', 'file_path': 'C:\\Users\\Alex\\Google Drive\\projects\\llama2_retrieval_augmented_generation\\data\\documents\\moby_dick_3.pdf', 'page': 0, 'total_pages': 385, 'format': 'PDF 1.7', 'title': '', 'author': 'Alexander Baker', 'subject': '', 'keywords': '', 'creator': 'Microsoft® Word for Office 365', 'producer': 'Micros

In [None]:
for doc in docs:
    collection.add(
        ids=[str(uuid.uuid1())], metadatas=doc.metadata, documents=doc.page_content
    )

In [64]:
langchain_chroma_db._collection.count()

0