# Introduction
This notebook is used as an example of building and interacting with a chromadb embeddings database using a locally stored and downloaded embeddings model.

Specifically, we will be using the [instructor-xl embedding model](https://huggingface.co/hkunlp/instructor-xl) which looks create an embedding that allows matching between prompts and relevant documents. This is a relatively flexible embedding model and is more closely aligned to our end-goal of a simple proof-of-concept of a Retrieval-Augmented Generation (RAG) approach to using Large Language Models (LLMs).

# Method
## Approach for embedding
To embed the files, we will be turning them into reasonably sized text chunks and storing them in a static, on-disk chromadb vector database which can be loaded and used for searching.

This notebook will show some simple walkthroughs of this and the associated calculated_document_embeddings.py script turns this into more consistent functions for re-use and with better documentation.

The use-case we are considering here is largely an asymmetric embedding problem where the query text may be a very different size to the stored text for comparison.

For similarity measures in the embedding space, we can consider both the cosine similarity and raw dot-product but noting they are each likely to favour different potential answer lengths.

## Text chunking strategy
We will aim for context-aware text chunks. This can be file / format specific (i.e Markdown, HTML, code, etc)
and will allow for some overlap between these text cunks.

## Environment setup

Note, I'd reccommend you pick a python kernel built to match the requirements specified in pyproject.toml and to use Poetry to manage the dependecies and virtual environment. I have added the pip magic command to install relevant packages in this notebook but it would be unnecessary if you set up an appropriate virtual environment.

## Example data
I've downloaded various books that are out of copyright. These are:
* Moby Dick converted to five formats:
  * txt
  * docx
  * pdf (but with text encoded)
  * markdown
  * html
* Pride and Prejudice as a PDF
* Romeo and Juliet as a PDF

In [None]:
%pip install chromadb langchain PyYAML sentence-transformers pypdf ipykernel unstructured markdown docx2txt tiktoken InstructorEmbedding accelerate bitsandbytes

We first set up some constants with paths to our data.

Please note, these are hardcoded and would need to be adjusted for where you setup your embedding model and your document data.

In [2]:
import chromadb  # For creating, managing and interacting with the local vector store
import sentence_transformers  # Nominal for default sentence_transformers but instructor model used instead
import os  # For file-path operators
import uuid  # Used for generating unique IDs for document chunks

core_working_directory = r"C:\Users\Alex\Google Drive\projects\llama2_retrieval_augmented_generation"
document_data_dir_path = os.path.join(core_working_directory, "data", "documents")

# Note that the name is in the convention for huggingface.co and this model is apache 2.0 licensed.
reference_models_directory = r"F:\reference_models"
embedding_model_name_used = "hkunlp/instructor-xl"
embedding_model_path = os.path.join(reference_models_directory, "embedding_models", "instructor-xl")

# Hugging face local example in collab notebook
# https://colab.research.google.com/drive/12v2ZBIucDZ-MBTX4VGEMJR4Fxf-EOYN0#scrollTo=JCb7algHVxeI
# https://discuss.huggingface.co/t/using-hugging-face-models-with-private-company-data/56403
# Parameters
llm_model_name_used = "tiiuae/falcon-7b-instruct"
llm_model_path = os.path.join(reference_models_directory,
                              "large_language_models",
                              "llama2",
                              "falcon-7b-instruct")

# Somewhat arbitrary but shouldn't be too large to avoid hitting issues with context window size on various LLMs
# without using context window extension methods, llama2 should have a 2048ish token context window
# Therefore want to keep chunks small enough that a few chunks of context can be added to a prompt.
n_size_in_doc_chunk = 512
n_size_in_chunk_overlap = 32

From here, the intent it to load the files, turn them into chunks of reasonably sized text, add some meta-data to each chunk and then save it to the vector database.

In [2]:
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader
from langchain.document_loaders import UnstructuredHTMLLoader
from langchain.document_loaders import UnstructuredMarkdownLoader
from langchain.document_loaders import PyMuPDFLoader
from langchain.document_loaders import Docx2txtLoader
# from langchain.document_loaders import JSONLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Note, below is pieced together from langchain docs here:
# https://python.langchain.com/docs/modules/data_connection/document_transformers/
# from langchain.document_loaders import DocxLoader
file_endings_considered = {"markdown": "md",
                           "html": "html",
                           "text": "txt",
                           "pdf": "pdf",
                           "word_doc": "docx"}

file_globbers = {file_type: os.path.join("**", f"*.{file_ending}")
                 for file_type, file_ending in file_endings_considered.items()}

# Build a markdown file loader
markdown_file_loader = DirectoryLoader(document_data_dir_path,
                                       glob = file_globbers['markdown'],
                                       show_progress = True,
                                       loader_cls = UnstructuredMarkdownLoader)

# Build a text file loader handling different text encodings
text_loader_kwargs= {'autodetect_encoding': True}
text_file_loader = DirectoryLoader(document_data_dir_path,
                                   glob = file_globbers['text'],
                                   show_progress = True,
                                   loader_cls = TextLoader,
                                   loader_kwargs=text_loader_kwargs)

# Build a html file loader handling different text encodings
html_file_loader = DirectoryLoader(document_data_dir_path,
                                   glob = file_globbers['html'],
                                   show_progress = True,
                                   loader_cls = UnstructuredHTMLLoader)

# Build a pdf file loader
pdf_file_loader = DirectoryLoader(document_data_dir_path,
                                  glob = file_globbers['pdf'],
                                  show_progress = True,
                                  loader_cls = PyMuPDFLoader)

# Build docx file loader
docx_file_loader = DirectoryLoader(document_data_dir_path,
                                  glob = file_globbers['word_doc'],
                                  show_progress = True,
                                  loader_cls = Docx2txtLoader)

# We also defined our text splitters here for each use-case
default_text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = n_size_in_doc_chunk,
    chunk_overlap = n_size_in_chunk_overlap,
    length_function = len,
    is_separator_regex = False,
)

# Note we'd use rapidocr-onnxruntime too if we degrade python version to suit it as it needs <3.12

In [3]:
loaded_markdown_files = markdown_file_loader.load()
loaded_text_files = text_file_loader.load()
loaded_html_files = html_file_loader.load()
loaded_pdf_files = pdf_file_loader.load()
loaded_docx_files = docx_file_loader.load()

100%|██████████| 1/1 [00:16<00:00, 16.29s/it]
100%|██████████| 1/1 [00:00<00:00, 71.42it/s]
100%|██████████| 1/1 [00:12<00:00, 12.36s/it]
100%|██████████| 3/3 [00:02<00:00,  1.46it/s]
100%|██████████| 1/1 [00:01<00:00,  1.67s/it]


After loading the documents we go through text splitting, chunks and metadata extraction.

In [4]:
# Text chunking is built together from langchain documentation here:
# https://python.langchain.com/docs/modules/data_connection/document_transformers/
# Note that it does propagate the metadata
markdown_file_chunks = default_text_splitter.split_documents(loaded_markdown_files)
text_file_chunks = default_text_splitter.split_documents(loaded_text_files)
html_file_chunks = default_text_splitter.split_documents(loaded_html_files)
pdf_file_chunks = default_text_splitter.split_documents(loaded_pdf_files)
docx_file_chunks = default_text_splitter.split_documents(loaded_docx_files)

We load up a local embedding model and then get ready to embed the text and save it to a chromadb.

Some example interfaces considered from this [tutorial](https://realpython.com/chromadb-vector-database/#get-started-with-chromadb-an-open-source-vector-database).

We also specifically referenced this [Google Collab example](https://colab.research.google.com/drive/17eByD88swEphf-1fvNOjf_C79k0h2DgF?usp=sharing#scrollTo=A-h1y_eAHmD-)
which is linked to this [youtube video](https://www.youtube.com/watch?v=cFCGUjc33aU&t=242s) produced by Sam Witteveen.

In [5]:
persistent_chromadb_location = os.path.join(core_working_directory, 'doc_db')
default_embedding_model = "all-MiniLM-L6-v2"
document_collection_name = "simple_test_of_chunks"

local_vector_db_client = chromadb.PersistentClient(path = persistent_chromadb_location)

Old example aiming to use hugging face before realising instruct model needs modified package, not default sentence_transformers package.

In [6]:
generic_instructor_document_task = "Represent the general document chunk for retrieval:"
generic_instructor_query_task = "Represent the general question for retrieving supporting documents:"

In [7]:
from langchain.embeddings import HuggingFaceInstructEmbeddings

# Create a dictionary with model configuration options, specifying to use the CPU for computations
embedding_model_kwargs = {'device': 'cpu'}

# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
embedding_encode_kwargs = {'normalize_embeddings': False,
                           'batch_size': 1}

# Initialize an instance of HuggingFaceEmbeddings with the specified parameters
instructor_xl_embedding = HuggingFaceInstructEmbeddings(
    model_name = embedding_model_path,
    embed_instruction = generic_instructor_document_task,
    query_instruction = generic_instructor_query_task,
    model_kwargs = embedding_model_kwargs
)
# chromadb_instruct_xl_embedding = embedding_functions.InstructorEmbeddingFunction(
#     model_name = embedding_model_path,
#     device = "cpu"
# )

load INSTRUCTOR_Transformer
max_seq_length  512


We now define a document collection built off the local ChromaDB client and go through and embed all of our document chunks inside it.

In [9]:
from langchain.vectorstores import Chroma

In [14]:
langchain_chroma_db = Chroma(
    client = local_vector_db_client,
    collection_name = document_collection_name,
    embedding_function = instructor_xl_embedding,
    persist_directory = persistent_chromadb_location
)

In [15]:
langchain_chroma_db.add_documents(
    pdf_file_chunks[0:100]
)

# Note, langchain add documents approachs avoids trouble of manual adding such as via loop and
# doc_collection_for_rag.add(
#     ids = chunk_id,
#     embeddings = chunk_context_embedding,a
#     metadatas = chunk_metadata,
#     documents = chunk_content_with_instruction
# )
# where I'd need to unpack those values from the langchain documents myself

['9c78402e-a7bc-11ee-a061-4ccc6af94b0a',
 '9c78402f-a7bc-11ee-8c73-4ccc6af94b0a',
 '9c784030-a7bc-11ee-a476-4ccc6af94b0a',
 '9c784031-a7bc-11ee-b5a2-4ccc6af94b0a',
 '9c784032-a7bc-11ee-83c5-4ccc6af94b0a',
 '9c784033-a7bc-11ee-a1ff-4ccc6af94b0a',
 '9c784034-a7bc-11ee-a405-4ccc6af94b0a',
 '9c784035-a7bc-11ee-91ba-4ccc6af94b0a',
 '9c784036-a7bc-11ee-ab94-4ccc6af94b0a',
 '9c784037-a7bc-11ee-9be7-4ccc6af94b0a',
 '9c784038-a7bc-11ee-b5fc-4ccc6af94b0a',
 '9c784039-a7bc-11ee-a284-4ccc6af94b0a',
 '9c78403a-a7bc-11ee-abf3-4ccc6af94b0a',
 '9c78403b-a7bc-11ee-8676-4ccc6af94b0a',
 '9c78403c-a7bc-11ee-8644-4ccc6af94b0a',
 '9c78403d-a7bc-11ee-b503-4ccc6af94b0a',
 '9c78403e-a7bc-11ee-9f05-4ccc6af94b0a',
 '9c78403f-a7bc-11ee-8fb3-4ccc6af94b0a',
 '9c784040-a7bc-11ee-ae6d-4ccc6af94b0a',
 '9c784041-a7bc-11ee-a9f6-4ccc6af94b0a',
 '9c784042-a7bc-11ee-9254-4ccc6af94b0a',
 '9c784043-a7bc-11ee-bbbc-4ccc6af94b0a',
 '9c784044-a7bc-11ee-b9da-4ccc6af94b0a',
 '9c784045-a7bc-11ee-9c98-4ccc6af94b0a',
 '9c784046-a7bc-

In [22]:
# Checked count for runtime sense check to embed all three PDF documents
len(pdf_file_chunks)

4898

We ask the lang-chain Chroma interface to perfect the file to disk after adding documents, clear it and then reload to disk.

In [16]:
langchain_chroma_db.persist()

In [17]:
langchain_chroma_db = None

In [18]:
import gc

# Clean up document datasets to save memory
del loaded_markdown_files
del loaded_text_files
del loaded_html_files
del loaded_pdf_files
del loaded_docx_files
del markdown_file_chunks
del text_file_chunks
del html_file_chunks
del pdf_file_chunks
del docx_file_chunks

gc.collect()


495

In [30]:
# Now we can load the persisted database from disk, and use it as normal. 
langchain_chroma_db = Chroma(
    collection_name = document_collection_name,
    persist_directory = persistent_chromadb_location, 
    embedding_function = instructor_xl_embedding
)

In [31]:
langchain_chroma_db._collection.count()

205

With this persistent example restablished, we can swap the langchain Chroma interface into a retriever mode and both do a simple dummy example AND use it to build a simple Retrieval Q&A language chain with a local LLM model.

In [32]:
document_retriever = langchain_chroma_db.as_retriever(search_kwargs = {"k": 3})

In [33]:
moby_dick_early_book_query = "What project has made the book Mody Dick available?"

# Note that the hugging face interface silently handles adding the query instruction
example_recovered_documents = document_retriever.get_relevant_documents(moby_dick_early_book_query)

In [34]:
example_recovered_documents

[Document(page_content='The Project Gutenberg eBook of Moby Dick; Or, The Whale \n     \nThis ebook is for the use of anyone anywhere in the United States and \nmost other parts of the world at no cost and with almost no restrictions \nwhatsoever. You may copy it, give it away or re-use it under the terms \nof the Project Gutenberg License included with this ebook or online \nat www.gutenberg.org. If you are not located in the United States, \nyou will have to check the laws of the country where you are located \nbefore using this eBook.', metadata={'author': 'Alexander Baker', 'creationDate': "D:20231202201435+10'00'", 'creator': 'Microsoft® Word for Office 365', 'file_path': 'C:\\Users\\Alex\\Google Drive\\projects\\llama2_retrieval_augmented_generation\\data\\documents\\moby_dick_3.pdf', 'format': 'PDF 1.7', 'keywords': '', 'modDate': "D:20231202201435+10'00'", 'page': 0, 'producer': 'Microsoft® Word for Office 365', 'source': 'C:\\Users\\Alex\\Google Drive\\projects\\llama2_retriev

Deeper into building the RAG part of the example.

The video series acting as a tutorial for this was (this youtube video)[https://www.youtube.com/watch?v=9ISVjh8mdlA] with this linked (Google Collab document)[https://colab.research.google.com/drive/1zG1R08TBikG05ecF8et4vi_1F9xutY-6?usp=sharing] by Sam Witteven.

In [3]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.llms.huggingface_pipeline import HuggingFacePipeline
from langchain import LLMChain
from transformers import AutoTokenizer, pipeline, TextStreamer, AutoModelForCausalLM
import torch

# Load in a local LLM and pair it with the chain 
llama2_model_tokeniser = AutoTokenizer.from_pretrained(llm_model_path)
llama2_model_streamer = TextStreamer(llama2_model_tokeniser)

llama2_local_model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path = llm_model_path,
    # load_in_8bit = True,
    # device_map = "cpu", # -1 should map to CPU "auto",
    torch_dtype = torch.bfloat16,
    low_cpu_mem_usage = False
)

llama2_model_pipeline = pipeline(
    "text-generation",
    model = llama2_local_model,
    tokenizer = llama2_model_tokeniser,
    trust_remote_code = False,
    max_length = 512,
    do_sample = True,
    top_k = 1,
    num_return_sequences = 1,
    eos_token_id = llama2_model_tokeniser.eos_token_id,
    pad_token_id = llama2_model_tokeniser.eos_token_id,
    streamer = llama2_model_streamer,
)

llama2_instruct_7b_llm = HuggingFacePipeline(pipeline = llama2_model_pipeline)

llama2_local_model = None
llama2_model_pipeline = None

In [None]:
# create the chain to answer questions 
llama2_llm_instructor_embed_qa_chain = RetrievalQA.from_chain_type(
    llm = llama2_instruct_7b_llm, 
    chain_type = "RAG", 
    retriever = document_retriever, 
    return_source_documents = True)

In [None]:
## Cite sources

import textwrap

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

def process_llm_response(llm_response):
    print(wrap_text_preserve_newlines(llm_response['result']))
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])