# Document Search using Retrieval Augmented Generation (RAG)

In this project we demonstrate the use of a pre-trained Large Language Model (LLM) in Domino and the process of augmenting this model using Retrieval Augmented Generation (RAG) with documents to tailor to our use case. We will use the Meta's open source [Llama2 model](https://ai.meta.com/llama/), the [Qdrant vector database](https://qdrant.tech/) and the [LangChain framework](https://www.langchain.com/) to enable us to run the entire chain on Domino.

In this notebook we will:
1. Fetch and Process the Documents
2. Initialise the Vector Store
3. Fetch and Initialise the Llama2 Model
4. Create the QA chain and test it!

In [1]:
# Import all the dependencies
from qdrant_client import models, QdrantClient
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader
from langchain.vectorstores.qdrant import Qdrant
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from peft import PeftModel, PeftConfig
#
from tqdm.auto import tqdm
from uuid import uuid4
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
from sentence_transformers import SentenceTransformer, util
import pandas as pd

#
import os
import random
import torch
import warnings
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


## Get Our Documents And Embeddings
In this example we will read from a document in our repository in the sample_data folder.

**Note: You will need to customise this section to your specific use case**

Domino has many ways to access data. Please see our [documentation to find the method that suits your use case](https://docs.dominodatalab.com/en/latest/user_guide/16d9c1/work-with-data/).

In [2]:
# Load the document that you need to parse, please change the location to where the pdf resides

# Load 1 PDF file
# loader = PyPDFLoader("/mnt/code/sample_data/MLOps_whitepaper.pdf")
# or load an entire folder
loader = PyPDFDirectoryLoader("/mnt/code/sample_data/")
data = loader.load_and_split(RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0))

In [3]:
print(f"There are {len(data)} chunks in the documents")

There are 36 chunks in the documents


In [4]:
# Pick a sample page
print(data[random.randint(0, len(data)-1)])

page_content='Why Scaling Data Science is So Hard\nData science is on an all-time tear. On an organizational level, 62 percent of ﬁrms have invested\nover $50 million in big data and AI, with 17 percent investing more than $500 million, according to\na recent survey from New Vantage Partners (NVP). Expectations are just as high as investment\nlevels, with a survey from Data IQ revealing that a quarter of companies expect data science to\nincrease revenue by 11 percent or more. This is a major leap for giant enterprises that are already\ngenerating huge cash ﬂows.\nYet while money is ﬂowing strong, results have not been so rosy. Consider the quotes shown\nbelow, which Domino has heard from data science stakeholders across large enterprises while in\nthe early phase of scaling data science. Their negativity is mirrored in\nﬁve conclusions\nfrom a\nrecent survey by Wakeﬁeld Research and Domino Data Lab about why initiatives are falling short\nof expectations:\n●\nShort-term investment thw

In [5]:
# Split the data into pages
metadatas = []
texts = []
for row in data:
  metadatas.append(row.metadata)
  texts.append(row.page_content)
print(len(metadatas),len(texts))

36 36


In [6]:
# Load the embedding model and cache it in our artifacts directory
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True}
embedding_model_name = "BAAI/bge-small-en"
os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/mnt/artifacts/model_cache/'
embeddings = HuggingFaceBgeEmbeddings(model_name="BAAI/bge-small-en",
                                      model_kwargs=model_kwargs,
                                      encode_kwargs=encode_kwargs
                                     )

Downloading .gitattributes: 100%|██████████| 1.52k/1.52k [00:00<00:00, 266kB/s]
Downloading 1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 40.4kB/s]
Downloading README.md: 100%|██████████| 90.8k/90.8k [00:00<00:00, 1.18MB/s]
Downloading config.json: 100%|██████████| 684/684 [00:00<00:00, 850kB/s]
Downloading (…)ce_transformers.json: 100%|██████████| 124/124 [00:00<00:00, 135kB/s]
Downloading model.safetensors: 100%|██████████| 133M/133M [00:00<00:00, 314MB/s] 
Downloading pytorch_model.bin: 100%|██████████| 134M/134M [00:00<00:00, 313MB/s] 
Downloading (…)nce_bert_config.json: 100%|██████████| 52.0/52.0 [00:00<00:00, 10.7kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 151kB/s]
Downloading tokenizer.json: 100%|██████████| 711k/711k [00:00<00:00, 2.45MB/s]
Downloading tokenizer_config.json: 100%|██████████| 366/366 [00:00<00:00, 419kB/s]
Downloading vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.77MB/s]
Downloading modules.json: 100

## Initialise The Vector Database
Now we can create the collection in the Qdrant Vector Database.

**Note: This step takes several minutes!**

In [7]:
# Persist the embeddings to disk in our artifacts directory
doc_store = Qdrant.from_texts(texts,
                              metadatas=metadatas,
                              embedding=embeddings,
                              path="/mnt/artifacts/local_qdrant/",
                              prefer_grpc=True,
                              collection_name="mlops")

## Initialise The Model

Now that we have the Vector Store and the Embedding Model we need to get the Foundation Model that we will be using.
In this case we are leveraging the open source Llama2 model Llama-2-7b-chat-hf. In contrast to third party services like OpenAI this open source model allows you to download the model into your cloud and run it entirely in your enterprises ecosystem meaning you have tighter controls over security and governance.

We will:
1. Set up the prompt for this use case
2. Configure bitsandbytes for the quantisation we need
3. Download, configure and save the Llama2 model

### 1. Set up the prompt

In [8]:
# Setup the prompt template to use for the QA bot
prompt_template = """Use the following pieces of context to answer the question enclosed within  3 backticks at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
Please provide an answer which is factually correct and based on the information retrieved from the vector store.
Please also mention any quotes supporting the answer if any present in the context supplied within two double quotes "" .

{context}

QUESTION:```{question}```
ANSWER:
"""
PROMPT = PromptTemplate(template=prompt_template, input_variables=["context","question"])
#


### 2. Configure bitsandbytes

In [9]:
# Configure bitsandbytes
chain_type_kwargs = {"prompt": PROMPT}


################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

### 3. Download and configure the model

**Note: This step can take several minutes**

In [10]:
model_id = "NousResearch/Llama-2-7b-chat-hf"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    cache_dir=os.environ['DOMINO_DATASETS_DIR'] + '/' + os.environ['DOMINO_PROJECT_NAME'],
    quantization_config=bnb_config,
    device_map='auto'
)

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.pad_token_id = model.config.eos_token_id

Downloading config.json: 100%|██████████| 583/583 [00:00<00:00, 122kB/s]
Downloading (…)fetensors.index.json: 100%|██████████| 26.8k/26.8k [00:00<00:00, 28.2MB/s]
Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s][A
Downloading (…)of-00002.safetensors:   0%|          | 41.9M/9.98G [00:00<00:24, 412MB/s][A
Downloading (…)of-00002.safetensors:   1%|          | 94.4M/9.98G [00:00<00:22, 440MB/s][A
Downloading (…)of-00002.safetensors:   1%|▏         | 147M/9.98G [00:00<00:22, 447MB/s] [A
Downloading (…)of-00002.safetensors:   2%|▏         | 199M/9.98G [00:00<00:22, 444MB/s][A
Downloading (…)of-00002.safetensors:   3%|▎         | 252M/9.98G [00:00<00:21, 448MB/s][A
Downloading (…)of-00002.safetensors:   3%|▎         | 304M/9.98G [00:00<00:21, 449MB/s][A
Downloading (…)of-00002.safetensors:   4%|▎         | 357M/9.98G [00:00<00:21, 451MB/s][A
Downloading (…)of-00002.safetensors:   4%|▍         | 4

## Putting it all together!

Now we have our Vector Database with our documents in it and our configured model we can create our RAG QA chain.

In [11]:
# Setup the QA chain
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=200)
rag_llm = HuggingFacePipeline(pipeline=pipe)
    
qa_chain = RetrievalQA.from_chain_type(llm=rag_llm,
                                       chain_type="stuff",
                                       chain_type_kwargs={"prompt": PROMPT},
                                       retriever=doc_store.as_retriever(search_kwargs={"k": 5}),
                                       return_source_documents=True
                                      )

Now we can test our model!

Run the following cell and ask a question based on the documents you have added to the vector store. You may want to play with the max_new_tokens parameter in the cell above.

In [12]:
# Ask a question
user_question = input("Please provide your question here :")
result = qa_chain(user_question)
print(result['result'])

Please provide your question here : why is scaling data science hard?


Scaling data science is hard because it involves managing a complex lifecycle of model development, deployment, and monitoring. This lifecycle includes various technical and cultural components, such as data management, model development, and collaboration between different stakeholders. Additionally, there are obstacles to achieving data science at scale, including limited product capabilities, unimproved models bringing higher risk, and organizational cultural components. To overcome these obstacles, successful model-driven businesses can take technical principles of MLOps and apply them to the entire data science lifecycle, as well as consider how the same efﬁciencies can apply to processes and people.

REferences:

* New Vantage Partners. (2020). The Data Science Revolution: A Survey of the State of the Art.
* Data IQ. (2020). The Data Science Revolution: A Survey of the State of the Art.
* Wake
