# Document Search using Retrieval Augmented Generation (RAG)

In this project we demonstrate the use of a pre-trained Large Language Model (LLM) in Domino and the process of augmenting this model using Retrieval Augmented Generation (RAG) with documents to tailor to our use case. We will use the Meta's open source [Llama2 model](https://ai.meta.com/llama/), the [Qdrant vector database](https://qdrant.tech/) and the [LangChain framework](https://www.langchain.com/) to enable us to run the entire chain on Domino.

In this notebook we will:
1. Fetch and Process the Documents
2. Initialise the Vector Store
3. Fetch and Initialise the Llama2 Model
4. Create the QA chain and test it!

In [1]:
# Import all the dependencies
from qdrant_client import models, QdrantClient
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader
from langchain.vectorstores.qdrant import Qdrant
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from peft import PeftModel, PeftConfig
#
from tqdm.auto import tqdm
from uuid import uuid4
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
from sentence_transformers import SentenceTransformer, util
import pandas as pd

#
import os
import random
import torch
import warnings
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


## Get Our Documents And Embeddings
In this example we will read from a document in our repository in the sample_data folder.

**Note: You will need to customise this section to your specific use case**

Domino has many ways to access data. Please see our [documentation to find the method that suits your use case](https://docs.dominodatalab.com/en/latest/user_guide/16d9c1/work-with-data/).

In [2]:
# Load the document that you need to parse, please change the location to where the pdf resides

# Load 1 PDF file
# loader = PyPDFLoader("/mnt/code/sample_data/MLOps_whitepaper.pdf")
# or load an entire folder
loader = PyPDFDirectoryLoader("/mnt/code/data/tiaa")
data = loader.load_and_split(RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0))

In [3]:
print(f"There are {len(data)} chunks in the documents")

There are 63 chunks in the documents


In [4]:
# Pick a sample page
print(data[random.randint(0, len(data)-1)])

page_content='Our interim target reduction is set against a 2019 baseline year.Public Corporate Debt \nThe 2025 interim target for the Public Corporate Bond portfolio will be measured against \nour 2019 baseline carbon intensity (Scope 1 and 2 emissions) expressed in metric tons of \nCO2e/$M sales. In order to arrive at the 2025 target reduction range, the Responsible Investing, \nRisk, General Account and Investment teams collaborated to model a variety of potential carbon \nintensity reduction pathways and their potential investment impact on the public corporate \ndebt portfolio.  As we move toward implementation, we anticipate relying on a combination of \nstrategies to achieve the 2025 target:  \n• Integrating climate-related guidelines in the selection of new investments\n• Roll-off of existing investments as bonds mature\n• Reductions in issuers’ greenhouse gas emissions over time\nASSET CLASS CARBON INTENSITY METRIC 2025 TARGET REDUCTION RANGE\nPublic corporate debt Tons CO2e/m

In [5]:
# Split the data into pages
metadatas = []
texts = []
for row in data:
  metadatas.append(row.metadata)
  texts.append(row.page_content)
print(len(metadatas),len(texts))

63 63


In [6]:
# Load the embedding model and cache it in our artifacts directory
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True}
embedding_model_name = "BAAI/bge-small-en"
os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/mnt/artifacts/model_cache/'
embeddings = HuggingFaceBgeEmbeddings(model_name="BAAI/bge-small-en",
                                      model_kwargs=model_kwargs,
                                      encode_kwargs=encode_kwargs
                                     )

Downloading .gitattributes: 100%|██████████| 1.52k/1.52k [00:00<00:00, 267kB/s]
Downloading 1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 40.9kB/s]
Downloading README.md: 100%|██████████| 90.8k/90.8k [00:00<00:00, 73.6MB/s]
Downloading config.json: 100%|██████████| 684/684 [00:00<00:00, 831kB/s]
Downloading (…)ce_transformers.json: 100%|██████████| 124/124 [00:00<00:00, 153kB/s]
Downloading model.safetensors: 100%|██████████| 133M/133M [00:00<00:00, 323MB/s] 
Downloading pytorch_model.bin: 100%|██████████| 134M/134M [00:00<00:00, 286MB/s] 
Downloading (…)nce_bert_config.json: 100%|██████████| 52.0/52.0 [00:00<00:00, 10.8kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 169kB/s]
Downloading tokenizer.json: 100%|██████████| 711k/711k [00:00<00:00, 28.4MB/s]
Downloading tokenizer_config.json: 100%|██████████| 366/366 [00:00<00:00, 489kB/s]
Downloading vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 91.5MB/s]
Downloading modules.json: 100

## Initialise The Vector Database
Now we can create the collection in the Qdrant Vector Database.

**Note: This step takes several minutes!**

In [8]:
# Persist the embeddings to disk in our artifacts directory
doc_store = Qdrant.from_texts(texts,
                              metadatas=metadatas,
                              embedding=embeddings,
                              path="/mnt/artifacts/local_qdrant/",
                              prefer_grpc=True,
                              collection_name="mlops")

RuntimeError: Storage folder /mnt/artifacts/local_qdrant/ is already accessed by another instance of Qdrant client. If you require concurrent access, use Qdrant server instead.

## Initialise The Model

Now that we have the Vector Store and the Embedding Model we need to get the Foundation Model that we will be using.
In this case we are leveraging the open source Llama2 model Llama-2-7b-chat-hf. In contrast to third party services like OpenAI this open source model allows you to download the model into your cloud and run it entirely in your enterprises ecosystem meaning you have tighter controls over security and governance.

We will:
1. Set up the prompt for this use case
2. Configure bitsandbytes for the quantisation we need
3. Download, configure and save the Llama2 model

### 1. Set up the prompt

In [9]:
# Setup the prompt template to use for the QA bot
prompt_template = """Use the following pieces of context to answer the question enclosed within  3 backticks at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
Please provide an answer which is factually correct and based on the information retrieved from the vector store.
Please also mention any quotes supporting the answer if any present in the context supplied within two double quotes "" .

{context}

QUESTION:```{question}```
ANSWER:
"""
PROMPT = PromptTemplate(template=prompt_template, input_variables=["context","question"])
#


### 2. Configure bitsandbytes

In [10]:
# Configure bitsandbytes
chain_type_kwargs = {"prompt": PROMPT}


################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

### 3. Download and configure the model

**Note: This step can take several minutes**

In [11]:
model_id = "NousResearch/Llama-2-7b-chat-hf"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    cache_dir=os.environ['DOMINO_DATASETS_DIR'] + '/' + os.environ['DOMINO_PROJECT_NAME'],
    quantization_config=bnb_config,
    device_map='auto'
)

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.pad_token_id = model.config.eos_token_id

Downloading config.json: 100%|██████████| 583/583 [00:00<00:00, 722kB/s]
Downloading (…)fetensors.index.json: 100%|██████████| 26.8k/26.8k [00:00<00:00, 27.7MB/s]
Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s][A
Downloading (…)of-00002.safetensors:   0%|          | 31.5M/9.98G [00:00<00:37, 265MB/s][A
Downloading (…)of-00002.safetensors:   1%|          | 73.4M/9.98G [00:00<00:34, 286MB/s][A
Downloading (…)of-00002.safetensors:   1%|          | 115M/9.98G [00:00<00:31, 313MB/s] [A
Downloading (…)of-00002.safetensors:   1%|▏         | 147M/9.98G [00:00<00:31, 313MB/s][A
Downloading (…)of-00002.safetensors:   2%|▏         | 189M/9.98G [00:00<00:31, 311MB/s][A
Downloading (…)of-00002.safetensors:   2%|▏         | 231M/9.98G [00:00<00:29, 327MB/s][A
Downloading (…)of-00002.safetensors:   3%|▎         | 273M/9.98G [00:00<00:31, 310MB/s][A
Downloading (…)of-00002.safetensors:   3%|▎         | 3

## Putting it all together!

Now we have our Vector Database with our documents in it and our configured model we can create our RAG QA chain.

In [12]:
# Setup the QA chain
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=200)
rag_llm = HuggingFacePipeline(pipeline=pipe)
    
qa_chain = RetrievalQA.from_chain_type(llm=rag_llm,
                                       chain_type="stuff",
                                       chain_type_kwargs={"prompt": PROMPT},
                                       retriever=doc_store.as_retriever(search_kwargs={"k": 5}),
                                       return_source_documents=True
                                      )

Now we can test our model!

Run the following cell and ask a question based on the documents you have added to the vector store. You may want to play with the max_new_tokens parameter in the cell above.

In [None]:
# Ask a question
user_question = input("Please provide your question here :")
result = qa_chain(user_question)
print(result['result'])

Please provide your question here : thast does TIAA feel about climate policy
