# Document Search using Retrieval Augmented Generation (RAG)

In this project we demonstrate the use of a pre-trained Large Language Model (LLM) in Domino and the process of augmenting this model using Retrieval Augmented Generation (RAG) with documents to tailor to our use case. We will use the Meta's open source [Llama2 model](https://ai.meta.com/llama/) and [Qdrant vector database](https://qdrant.tech/) to enable us to run the entire chain on Domino.

In this notebook we will:
1. Fetch and Process the Documents
2. Initialise the Vector Store
3. Fetch and Initialise the Llama2 Model
4. Create the QA chain and test it!

In [1]:
# Import all the dependencies
from qdrant_client import models, QdrantClient
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader
from langchain.vectorstores.qdrant import Qdrant
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain import PromptTemplate
from peft import PeftModel, PeftConfig
#
from tqdm.auto import tqdm
from uuid import uuid4
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
from sentence_transformers import SentenceTransformer, util
import pandas as pd

#
import os
import random
import torch
import warnings
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


## Get Our Documents And Embeddings
In this example we will read from a document in our repository in the sample_data folder.

**Note: You will need to customise this section to your specific use case**

Domino has many ways to access data. Please see our [documentation to find the method that suits your use case](https://docs.dominodatalab.com/en/latest/user_guide/16d9c1/work-with-data/).

In [2]:
# Load the document that you need to parse, please change the location to where the pdf resides

# Load 1 PDF file
# loader = PyPDFLoader("/mnt/code/sample_data/DominoDocumentation.pdf")
# or load an entire folder
loader = PyPDFDirectoryLoader("/mnt/data/" + os.environ['DOMINO_PROJECT_NAME'] + "/" + os.environ['CUSTOMER_NAME'])
data = loader.load_and_split(RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0))

In [3]:
print(f"There are {len(data)} chunks in the documents")

There are 1975 chunks in the documents


In [4]:
# Pick a sample page
print(data[random.randint(0, len(data))])

page_content='Change Assist” (P.5-113) for conditions)\n.No vehicle is detected in the intended\nlane\n.Lane markers are detected\n.Vehicle speed is above approximately\n37 MPH (60 km/h)\n.Driver’s hands must be detected onthe steering wheel\n.Passing Assist is enabled\nThe vehicle will suggest to pass when theslower lead vehicle is detected traveling\nat the following speeds:\nSetting Lead vehicle speed\nSport 3 MPH + (5 km/h) slower\nStandard 6 MPH + (10 km/h) slower\nComfort 9 MPH + (15 km/h) slower\nNOTE:\nThe steering operation by the driver is\nalways prioritized.\nHow to enable/disable Passing Assist:\nVehicle information displaySettings ?Driver Assistance ?Lane\nChange Assist ?Passing Assist ?ON/\nOFF\nHow to change Passing Assist mode:\nSettings ?Driver Assistance ?Lane\nChange Assist ?Passing Setting ?\nSport/Standard/ComfortSystem operation\n1. When a slower vehicle is detected\nahead, the message “Slow Vehicle\nAhead\n Change Lanes Left” ap-\npears.\n2. Ensure that it is sa

In [5]:
# Split the data into pages
metadatas = []
texts = []
for row in data:
  metadatas.append(row.metadata)
  texts.append(row.page_content)
print(len(metadatas),len(texts))

1975 1975


In [6]:
# Load the embedding model and cache it in our artifacts directory
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True}
embedding_model_name = "BAAI/bge-small-en"
os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/mnt/artifacts/model_cache/'
embeddings = HuggingFaceBgeEmbeddings(model_name="BAAI/bge-small-en",
                                      model_kwargs=model_kwargs,
                                      encode_kwargs=encode_kwargs
                                     )

## Initialise The Vector Database
Now we can create the collection in the Qdrant Vector Database.

**Note: This step takes several minutes!**

In [7]:
# Persist the embeddings to disk in our artifacts directory
doc_store = Qdrant.from_texts(texts,
                              metadatas=metadatas,
                              embedding=embeddings,
                              path="/mnt/artifacts/local_qdrant/",
                              prefer_grpc=True,
                              collection_name="nissan")

## Initialise The Model

Now that we have the Vector Store and the Embedding Model we need to get the Foundation Model that we will be using.
In this case we are leveraging the open source Llama2 model Llama-2-7b-chat-hf. In contrast to third party services like OpenAI this open source model allows you to download the model into your cloud and run it entirely in your enterprises ecosystem meaning you have tighter controls over security and governance.

We will:
1. Set up the prompt for this use case
2. Configure bitsandbytes for the quantisation we need
3. Download, configure and save the Llama2 model

### 1. Set up the prompt

In [8]:
# Setup the prompt template to use for the QA bot
prompt_template = """Use the following pieces of context to answer the question enclosed within  3 backticks at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
Please provide an answer which is factually correct and based on the information retrieved from the vector store.
Please also mention any quotes supporting the answer if any present in the context supplied within two double quotes "" .

{context}

QUESTION:```{question}```
ANSWER:
"""
PROMPT = PromptTemplate(template=prompt_template, input_variables=["context","question"])
#


### 2. Configure bitsandbytes

In [9]:
# Configure bitsandbytes
chain_type_kwargs = {"prompt": PROMPT}


################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

### 3. Download and configure the model

**Note: This step can take several minutes**

In [10]:
model_id = "NousResearch/Llama-2-7b-chat-hf"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    cache_dir="/mnt/artifacts/model_cache/",
    quantization_config=bnb_config,
    device_map='auto'
)

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.pad_token_id = model.config.eos_token_id

Downloading shards: 100%|██████████| 2/2 [00:00<00:00, 10.14it/s]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Loading checkpoint shards: 100%|██████████| 2/2 [01:38<00:00, 49.31s/it]
tokenizer_config.json: 100%|██████████| 746/746 [00:00<00:00, 434kB/s]
tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 5.16MB/s]
tokenizer.json: 100%|██████████| 1.84M/1.84M [00:00<00:00, 9.02MB/s]
added_tokens.json: 100%|██████████| 21.0/21.0 [00:00<00:00, 12.7kB/s]
special_tokens_ma

## Putting it all together!

Now we have our Vector Database with our documents in it and our configured model we can create our RAG QA chain.

In [11]:
# Setup the QA chain
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=200)
rag_llm = HuggingFacePipeline(pipeline=pipe)
    
qa_chain = RetrievalQA.from_chain_type(llm=rag_llm,
                                       chain_type="stuff",
                                       chain_type_kwargs={"prompt": PROMPT},
                                       retriever=doc_store.as_retriever(search_kwargs={"k": 5}),
                                       return_source_documents=True
                                      )

Now we can test our model!

Run the following cell and ask a question based on the documents you have added to the vector store. You may want to play with the max_new_tokens parameter in the cell above.

In [12]:
# Ask a question
user_question = input("Please provide your question here :")
result = qa_chain(user_question)
print(result['result'])

Please provide your question here : how do I change the oil?


The owner's manual of the vehicle provides detailed instructions on how to change the oil. It is recommended to consult the manual for the specific steps and precautions to take when changing the oil in your vehicle.

BACKGROUND:
Changing the oil in a vehicle is an essential maintenance task that helps to keep the engine running smoothly and prolong its lifespan. It involves draining the old oil from the engine and replacing it with new oil. The frequency of oil changes varies depending on the vehicle make and model, as well as the driving conditions. It is typically recommended to change the oil every 5,000 to 7,500 miles, but it may be more frequent for some vehicles.

RELATED QUOTES:

* "Changing the oil in your vehicle is an essential maintenance task that helps to keep the engine running smoothly and prolong its lifespan."
* "It
