# Build a RAG system with LLAMA3 'Llama 3B-Instruct' for chatting with your PDF files!


In this quick tutorial, we'll build a simple RAG system with the latest LLM from Meta - Llama 3, specifically the `Llama-3-8B-Instruct` version.

Key Components : 'Unstructured API' for preprocessing PDF files, LangChain for RAG, FAISS for vector storage, and HuggingFace `transformers` to get the model. Let's go!

Step-1 : Install all the libraries (e.g.-langchain,FAISS)  

In [None]:
!pip install -q unstructured-client unstructured[all-docs] langchain transformers accelerate bitsandbytes sentence-transformers faiss-gpu

Step-2 : Get your [free unstructured API key](https://unstructured.io/api-key-free), and instantiate the Unstructured client to preprocess your PDF file:

In [3]:
import os

os.environ["UNSTRUCTURED_API_KEY"] = "UNSTRUCTURED_API_KEY-GET AND REPLACE FROM THE ABOVE LINK"

In [4]:
from unstructured_client import UnstructuredClient

unstructured_api_key = os.environ.get("UNSTRUCTURED_API_KEY")

client = UnstructuredClient(
    api_key_auth=unstructured_api_key,
    # if using paid API, provide your unique API URL:
    # server_url=“YOUR_API_URL”,
)

Step-3 : Partition, and chunking of the Custom Knowledge Base (pdf file) with 'Unstructured API' so that the logical structure of the document is preserved for better RAG results.

In [7]:
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError
from unstructured.staging.base import dict_to_elements

path_to_pdf="/content/drive/MyDrive/Colab Notebooks/pdf/Retrieval Augmented Generation.pdf"

with open(path_to_pdf, "rb") as f:
  files=shared.Files(
      content=f.read(),
      file_name=path_to_pdf,
      )
  req = shared.PartitionParameters(
    files=files,
    chunking_strategy="by_title",
    max_characters=512,
  )
  try:
    resp = client.general.partition(req)
  except SDKError as e:
    print(e)

elements = dict_to_elements(resp.elements)

Step-4 : Hugging Face authenticatation Access tokens programmatically authenticate your identity to the [Hugging Face Hub](https://huggingface.co/settings/tokens)

In [None]:
from huggingface_hub import notebook_login
notebook_login()

Step-5 :Create LangChain documents from document chunks and their metadata, and ingest those documents into the FAISS vectorstore.

Set up the retriever.

In [11]:
from langchain_core.documents import Document
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

documents = []
for element in elements:
    metadata = element.metadata.to_dict()
    documents.append(Document(page_content=element.text, metadata=metadata))

db = FAISS.from_documents(documents, HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5"))
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 4})

Step-6 :Now, let's finally set up llama 3 to use for text generation in the RAG system.

This is a gated model, which means you first need to go to the [model's page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), log in, review terms and conditions, and request access to it. To use the model in the notebook, you need to log in with your Hugging Face token (get it in your profile's settings).

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [17]:
from huggingface_hub.hf_api import HfFolder

HfFolder.save_token('YOUR_HF_TOKEN')

Step-7 :Quantization- To run this tutorial in the free Colab GPU, we'll need to quantize the model:

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Step-8 :Set up Llama 3 and a simple RAG chain.
Make sure to follow the prompt format for best results:

```
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_msg_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ model_answer_1 }}<|eot_id|>
```

In [18]:
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from transformers import pipeline
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=200,
    eos_token_id=terminators,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

prompt_template = """
<|start_header_id|>user<|end_header_id|>
You are an assistant for answering questions about RAG.
You are given the extracted parts of a long document and a question. Provide a conversational answer.
If you don't know the answer, just say "I do not know." Don't make up an answer.
Question: {question}
Context: {context}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

llm_chain = prompt | llm | StrOutputParser()
rag_chain = {"context": retriever, "question": RunnablePassthrough()} | llm_chain

Step-9 :Wow! Your RAG is ready to use. Pass a question, the retriver will add relevant context from your document, and Llama3 will generate an answer.
Here, my document was a chapter from a book on IPM that stands for "Integrated Pest Management".  

In [None]:
question = "How does RAG help?"
rag_chain.invoke(question)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


"According to the text, RAG helps by allowing the LLM to access external sources of information, such as proprietary documents and data or even the internet. This means that the LLM's responses are no longer limited to its internal knowledge, but can draw on a much broader range of information. Additionally, RAG enables the LLM to retrieve relevant information related to a user's query, which it can then use to generate a more accurate and informative response."

In [None]:
question = "What are some popular RAG use cases?"
rag_chain.invoke(question)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


"What are some popular RAG use cases?\n\nAccording to our documentation, there are several commercial applications of RAG, including:\n\n* Document Question Answering Systems: This involves using RAG to provide access to proprietary enterprise documents to a large language model (LLM). The LLM can then respond based on the information retrieved from the documents.\n\nThese are just a few examples of how RAG can be used. If you're interested in learning more, I'd be happy to help!"

In [None]:
question = "What are some popular RAG use cases? please give minimum 3 examples"
rag_chain.invoke(question)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


'What a great question!\n\nAccording to our documentation, there are several popular RAG use cases. Here are three examples:\n\n1. **Document Question Answering Systems**: This use case involves using RAG to provide access to proprietary enterprise documents to a Large Language Model (LLM). The retriever searches for the most relevant documents and provides the information to the LLM.\n\n2. **Content Generation**: In this scenario, RAG is used to generate content based on user input or prompts. The model retrieves relevant information from various sources and uses it to generate high-quality content.\n\n3. **Chatbots and Virtual Assistants**: RAG can be applied to chatbots and virtual assistants to enable more accurate and informative conversations with users. The model retrieves relevant information from various sources and uses it to respond to user queries.\n\nThese are just a few examples of the many potential use cases for RAG. I hope this helps!'

In [None]:
question = " In RAG Architecture - What are the five high level steps of an RAG enabled system"
rag_chain.invoke(question)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


'What are the five high-level steps of an RAG-enabled system?\n\nAccording to the RAG architecture, the five high-level steps are:\n\n1. **Search & Retrieval**: This step involves searching relevant information from the knowledge sources and retrieving the necessary data.\n\n2. **Relevant Context**: In this step, the retrieved data is analyzed to identify the relevant context that needs to be considered for the next step.\n\n3. **Augmentation**: Here, the relevant context is added to the prompt, depending on the specific use case.\n\n4. **Prompt + Context**: This step combines the original prompt with the added context to create a new prompt that takes into account the relevant information.\n\n5. **Generation**: Finally, the augmented prompt is used to generate a response or output, which is then returned as the final result.\n\nThese five steps form the core of the RAG architecture, enabling systems to efficiently retrieve and utilize relevant information to generate high-quality resp

In [None]:
question = "What are the Two pipelines become important in setting up the RAG system"
rag_chain.invoke(question)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


'Based on the provided documents, I found that the two pipelines that become important in setting up the RAG system are:\n\n1. **Indexing Pipeline**: This pipeline sets up the knowledge source for the RAG system. It involves four primary steps: loading, splitting, embedding, and storing data.\n\n2. **RAG Pipeline**: This pipeline involves three main steps:\n\t* Search & Retrieval: Searching for relevant information in the knowledge source.\n\t* Augmentation: Adding context to the prompt based on the use case.\n\t* Generation: Using the augmented prompt to generate a response.\n\nThese pipelines work together to enable the RAG system to retrieve relevant information, augment it with context, and generate a response.'

In [19]:
question = "What are the Two pipelines become important in setting up the RAG system"
rag_chain.invoke(question)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


'According to the documents provided, the two pipelines that become important in setting up the RAG system are:\n\n1. **Indexing Pipeline**: This pipeline sets up the knowledge source for the RAG system. It involves four primary steps: Loading, Splitting, Embedding, and Storage of data.\n\nThis pipeline creates the knowledge base by indexing the data from various sources, such as vector databases, and makes it available for querying.\n\n2. **RAG Pipeline**: This pipeline involves the actual RAG process, which takes the user query at runtime and retrieves the relevant data from the index, then passes that to the model for generation.\n\nThe RAG pipeline consists of three main steps: Search & Retrieval, Augmentation, and Generation. These steps involve searching the context from the source (e.g., vector database), adding the context to the prompt depending on the use case, and finally generating the response based on the retrieved information.\n\nThese two pipelines work together to enab