# RAG with Llama 2 and LangChain
Retrieval-Augmented Generation (RAG) is a technique that combines a retriever and a generative language model to deliver accurate response. It involves retrieving relevant information from a large corpus and then generating contextually appropriate responses to queries. Here we use the quantized version of the Llama 2 13B LLM with LangChain to perform generative QA with RAG. The notebook file has been tested in Google Colab with T4 GPU. Please change the runtime type to T4 GPU before running the notebook.

## Install Packages

In [None]:
!pip install transformers==4.37.2 optimum==1.12.0 --quiet
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ --quiet
!pip install langchain==0.1.9 --quiet
# !pip install chromadb
!pip install sentence_transformers==2.4.0 --quiet
!pip install unstructured --quiet
!pip install pdf2image --quiet
!pip install pdfminer.six==20221105 --quiet
!pip install unstructured-inference --quiet
!pip install faiss-gpu==1.7.2 --quiet
!pip install pikepdf==8.13.0 --quiet
!pip install pypdf==4.0.2 --quiet
!pip install pillow_heif==0.15.0 --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m380.6/380.6 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m43.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m66.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m80.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m97.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━

## Restart Runtime

## Load Llama 2
We will use the quantized version of the LLAMA 2 13B model from HuggingFace for our RAG task.

In [None]:
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline

model_name = "TheBloke/Llama-2-13b-Chat-GPTQ"

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # 0.0
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/837 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/7.26G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

#### Test LLM with Llama 2 prompt structure and LangChain PromptTemplate

In [None]:
from textwrap import fill
from langchain.prompts import PromptTemplate

template = """
<s>[INST] <<SYS>>
You are an AI assistant. You are truthful, unbiased and honest in your response.

If you are unsure about an answer, truthfully say "I don't know"
<</SYS>>

{text} [/INST]
"""

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

text = "Explain artificial intelligence in a few lines"
result = llm.invoke(prompt.format(text=text))
print(fill(result.strip(), width=100))

<s>[INST] <<SYS>> You are an AI assistant. You are truthful, unbiased and honest in your response.
If you are unsure about an answer, truthfully say "I don't know" <</SYS>>  Explain artificial
intelligence in a few lines [/INST]   Sure! Here is my explanation of artificial intelligence:
Artificial intelligence (AI) refers to the development of computer systems that can perform tasks
that typically require human intelligence, such as learning, problem-solving, and decision-making.
These systems use algorithms and machine learning techniques to analyze data and make predictions or
take actions based on that data. Some examples of AI include natural language processing, image
recognition, and autonomous vehicles. Overall, the goal of AI is to create machines that can think
and act like humans, but with greater speed, accuracy, and consistency.


In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

## RAG from PDF Files
### A. Create a vectore store for the context/external data
Here, we'll create embedding vectores of the unstructured data loaded from the the source and store them in a vectore store.  

####Download pdf files

In [None]:
!gdown "https://github.com/muntasirhsn/datasets/raw/main/Solar-System-Wikipedia.pdf" # this is just a pdf print of the Solar System page on Wikipedia!

Downloading...
From: https://github.com/muntasirhsn/datasets/raw/main/Solar-System-Wikipedia.pdf
To: /content/Solar-System-Wikipedia.pdf
100% 4.49M/4.49M [00:00<00:00, 14.0MB/s]


####Load PDF Files
Depending on the type of the source data, we can use the appropriate data loader from LangChain to load the data.

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format

pdf_loader = UnstructuredPDFLoader("/content/Solar-System-Wikipedia.pdf")
pdf_doc = pdf_loader.load()
updated_pdf_doc = filter_complex_metadata(pdf_doc)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


#### Spit the document into chunks
Due to the limited size of the context window of an LLM, the data need to be divided into smaller chunks with a text splitter like CharacterTextSplitter or RecursiveCharacterTextSplitter. In this way, the smaller chunks can be fed into the LLM.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=128)
chunked_pdf_doc = text_splitter.split_documents(updated_pdf_doc)
len(chunked_pdf_doc)

240

#### Create a vector database of the chunked documents with HuggingFace embeddings

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings()

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

We can either use FAISS or Chroma to create the [Vector Store](https://python.langchain.com/docs/modules/data_connection/vectorstores.html).

In [None]:
%%time
# Create the vectorized db with FAISS
from langchain.vectorstores import FAISS
db_pdf = FAISS.from_documents(chunked_pdf_doc, embeddings)

# Create the vectorized db with Chroma
# from langchain.vectorstores import Chroma
# db_pdf = Chroma.from_documents(chunked_pdf_doc, embeddings)

CPU times: user 4.84 s, sys: 19.7 ms, total: 4.86 s
Wall time: 5.13 s


### B. Use RetrievalQA chain
We instantiate a RetrievalQA chain from LangChain which takes in a retriever, LLM and a chain_type as the input arguments. When the QA chain receives a query, the retriever retrieves information relevent to the query from the vectore store.   The ``chain type = "stuff"`` method stuffs all the retrieved information into context and makes a call to the language model. The LLM then generates the text/response from the retrieved documents. [See information on Langchain Retriver](https://python.langchain.com/docs/use_cases/question_answering/vector_db_qa).

**LLM prompt structure**

We can also pass in the recommended prompt structue for Llama 2 for the QA. In this way, we'd be able to advise our LLM to only use the available context to answer our question. If it cannot find information relevant to our query in the context, it'll **NOT** make up an answer, rather, it would advise that it's unable to find relevant information in the context.

In [None]:
%%time
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

# use the recommended propt style for the LLAMA 2 LLM
prompt_template = """
<s>[INST] <<SYS>>
Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

<</SYS>>

{context}

Question: {question} [/INST]
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    # retriever=db.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 5, 'score_threshold': 0.8})
    # Similarity Search is the default way to retrieve documents relevant to a query, but we can use MMR by setting search_type = "mmr"
    # k defines how many documents are returned; defaults to 4.
    # score_threshold allows to set a minimum relevance for documents returned by the retriever, if we are using the "similarity_score_threshold" search type.
    # return_source_documents=True, # Optional parameter, returns the source documents used to answer the question
    retriever=db_pdf.as_retriever(), # (search_kwargs={'k': 5, 'score_threshold': 0.8}),
    chain_type_kwargs={"prompt": prompt},
)
query = "When was the solar system formed?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

<s>[INST] <<SYS>> Use the following context to Answer the question at the end. Do not use any other
information. If you can't find the relevant information in the context, just say you don't have
enough information to answer the question. Don't try to make up an answer.  <</SYS>>  The Solar
System formed 4.568 billion years ago from the gravitational collapse of a region within a large
molecular cloud.[e] This initial cloud was likely several light-years across and probably birthed
several stars.[10] As is typical of molecular clouds, this one consisted mostly of hydrogen, with
some helium, and small amounts of heavier elements fused by previous generations of stars. As the
region that would become the Solar System, known as the pre-solar nebula,[11] collapsed,
conservation of angular momentum caused it to rotate faster. The center, where most of the mass the
collected, became surrounding disc.[10] As the contracting nebula rotated faster, it began to
flatten into a protoplanetary disc

In [None]:
%%time
query = "Explain in detail how the solar system was formed."
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

<s>[INST] <<SYS>> Use the following context to Answer the question at the end. Do not use any other
information. If you can't find the relevant information in the context, just say you don't have
enough information to answer the question. Don't try to make up an answer.  <</SYS>>  The Solar
System formed 4.568 billion years ago from the gravitational collapse of a region within a large
molecular cloud.[e] This initial cloud was likely several light-years across and probably birthed
several stars.[10] As is typical of molecular clouds, this one consisted mostly of hydrogen, with
some helium, and small amounts of heavier elements fused by previous generations of stars. As the
region that would become the Solar System, known as the pre-solar nebula,[11] collapsed,
conservation of angular momentum caused it to rotate faster. The center, where most of the mass the
collected, became surrounding disc.[10] As the contracting nebula rotated faster, it began to
flatten into a protoplanetary disc

In [None]:
%%time
query = "What are the planets of the solar system composed of? Give a detailed response."
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

<s>[INST] <<SYS>> Use the following context to Answer the question at the end. Do not use any other
information. If you can't find the relevant information in the context, just say you don't have
enough information to answer the question. Don't try to make up an answer.  <</SYS>>  11/8/23, 11:20
PM  Solar System  The Solar System[c] is the gravitationally bound system of the Sun and the objects
that orbit it. The largest eight objects, which form a planetary system, are, in order of distance
from the Sun: four terrestrial planets, Mercury, Venus, Earth and Mars; and four giant planets which
include two gas giants, Jupiter and Saturn, and two ice giants, Uranus and Neptune. The terrestrial
planets, which all have a definite surface, are mostly made of rock and metal. The gas giants are
mostly made of hydrogen and helium, while the ice giants are mostly made of volatile substances such
as water, ammonia, and methane. The terrestrial planets are also called the inner Solar System and
the 

### C. Hallucination Check
Hallucination in RAG refers to the generation of content by an LLM that is not based onn the retrieved knowledge.

Let's test our LLM with a query that is not relevant to the context. The model should respond that it does not have enough information to respond to this query.

In [None]:
%%time
query = "How does the tranformers architecture work?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

<s>[INST] <<SYS>> Use the following context to Answer the question at the end. Do not use any other
information. If you can't find the relevant information in the context, just say you don't have
enough information to answer the question. Don't try to make up an answer.  <</SYS>>  regions such
as the asteroid belt, Kuiper belt, and Oort cloud.[15] The Nice model is an explanation for the
creation of these regions and how the outer planets could have formed in different positions and
migrated to their current orbits through various gravitational interactions.[17]  The expanding Sun
is expected to vaporize Mercury as well as Venus, and render Earth uninhabitable (possibly
destroying it as well). Eventually, the core will be hot enough for helium fusion; the Sun will burn
helium for a fraction of the time it burned hydrogen in the core. The Sun is not massive enough to
commence the fusion of heavier elements, and nuclear reactions in the core will dwindle. Its outer
layers will be ejected

The model responded as expected. The context provided to it do not contain any information on tranformers architectures. So, it cannot answer this question and do not suffer from hallucination!

## RAG from web pages

####Load the document



In [None]:
from langchain.document_loaders import UnstructuredURLLoader

web_loader = UnstructuredURLLoader(
    urls=["https://en.wikipedia.org/wiki/Solar_System"], mode="elements", strategy="fast",
    )
web_doc = web_loader.load()
updated_web_doc = filter_complex_metadata(web_doc)

####Split the documents into chunks

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=128)
chunked_web_doc = text_splitter.split_documents(updated_web_doc)
len(chunked_web_doc)

907

#### Create a vector database of the chunked documents with HuggingFace embeddings

In [None]:
%%time
# Create the vectorized db with FAISS
db_web = FAISS.from_documents(chunked_web_doc, embeddings)

CPU times: user 3.88 s, sys: 28.8 ms, total: 3.91 s
Wall time: 3.93 s


#### RAG with RetrievalQA

In [None]:
%%time
Chain_web = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_web.as_retriever(),
    chain_type_kwargs={"prompt": prompt},
)
query = "When was the solar system formed?"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

<s>[INST] <<SYS>> Use the following context to Answer the question at the end. Do not use any other
information. If you can't find the relevant information in the context, just say you don't have
enough information to answer the question. Don't try to make up an answer.  <</SYS>>  Formation and
evolution of the Solar System  The Solar System formed 4.568 billion years ago from the
gravitational collapse of a region within a large molecular cloud.[f] This initial cloud was likely
several light-years across and probably birthed several stars.[8] As is typical of molecular clouds,
this one consisted mostly of hydrogen, with some helium, and small amounts of heavier elements fused
by previous generations of stars.[9]  The Solar System[c] is the gravitationally bound system of the
Sun and the objects that orbit it.[4] It was formed 4.6 billion years ago when a dense region of a
molecular cloud collapsed, forming the Sun and a protoplanetary disc. The Sun is an ordinary star
that maintains h

In [None]:
%%time
query = "Explain in detail how the solar system was formed."
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

<s>[INST] <<SYS>> Use the following context to Answer the question at the end. Do not use any other
information. If you can't find the relevant information in the context, just say you don't have
enough information to answer the question. Don't try to make up an answer.  <</SYS>>  Formation and
evolution of the Solar System  The Solar System formed 4.568 billion years ago from the
gravitational collapse of a region within a large molecular cloud.[f] This initial cloud was likely
several light-years across and probably birthed several stars.[8] As is typical of molecular clouds,
this one consisted mostly of hydrogen, with some helium, and small amounts of heavier elements fused
by previous generations of stars.[9]  The Solar System[c] is the gravitationally bound system of the
Sun and the objects that orbit it.[4] It was formed 4.6 billion years ago when a dense region of a
molecular cloud collapsed, forming the Sun and a protoplanetary disc. The Sun is an ordinary star
that maintains h

In [None]:
%%time
query = "What are the planets of the solar system composed of? Give a detailed response."
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

<s>[INST] <<SYS>> Use the following context to Answer the question at the end. Do not use any other
information. If you can't find the relevant information in the context, just say you don't have
enough information to answer the question. Don't try to make up an answer.  <</SYS>>  The four
terrestrial or inner planets have dense, rocky compositions, few or no moons, and no ring systems.
They are composed largely of refractory minerals such as silicates—which form their crusts and
mantles—and metals such as iron and nickel which form their cores. Three of the four inner planets
(Venus, Earth and Mars) have atmospheres substantial enough to generate weather; all have impact
craters and tectonic surface features, such as rift valleys and volcanoes. The term inner planet
should not be confused with inferior planet, which designates those planets that are closer to the
Sun than Earth (i.e. Mercury and Venus).[39]  The four outer planets, also called giant planets or
Jovian planets, collecti

#### Hallucination Check

In [None]:
%%time
query = "How does the tranformers architecture work?"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

<s>[INST] <<SYS>> Use the following context to Answer the question at the end. Do not use any other
information. If you can't find the relevant information in the context, just say you don't have
enough information to answer the question. Don't try to make up an answer.  <</SYS>>  Composition
Portals:  orbiter.  Simple English  Question: How does the tranformers architecture work? [/INST]
I'm sorry, but I cannot provide a detailed explanation of how the Transformers architecture works
based on the given context. The context only mentions "portals" and "orbiter," which do not seem to
be related to the Transformers architecture. Additionally, the context is written in Simple English,
which may not provide enough information to explain a complex topic like the Transformers
architecture. Without more information or context, it would be difficult for me to provide a
comprehensive answer to your question.
CPU times: user 5.02 s, sys: 92.5 ms, total: 5.11 s
Wall time: 5.11 s


The model does not suffer from hallucination!

In [None]:
from langchain.document_loaders import UnstructuredURLLoader

web_loader = UnstructuredURLLoader(
    urls=["https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)"], mode="elements", strategy="fast",
    )
web_doc = web_loader.load()
updated_web_doc = filter_complex_metadata(web_doc)

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=128)
chunked_web_doc = text_splitter.split_documents(updated_web_doc)
len(chunked_web_doc)

384

In [None]:
%%time
# Create the vectorized db with FAISS
db_web = FAISS.from_documents(chunked_web_doc, embeddings)

CPU times: user 1.69 s, sys: 10.4 ms, total: 1.7 s
Wall time: 1.69 s


In [None]:
%%time
Chain_web = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_web.as_retriever(),
    chain_type_kwargs={"prompt": prompt},
)
query = "How does transformers work?"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

<s>[INST] <<SYS>> Use the following context to Answer the question at the end. Do not use any other
information. If you can't find the relevant information in the context, just say you don't have
enough information to answer the question. Don't try to make up an answer.  <</SYS>>  Transformer.
All transformers have the same primary components:  "Transformer architecture" redirects here. For
the design of electrical transformers, see  ^ a b c Alammar, Jay. "The Illustrated Transformer".
jalammar.github.io. Archived from the original on 2020-10-18. Retrieved 2019-10-15.  Question: How
does transformers work? [/INST]  Based on the provided context, I can provide information on how
transformers work. Transformers are a type of neural network architecture that is commonly used for
natural language processing tasks such as machine translation, text classification, and language
modeling.  The primary components of a transformer include self-attention mechanisms and feedforward
networks. Self-

In [None]:
%%time
query = "How does transformers work?"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))



<s>[INST] <<SYS>> Use the following context to Answer the question at the end. Do not use any other
information. If you can't find the relevant information in the context, just say you don't have
enough information to answer the question. Don't try to make up an answer.  <</SYS>>  Transformer.
All transformers have the same primary components:  "Transformer architecture" redirects here. For
the design of electrical transformers, see  ^ a b c Alammar, Jay. "The Illustrated Transformer".
jalammar.github.io. Archived from the original on 2020-10-18. Retrieved 2019-10-15.  Question: How
does transformers work? [/INST]  Based on the provided context, I can provide information on how
transformers work. Transformers are a type of neural network architecture that is commonly used for
natural language processing tasks such as machine translation, text classification, and language
modeling.  The primary components of a transformer include self-attention mechanisms and feedforward
networks. Self-