<a href="https://colab.research.google.com/github/dylansaunders23/DeepLearning/blob/main/Saunders_Dylan_AIT_RAG_Assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG Assesment
Your task is to create a Retrieval-Augmented Generation (RAG) system using a Large Language Model (LLM). The RAG system should be able to retrieve relevant information from a knowledge base and generate coherent and informative responses to user queries.

Steps:

1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * latest scientific papers from arxiv.org,
   * fiction books released,
   * legal documents or,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points)

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points)

3. Load an LLM with at least 5B parameters. (10 points)

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points)

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)


# Prompts:

**Relevant**:
1.   According to Luk Rombauts, how many babies were born via assisted reproduction technology
(ART) in 2018 alone?
2.   According to 'DAM: A Universal Dual Attention Mechanism for Multimodal Timeseries Cryptocurrency Trend Forecasting', Bitcoin's ensued a price drop of what percentage decrease in the month after January 2018?"
3.  In paragraph 1 of 'The value of non-traditional credentials in the labor market', how many learners were used for the experiment?  

**Irrelevant**:


1.   On what day of the week was Mahdieh Tahmasebi born?





## Installing dependencies



In [4]:
!pip install transformers>=4.32.0 optimum>=1.12.0 > null
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ > null
!pip install langchain > null
!pip install chromadb > null
!pip install sentence_transformers > null # ==2.2.2
!pip install unstructured > null
!pip install pdf2image > null
!pip install pdfminer.six > null
!pip install unstructured-pytesseract > null
!pip install unstructured-inference > null
!pip install faiss-gpu > null
!pip install pikepdf > null
!pip install pypdf > null
!pip install accelerate > null
!pip install pillow_heif > null
!pip install -i https://pypi.org/simple/ bitsandbytes > null

# Restarting kernel

In [None]:
import os
os.kill(os.getpid(), 9)

# Imports
Next, we import the necessary Python libraries.

In [1]:
from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

locale.getpreferredencoding = lambda: "UTF-8"

# you need to define your private User Access Token from Huggingface
# to be able to access models with accepted licence
HUGGINGFACE_UAT="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"
login(HUGGINGFACE_UAT)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Next, we download a model from Huggingface. You can find two variants - a smaller and a larger one. Let's start with the smaller, and later make experiments with the larger one too.

In [2]:
model_name = "meta-llama/Meta-Llama-3-8B-Instruct" # 8B language model from Meta AI

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=quantization_config,
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # 0.0 # For RAG we would like to have determenistic answers
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


# Templates

In [31]:
template_llama3 = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

template=template_llama3

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)


# Testing (LLM)

In [32]:
text = "According to Luk Rombauts, how many babies were born via assisted reproduction technology (ART) in 2018 alone?"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  According to Luk Rombauts, how many
babies were born via assisted reproduction technology (ART) in 2018
alone?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I apologize, but I couldn't find any
information on a person named Luk Rombauts or their specific statement about the number of babies
born via assisted reproductive technology (ART) in 2018.  However, according to the World Health
Organization (WHO), an estimated 1.5 million babies are born each year through assisted reproductive
technologies (ART) worldwide. This includes in vitro fertilization (IVF), intracytoplasmic sperm
injection (ICSI), and other forms of ART.  In the United States specifically, the Centers for
Disease Control and Prevention (CDC) reported that in 2018, there were approximately 343,671 ART
cycles performed, resulting in 72,000 live births. This represents a birth rate of 11.4 per 1,000
women aged 15-44 years.  Please note that these numbers 

In [33]:
text = "According to 'DAM: A Universal Dual Attention Mechanism for Multimodal Timeseries Cryptocurrency Trend Forecasting', Bitcoin's ensued a price drop of what percentage decrease in the month after January 2018?"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  According to 'DAM: A Universal Dual
Attention Mechanism for Multimodal Timeseries Cryptocurrency Trend Forecasting', Bitcoin's ensued a
price drop of what percentage decrease in the month after January
2018?<|eot_id|><|start_header_id|>assistant<|end_header_id|> A very specific and technical question!
According to the paper "DAM: A Universal Dual Attention Mechanism for Multimodal Time Series
Cryptocurrency Trend Forecasting" (2020), which analyzed Bitcoin's price data from 2014 to 2019, the
authors report that:  "...the price of Bitcoin dropped by approximately 44.5% in the month following
January 2018."  This refers to the period between February 1st, 2018, and March 31st, 2018.  So, to
answer your question, Bitcoin's price decreased by around 44.5% in the month after January 2018.


In [34]:
text = "In paragraph 1 of 'The value of non-traditional credentials in the labor market', how many learners were used for the experiment?"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  In paragraph 1 of 'The value of non-
traditional credentials in the labor market', how many learners were used for the
experiment?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I apologize, but there is no
paragraph 1 in "The value of non-traditional credentials in the labor market" that mentions a
specific number of learners being used for an experiment. It seems you may have provided incorrect
information or context. If you meant to ask about a different text or topic, please feel free to
rephrase your question!


1/3: No answer

2: Incorrect

# Loading embeddings

In [12]:
embeddings = HuggingFaceEmbeddings() # default model_name="sentence-transformers/all-mpnet-base-v2"


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# Using LangChain RetrievalQA and establishing RAG template



In [13]:
%%time

prompt_template_llama3 = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

{context}<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt_template=prompt_template_llama3


CPU times: user 4 µs, sys: 1e+03 ns, total: 5 µs
Wall time: 18.1 µs


# Loading PDFs
----
As per the assignment details, I loaded 8 academic papers from arxiv.org. These papers were all published on May 1st, 2024, and thus were not included in LLaMa 3's training data (as LLaMa 3 was published on April 18th, 2024).



In [14]:
# All of these papers were released in May 2024
!wget -O document1.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00234"
!wget -O document2.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00235"
!wget -O document3.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00247"
!wget -O document4.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00357"
!wget -O document5.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00473"
!wget -O document6.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00522"

--2024-05-08 19:17:39--  https://arxiv.org/pdf/2405.00234
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.67.42, 151.101.195.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2028478 (1.9M) [application/pdf]
Saving to: ‘document1.pdf’


2024-05-08 19:17:39 (25.1 MB/s) - ‘document1.pdf’ saved [2028478/2028478]

--2024-05-08 19:17:39--  https://arxiv.org/pdf/2405.00235
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.67.42, 151.101.195.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 139328 (136K) [application/pdf]
Saving to: ‘document2.pdf’


2024-05-08 19:17:39 (5.75 MB/s) - ‘document2.pdf’ saved [139328/139328]

--2024-05-08 19:17:39--  https://arxiv.org/pdf/2405.00247
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.67.42, 151.101.195.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... 

In [17]:
from langchain.document_loaders import UnstructuredPDFLoader
import os
loaders = [UnstructuredPDFLoader(fn) for fn in ["/content/document1.pdf","/content/document2.pdf", "/content/document3.pdf", "/content/document4.pdf", "/content/document5.pdf", "/content/document6.pdf"]]


As before, we split it into smaller pieces.

In [18]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunked_pdf_doc = []

for loader in loaders:
    print("Loading raw document..." + loader.file_path)
    pdf_doc = loader.load()
    updated_pdf_doc = filter_complex_metadata(pdf_doc)
    print("Splitting text...")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=256)
    documents = text_splitter.split_documents(updated_pdf_doc)
    chunked_pdf_doc.extend(documents)

len(chunked_pdf_doc)

Loading raw document.../content/document1.pdf
Splitting text...
Loading raw document.../content/document2.pdf
Splitting text...
Loading raw document.../content/document3.pdf
Splitting text...
Loading raw document.../content/document4.pdf
Splitting text...
Loading raw document.../content/document5.pdf
Splitting text...
Loading raw document.../content/document6.pdf
Splitting text...


454

Make the vectors for each part

In [19]:
%%time
db_pdf = FAISS.from_documents(chunked_pdf_doc, embeddings)

CPU times: user 8.77 s, sys: 33.9 ms, total: 8.8 s
Wall time: 11.5 s


# Testing (LLM + RAG)

In [20]:
%%time
prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 4, 'score_threshold': 0.2}),
    chain_type_kwargs={"prompt": prompt},
)
query = "According to Luk Rombauts, how many babies were born via assisted reproduction technology (ART) in 2018 alone?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to Answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  4 2 0 2  r p A 0 3  ]  N G . n o c e [  1 v 4 3 2 0 0 . 5 0 4 2 : v i X r a  Conceiving
Naturally After IVF: the effect of assisted reproduction on obstetric interventions and child health
at birth.  Fabio I. Martinenghi, Xian Zhang, Luk Rombauts and Georgina M. Chambers  1 Introduction
It is estimated that 439,039 babies were born via assisted reproduction technology (ART) in 2018
alone (Chambers et al., 2021b). That same year, almost 5% of all the children born in Australia were
conceived via in-vitro fertilisation (IVF, a form of ART; Newman et al., 2020), while estimates for
the US and Europe are at 1.6% and 4.5%, respectively, and growing (De Geyter et al., 2018; Sunderam

In [29]:
%%time

query = "According to 'DAM: A Universal Dual Attention Mechanism for Multimodal Timeseries Cryptocurrency Trend Forecasting', Bitcoin's ensued a price drop of what percentage decrease in the month after January 2018?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to Answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  [65] Rathan, K., Sai, S. V., Manikanta, T. S. “Crypto-currency price pre- diction using
decision tree and regression techniques.” In 2019 3rd International Conference on Trends in
Electronics and Informatics (ICOEI) IEEE 190–194 (2019) .  [66] Li, Y., Dai, W. “Bitcoin price
forecasting method based on CNN-LSTM hybrid neural network model.” The journal of engineering
2020.13 344– 347 (2020).  [67] Serafini, G., et al. “Sentiment-driven price prediction of the
bitcoin based on statistical and deep learning approaches.” In 2020 International Joint Conference
on Neural Networks (IJCNN) IEEE 1–8 (2020) .  [68] Phaladisailoed, T., Numnonda, T. “Machine
learning models comparison for

In [24]:
%%time

query = "In paragraph 1 of 'The value of non-traditional credentials in the labor market', how many learners were used for the experiment?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to Answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  10  and less likely to be from a developing country (the difference is -0.048 p.p. with
S.E. 0.002). Recall  that gender and level of education are provided voluntarily through the
registration survey, similar to  the LinkedIn URLs. We observe that in the LinkedIn Matched Sample,
learners are more likely to report  their gender or level of education.  Panel B of Table 1 presents
the summary statistics of the outcome variables. In the LinkedIn Matched  Sample, 18% of the
learners found new jobs during the considered period, and 13% of all the learners  found jobs that
we consider in scope, indicating that most of the new jobs reported were related to  certificates.
However, mos

# Irrelevant prompt --> test for hallucinations

In [35]:
%%time

query = "On what day of the week was Mahdieh Tahmasebi born?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to Answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  <|eot_id|><|start_header_id|>user<|end_header_id|>  On what day of the week was Mahdieh
Tahmasebi born?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I don't have enough
information to answer the question.
CPU times: user 2.94 s, sys: 9.65 ms, total: 2.95 s
Wall time: 2.95 s
