<a href="https://colab.research.google.com/github/eliza-giane/dl-assessment7/blob/main/Copy_of_AIT_RAG_Assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG + LLM Assessment

Your task is to create a Retrieval-Augmented Generation (RAG) system using a Large Language Model (LLM). The RAG system should be able to retrieve relevant information from a knowledge base and generate coherent and informative responses to user queries.

Steps:

1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * latest scientific papers from arxiv.org,
   * fiction books released,
   * legal documents or,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points)

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points)

3. Load an LLM with at least 5B parameters. (10 points)

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points)

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)


# 0.1. INSTALL DEPENDENCIES:

In [1]:
!pip install transformers>=4.32.0 optimum>=1.12.0 > null --use-deprecated=legacy-resolver
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ > null --use-deprecated=legacy-resolver
!pip install langchain > null --use-deprecated=legacy-resolver
!pip install chromadb > null --use-deprecated=legacy-resolver
!pip install sentence_transformers > null --use-deprecated=legacy-resolver # ==2.2.2
!pip install unstructured > null --use-deprecated=legacy-resolver
!pip install pdf2image > null --use-deprecated=legacy-resolver
!pip install pdfminer.six > null --use-deprecated=legacy-resolver
!pip install unstructured-pytesseract > null --use-deprecated=legacy-resolver
!pip install unstructured-inference > null --use-deprecated=legacy-resolver
!pip install faiss-gpu > null --use-deprecated=legacy-resolver
!pip install pikepdf > null --use-deprecated=legacy-resolver
!pip install pypdf > null --use-deprecated=legacy-resolver
!pip install accelerate > null
!pip install pillow_heif > null --use-deprecated=legacy-resolver
!pip install -i https://pypi.org/simple/ bitsandbytes > null


In [None]:
# RESTART KERNEL:
import os
os.kill(os.getpid(), 9)

#0.2. IMPORTS:

In [1]:
from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

locale.getpreferredencoding = lambda: "UTF-8"

# you need to define your private User Access Token from Huggingface
# to be able to access models with accepted licence
HUGGINGFACE_UAT="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"
login(HUGGINGFACE_UAT)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


#1. COLLECT DATASET

In [2]:
# TRAINING DATASET CUTOFF DATE: March 2023
# DATE OF COLLECTED DATASET: 2024

!wget -O document1.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04403"
!wget -O document2.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04345"
!wget -O document3.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04251"
!wget -O document4.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04097"
!wget -O document5.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04390"

--2024-05-08 20:08:03--  https://arxiv.org/pdf/2405.04403
Resolving arxiv.org (arxiv.org)... 151.101.195.42, 151.101.131.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.195.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 861504 (841K) [application/pdf]
Saving to: ‘document1.pdf’


2024-05-08 20:08:03 (15.1 MB/s) - ‘document1.pdf’ saved [861504/861504]

--2024-05-08 20:08:03--  https://arxiv.org/pdf/2405.04345
Resolving arxiv.org (arxiv.org)... 151.101.195.42, 151.101.131.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.195.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9154840 (8.7M) [application/pdf]
Saving to: ‘document2.pdf’


2024-05-08 20:08:03 (74.1 MB/s) - ‘document2.pdf’ saved [9154840/9154840]

--2024-05-08 20:08:03--  https://arxiv.org/pdf/2405.04251
Resolving arxiv.org (arxiv.org)... 151.101.195.42, 151.101.131.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.195

In [3]:
# LOAD:
from langchain.document_loaders import UnstructuredPDFLoader
import os
loaders = [UnstructuredPDFLoader(fn) for fn in ["/content/document1.pdf","/content/document2.pdf","/content/document3.pdf","/content/document4.pdf","/content/document5.pdf"]]

#2. WRITE PROMPTS

In [4]:
# RELEVANT PROMPTS:
prompt_1 = "Summarize the paper written by scientists Chen Min about 4D Pre-trained Scene Understanding"
prompt_2 = "What were the findings of the paper written by Ammarah Hashmi, Sahibzada Adil Shahzad, et al, about understanding human perception of audiovisual Deepfakes"
prompt_3 = "Summarize the paper written by scientists Somayeh Malekshahi, Javad M. Kheyridoost, and Omid Fatemi on the implementation and evaluation of general models for Detecting Learner Engagement"


# IRRELEVANT PROMPT:
irr_prompt = 'What was Rick Riordan wearing when Brasil won the FIFA World Cup in 2002?'

#3. LOAD LLM

In [5]:
# must have at least 5B parameters

model_name = "meta-llama/Meta-Llama-3-8B-Instruct" # 8B language model from Meta AI

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=quantization_config,
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # 0.0 # For RAG we would like to have determenistic answers
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


#4. TEST MODEL WITH THE PROMPTS

In [6]:
# without the collected dataset, the model should fail at answering the questions
# change either the prompts or the dataset if it is able to answer them

template = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""


prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)


In [7]:
prompts = [prompt_1, prompt_2, prompt_3, irr_prompt]
i = 1
for text in prompts:
  result = llm(prompt.format(text=text))
  print("\n\nPROMPT", str(i) + ":")
  print(fill(result.strip(), width=100))
  i += 1

  warn_deprecated(
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




PROMPT 1:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>  Summarize the paper written by scientists
Chen Min about 4D Pre-trained Scene
Understanding<|eot_id|><|start_header_id|>assistant<|end_header_id|> I apologize, but I couldn't
find any information on a specific paper written by scientist Chen Min about "4D Pre-trained Scene
Understanding". It's possible that the paper may not be widely known or published, or it could be a
misspelling or incorrect name.  However, I can provide some general information on the topic of 4D
scene understanding and pre-training in computer vision.   Scene understanding is a fundamental task
in computer vision, which involves recognizing and interpreting the objects, actions, and
relationships within an image or video sequence. In recent years, there has been growing interest in
developing methods for 4D scene understanding, which refers to the ability to understand scenes
across multiple frames and dimensions (e.g., time, depth).  Pre-trai

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




PROMPT 2:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What were the findings of the paper
written by Ammarah Hashmi, Sahibzada Adil Shahzad, et al, about understanding human perception of
audiovisual Deepfakes<|eot_id|><|start_header_id|>assistant<|end_header_id|> I apologize, but I
couldn't find any information on a specific paper written by Ammarah Hashmi, Sahibzada Adil Shahzad,
and others about "understanding human perception of audiovisual Deepfakes". It's possible that the
paper may not exist or is not widely known.  However, there have been several studies published on
the topic of human perception of deepfakes, including audiovisual deepfakes. Here are some key
findings from recent research:  1. **Perception of authenticity**: Studies have shown that people
can be easily deceived by deepfake videos, with many believing them to be real (e.g., [1]). In one
study, 86% of participants believed a manipulated video was authentic [2]. 2. **Factors influencing
percepti

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




PROMPT 3:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>  Summarize the paper written by scientists
Somayeh Malekshahi, Javad M. Kheyridoost, and Omid Fatemi on the implementation and evaluation of
general models for Detecting Learner
Engagement<|eot_id|><|start_header_id|>assistant<|end_header_id|> I apologize, but I couldn't find
any information on a specific paper written by Somayeh Malekshahi, Javad M. Kheyridoost, and Omid
Fatemi on "Implementation and Evaluation of General Models for Detecting Learner Engagement". It's
possible that the paper may not be publicly available or may not have been published yet.  However,
I can provide you with some general information on detecting learner engagement in educational
settings. Learner engagement is an important aspect of learning, as it has been linked to improved
academic performance, increased motivation, and better retention rates. There are various methods
and techniques used to detect learner engagement, including:  1.

#5. CREATE LangChain-BASED RAG SYSTEM

In [8]:
# set up a vector database from the documents

# split to smaller pieces:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunked_pdf_doc = []

for loader in loaders:
    print("Loading raw document..." + loader.file_path)
    pdf_doc = loader.load()
    updated_pdf_doc = filter_complex_metadata(pdf_doc)
    print("Splitting text...")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=256)
    documents = text_splitter.split_documents(updated_pdf_doc)
    chunked_pdf_doc.extend(documents)

len(chunked_pdf_doc)

Loading raw document.../content/document1.pdf
Splitting text...
Loading raw document.../content/document2.pdf
Splitting text...
Loading raw document.../content/document3.pdf
Splitting text...
Loading raw document.../content/document4.pdf
Splitting text...
Loading raw document.../content/document5.pdf
Splitting text...


384

In [9]:
# make vectors:
%%time
embeddings = HuggingFaceEmbeddings()
db_pdf = FAISS.from_documents(chunked_pdf_doc, embeddings)



CPU times: user 9.39 s, sys: 608 ms, total: 10 s
Wall time: 13.4 s


In [10]:
template = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Use the following context to answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

{context}<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)


In [11]:
#test:
%%time

Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 10, 'score_threshold': 0.2}),
    chain_type_kwargs={"prompt": prompt},
)
# query = "What are the risk factors for NVIDIA?"
# result = Chain_pdf.invoke(query)
# print(fill(result['result'].strip(), width=100))

CPU times: user 1.38 ms, sys: 0 ns, total: 1.38 ms
Wall time: 3.24 ms


In [12]:
# retrieve information:
%%time

Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 4, 'score_threshold': 0.2}),
    chain_type_kwargs={"prompt": prompt},
)

CPU times: user 666 µs, sys: 0 ns, total: 666 µs
Wall time: 683 µs


#6. PROVIDE PROMPTS TO THE RAG SYSTEM

In [14]:
# the RAG system should return relevant answers to the relevant prompts
# and an empty answer to the irrelevant prompt

prompt_1 = "Give a brief summarry of the paper DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving"
prompt_2 = "What were the findings of the paper written by Ammarah Hashmi, Sahibzada Adil Shahzad, et al, about understanding human perception of audiovisual Deepfakes"
prompt_3 = "Give a brief summary of the paper 'A General Model for Detecting Learner Engagement:Implementation and Evaluation' written by scientists Somayeh Malekshahi, Javad M. Kheyridoost, and Omid Fatemi on the implementation and evaluation of general models for Detecting Learner Engagement"
irr_prompt = 'What was Rick Riordan wearing when Brasil won the FIFA World Cup in 2002?'
prompts = [prompt_1, prompt_2, prompt_3, irr_prompt]
i = 1
for query in prompts:
  result = Chain_pdf.invoke(query)
  print("\n\nPROMPT", str(i) + ":")
  print(fill(result['result'].strip(), width=100))
  i += 1

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




PROMPT 1:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  4 2 0 2  y a M 7  ]  V C . s c [  1 v 0 9 3 4 0 . 5 0 4 2 : v i X r a  DriveWorld: 4D
Pre-trained Scene Understanding via World Models for Autonomous Driving  Chen Min1,2, Dawei Zhao2*,
Liang Xiao2*, Jian Zhao3, Xinli Xu4, Zheng Zhu5 Lei Jin6, Jianshu Li7, Yulan Guo8, Junliang Xing9,
Liping Jing10, Yiming Nie2, Bin Dai2 1School of Computer Science, Peking University 2Unmanned
Systems Technology Research Center, Defense Innovation Institute 3China Telecom Institute of AI &
NPU 4HKUST 5GigaAI 6BUPT 7Ant Group 8SYSU 9THU 10BJTU minchen@stu.pku.edu.cn, adamzdw@163.com,
xiaoliang@nudt.edu.cn  (a) Monocular 2D Pre-training 2D Pre-text Task  2D to 3D View Transform(c)
Mul

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




PROMPT 2:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  4 2 0 2  y a M 7  ]  V C . s c [  1 v 7 9 0 4 0 . 5 0 4 2 : v i X r a  Unmasking
Illusions: Understanding Human Perception of Audiovisual Deepfakes  Ammarah Hashmi, Sahibzada Adil
Shahzad, Chia-Wen Lin, Fellow, IEEE, and Yu Tsao, Senior Member, IEEE, Hsin-Min Wang, Senior Member,
IEEE  V. LIMITATIONS  We are the first to offer insights into the study of human perception of
audiovisual deepfakes. In addition to its contribu- tion, this study has several limitations that
can be mitigated in future studies to thoroughly explore this field. First, our choice to use a
sample size of 40 may not be sufficient to draw general conclusions about human perception in
audiovisu

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




PROMPT 3:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  4 2 0 2  y a M 7  ]  V C . s c [  1 v 1 5 2 4 0 . 5 0 4 2 : v i X r a  A General Model
for Detecting Learner Engagement: Implementation and Evaluation  Somayeh Malekshahi*
s.malekshahi@ut.ac.ir  Javad M. Kheyridoost javadkheyridoost@gmail.com  Omid Fatemi*
ofatemi@ut.ac.ir  ABSTRACT  2  A General Model for Detecting Learner Engagement: Implementation and
Evaluation  7  A General Model for Detecting Learner Engagement: Implementation and Evaluation
Table 3: Adapting labels of the DAiSEE dataset to two-level engagement detection  Affective State
Policy  Engagement  Confusion  Frustration  Engagement; Boredom  Engagement; Confusion  Engagement ≤
1 → Disengaged Engagem