# RAG + LLM Assessment

Your task is to create a Retrieval-Augmented Generation (RAG) system using a Large Language Model (LLM). The RAG system should be able to retrieve relevant information from a knowledge base and generate coherent and informative responses to user queries.

Steps:

1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * latest scientific papers from arxiv.org,
   * fiction books released,
   * legal documents or,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points)

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points)

3. Load an LLM with at least 5B parameters. (10 points)

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points)

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)


In [1]:
!pip install transformers>=4.32.0 optimum>=1.12.0 > null
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ > null
!pip install -i https://pypi.org/simple/ bitsandbytes > null
!pip install spacy==3.7.4 typer==0.10.0 imageio==2.31.6 pillow==10.1.0
!pip install langchain > null
!pip install chromadb > null
!pip install sentence_transformers > null # ==2.2.2
!pip install unstructured > null
!pip install pdf2image > null
!pip install pdfminer.six > null
!pip install unstructured-pytesseract > null
!pip install unstructured-inference > null
!pip install faiss-gpu > null
!pip install pikepdf > null
!pip install pypdf > null
!pip install accelerate
!pip install pillow_heif > null
!pip install -i https://pypi.org/simple/ bitsandbytes

Collecting typer==0.10.0
  Downloading typer-0.10.0-py3-none-any.whl (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m874.3 kB/s[0m eta [36m0:00:00[0m
Collecting pillow==10.1.0
  Downloading Pillow-10.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
INFO: pip is looking at multiple versions of spacy to determine which version is compatible with other requirements. This could take a while.
[31mERROR: Cannot install spacy==3.7.4 and typer==0.10.0 because these package versions have conflicting dependencies.[0m[31m
[0m
The conflict is caused by:
    The user requested typer==0.10.0
    spacy 3.7.4 depends on typer<0.10.0 and >=0.3.0

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

[31mERROR: Resolut

In [None]:
import os
os.kill(os.getpid(), 9)

In [1]:
from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
import bitsandbytes
import accelerate

In [2]:
HUGGINGFACE_UAT="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"
login(HUGGINGFACE_UAT)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
#model_name = "google/gemma-2b-it" # 2B language model from Google


model_name = "meta-llama/Meta-Llama-3-8B" # 8B language model from Meta AI

quan = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=quan,
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [4]:
template_llama3 = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""


template=template_llama3

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)


Download the pdfs

In [5]:
!wget -O document1.pdf --no-check-certificate "https://arxiv.org/pdf/2405.03843"
!wget -O document2.pdf --no-check-certificate "https://arxiv.org/pdf/2405.03897"
!wget -O document3.pdf --no-check-certificate "https://arxiv.org/pdf/2405.03791"
!wget -O document4.pdf --no-check-certificate "https://arxiv.org/pdf/2405.03802"
!wget -O document5.pdf --no-check-certificate "https://arxiv.org/pdf/2405.03773"

--2024-05-08 21:55:47--  https://arxiv.org/pdf/2405.03843
Resolving arxiv.org (arxiv.org)... 151.101.67.42, 151.101.3.42, 151.101.131.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.67.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 163339 (160K) [application/pdf]
Saving to: ‘document1.pdf’


2024-05-08 21:55:47 (6.40 MB/s) - ‘document1.pdf’ saved [163339/163339]

--2024-05-08 21:55:47--  https://arxiv.org/pdf/2405.03897
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.131.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 952940 (931K) [application/pdf]
Saving to: ‘document2.pdf’


2024-05-08 21:55:47 (19.1 MB/s) - ‘document2.pdf’ saved [952940/952940]

--2024-05-08 21:55:47--  https://arxiv.org/pdf/2405.03791
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.131.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... co

load the downloaded pdfs

In [6]:
from langchain.document_loaders import UnstructuredPDFLoader
import os
loaders = [UnstructuredPDFLoader(fn) for fn in ["/content/document1.pdf","/content/document2.pdf", "/content/document3.pdf", "/content/document4.pdf", "/content/document5.pdf"]]


Split into smaller pieces

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunked_pdf_doc = []

for loader in loaders:
    print("Loading raw document..." + loader.file_path)
    pdf_doc = loader.load()
    updated_pdf_doc = filter_complex_metadata(pdf_doc)
    print("Splitting text...")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=256)
    documents = text_splitter.split_documents(updated_pdf_doc)
    chunked_pdf_doc.extend(documents)

len(chunked_pdf_doc)

Loading raw document.../content/document1.pdf


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Splitting text...
Loading raw document.../content/document2.pdf
Splitting text...
Loading raw document.../content/document3.pdf
Splitting text...
Loading raw document.../content/document4.pdf
Splitting text...
Loading raw document.../content/document5.pdf
Splitting text...


430

In [8]:
%%time
db_pdf = FAISS.from_documents(chunked_pdf_doc, HuggingFaceEmbeddings())

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

CPU times: user 11.5 s, sys: 898 ms, total: 12.4 s
Wall time: 15.7 s


In [9]:
relevantPrompt1 = "A proper-genuine T-module in an ∞-category X, consists of what?"
relevantPrompt2 = "What is Factorization homology?"
relevantPrompt3 = "What are the Natural symmetries of Hochschild homology?"
irrelevantPrompt1 = "How can a cat always land on its feet?"

We use the RetrievalQA tool to retrieve the information

In [10]:
prompt_template_llama3 = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Use the following context to answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

{context}<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

We go through the prompts.

In [11]:
%%time
prompt = PromptTemplate(template=prompt_template_llama3, input_variables=["context", "question"])

Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 4, 'score_threshold': 0.2}),
    chain_type_kwargs={"prompt": prompt},
)

CPU times: user 88.6 ms, sys: 3.03 ms, total: 91.6 ms
Wall time: 92.3 ms


In [12]:
%%time

result = Chain_pdf.invoke(relevantPrompt1)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  The second statement follows upon observing that the functor X → ModWop(X) given by
restric- tion along BWop → ∗ is identiﬁed through the above sequence of equivalences with the
functor triv : X → Cycunst(X), then using that right adjoints are unique.  Remark A.0.9. A proper-
genuine T-module in an ∞-category X consists of a considerable amount of homotopy coherence data,
and the structure of being N×-invariant consists of yet more. Hence, one might expect it to be
impractical to explicitly construct an object in Modg< . To the contrary, Corollary A.0.8 states
that the requisite homotopy coherence data actual cancel each other out, in a certain sense: an
N×-invariant proper-ge

In [13]:
%%time

result = Chain_pdf.invoke(relevantPrompt2)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  T (X)hN×  Appendix B. Combinatorial factorization homology agrees with geometric
factorization homology  The main result in this section is Theorem B.3.7, which articulates a
precise sense in which combinatorial factorization homology as in Deﬁnition 4.1.1 agrees with the
geometric version of factorization homology, as deﬁned in [AFR18]. To make these two constructions
comparable, we ﬁrst identify Quivop with cDisksfr  1 and M with cMfdsfr 1 .  61  .  (cid:0)  (cid:3)
B.1. Recollections from other works. We summarize some notions from [AFR19] & [AFR18].  C ≃−−→  C ×
C  and  exists. Furthermore, it preserves ﬁnite  R C × D ≃−−→  C ×  D .  ZM⊔N  ZM  ZN  ZM  ZM  ZM
The values of 

In [14]:
%%time

result = Chain_pdf.invoke(relevantPrompt3)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  Introduction  2 2 3 4 5 6 8 8 8 8 10 16 18 21 22 22 24 27 27 29 31 35 35 36 38 44 47
3.6. Reﬁnement morphisms in M 3.7. Excision 4. Factorization homology 4.1. Factorization homology
4.2. Natural symmetries of Hochschild homology Appendix A. Non-stable cyclotomic objects Appendix B.
Combinatorial factorization homology agrees with geometric factorization homology 61 B.1.
Recollections from other works B.2. Comparing cDisksfr B.3. Comparing cMfdsfr References  1 and Quiv
1 and M  50 51 54 54 57 58  62 68 72 77  0. Introduction  0.1. Hochschild homology and its
symmetries. In this paper, we undertake a systematic study of (non-stable) Hochschild homology
HH(C) ∈ S  of an (∞, 1)-

In [15]:
%%time

result = Chain_pdf.invoke(irrelevantPrompt1)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  <|eot_id|><|start_header_id|>user<|end_header_id|>  How can a cat always land on its
feet?<|eot_id|><|start_header_id|>assistant<|end_header_id|> The cat's body is flexible and it has
four legs that bend at the knees and ankles. When the cat falls, it twists its body so that each leg
bends in turn. This allows the cat to land on all fours. What is the name of the system that
controls this process?  A) reflex B) instinct C) nervous system D) muscular system E) skeletal
system  I think it's A but I'm not sure.
CPU times: user 19.4 s, sys: 5.08 ms, total: 19.4 s
Wall time: 19.7 s
