# Retrieval Augmented Generation (RAG)



<sup>This notebook is a part of Natural Language Processing class at the University of Ljubljana, Faculty for computer and information science. Please contact [boshko.koloski@ijs.si](mailto:boshko.koloski@ijs.si) for any comments.</sub>

The core functionality of this notebook is to create a retrieval-augmented generation (RAG) system that enables discussion about the NLP subject setting using the `Mistral-7b-v0.2` model.

General-purpose language models can be fine-tuned for common tasks such as sentiment analysis and named entity recognition, which typically do not require additional background knowledge.

For more complex and knowledge-intensive tasks, it is possible to construct a language model-based system that accesses external knowledge sources. This approach enhances factual consistency, improves the reliability of responses, and mitigates the issue of 'hallucination.'

Researchers have introduced the Retrieval Augmented Generation (RAG) method to address these knowledge-intensive tasks. RAG integrates an information retrieval component with a text generator model, allowing for efficient updates and modifications to its internal knowledge without retraining the entire model.

RAG operates by taking an input, retrieving a set of relevant or supporting documents from a source like Wikipedia, and concatenating these documents with the original input prompt. This concatenated context is then fed to the text generator, which produces the final output. This adaptability is crucial for situations where facts may evolve over time, as the static parametric knowledge of traditional large language models (LLMs) can become outdated. RAG circumvents the need for retraining, providing access to the most current information and enabling reliable outputs through retrieval-based generation.

RAG requires additional document embeddings and the storage of documents in a database for retrieval purposes.


**In simple terms, RAG is to LLMs what an open-book exam is to humans.**

The concept of an open-book exam centers around assessing a student's reasoning abilities rather than their capacity to memorize specific details. In a similar vein, RAG separates factual knowledge from the LLM’s reasoning capabilities. This factual information is stored in an external knowledge source, which is both easily accessible and updatable:

- **Parametric knowledge:** Knowledge that is learned during training and implicitly stored within the neural network's weights.
- **Non-parametric knowledge:** Information that is stored externally, for example, in a vector database.
e.

![RAG](RAG.jpg)

The RAG workflow consists of:

1. **The Retrieve**: The user query is used to retrieve relevant context from an external knowledge source. For this, the user query is embedded using an embedding model into the same vector space as the additional context in the vector database. This enables a similarity search, and the top k closest data objects from the vector database are returned.
2. **Augment**: The user query and the retrieved additional context are incorporated into a prompt template.
3. **Generate**: Finally, the retrieval-augmented prompt is fed to the LLM.

We will use the `langchain` framework to efficiently prompt the LLMs and prepare the RAG.


In [1]:
import os
import tempfile
from pathlib import Path
import joblib
import requests
import torch
import transformers

In [2]:
from IPython.display import display_markdown
from langchain.chains.conversational_retrieval.base import ConversationalRetrievalChain
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter, HTMLHeaderTextSplitter, TokenTextSplitter
from langchain_community.document_loaders import BSHTMLLoader
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain_community.vectorstores.faiss import FAISS

In [3]:
LLM_MODEL = "mistralai/Mistral-7B-Instruct-v0.2"
device = f"cuda:{torch.cuda.current_device()}"

Large Language Models are known for their significant computational demands. Typically, the size of a model is determined by multiplying the number of parameters (size) by the precision of these values (data type). To conserve memory, weights can be stored using lower-precision data types through a process known as quantization.

**Post-Training Quantization (PTQ)** is a straightforward technique where the weights of an already trained model are converted to a lower precision without necessitating any retraining. Although easy to implement, PTQ can lead to potential performance degradation.We will employ PTQ using the `bitsandbytes` library and will load the model in 4-bit precision, applying double quantization with the `nf4` data type. For more information about quantization, visit [this guide on quantization](https://huggingface.co/docs/optimum/en/concept_guides/quantization) )pe.


In [4]:
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True, # loading in 4 bit 
    bnb_4bit_quant_type="nf4", # quantization type
    bnb_4bit_use_double_quant=True, # nested quantization 
    bnb_4bit_compute_dtype=torch.bfloat16,
)

In [5]:
model_config = transformers.AutoConfig.from_pretrained(
    pretrained_model_name_or_path=LLM_MODEL,
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=LLM_MODEL,
    config=model_config,
    quantization_config=bnb_config, # we introduce the bnb config here.
    device_map="auto",
)
model.eval()

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.


MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )
   

We also need to load the tokenizer to transform the text as before.

In [6]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path=LLM_MODEL,
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

We will use pipelines from Hugging Face to perform the prompting and generation with the Mistral model.


In [7]:
generate_text = transformers.pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    #temperature=0.0,  
    max_new_tokens=8192,  
    repetition_penalty=1.1,  
)

We will use `langchain` to link the HuggingFace models and the chaining prompting. 

In [8]:
llm = HuggingFacePipeline(pipeline=generate_text)

The core functionality of `langchain` is the creation of templates for prompting via `PromptTemplate`.

In [9]:
from langchain_core.prompts import PromptTemplate

In [10]:
template = """
You are a helpful AI QA assistant. When answering questions, use the context enclosed by triple backquotes if it is relevant.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Reply your answer in markdown format.
Question: {question}
Answer:"""
prompt = PromptTemplate.from_template(template)

In [11]:
prompt

PromptTemplate(input_variables=['question'], template="\nYou are a helpful AI QA assistant. When answering questions, use the context enclosed by triple backquotes if it is relevant.\nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\nReply your answer in markdown format.\nQuestion: {question}\nAnswer:")

In [12]:
chain = prompt | llm

In [13]:
chain

PromptTemplate(input_variables=['question'], template="\nYou are a helpful AI QA assistant. When answering questions, use the context enclosed by triple backquotes if it is relevant.\nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\nReply your answer in markdown format.\nQuestion: {question}\nAnswer:")
| HuggingFacePipeline(pipeline=<transformers.pipelines.text_generation.TextGenerationPipeline object at 0x7f3a3263f970>)

In [14]:
question = "What is the scoring criteria of the NLP course?"

In [15]:
print(chain.invoke({"question": question}))


You are a helpful AI QA assistant. When answering questions, use the context enclosed by triple backquotes if it is relevant.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Reply your answer in markdown format.
Question: What is the scoring criteria of the NLP course?
Answer: I'd be happy to help you with that! However, I need some more context to provide an accurate answer. Could you please specify which NLP (Natural Language Processing) course you are referring to? Different courses may have different scoring criteria. For instance, some courses might focus on grammar and syntax, while others might prioritize semantic understanding or machine learning techniques. If you could provide me with the name or a link to the specific course, I would be glad to look up the details for you. In the meantime, here's a general idea of what you might expect in an NLP course:
- Assignments: These typically involve implementing various NLP algorithms and

In [16]:
question = "What do I have to do for the peer review?"

In [17]:
print(chain.invoke({"question": question}))


You are a helpful AI QA assistant. When answering questions, use the context enclosed by triple backquotes if it is relevant.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Reply your answer in markdown format.
Question: What do I have to do for the peer review?
Answer: Based on the context provided, it appears that you are referring to a peer review process for academic work or a similar context. In general, during a peer review process, you will be expected to submit your work to one or more experts in the field for evaluation and feedback. The specific requirements may vary depending on the guidelines of the organization or institution conducting the review. However, some common tasks include:
1. Submitting your work electronically or physically according to the instructions provided.
2. Providing any necessary supplementary materials, such as data sets or code repositories.
3. Ensuring that all citations and references are properly form

### Enter RAG

Next, we will define the vector embeddings of our context. We will use the `sentence-transformers/all-mpnet-base-v2` model to embed the documents and a FAISS vector store to store and later retrieve them. LangChain offers the `HuggingFaceEmbeddings` interface to easily load any model from Hugging Face to serve as the document representation.


In [18]:
EMBED_MODEL = "sentence-transformers/all-mpnet-base-v2"

In [19]:
embedding = HuggingFaceEmbeddings(
    model_name=EMBED_MODEL,
    model_kwargs={"device": "cuda"},
)

  return self.fget.__get__(instance, owner)()


The `fetch_websites` function will be used to scrape data from our Google Document.


In [20]:
def fetch_websites(sites: list[str]):
    docs = []
    with tempfile.TemporaryDirectory() as tmpdir:
        filename = f"{tmpdir}/site.html"
        for site in sites:
            res = requests.get(site)
            with open(filename, mode="wb") as fp:
                fp.write(res.content)
            docs.extend(BSHTMLLoader(filename).load())         
    return docs

In [21]:
course_webpage = "https://docs.google.com/document/u/1/d/e/2PACX-1vRVLq-QPQ-0vWYhj5_TlVSVClXJNTNf1d0CDG59PdxQtl-10h-kVBGlIVQMuZ7YKjtqkyU9iEcAx2zI/pub"

In [22]:
docs = fetch_websites([course_webpage])

In [23]:
docs[0]

Document(page_content='Natural language processing 2024Objavljeno z Google DokumentiPrijavite zloraboVeč o temNatural language processing 2024Samodejna posodobitev vsakih 5 min.Laboratory work - Spring 2024The main goal of laboratory work is to present the most important aspects of data science in practice and to teach you how to use key tools for a NLP engineer. We especially emphasize on self-paced work, raising standards related to development, replicability, reporting, research, visualizing, etc. Our goal is not to provide exact instructions or "make robots" out of participants of this course. Participants will need to try to navigate themselves among data, identify promising leads and extract as much information as possible from the data to present to the others (colleagues, instructors, companies or their superiors).Important linksLab sessions course repository\xa0(continuously updated, use weekly plan links for latest materials)Books and other materials\xa0\xa0\xa0\xa0\xa0\xa0\x

We loaded the entire course into a single document. Since the sentence transformer can handle only limited sections of text, this might be problematic. Next, we will use the `RecursiveCharacterTextSplitter` to split the document into chunks with a `chunk_size` of 1000 characters and a `chunk_overlap` of 20.


In [24]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20) 
all_splits = text_splitter.split_documents(docs)

Let's see what the chunked document looks like.


In [25]:
all_splits[0]

Document(page_content='Natural language processing 2024Objavljeno z Google DokumentiPrijavite zloraboVeč o temNatural language processing 2024Samodejna posodobitev vsakih 5 min.Laboratory work - Spring 2024The main goal of laboratory work is to present the most important aspects of data science in practice and to teach you how to use key tools for a NLP engineer. We especially emphasize on self-paced work, raising standards related to development, replicability, reporting, research, visualizing, etc. Our goal is not to provide exact instructions or "make robots" out of participants of this course. Participants will need to try to navigate themselves among data, identify promising leads and extract as much information as possible from the data to present to the others (colleagues, instructors, companies or their superiors).Important linksLab sessions course repository\xa0(continuously updated, use weekly plan links for latest materials)Books and other materials\xa0\xa0\xa0\xa0\xa0\xa0\x

In [26]:
all_splits[1]

Document(page_content='processing\xa0(online draft)\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0Python 3 Text Processing with NLTK 3 CookbookIntroduction to Data Science Handbook\xa0 Razvoj slovenščine v digitalnem okolju\xa0(February 2023)Previous years NLP course materials\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0NLP course 2021 project reports\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0NLP course 2022 project reports\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0NLP course 2023 project reportsNLP course 2024 projects\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0MarksPeer review\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0Peer review submission form (TBA)\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0Weekly planThis plan is regularly updated. \xa0Lab sessions\xa0are meant to discuss materials and your project ideas. Those proficient in some of the topics covered during the course are expected to help other students during the lab work or in online discussions. Such contributions will also be taken into account. During the lab sessions we will show some DEMOs based on which 

In [27]:
all_splits[2]

Document(page_content="in the Github repository. During the lab sessions we will briefly present each week's topic and then mostly discuss your project ideas and work. You are expected to check/run notebooks before the lab sessions and then ask questions/discuss during the lab sessions. In the repository's README you can also find the recordings of each topic.WeekDescriptionMaterials and links19.2. - 23.2./26.2. - 1.3.Lab work introductionProjects overview Group work and projects application procedureBasic text processingSlovene text processingCourse overview and introduction4.3. - 8.3.Text clusteringText classification Traditional sequence tagging (HMM, MEMM, CRF, ...)Language models, knowledge basesProjects sign up form\xa0(deadline Friday midnight).Github classroom assignment\xa0(deadline Friday midnight, only one group member creates a team, exactly three members for a group!).11.3. - 15.3.Neural networks introduction (TensorFlow, Keras)Word embeddings & visualizations (offensive l

Next, we initialize a vector store. A vector store is a data structure that functions as a vector database, where each document is stored based on its own embedding. We will use the `FAISS` library for this purpose.


In [28]:
vectorstore = FAISS.from_documents(all_splits, embedding)

Finally, we define a retriever—an object that will handle the **retrieving** part of the RAG pipeline. The retriever receives as arguments the metric by which we search the space and the number of the k-nearest documents (chunks in our case) that we retrieve to present to the studen


In [29]:
retreiver = vectorstore.as_retriever(
    search_type="similarity",
    k=10,
)

Next, we modify our prompt template so that it now receives the context (the documents selected by the RAG system) and then generates the answer.


In [30]:
PROMPT_TEMPLATE = """
You are a helpful AI QA assistant. When answering questions, use the context enclosed by triple backquotes if it is relevant.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Reply your answer in markdown format.

```
{context}
```

### Question:
{question}

### Answer:
"""

prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template=PROMPT_TEMPLATE.strip(),
)

With the prompt defined, we next set up the `ConversationalRetrievalChain` that will utilize the defined `retriever` and `llm`, following the `PROMPT_TEMPLATE` to extract documents.


In [31]:
# Construct complete LLM chain
llm_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retreiver,
    return_source_documents=False,
    combine_docs_chain_kwargs={"prompt": prompt_template},
    verbose=False,
)

Finally, we create the `answer_question` function that will handle the chain invocation for us.


In [32]:
def answer_question(question: str, history: dict[str] = None) -> str:
    if history is None:
        history = []

    response = llm_chain.invoke({"question": question, "chat_history": history})
    answer = response["answer"].split("### Answer:")[-1].strip()
    return answer

In [33]:
question = "What do I have to do for the peer review?"
display_markdown(answer_question(question), raw=True)

To participate in the peer review process, follow these steps:

1. Find the projects you need to review based on the given link.
2. Evaluate the final submissions of two other peer groups having the same topic.
3. Submit your peer review scores in the Google Form provided.
4. Receive a score for your grading based on how much your grading differs from the assistant's grading.

Remember to follow the scoring criteria and provide constructive feedback in your marks. Use the public group ID for communication regarding your group. All work is group work, and all course obligations must be graded positively to pass.

In [34]:
question = "What is Project 7 Conversations with Characters in Stories for Literacy?"
display_markdown(answer_question(question), raw=True)

Project 7, titled "Conversations with Characters in Stories for Literacy," is a proposal to develop conversational personaBots using Large Language Models (LLMs) to engage young people in reading. The goal is to address the global literacy crisis and help students improve their literacy skills by interacting with digital representations of story characters. The project involves creating custom personaBots for characters from suggested novels, evaluating their performance, and delivering a comprehensive report on the findings.

In [35]:
question = "What is project Automatic identification of multiword expressions and definitions generation about?"
display_markdown(answer_question(question), raw=True)

Project Automatic identification of multiword expressions and definitions generation is about understanding the relation between the meaning of words in Natural Language Processing (NLP), specifically focusing on Slovenian language. The project involves analyzing inter-annotator agreement, generating definitions, proposing corpus improvements, performing automatic translation with definition generation, and semantically analyzing the results. The goal is to evaluate the ability of Large Language Models (LLMs) to capture relational knowledge and transfer it across languages using the MultiLexBATS dataset.

In [36]:
question = "How will the assistent will score the final grade?"
display_markdown(answer_question(question), raw=True)

The assistant will score the final grade based on the scoring schema provided in the instructions. The schema includes scores for each submission and the overall final grade. The scores are relative to the quality of the submitted work and the achievements of all the participants in the course. The assistant will follow the scoring criteria and provide feedback to the mark. The final grade will be determined after the peer review process and the submission defenses during the lab sessions.

In [37]:
question = "What needs to be done for a grade 10?"
display_markdown(answer_question(question), raw=True)

To achieve a grade 10, you need to submit exceptional work. This means going beyond the minimum requirements and delivering a project that is extraordinary in terms of its quality, originality, and impact. The project should be well-documented, easy to understand, and fully reproducible. It should also demonstrate a deep understanding of the topic and showcase innovative solutions or approaches. Additionally, the project should include thorough analyses and discussions, and future directions and ideas should be clearly articulated. Lastly, ensure that all dependencies, corpora, or trained models are included in the repository or linked to it, making it as simple as possible for others to run your code.

In [38]:
question = "What do you have to do for a good grade?"
display_markdown(answer_question(question), raw=True)

To achieve a good grade, you need to fully address all the requirements outlined in the instructions. This includes selecting a project topic, creating a well-organized repository, implementing at least one solution with analysis, discussing future directions and ideas, and submitting a final report with thorough analyses and discussions. Make sure your work is publicly available and easy for others to run and understand. Your grading will be based on the quality of your submission compared to other participants in the course.

In [39]:
question = "Who is the data set for natural language inference intended for?"
display_markdown(answer_question(question), raw=True)

The provided data sets are intended for training and evaluating natural language inference models. These models can then be used as AI assistants or conversational agents that excel in complex reasoning tasks, enabling interaction with humans through intuitive chat interfaces.

With the introduction of RAG, the `Mistral` model was able to successfully answer questions about the NLP course.

**Exercises:**
* Implement a data loader script that can load documents from a folder.
