# Retrieval Augmented Generation (RAG)
<sup>This notebook is a part of Natural Language Processing class at the University of Ljubljana, Faculty for computer and information science. Please contact [boshko.koloski@ijs.si](mailto:boshko.koloski@ijs.si) for any comments.</sub>




The core functionality of this notebook is to create a retrieval-augmented generation (RAG) system that enables discussion about the NLP subject setting using the `Mistral-7b-v0.2` model.

General-purpose language models can be fine-tuned for common tasks such as sentiment analysis and named entity recognition, which typically do not require additional background knowledge.

For more complex and knowledge-intensive tasks, it is possible to construct a language model-based system that accesses external knowledge sources. This approach enhances factual consistency, improves the reliability of responses, and mitigates the issue of 'hallucination.'

Researchers have introduced the Retrieval Augmented Generation (RAG) method to address these knowledge-intensive tasks. RAG integrates an information retrieval component with a text generator model, allowing for efficient updates and modifications to its internal knowledge without retraining the entire model.

RAG operates by taking an input, retrieving a set of relevant or supporting documents from a source like Wikipedia, and concatenating these documents with the original input prompt. This concatenated context is then fed to the text generator, which produces the final output. This adaptability is crucial for situations where facts may evolve over time, as the static parametric knowledge of traditional large language models (LLMs) can become outdated. RAG circumvents the need for retraining, providing access to the most current information and enabling reliable outputs through retrieval-based generation.

RAG requires additional document embeddings and the storage of documents in a database for retrieval purposes.


### Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `bitsandbytes`, `accelerate`, `transformers`, `datasets`, `sentence-transformers` and  `faiss-gpu`.

In [1]:
!pip install langchain bitsandbytes accelerate sentence-transformers faiss-cpu langchain_community unstructured python-docx

Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Downloading python_docx-1.1.2-py3-none-any.whl (244 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.3/244.3 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-docx
Successfully installed python-docx-1.1.2


**In simple terms, RAG is to LLMs what an open-book exam is to humans.**

The concept of an open-book exam centers around assessing a student's reasoning abilities rather than their capacity to memorize specific details. In a similar vein, RAG separates factual knowledge from the LLM’s reasoning capabilities. This factual information is stored in an external knowledge source, which is both easily accessible and updatable:

- **Parametric knowledge:** Knowledge that is learned during training and implicitly stored within the neural network's weights.
- **Non-parametric knowledge:** Information that is stored externally, for example, in a vector database.
e.

![RAG](RAG.jpg)

The RAG workflow consists of:

1. **The Retrieve**: The user query is used to retrieve relevant context from an external knowledge source. For this, the user query is embedded using an embedding model into the same vector space as the additional context in the vector database. This enables a similarity search, and the top k closest data objects from the vector database are returned.
2. **Augment**: The user query and the retrieved additional context are incorporated into a prompt template.
3. **Generate**: Finally, the retrieval-augmented prompt is fed to the LLM.

We will use the `langchain` framework to efficiently prompt the LLMs and prepare the RAG.


In [2]:
import os
import tempfile
from pathlib import Path
import joblib
import requests
import torch
import transformers

In [3]:
from IPython.display import display_markdown
from langchain.chains.conversational_retrieval.base import ConversationalRetrievalChain
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter, HTMLHeaderTextSplitter, TokenTextSplitter
from langchain_community.document_loaders import BSHTMLLoader
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain_community.vectorstores.faiss import FAISS

In [4]:
LLM_MODEL = "mistralai/Mistral-7B-Instruct-v0.2"
device = f"cuda:{torch.cuda.current_device()}"

Large Language Models are known for their significant computational demands. Typically, the size of a model is determined by multiplying the number of parameters (size) by the precision of these values (data type). To conserve memory, weights can be stored using lower-precision data types through a process known as quantization.

**Post-Training Quantization (PTQ)** is a straightforward technique where the weights of an already trained model are converted to a lower precision without necessitating any retraining. Although easy to implement, PTQ can lead to potential performance degradation.We will employ PTQ using the `bitsandbytes` library and will load the model in 4-bit precision, applying double quantization with the `nf4` data type. For more information about quantization, visit [this guide on quantization](https://huggingface.co/docs/optimum/en/concept_guides/quantization) )pe.


In [5]:
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True, # loading in 4 bit
    bnb_4bit_quant_type="nf4", # quantization type
    bnb_4bit_use_double_quant=True, # nested quantization
    bnb_4bit_compute_dtype=torch.bfloat16,
)

In [6]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: read)

In [7]:
model_config = transformers.AutoConfig.from_pretrained(
    pretrained_model_name_or_path=LLM_MODEL,
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=LLM_MODEL,
    config=model_config,
    quantization_config=bnb_config, # we introduce the bnb config here.
    device_map="auto",
)
model.eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): Mist

We also need to load the tokenizer to transform the text as before.

In [8]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path=LLM_MODEL,
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

We will use pipelines from Hugging Face to perform the prompting and generation with the Mistral model.


In [9]:
generate_text = transformers.pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    #temperature=0.0,
    max_new_tokens=8192,
    repetition_penalty=1.1,
)

Device set to use cuda:0


We will use `langchain` to link the HuggingFace models and the chaining prompting.

In [10]:
llm = HuggingFacePipeline(pipeline=generate_text)

  llm = HuggingFacePipeline(pipeline=generate_text)


The core functionality of `langchain` is the creation of templates for prompting via `PromptTemplate`.

In [11]:
from langchain_core.prompts import PromptTemplate

In [12]:
template = """
You are a helpful AI QA assistant. When answering questions, use the context enclosed by triple backquotes if it is relevant.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Reply your answer in markdown format.
Question: {question}
Answer:"""
prompt = PromptTemplate.from_template(template)

In [13]:
prompt

PromptTemplate(input_variables=['question'], input_types={}, partial_variables={}, template="\nYou are a helpful AI QA assistant. When answering questions, use the context enclosed by triple backquotes if it is relevant.\nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\nReply your answer in markdown format.\nQuestion: {question}\nAnswer:")

In [14]:
chain = prompt | llm

In [15]:
chain

PromptTemplate(input_variables=['question'], input_types={}, partial_variables={}, template="\nYou are a helpful AI QA assistant. When answering questions, use the context enclosed by triple backquotes if it is relevant.\nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\nReply your answer in markdown format.\nQuestion: {question}\nAnswer:")
| HuggingFacePipeline(pipeline=<transformers.pipelines.text_generation.TextGenerationPipeline object at 0x7c2a10c27ad0>)

In [16]:
question = "What is the scoring criteria of the NLP course?"

In [17]:
print(chain.invoke({"question": question}))


You are a helpful AI QA assistant. When answering questions, use the context enclosed by triple backquotes if it is relevant.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Reply your answer in markdown format.
Question: What is the scoring criteria of the NLP course?
Answer: I'd be happy to help you with that! However, without specific context regarding which NLP (Natural Language Processing) course you're referring to, I can only provide some general information about common scoring criteria used in NLP courses.
Here are some common assessment areas for NLP projects and assignments:
- **Accuracy**: This measures how often the model or algorithm produces correct results.
- **Precision**: This measures the proportion of true positives among the total number of positive predictions made by the model.
- **Recall**: This measures the proportion of true positives among all actual positive instances in the data.
- **F1 Score**: This is the harmo

In [18]:
question = "What do I have to do for the peer review?"

In [19]:
print(chain.invoke({"question": question}))


You are a helpful AI QA assistant. When answering questions, use the context enclosed by triple backquotes if it is relevant.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Reply your answer in markdown format.
Question: What do I have to do for the peer review?
Answer: To prepare for a peer review, you should first read and understand the document or code thoroughly. Identify any issues or areas for improvement, and write down your thoughts and suggestions. During the review process, be respectful and constructive in your feedback. Use specific examples and clear language to explain your reasoning. Remember, the goal is to help improve the work and provide valuable insights.


**NOTE**: The model provides responses that are fairly general. We plan to enhance this by integrating contextual details from our course materials via **RAG**.

### Enter RAG

Next, we will define the vector embeddings of our context. We will use the `sentence-transformers/all-mpnet-base-v2` model to embed the documents and a FAISS vector store to store and later retrieve them. LangChain offers the `HuggingFaceEmbeddings` interface to easily load any model from Hugging Face to serve as the document representation.


In [20]:
EMBED_MODEL = "sentence-transformers/all-mpnet-base-v2"

In [21]:
embedding = HuggingFaceEmbeddings(
    model_name=EMBED_MODEL,
    model_kwargs={"device": "cuda"},
)

  embedding = HuggingFaceEmbeddings(


In [32]:
def fetch_websites(sites: list[str]):
    docs = []
    with tempfile.TemporaryDirectory() as tmpdir:
        filename = f"{tmpdir}/site.html"
        for site in sites:
            res = requests.get(site)
            with open(filename, mode="wb") as fp:
                fp.write(res.content)
            docs.extend(BSHTMLLoader(filename, open_encoding = 'utf8').load())
    return docs

In [33]:
course_webpage = "https://docs.google.com/document/d/e/2PACX-1vRwReAm5Vz_aiNyZN33eTz22fqMJhM0H-KtJdXthUw5cean_WYdBkZgJchP_s9th0rtUW0ikZZ_Fh5l/pub"

The `fetch_websites` function will be used to scrape data from our Google Document.


In [34]:
docs = fetch_websites([course_webpage])

In [35]:
docs[0]

Document(metadata={'source': '/tmp/tmpntspgd2l/site.html', 'title': 'Natural language processing 2025'}, page_content='Natural language processing 2025Published using Google DocsReport abuseLearn moreNatural language processing 2025Updated automatically every 5 minutesLaboratory work - Spring 2025The main goal of laboratory work is to present the most important aspects of data science in practice and to teach you how to use key tools for a NLP engineer. We especially emphasize on self-paced work, raising standards related to development, replicability, reporting, research, visualizing, etc. Our goal is not to provide exact instructions or "make robots" out of participants of this course. Participants will need to try to navigate themselves among data, identify promising leads and extract as much information as possible from the data to present to the others (colleagues, instructors, companies or their superiors).Important linksLab sessions course repository\xa0(continuously updated, us

We loaded the entire course into a single document. Since the sentence transformer can handle only limited sections of text, this might be problematic. Next, we will use the `RecursiveCharacterTextSplitter` to split the document into chunks with a `chunk_size` of 1000 characters and a `chunk_overlap` of 20.


In [36]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(docs)

Let's see what the chunked document looks like.


In [37]:
all_splits[0]

Document(metadata={'source': '/tmp/tmpntspgd2l/site.html', 'title': 'Natural language processing 2025'}, page_content='Natural language processing 2025Published using Google DocsReport abuseLearn moreNatural language processing 2025Updated automatically every 5 minutesLaboratory work - Spring 2025The main goal of laboratory work is to present the most important aspects of data science in practice and to teach you how to use key tools for a NLP engineer. We especially emphasize on self-paced work, raising standards related to development, replicability, reporting, research, visualizing, etc. Our goal is not to provide exact instructions or "make robots" out of participants of this course. Participants will need to try to navigate themselves among data, identify promising leads and extract as much information as possible from the data to present to the others (colleagues, instructors, companies or their superiors).Important linksLab sessions course repository\xa0(continuously updated, us

In [38]:
all_splits[1]

Document(metadata={'source': '/tmp/tmpntspgd2l/site.html', 'title': 'Natural language processing 2025'}, page_content='and language processing\xa0(online draft)\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0Python 3 Text Processing with NLTK 3 CookbookIntroduction to Data Science Handbook\xa0 Razvoj slovenščine v digitalnem okolju\xa0(February 2023)Previous years NLP course materials NLP course 2024 project reports\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0MarksPeer review\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0TBA\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0* Check Marks spreadsheet (sheet "Projects to review") to get list of repositories you need to review.\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0Weekly planThis plan is regularly updated. \xa0Lab sessions\xa0are meant to discuss materials and your project ideas. Those proficient in some of the topics covered during the course are expected to help other students during the lab work or in online discussions. Such contributions will also be taken into account. During the lab sessions we will

In [39]:
all_splits[2]

Document(metadata={'source': '/tmp/tmpntspgd2l/site.html', 'title': 'Natural language processing 2025'}, page_content="repository. During the lab sessions we will briefly present each week's topic and then mostly discuss your project ideas and work. You are expected to check/run notebooks before the lab sessions and then ask questions/discuss during the lab sessions. In the repository's README you can also find the recordings of each topic.WeekDescriptionMaterials and links17.2. - 21.2./24.2. - 28.2.Lab work introductionProjects overview Group work and projects application procedureBasic text processingSlovene text processing3.3. - 7.3.Text clusteringText classification\xa0Language models, knowledge basesProjects sign up form\xa0(deadline Friday midnight).Github classroom assignment\xa0(deadline Friday midnight, only one group member creates a team, exactly three members for a group!). Use GROUP ID\xa0from the projects sign up form as a team name.10.3. - 14.3.Neural networks introducti

Next, we initialize a vector store. A vector store is a data structure that functions as a vector database, where each document is stored based on its own embedding. We will use the `FAISS` library for this purpose.


In [40]:
vectorstore = FAISS.from_documents(all_splits, embedding)

Finally, we define a retriever—an object that will handle the **retrieving** part of the RAG pipeline. The retriever receives as arguments the metric by which we search the space and the number of the k-nearest documents (chunks in our case) that we retrieve to present to the studen


In [41]:
retreiver = vectorstore.as_retriever(
    search_type="similarity",
    k=10,
)

Next, we modify our prompt template so that it now receives the context (the documents selected by the RAG system) and then generates the answer.


In [42]:
PROMPT_TEMPLATE = """
You are a helpful AI QA assistant. When answering questions, use the context enclosed by triple backquotes if it is relevant.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Reply your answer in markdown format.

```
{context}
```

### Question:
{question}

### Answer:
"""

prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template=PROMPT_TEMPLATE.strip(),
)

With the prompt defined, we next set up the `ConversationalRetrievalChain` that will utilize the defined `retriever` and `llm`, following the `PROMPT_TEMPLATE` to extract documents.


In [43]:
# Construct complete LLM chain
llm_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retreiver,
    return_source_documents=False,
    combine_docs_chain_kwargs={"prompt": prompt_template},
    verbose=False,
)

Finally, we create the `answer_question` function that will handle the chain invocation for us.


In [44]:
def answer_question(question: str, history: dict[str] = None) -> str:
    if history is None:
        history = []

    response = llm_chain.invoke({"question": question, "chat_history": history})
    answer = response["answer"].split("### Answer:")[-1].strip()
    return answer

In [45]:
question = "What do I have to do for the peer review?"
display_markdown(answer_question(question), raw=True)

To participate in the peer review process, follow these steps:

1. Find the projects you need to review based on the provided link.
2. Review projects of the same topic as your group.
3. Submit your peer review scores in the Google Form.
4. Provide feedback to your marks.

Remember, your grading will also be evaluated during the peer review process, and your score will depend on how much your grading differs from the assistant's grading, following the given scoring criteria. Make sure to include links to all dependencies/corpora or include them in the repository, and ensure that anyone can easily run the code for the reviewed projects.

In [55]:
question = "What is the first project about?"
display_markdown(answer_question(question), raw=True)

The first project is called "Conversational Agent with Retrieval-Augmented Generation". Its goal is to develop a conversational agent that enhances the quality and accuracy of its responses by dynamically retrieving and integrating relevant external documents from the web. Unlike traditional chatbots that rely solely on pre-trained knowledge, this system will perform real-time information retrieval, ensuring up-to-date answers.

In [50]:
question = "What do you have to do for a good grade?"
display_markdown(answer_question(question), raw=True)

To achieve a good grade, you need to pass Submission 1 and 2, and gain 50% total from each of Submission 3 and Peer review. Make sure your work fulfills the instructions and is publicly available on GitHub before the deadlines. Additionally, attend the lab sessions and discussions, and provide clear and constructive feedback during peer review.

In [56]:
question = "What is the project intended for digital linguistics about?"
display_markdown(answer_question(question), raw=True)

The projects for digital linguistics students are about analyzing and comparing translation errors and biases in large language models (LLMs). Students will evaluate how different models handle translations, identifying common errors such as mistranslations, omissions, and cultural misinterpretations. Additionally, the project will explore biases that emerge in translations, including gender, regional, and political biases. By systematically comparing multiple LLMs, students will assess translation quality using both automated metrics and human evaluations. The findings will help improve fairness and accuracy in AI-driven translation systems.

With the introduction of RAG, the `Mistral` model was able to successfully answer questions about the NLP course.

**Exercises:**
* Implement a data loader script that can load documents from a folder.
