![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logo/0x150/1643104191/logo-mse.png)

# AdvNLP Lab (Graded Lab): Experimenting with Retrieval as Part of a RAG System

Total: 44 points

**Objectives:** We build the retrieval part of a RAG system and compare performance of classic KNN retrieval with additional cross encoder reranking. Eventually, we write two prompts for generation and test it on a LLM.

**Useful documentation:** Since you'll use LangChain for this assignment, [their documentation](https://python.langchain.com/docs/introduction/) might be helpful.

## Students

Dave Brunner, Andrea Wey

## Setup

First, we need to install the required packages for this assignment.

In [1]:
# !pip install pandas langchain-community langchain-huggingface faiss-cpu --quiet

In [2]:
import pandas as pd
from langchain.retrievers import ContextualCompressionRetriever
from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import CSVLoader
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker



We will use the [DRAGONBall Dataset](https://github.com/OpenBMB/RAGEval) as a basis for this assignment and load a subset of their documents. These will be the stored knowledge of the RAG system. To store them into the vector store, we will later directly create embeddings out of them, since they have alredy the size of suitable chunks. Each document consists of a unique ID and the actual content.

In [3]:
documents = pd.read_csv('docs.csv', index_col=0)
documents

Unnamed: 0_level_0,content
id,Unnamed: 1_level_1
40,Acme Government Solutions is a government indu...
41,Entertainment Enterprises Inc. is an entertain...
42,"Advanced Manufacturing Solutions Inc., establi..."
43,"EcoGuard Solutions, established on April 15, 2..."
44,"Green Fields Agriculture Ltd., established on ..."
...,...
211,Hospitalization Record:\n\nBasic Information:\...
212,**Hospitalization Record**\n\n**Basic Informat...
213,Hospitalization Record\n\nBasic Information:\n...
214,Hospitalization Record\n----------------------...


The main goal of the assignment is to evaluate the retrieval component of the RAG system. For that, we also load a dataset of queries, which we can use to retrieve matching documents. Each query has also assigned an array of documents in the form of their IDs, which match with the documents loaded before. We can use these to evaluate whether the correct documents were found by the retrieval or not.

In [4]:
queries = pd.read_csv('queries.csv', index_col=0)
queries['ground_truth_doc_ids'] = queries['ground_truth_doc_ids'].apply(lambda x: x.split(';'))
queries

Unnamed: 0_level_0,query,ground_truth_doc_ids
query_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2286,When was Sparkling Clean Housekeeping Services...,[64]
2433,How did HealthPro Innovations' strategic partn...,[54]
6266,According to the hospitalization records of Br...,[212]
4499,"According to the judgment of Norwood, Unionvil...",[124]
2448,Based on HealthLife Solutions' 2020 corporate ...,[73]
...,...,...
2186,How did the severe drought in August 2018 lead...,[65]
3251,Compare the large-scale financing activities o...,"[58, 55]"
2268,How did CleanCo Housekeeping Services' investm...,[47]
3311,What were the outcomes of the debt restructuri...,"[56, 53]"


## 1. Recall@N

**1a) [2 points]** We will evaluate the retrieval by comparing the retrieved documents with the ground truth documents assigned to the query. For that, we will use the Recall@N metric. Please describe in 1-2 sentences how we can interpret this metric in our case.

**Your Answer:**
Recall@N is a metric that measures the proportion of relevant documents (ground truth) that are retrieved within the top N results for a given query. In our case, it indicates how many of the documents that are actually relevant to the query are found among the top N retrieved documents, thus providing insight into the effectiveness of the retrieval system in returning relevant information.

**1b) [4 points]** Implement the Recall@N metric and test it with the following code.

In [5]:
def recall_at_n(retrieved_docs, relevant_doc_ids, n):
    """
    Calculate Recall@N.

    Parameters:
    - retrieved_docs: Sorted list of retrieved documents as LangChain Document objects
    - relevant_doc_ids: List of relevant document IDs
    - n: Number of top documents to consider

    Returns:
    - Recall@N
    """
    # Get the IDs of the top N retrieved documents
    top_n_ids = [doc.metadata['id'] for doc in retrieved_docs[:n]]
    # Calculate the number of relevant documents found in the top N
    relevant_found = sum(1 for doc_id in top_n_ids if doc_id in relevant_doc_ids)
    # Calculate Recall@N
    recall = relevant_found / len(relevant_doc_ids) if relevant_doc_ids else 0
    return recall

In [6]:
### Test
recall_at_n(
    [Document(page_content='', metadata={'id': str(id)}) for id in range(10)],
    ['0', '1', '20'],
    3
)

0.6666666666666666

## 2. Embedding Model

**2a) [3 points]** Each document will be converted to an embedding representing the semantic meaning of the document. In this assignment, we will use model `sentence-transformers/all-MiniLM-L6-v2` from HuggingFace. Please answer the following questions about this model:

**Your Answers:**

Embedding Length: 384

Number of Parameters: 22.7M

Maximum Sequence Length: 256 word pieces

## 3. Vector Store

**3a) [4 points]** Use LangChain to create a FAISS vector store and embed the documents with the above-mentioned embedding model. Load the documents again but this time with a Loader object from LangChain. Eventually, print the number of documents in the vector store.

In [7]:
from langchain_community.document_loaders import DataFrameLoader

loader = CSVLoader(
    file_path='docs.csv',
    metadata_columns=['id'],
)
documents = loader.load()

embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

vector_store = FAISS.from_documents(documents, embedding_model)

print(f"Number of documents in the vector store: {len(documents)}")

Number of documents in the vector store: 108


**3b) [3 points]** Retrieve the Top-3 documents for this query: "According to the hospitalization records of Bridgewater General Hospital, summarize the present illness of J. Reyes." and print the documents' ID and L2 distance.

In [8]:
query = 'According to the hospitalization records of Bridgewater General Hospital, summarize the present illness of J. Reyes.'

similar_docs = vector_store.similarity_search_with_score(query, k=3)
for doc in similar_docs:
    print(f"Document ID: {doc[0].id}, L2 Distance: {doc[1]}")

Document ID: 4cf31824-21a1-406c-b56f-b23be376c9a9, L2 Distance: 0.7302275896072388
Document ID: abe0b0b7-de70-4aec-9459-2b6256f0f17c, L2 Distance: 0.9898043870925903
Document ID: 4cedc1f5-d146-4a29-bf7d-e57b072b4004, L2 Distance: 1.0050287246704102


**3c) [2 points]** Check and show if a suitable document is found for the query in the Top-3 retrieved documents and show the relevant ones.

In [9]:
relevant_docs = []
for doc, score in similar_docs:
    content = doc.page_content
    # Check if the document contains relevant keywords
    if "Bridgewater General Hospital" in content and "J. Reyes" in content:
        relevant_docs.append((doc.id, score, content))

# Print the relevant documents ID, L2 distance, and content
if relevant_docs:
    for doc_id, score, content in relevant_docs:
        print(f"Relevant Document Found - ID: {doc_id}, L2 Distance: {score}")
        print(f"Content: {content}\n")
else:
    print("No relevant documents found in the top-3 retrieved documents.")

Relevant Document Found - ID: 4cf31824-21a1-406c-b56f-b23be376c9a9, L2 Distance: 0.7302275896072388
Content: content: **Hospitalization Record**

**Basic Information:**
Name: J. Reyes
Gender: Male
Age: 52
Ethnicity: Hispanic
Marital Status: Married
Occupation: Construction Worker
Address: 22, Sunnyvale street, Bridgewater
Admission Time: 7th, September
Record Time: 8th, September
Historian: Self
Hospital Name: Bridgewater General Hospital

**Chief Complaint:**
Persistent joint pain and morning stiffness for 6 months

**Present Illness:**
Onset: The symptoms began insidiously 6 months ago, initially noticed while working at a construction site. Gradual onset with morning stiffness in the fingers and wrists.
Main Symptoms: Morning stiffness, arthritis affecting hands, feet, wrists, ankles, and temporomandibular joints. Pain characterized as dull and persistent, worsens with activity and improves with rest.
Accompanying Symptoms: Joint deformities in the hands, fatigue, intermittent fever

## 4. Vector Store Evaluation

**4a) [4 points]** Now, we will search with each of the queries for the most relevant documents in the vector store, and calculate Recall@N with them and the assigned ground truth document IDs. To aggregate the results over all queries, we will calculate the mean. We will do this 3 times to and use a different value for $N$ each time: $N \in \{ 1, 3, 5, 25\}$.

In [10]:
N_values = [1, 3, 5, 25]

results = {n: [] for n in N_values}
total_recall = {n: 0 for n in N_values}

for index, row in queries.iterrows():

    retrieved_docs = vector_store.similarity_search(row['query'], k=max(N_values))

    for n in N_values:
        recall = recall_at_n(retrieved_docs, row['ground_truth_doc_ids'], n)
        total_recall[n] += recall

print("\n\n-------------------------------------\nAverage Recall@N for all queries:")
for n in N_values:
    average_recall = total_recall[n] / len(queries)
    print(f'Recall@{n} = {average_recall:.4f}')



-------------------------------------
Average Recall@N for all queries:
Recall@1 = 0.6650
Recall@3 = 0.8150
Recall@5 = 0.8600
Recall@25 = 1.0000


**4b) [2 points]** When looking at the four calculated Recall@N scores, what do you observe and how can you explain this?

**Your Answer:**
I can observe that the Recall@N scores increase with the value of N. <br>
This is expected, as a larger N allows for a broader search space, increasing the chances of finding the relevant documents among the retrieved results.

## 5. Cross Encoder

**5a) [3 points]** We want to use a cross encoder model to rerank the retrieved documents. Describe in 1-2 sentences how a new document order can be determined using a cross encoder.

**Your Answer:**
It can determine a new document order by evaluating the relevance of each query-document pair as a single unit, producing a score that reflects how well the document matches the query. The documents can then be reranked based on these scores, placing those with higher relevance scores at the top of the list.


**5b) [4 points]** Now again, we want to calculate Recall@N for all queries and the same $N$ as before. This time, we want to rerank the Top-25 retrieved documents using the cross encoder model `BAAI/bge-reranker-base`. Implement this using LangChain components and report the average Recall for $N \in \{ 1, 3, 5, 25\}$.

In [11]:
def calculate_recall_with_reranking(queries, vector_store, N_values):
    """
    Calculate Recall@N for given queries using a cross-encoder to rerank documents.

    Parameters:
    - queries: DataFrame containing the queries and their ground truth document IDs.
    - vector_store: The FAISS vector store containing the documents.
    - N_values: List of N values for which to calculate Recall@N.

    Returns:
    - Dictionary of average Recall@N for each N value.
    """
    # Initialize dictionaries to store results and total recall
    results = {n: [] for n in N_values}
    total_recall = {n: 0 for n in N_values}

    # Initialize the cross-encoder
    cross_encoder = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
    compressor = CrossEncoderReranker(model=cross_encoder, top_n=25)
    compression_retriever = ContextualCompressionRetriever(
        base_compressor=compressor, base_retriever=vector_store.as_retriever()
    )

    for index, row in queries.iterrows():
        query = row['query']
        ground_truth = set(row['ground_truth_doc_ids'])

        reranked_docs = compression_retriever.invoke(query)

        # Calculate and print Recall@N for each N value
        for n in N_values:
            recall = recall_at_n(reranked_docs, ground_truth, n)
            total_recall[n] += recall

    # Calculate and print the average Recall@N for all queries
    print("\n\n-------------------------------------\nAverage Recall@N for all queries:")
    average_recalls = {}
    for n in N_values:
        average_recall = total_recall[n] / len(queries)
        average_recalls[n] = average_recall
        print(f'Recall@{n} = {average_recall:.4f}')

    return average_recalls


average_recalls = calculate_recall_with_reranking(queries, vector_store, N_values)



-------------------------------------
Average Recall@N for all queries:
Recall@1 = 0.7300
Recall@3 = 0.8450
Recall@5 = 0.8450
Recall@25 = 0.8450


**5c) [2 points]** What do you observe when you compare the Recall@N scores after reranking with the scores without reranking? Write 1-2 sentences about this and why this might happen.

**Your Answer:**
The Recall@N1 and Recall@2 scores improved after the reranking than without reraking.
Indicating that the cross encoder model effectively improved the relevance of the retrieved document, when only few documents are considered.

The Recall@N5, and Recall@N25 scores after reranking are lower than without reranking.
Indicating that the cross encoder model is not able to improve the relevance of the retrieved documents when more documents are considered.

## 6. Generation

**6a) [6 points]** After improving the retrieval part of the RAG system, we want to finally generate an answer for our query. Retrieve the most relevant document for query "How much funding did HealthPro Innovations raise in February 2021?" and print its ID. Then write the instruction message of a prompt to answer this query including all necessary elements before running it using your favourite LLM (ChatGPT GPT-4o, etc.). Please paste the answer from the model and indicate which model you used.

In [12]:
query = "How much funding did HealthPro Innovations raise in February 2021?"

retrieved_docs = vector_store.similarity_search(query, k=1)
print(f'Retrieved Document ID: {retrieved_docs[0].metadata["id"]}')

Retrieved Document ID: 54


In [13]:
prompt = f"""
You are a RAG system that answers questions based on the provided document.
####### Document
{retrieved_docs[0].page_content}
####### Question
{query}
"""

In [14]:
import os
from mistralai import Mistral
from dotenv import load_dotenv

load_dotenv()
api_key = os.environ["MISTRAL_API_KEY"]
model = "mistral-medium-latest"
client = Mistral(api_key=api_key)


def generate_answer(prompt):
    return client.chat.complete(
        model=model,
        messages=[
            {
                "role": "user",
                "content": prompt,
            },
        ]
    )

In [15]:
chat_response = generate_answer(prompt)

### Results

**Your Prompt:**



In [16]:
prompt

"\nYou are a RAG system that answers questions based on the provided document.\n####### Document\ncontent: HealthPro Innovations, established in September 2009, is a publicly traded healthcare company based in New York, specializing in the development and sale of innovative healthcare solutions.\nIn 2021, HealthPro Innovations experienced several significant events that had a profound impact on its financial performance and market position. Firstly, in February 2021, the company conducted a large-scale financing activity, raising $150 million in funds. This financing activity strengthened the company's financial strength and provided support for its expansion and development. One of the sub-events that contributed to this financing activity was the strategic partnership formed with a leading healthcare provider in January 2021. This partnership aimed to expand HealthPro Innovations' market reach and gain access to new customers, ultimately resulting in increased sales and revenue poten

**Generated Answer:**

In [17]:
chat_response.choices[0].message.content

'HealthPro Innovations raised $150 million in funds during its large-scale financing activity in February 2021.'

**Used Model:**

In [18]:
model

'mistral-medium-latest'

**6b) [3 points]** We want to use in-context learning and provide the LLM one example of a possible answer. Use the same prompt and extend it, that it should follow this example answer: "Yep, they sold a lot in that year. Over 50 million units as I can see — pretty big move, respect!". Use the same model, create a fresh chat and run this new prompt. Highlight the changes in the prompt using **bold style** or <span style="color:red;">color</span>.

In [19]:
prompt = f"""
You are a RAG system that answers questions based on the provided document.
####### Document
{retrieved_docs[0].page_content}
####### Question
{query}
####### Style
Answer in a casual tone, as if you were talking to a friend.
####### In context learning example
How is the sales of the iPhone 6 in 2015?
Yep, they sold a lot in that year. Over 50 million units as I can see — pretty big move, respect!
"""

**Your Prompt:**



In [20]:
prompt

"\nYou are a RAG system that answers questions based on the provided document.\n####### Document\ncontent: HealthPro Innovations, established in September 2009, is a publicly traded healthcare company based in New York, specializing in the development and sale of innovative healthcare solutions.\nIn 2021, HealthPro Innovations experienced several significant events that had a profound impact on its financial performance and market position. Firstly, in February 2021, the company conducted a large-scale financing activity, raising $150 million in funds. This financing activity strengthened the company's financial strength and provided support for its expansion and development. One of the sub-events that contributed to this financing activity was the strategic partnership formed with a leading healthcare provider in January 2021. This partnership aimed to expand HealthPro Innovations' market reach and gain access to new customers, ultimately resulting in increased sales and revenue poten

**Changes in Prompt:**
The last part was added to the prompt the rest stayed the same. <br>



 ...<br>
 <span style="color:red">
**####### Style <br>
Answer in a casual tone, as if you were talking to a friend. <br>
####### In context learning example <br>
How is the sales of the iPhone 6 in 2015? <br>
Yep, they sold a lot in that year. Over 50 million units as I can see — pretty big move, respect!**
</span>.

**Generated Answer:**



In [21]:
chat_response = generate_answer(prompt)
chat_response.choices[0].message.content

'Oh, in February 2021, HealthPro Innovations raised a solid $150 million in funding. Pretty big move, respect! That cash boost definitely helped them level up their game.'

**6c) [2 points]** Please check if the two answers are correct according to the document and how they differ. Does the model follow the example in the second prompt?

**Your Answer:**
Yes both questions are answered correctly according to the document.


## End of AdvNLP Lab

Please make sure all cells have been executed, save this completed notebook, and upload it to Moodle.