![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logo/0x150/1643104191/logo-mse.png)

# AdvNLP Lab (Graded Lab): Experimenting with Retrieval as Part of a RAG System

Total: 44 points

**Objectives:** We build the retrieval part of a RAG system and compare performance of classic KNN retrieval with additional cross encoder reranking. Eventually, we write two prompts for generation and test it on a LLM.

**Useful documentation:** Since you'll use LangChain for this assignment, [their documentation](https://python.langchain.com/docs/introduction/) might be helpful.

## Students

Dave Brunner, Andrea Wey

## Setup

First, we need to install the required packages for this assignment.

In [1]:
!pip install pandas langchain-community langchain-huggingface faiss-cpu --quiet

In [2]:
import pandas as pd
from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import CSVLoader
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers import ContextualCompressionRetriever

We will use the [DRAGONBall Dataset](https://github.com/OpenBMB/RAGEval) as a basis for this assignment and load a subset of their documents. These will be the stored knowledge of the RAG system. To store them into the vector store, we will later directly create embeddings out of them, since they have alredy the size of suitable chunks. Each document consists of a unique ID and the actual content.

In [3]:
queries= pd.read_csv('queries.csv', index_col=0)
queries['ground_truth_doc_ids']= queries['ground_truth_doc_ids'].apply(lambda x: x.split(','))

In [4]:
documents = pd.read_csv('docs.csv', index_col=0)
documents

Unnamed: 0_level_0,content
id,Unnamed: 1_level_1
40,Acme Government Solutions is a government indu...
41,Entertainment Enterprises Inc. is an entertain...
42,"Advanced Manufacturing Solutions Inc., establi..."
43,"EcoGuard Solutions, established on April 15, 2..."
44,"Green Fields Agriculture Ltd., established on ..."
...,...
211,Hospitalization Record:\n\nBasic Information:\...
212,**Hospitalization Record**\n\n**Basic Informat...
213,Hospitalization Record\n\nBasic Information:\n...
214,Hospitalization Record\n----------------------...


The main goal of the assignment is to evaluate the retrieval component of the RAG system. For that, we also load a dataset of queries, which we can use to retrieve matching documents. Each query has also assigned an array of documents in the form of their IDs, which match with the documents loaded before. We can use these to evaluate whether the correct documents were found by the retrieval or not.

In [5]:
queries = pd.read_csv('queries.csv', index_col=0)
queries['ground_truth_doc_ids'] = queries['ground_truth_doc_ids'].apply(lambda x: x.split(';'))
queries

Unnamed: 0_level_0,query,ground_truth_doc_ids
query_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2286,When was Sparkling Clean Housekeeping Services...,[64]
2433,How did HealthPro Innovations' strategic partn...,[54]
6266,According to the hospitalization records of Br...,[212]
4499,"According to the judgment of Norwood, Unionvil...",[124]
2448,Based on HealthLife Solutions' 2020 corporate ...,[73]
...,...,...
2186,How did the severe drought in August 2018 lead...,[65]
3251,Compare the large-scale financing activities o...,"[58, 55]"
2268,How did CleanCo Housekeeping Services' investm...,[47]
3311,What were the outcomes of the debt restructuri...,"[56, 53]"


## 1. Recall@N

**1a) [2 points]** We will evaluate the retrieval by comparing the retrieved documents with the ground truth documents assigned to the query. For that, we will use the Recall@N metric. Please describe in 1-2 sentences how we can interpret this metric in our case.

**Your Answer:**

The higher the recall, the more documents from the ground truth were found in the top N positions of the retrieved documents.

**1b) [4 points]** Implement the Recall@N metric and test it with the following code.

In [6]:
def recall_at_n(retrieved_docs, relevant_doc_ids, n):
    """
    Calculate Recall@N.

    Parameters:
    - retrieved_docs: Sorted list of retrieved documents as LangChain Document objects
    - relevant_doc_ids: List of relevant document IDs
    - n: Number of top documents to consider

    Returns:
    - Recall@N
    """

    # Get the IDs of the top N retrieved documents
    retrieved_doc_ids = [doc.metadata['id'] for doc in retrieved_docs[:n]]

    # Calculate the number of relevant documents found in the top N
    relevant_found = len(set(retrieved_doc_ids) & set(relevant_doc_ids))

    # Calculate Recall@N
    recall_at_n = relevant_found / len(relevant_doc_ids) if relevant_doc_ids else 0

    return recall_at_n

In [7]:
### Test

recall_at_n(
    [Document(page_content='', metadata={'id': str(id)}) for id in range(10)],
    ['0', '1', '20'],
    3
)

0.6666666666666666

## 2. Embedding Model

**2a) [3 points]** Each document will be converted to an embedding representing the semantic meaning of the document. In this assignment, we will use model `sentence-transformers/all-MiniLM-L6-v2` from HuggingFace. Please answer the following questions about this model:

In [8]:
# Load the embedding model
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-minilm-l6-v2')
model = embeddings._client

# what is the embedding length ?
embedding_length = model.get_sentence_embedding_dimension()
print(f"Embedding length: {embedding_length}")
# number of parameters?
total_params =[p.numel() for p in model.parameters()]
print(f"Total parameters: {sum(total_params)}")

# maximum sequence length?
max_seq_length = model.get_max_seq_length()
print(f"Maximum sequence length: {max_seq_length}")

  from .autonotebook import tqdm as notebook_tqdm


Embedding length: 384
Total parameters: 22713216
Maximum sequence length: 256


**Your Answers:**

Embedding Length: 384

Number of Parameters: 22713216

Maximum Sequence Length: 256

## 3. Vector Store

**3a) [4 points]** Use LangChain to create a FAISS vector store and embed the documents with the above-mentioned embedding model. Load the documents again but this time with a Loader object from LangChain. Eventually, print the number of documents in the vector store.

In [9]:
# load the csv file
loader = CSVLoader(file_path='docs.csv', metadata_columns=['id'])
# create a FAISS vector store
vector_store = FAISS.from_documents(
    documents=loader.load(),
    embedding=embeddings
)

print(f"there are {vector_store.index.ntotal} vectors in the vector store")

there are 108 vectors in the vector store


**3b) [3 points]** Retrieve the Top-3 documents for this query: "According to the hospitalization records of Bridgewater General Hospital, summarize the present illness of J. Reyes." and print the documents' ID and L2 distance.

In [10]:
# retrieve the top 3 documents for this query: 
query = "According to the hospitalization records of Bridgewater General Hospital, summarize the present illness of J. Reyes"
retrieved_docs = vector_store.similarity_search_with_score(query, k=3, )
# print the documentsID and L2 distance
for doc in retrieved_docs:
    print(f"Document ID: {doc[0].id}, L2 distance: {doc[1]}")

Document ID: e1a873b3-e65b-44a0-a135-51bc5c29cdfa, L2 distance: 0.7137991786003113
Document ID: 04361aed-6dee-40df-a5cf-667bf45174f9, L2 distance: 0.98215651512146
Document ID: f1c38ba0-8dc6-4113-80c7-fe5eb3376761, L2 distance: 0.9883989691734314


**3c) [2 points]** Check and show if a suitable document is found for the query in the Top-3 retrieved documents and show the relevant ones.

In [11]:
relevant_docs = []
for doc, score in retrieved_docs:
    content = doc.page_content
    # Check if the document contains relevant keywords
    if "Bridgewater General Hospital" in content and "J. Reyes" in content:
        relevant_docs.append((doc.id, score, content))

# Print the relevant documents ID, L2 distance, and content
if relevant_docs:
    for doc_id, score, content in relevant_docs:
        print(f"Relevant Document Found - ID: {doc_id}, L2 Distance: {score}")
        print(f"Content: {content}\n")
else:
    print("No relevant documents found in the top-3 retrieved documents.")



Relevant Document Found - ID: e1a873b3-e65b-44a0-a135-51bc5c29cdfa, L2 Distance: 0.7137991786003113
Content: content: **Hospitalization Record**

**Basic Information:**
Name: J. Reyes
Gender: Male
Age: 52
Ethnicity: Hispanic
Marital Status: Married
Occupation: Construction Worker
Address: 22, Sunnyvale street, Bridgewater
Admission Time: 7th, September
Record Time: 8th, September
Historian: Self
Hospital Name: Bridgewater General Hospital

**Chief Complaint:**
Persistent joint pain and morning stiffness for 6 months

**Present Illness:**
Onset: The symptoms began insidiously 6 months ago, initially noticed while working at a construction site. Gradual onset with morning stiffness in the fingers and wrists.
Main Symptoms: Morning stiffness, arthritis affecting hands, feet, wrists, ankles, and temporomandibular joints. Pain characterized as dull and persistent, worsens with activity and improves with rest.
Accompanying Symptoms: Joint deformities in the hands, fatigue, intermittent fever

**Your Answer:**



## 4. Vector Store Evaluation

**4a) [4 points]** Now, we will search with each of the queries for the most relevant documents in the vector store, and calculate Recall@N with them and the assigned ground truth document IDs. To aggregate the results over all queries, we will calculate the mean. We will do this 3 times to and use a different value for $N$ each time: $N \in \{ 1, 3, 5, 25\}$.

In [12]:
N_values = [1, 3, 5, 25]

results = {n: [] for n in N_values}
total_recall = {n: 0 for n in N_values}

for index, row in queries.iterrows():

    retrieved_docs = vector_store.similarity_search(row['query'], k=max(N_values))

    for n in N_values:
        recall = recall_at_n(retrieved_docs, row['ground_truth_doc_ids'], n)
        total_recall[n] += recall

for n in N_values:
    print(f'Recall@{n} = {total_recall[n] / len(queries):.4f}')

Recall@1 = 0.6650
Recall@3 = 0.8150
Recall@5 = 0.8600
Recall@25 = 1.0000


**4b) [2 points]** When looking at the four calculated Recall@N scores, what do you observe and how can you explain this?

**Your Answer:**

The recall@N increases with the value of N, since a larger value of N allows a bigger search space.


## 5. Cross Encoder

**5a) [3 points]** We want to use a cross encoder model to rerank the retrieved documents. Describe in 1-2 sentences how a new document order can be determined using a cross encoder.

**Your Answer:**



**5b) [4 points]** Now again, we want to calculate Recall@N for all queries and the same $N$ as before. This time, we want to rerank the Top-25 retrieved documents using the cross encoder model `BAAI/bge-reranker-base`. Implement this using LangChain components and report the average Recall for $N \in \{ 1, 3, 5, 25\}$.

In [17]:
def calculate_with_rerankings(queries, vector_store, N_values):
    results = {n: [] for n in N_values}
    total_recall = {n: 0 for n in N_values}

    cross_encoder = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
    compressor = CrossEncoderReranker(model=cross_encoder, top_n=25)
    compression_retriever = ContextualCompressionRetriever(
        base_compressor=compressor, base_retriever=vector_store.as_retriever()
    )

    for index, row in queries.iterrows():
        query = row['query']
        ground_truth = set(row['ground_truth_doc_ids'])

        reranked_docs = compression_retriever.invoke(query)
        for n in N_values:
            recall = recall_at_n(reranked_docs, ground_truth, n)
            total_recall[n] += recall

    average_recalls = {}
    for n in N_values:
        average_recalls[n] = total_recall[n] / len(queries)
        print(f'Recall@{n} with reranking = {average_recalls[n]:.4f}')

    return average_recalls


In [18]:
average_recall = calculate_with_rerankings(queries, vector_store, N_values)

Recall@1 with reranking = 0.7300
Recall@3 with reranking = 0.8450
Recall@5 with reranking = 0.8450
Recall@25 with reranking = 0.8450


**5c) [2 points]** What do you observe when you compare the Recall@N scores after reranking with the scores without reranking? Write 1-2 sentences about this and why this might happen.

**Your Answer:**
Reranking improves Recall@1, meaning the top result is more often relevant. Recall@3, @5, and @25 stay the same because reranking changes the order, not the retrieved documents.

## 6. Generation

**6a) [6 points]** After improving the retrieval part of the RAG system, we want to finally generate an answer for our query. Retrieve the most relevant document for query "How much funding did HealthPro Innovations raise in February 2021?" and print its ID. Then write the instruction message of a prompt to answer this query including all necessary elements before running it using your favourite LLM (ChatGPT GPT-4o, etc.). Please paste the answer from the model and indicate which model you used.

In [20]:
query='How much funding did HealthPro Innovations raise in February 2021?'

retrieved_docs = vector_store.similarity_search(query, k=1)
print(f"Retrieved Document ID: {retrieved_docs[0].metadata['id']}")

Retrieved Document ID: 54


**Your Prompt:**



In [22]:
prompt= f"""
You are a medical expert. Please answer the following question based on the retrieved document:
###### Document
{retrieved_docs[0].page_content}
###### Question
{query}
"""

**Generated Answer:**

HealthPro Innovations raised $150 million in funding in February 2021.

**Used Model:**

ChatGPT

**6b) [3 points]** We want to use in-context learning and provide the LLM one example of a possible answer. Use the same prompt and extend it, that it should follow this example answer: "Yep, they sold a lot in that year. Over 50 million units as I can see — pretty big move, respect!". Use the same model, create a fresh chat and run this new prompt. Highlight the changes in the prompt using **bold style** or <span style="color:red;">color</span>.

**Your Prompt:**

Same as above + <b>Use this sample answer: 
"Yep, they sold a lot in that year. Over 50 million units as I can see — pretty big move, respect!"</b>

**Generated Answer:**

Yep, they raised a lot in that month. A solid $150 million in February 2021 — pretty big move, respect!


**6c) [2 points]** Please check if the two answers are correct according to the document and how they differ. Does the model follow the example in the second prompt?

**Your Answer:**
both are answered correctly according to the context

## End of AdvNLP Lab

Please make sure all cells have been executed, save this completed notebook, and upload it to Moodle.