# NLP Lab10 - RAG-based Question Answering

**Author: Bartłomiej Jamiołkowski**

The exercise introduces modern approaches to Question Answering using Retrieval Augmented Generation (RAG) with LLMs and vector databases.

In [1]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.output_parsers import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain_chroma import Chroma
from langchain_core.prompts import PromptTemplate
from langchain.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from uuid import uuid4

import chromadb
import warnings
warnings.filterwarnings('ignore')

## Task 1

Set up the QA environment:
   * Install OLLAMA and select an appropriate LLM
   * Configure [Qdrant](https://qdrant.tech/) vector database (or vector DB of your choosing)
   * Install necessary Python packages for embedding generation

In [2]:
ollama = Ollama(base_url = 'http://localhost:11434', model = 'gemma2')

I decided to use Chroma vector database, which is configured in the following tasks.

## Tasks 2 - 3

Find PDF file of your choosing. Example - some publication or CV file.

Write next procedures necessary for RAG pipeline. Use [LangChain](https://python.langchain.com/docs/introduction/) library:
 
   * Load PDF file using `PyPDFLoader`.  
   * Split documents into appropriate chunks using `RecursiveCharacterTextSplitter`.
   * Generate and store embeddings in Qdrant database

For this task, I decided to use fictional CV in PDF format provided by [BeamJobs](https://www.beamjobs.com/resumes/nlp-data-scientist-resume-examples). Mentioned CV has only one page.

In [3]:
loader = PyPDFLoader(file_path = './data/nlp-data-scientist-official-resume-example.pdf')
pages = loader.load()

In [4]:
print(pages[0].metadata)

{'source': './data/nlp-data-scientist-official-resume-example.pdf', 'page': 0}


In [5]:
print(pages[0].page_content[:100])

RAHUL MALIK
NLP DATA SCIENTIST
CONTACT
rahulmalik@email.com
(123) 456-7890
Brooklyn, NY
LinkedIn
Git


In [6]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 100, chunk_overlap = 40, length_function = len)

In [7]:
chunks = text_splitter.split_documents(pages)
print(f'Chunks: {chunks}\n\nNumber of chunks: {len(chunks)}')

Chunks: [Document(metadata={'source': './data/nlp-data-scientist-official-resume-example.pdf', 'page': 0}, page_content='RAHUL MALIK\nNLP DATA SCIENTIST\nCONTACT\nrahulmalik@email.com\n(123) 456-7890\nBrooklyn, NY\nLinkedIn'), Document(metadata={'source': './data/nlp-data-scientist-official-resume-example.pdf', 'page': 0}, page_content='(123) 456-7890\nBrooklyn, NY\nLinkedIn\nGithub\nEDUCATION\nPhD\nNatural Language\nProcessing (NLP)'), Document(metadata={'source': './data/nlp-data-scientist-official-resume-example.pdf', 'page': 0}, page_content='PhD\nNatural Language\nProcessing (NLP)\nUniversity of Maryland\nSeptember 2010 - April 2016'), Document(metadata={'source': './data/nlp-data-scientist-official-resume-example.pdf', 'page': 0}, page_content='September 2010 - April 2016\nCollege Park, MD\nB.S.\nStatistics\nPrinceton University'), Document(metadata={'source': './data/nlp-data-scientist-official-resume-example.pdf', 'page': 0}, page_content='B.S.\nStatistics\nPrinceton University

The following code generates embeddings using the Ollama model 'gemma2' and stores them in the Chroma vector database, as specified in Task 1.

In [8]:
embeddings = OllamaEmbeddings(model = 'gemma2')

In [9]:
vector_store = Chroma(
    collection_name = 'lab10_rag_collection',
    embedding_function = embeddings,
    persist_directory = './chroma_langchain_db'
)

In [10]:
uuids = [str(uuid4()) for _ in range(len(chunks))]
vector_store.add_documents(documents = chunks, ids = uuids)

['247ab915-6c12-4dc4-9783-df5a128919a6',
 '5871e483-cf8b-4286-a861-a84cd9516885',
 'b34d131e-62e3-4c29-a0fd-3f0ce9f3823a',
 'c6029e14-6210-498f-9dc9-fd493ea4eb6b',
 '433ddd5e-ba4d-4a63-a125-482abdaebef3',
 '6b2bf3da-0ab0-44b7-acd7-49195a742764',
 '3d218436-0f1e-417a-af0a-9601a1168b97',
 '0a2b2507-9450-4061-bcb9-5d9f2dcfb3df',
 'b475b447-3aa3-4ef9-82d2-f2385e4a98c6',
 'd54711e2-7f6e-4568-8da2-afc6c938eaa7',
 '96e0c6dd-757c-4c36-a09e-0421aaf86a21',
 '6140ebfc-aeb6-4aed-953b-45389a85a4b8',
 '9a966c78-511a-44fe-b35b-a0489ab189ff',
 'eeeea5fd-5db1-460e-84df-df5ccfb32657',
 '97ad0478-e33d-4b8f-a705-9e237294247a',
 '84a195a8-7a8b-4b5a-b44e-9975d2eb9049',
 'f81149c3-4c50-4c0e-bb9a-09787f4fce4e',
 '2c96a75c-bded-4e71-b943-6e8e7b70808d',
 '953381cb-f5c5-4f8e-bcc5-be2be2357f70',
 'be9fde12-ba2c-4f54-bf7a-76ad3d0d3916',
 '64eea69c-1357-4641-bc68-0e44d12bc0ed',
 '5d6ae52b-11a9-46aa-81ef-7996474650ca',
 '6b331817-6cb7-4c38-b1ab-4403ac2448dc',
 '59189352-538e-4b89-b07c-5ddabebc902c',
 '43b1fc3e-762d-

## Task 4

Design and implement the RAG pipeline with `LCEL`. As reference use this detailed guide created by LangChain community - [RAG](https://python.langchain.com/docs/tutorials/rag/). Next steps should involve:
   * Create query embedding generation
   * Implement semantic search in Qdrant
   * Design prompt templates for context integration
   * Build response generation with the LLM

In this task, I use a retriever that generates query embeddings and performs semantic search in Chroma.

In [11]:
retriever = vector_store.as_retriever(search_kwargs = {'k': 7})

Creating basic QA prompt.

In [12]:
prompt = PromptTemplate(
    input_variables = ['context', 'query'],
    template = '''
    <start_of_turn>user
    You are an AI assistant. Use the context below to answer the query. You are not allowed to write additional comments.

    Context: {context}

    Query: {query}
    <end_of_turn>
    <start_of_turn>model
    '''
)

In [13]:
def chunks_to_string(chunks):
    return '\n\n'.join(chunk.page_content for chunk in chunks)

Implementing the RAG pipeline with LCEL.

In [14]:
rag_chain = {'context': retriever | chunks_to_string, 'query': RunnablePassthrough()} | prompt | ollama | StrOutputParser()

I determine 5 evaluation queries for evaluation and further comparison purposes..

In [15]:
queries = [
    'What skills does Rahul Malik have?',
    'Where does Rahul Malik live?',
    'What is the current position of Rahul Malik?',
    'What are the former employers of Rahul Malik?',
    'When did Rahul Malik graduate from the university?'
]

Building response generation.

In [16]:
for query in queries:
    answer = rag_chain.invoke(query)
    print(f"Query: {query}\nAnswer: {answer}{'-' * 45}")

Query: What skills does Rahul Malik have?
Answer: Python (NumPy, Pandas, Scikit-learn, Keras, Flask), SQL (MySQL, Postgres), Git, Time Series Forecasting, Productionizing Models, Recommendation Engines, Customer Segmentation, AWS, NLP 
---------------------------------------------
Query: Where does Rahul Malik live?
Answer: New York, NY  
---------------------------------------------
Query: What is the current position of Rahul Malik?
Answer: NLP Data Scientist 
---------------------------------------------
Query: What are the former employers of Rahul Malik?
Answer: Priceline 
---------------------------------------------
Query: When did Rahul Malik graduate from the university?
Answer: April 2016 
---------------------------------------------


Comparing performance of RAG vs. pure LLM response.

In [17]:
for query in queries:
    answer = ollama.invoke(prompt.format(context = str(pages[0].page_content), query = query), temperature = 0.0)
    print(f"Query: {query}\nAnswer: {answer}{'-' * 45}")

Query: What skills does Rahul Malik have?
Answer: Python (NumPy, Pandas, Scikit-learn, Keras, Flask), SQL (MySQL, Postgres), Git, Time Series Forecasting, Productionizing Models, Recommendation Engines, Customer Segmentation, AWS, NLP 
---------------------------------------------
Query: Where does Rahul Malik live?
Answer: Brooklyn, NY  
---------------------------------------------
Query: What is the current position of Rahul Malik?
Answer: NLP Data Scientist  
---------------------------------------------
Query: What are the former employers of Rahul Malik?
Answer: Amazon, Priceline, Microsoft 
---------------------------------------------
Query: When did Rahul Malik graduate from the university?
Answer: September 2010 - April 2016  
---------------------------------------------


RAG and pure LLM returned quite similar answers. However, there are some differences that may indicate that RAG can be more precise than pure LLM. First of all, the fourth answer was more accurate for RAG. In the fourth question, I asked RAG about the former employers of the fictional character Rahul Malik from the CV. RAG returned one of the two names of past employers. In contrast, the LLM returned all organization names, including the name of Rahul Malik's current employer. In my opinion, this indicates that the LLM was not as precise.

A similar situation occurred with the fifth query, where I asked for information about when the analyzed person graduated from university. RAG provided a precise answer, whereas the LLM failed in this task by not understanding the context. It returned the time span during which this fictional person studied instead of providing the correct information.

These findings, along with an analysis of the theoretical aspects of both methods, allow me to answer the following questions.

## Questions

### Question 1

How does RAG improve the quality and reliability of LLM responses compared to pure LLM generation?

RAG improves the quality and reliability of LLM responses compared to pure LLM generation by combining the generative power of LLMs with real-time information retrieval from external sources, such as databases. As a result, it can handle complex queries by pulling in relevant sections of retrieved text. Moreover, in contrast to LLMs, RAG mitigates hallucinations by grounding the model’s responses in actual retrieved documents or data. Additionally, RAG enhances the contextual relevance of responses. When the model retrieves specific documents or pieces of information, it can tailor its response more precisely to the user’s query. Finally, RAG can cross-reference multiple sources during generation, helping it maintain coherent and consistent facts in the response.

### Question 2

What are the key factors affecting RAG performance (chunk size, embedding quality, prompt design)?

Factors such as: chunk size, embedding quality and prompt design play a crucial role in RAG performance. Chunk size affects the balance between precision and context. Smaller chunks improve retrieval accuracy, but may lose broader context, while larger chunks provide richer context, but risk including irrelevant details. The optimal size depends on the content and query type. Embedding quality ensures accurate retrieval by capturing semantic meaning. High-quality, domain-specific embeddings improve relevance, while poor embeddings lead to irrelevant or misleading retrievals. Additionally, well-structured prompt design guides the model in using retrieved information effectively, ensuring: logical, relevant and coherent responses.

### Question 3

How does the choice of vector database and embedding model impact system performance?

The choice of vector database and embedding model significantly impacts system performance. The vector database determines the efficiency of retrieving relevant information, which directly affects response time. This is achieved through features like efficient indexing and ANN search. Scalability is another critical factor, because a good vector database can handle large datasets while maintaining fast retrieval times. 

Similarly, the embedding model plays a crucial role in system performance. An efficient embedding model reduces latency and computational costs, while high-quality embeddings ensure better retrieval accuracy and more relevant results. Together, the vector database and embedding model form the foundation for a performant RAG system, influencing both speed and accuracy.

### Question 4

What are the main challenges in implementing a production-ready RAG system?

Implementing a production-ready Retrieval-Augmented Generation (RAG) system comes with several challenges. One of the main issues is the incompleteness of the knowledge base. This means that the data needed to answer a query may be unavailable or poorly formatted, making it difficult to use effectively. Another common challenge is ensuring the output is generated in the correct format. JSON is the standard format, but systems often produce incomplete or incorrectly structured outputs. Additionally, parsing complex documents, such as PDFs with embedded charts, tables or images, requires advanced parsers capable of handling these sophisticated structures. Finally, managing large-scale data retrieval presents significant challenges, as it demands efficient algorithms or specialized system architectures to ensure both accuracy and performance.

### Question 5

How can the system be improved to handle complex queries requiring multiple document lookups?

The RAG system can be improved in various ways to handle complex queries requiring multiple document lookups. First, I recommend enabling multi-document retrieval and re-ranking, as these will ensure the most relevant information is prioritized. Another improvement worth considering is query decomposition and expansion, which can enhance understanding by breaking down complex queries and broadening the search scope. Advanced embedding models leveraging contextual and cross-document embeddings will enable better synthesis of information. Finally, post-processing techniques, such as information fusion and redundancy filtering, can further refine responses by combining insights and removing irrelevant data. These enhancements will allow the system to deliver accurate and relevant answers to complex multi-step queries.