# Evaluating RAG Pipelines for Document-Based Question Answering

This notebook purpose is to build and inspect the parts of the **Two-Stage Consecutive RAG** pipeline.  
We use LangChain's `QAGenerateChain` to generate a set of question-answer (QA) pairs from the collection of PDF documents. This set will be used to evaluate the performance of the RAG pipeline.  
The performance of the Two-Stage RAG pipeline will be also compared to two other RAG pipelines—**Standard RAG** and **Hybrid RAG** for document-based QA. We compare the accuracy rates of each pipeline to determine their effectiveness in delivering precise and contextually relevant responses based on the provided documents.

# Load the necessary libraries

In [100]:
%load_ext autoreload
%autoreload 2
%matplotlib inline


# Required imports
# import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import os
import glob
from langchain.document_loaders import PyPDFLoader
from langchain.evaluation.qa import QAGenerateChain, QAEvalChain
from langchain_openai import ChatOpenAI
import numpy as np
from pprint import pprint
from hydra import compose, initialize
from omegaconf import DictConfig, OmegaConf
import random
# initialize(config_path="../configs", job_name="notebook_config")


import sys
# sys.path.append('../src')

# Assuming you're in notebooks/ directory
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

# from backend.settings import get_env_secrets  # now works!
from backend.my_lib.pdf_manager import PDFManager
from backend.my_lib.retrievers import Retrievers
from backend.my_lib.qa_chains import QAchains
from backend.my_lib.hybrid_retrieval import Hybrid_Retrieval

config = OmegaConf.load("../configs/config.yaml")
print('\n ======== Configuration parameters:')
print(OmegaConf.to_yaml(config))



The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

settings:
  verbose: false
splitter:
  large_chunk_size: 2000
  small_chunk_size: 400
  paragraph_separator: "\n \n"
llm:
  openai_modelID: gpt-4o-mini
  embed_model_id: sentence-transformers/all-MiniLM-L12-v2
Vectorstore:
  collection_name: large_chunks
  persist_directory: vector_store
Retrieval:
  semantic_CE_model: cross-encoder/stsb-TinyBERT-L-4
  keyword_CE_model: cross-encoder/ms-marco-MiniLM-L-6-v2
  top_k_BM25: 200
  top_k_documents: 5
  top_k_semantic: 50
  top_k_final: 10



# Parameters

In [2]:
skip_qa_generation = True
data_root = "../data/sample_pdfs/" # sample pdfs from the repository
# data_root = "../data/pdfs_selected/" # your uploaded pdfs

config = compose(config_name="config")
modelID = config.llm.openai_modelID
top_k_BM25 = config.Retrieval.top_k_BM25
top_k_semantic = config.Retrieval.top_k_semantic
top_k_final = config.Retrieval.top_k_final

question = " According to the documents, what is Morningstar's view on the Federal Reserve's interest rate decisions for the remainder of 2024 and into 2025?"

# Sample QA test case generation

## Document-level chunks
For the QA sample generation it is sufficient to create only one chunk per pdf file.  
Later, we will perform separate finer-grained chunking for the RAG pipelines.

In [46]:
filenames = [f for f in glob.glob(data_root + '*.pdf') if os.path.isfile(f)]

documents = []
for file in filenames:    
    loader = PyPDFLoader(f'{file}')
    pages = loader.load()
    # add all pages together    
    combined_page_content = "".join([page.page_content for page in pages])
    document = {"page_content": combined_page_content, "metadata": {"source": file}}
    documents.append(document)
    print(file)
print(f'{len(documents)} PDF chunks created with one chunk per PDF\n')

../data/sample_pdfs/Carl Vine_ Japan Holds the ‘Most Fascinating Little Pocket of the Global Equity Market’ _ Morningstar.pdf
../data/sample_pdfs/5 Hot Stocks to Sell Before They Report Earnings _ Morningstar.pdf
../data/sample_pdfs/4 Risky Stocks to Sell and 4 Picks to Buy Instead _ Morningstar.pdf
../data/sample_pdfs/5 Undervalued Stocks to Buy as Their Stories Play Out _ Morningstar.pdf
../data/sample_pdfs/5 Undervalued Stocks to Buy During Q4 _ Morningstar.pdf
../data/sample_pdfs/August 2024 Stock Market Outlook_ Small-Cap and Value Stocks Shine _ Morningstar.pdf
../data/sample_pdfs/5 Stocks to Buy as the Market Rally Broadens _ Morningstar.pdf
7 PDF chunks created with one chunk per PDF



## QA pair generation using LLM

In [47]:
# generating n_samples of QA pairs from available PDFs
n_samples = 2

index = np.random.randint(1, len(documents), n_samples)
sample_docs = [documents[i] for i in index]   
llm = ChatOpenAI(temperature = 0.0, model=modelID)

example_gen_chain = QAGenerateChain.from_llm(llm)
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in sample_docs],
)
print(f'{len(new_examples)} QA pairs generated from PDFs\n')

for i, example in enumerate(new_examples):
    source = sample_docs[i]['metadata']['source']
    print("\n Source file:\n", source)    
    example['source'] = source
    print(f'Question:')
    pprint(f'{example.get('qa_pairs').get('query')}')
    print('Answer:')
    pprint(f'{example.get('qa_pairs').get('answer')}')    



2 QA pairs generated from PDFs


 Source file:
 ../data/sample_pdfs/5 Undervalued Stocks to Buy as Their Stories Play Out _ Morningstar.pdf
Question:
("What are the expectations for the Federal Reserve's interest rate cuts "
 'according to David Sekera, and how do these expectations relate to the '
 'upcoming Fed meetings and inflation data?')
Answer:
('David Sekera indicates that the market currently implies a 0% chance of a '
 'rate cut at the upcoming Fed meeting, with only a 10% probability of a cut '
 'during the July meeting. However, the probability increases to 50% for the '
 'September meeting. Sekera expects the Fed to maintain its current stance '
 'during the July meeting, emphasizing the need for additional data to confirm '
 'that inflation is on a downward path. He anticipates that the July meeting '
 'will be uneventful, with no significant commentary on the timing of rate '
 'cuts. The release of CPI and PPI numbers is also crucial, as they could '
 'influence market v

# Two-Stage RAG

## Building the RAG pipeline

In [50]:
import warnings
import numpy as np

# config.settings.verbose = True

pdf_manager = PDFManager(data_root, config)
pdf_manager.load_pdfs()
pdf_manager.chunk_documents()
pdf_manager.create_vectorstore()

# Create retrievers
retrievers = Retrievers(pdf_manager, config)
retrievers.setup_retrievers()

# create qa chains
qa_chains = QAchains(retrievers, config)

def two_stage_rag(question):
    qa_chains.shorten_question(question)
    qa_chains.retrieve_context()
    answer = qa_chains.generate_answer()
    return answer

Total document pages loaded: 134 from ../data/sample_pdfs/
Documents split into 758 small and 215 large chunks.
Collection large_chunks is deleted
Vectorstore large_chunks created successfully with 215 documents.
Retrievers created successfully.


# Inspection of the two-stage RAG with a sample question

In [51]:
question = new_examples[0]['qa_pairs']['query']
pprint(question)

("What are the expectations for the Federal Reserve's interest rate cuts "
 'according to David Sekera, and how do these expectations relate to the '
 'upcoming Fed meetings and inflation data?')


In [None]:
# Question shortening
qa_chains.shorten_question(question)
shortened_question = qa_chains.shortened_question

Federal Reserve, interest rate cuts, David Sekera, expectations, Fed meetings, inflation data.


In [102]:
# Test the retrievers
qa_chains.retrieve_context()
top_score_docs = qa_chains.top_score_docs
i_chunk = random.randint(0, len(top_score_docs)-1)
print('Number of top score chunks retrieved:',len(top_score_docs))
print('\n====== Content of a retrieved top score chunk:\n',top_score_docs[i_chunk].page_content)
print(f'\n====== The chunk metadata:\n {top_score_docs[i_chunk].metadata}')  

The small chunks were retrieved
The DRS was calculated for relevant PDF documents

 ==== Ranked retrieved large chunks ==== 

The large chunks were retrieved
The aggregated scores were calculated for all retrieved chunks
The top score chunks were concatenated
Number of top score chunks retrieved: 20

 And of course, that’s still our base case scenario as far as what we expect the Fed to
start cutting interest rates.
Dziubinski: So then what are you going to be listening for during Fed Chair Powell’s
remarks? And from the Fed statement?
Sekera: I think this is going to be actually a pretty boring, uneventful Fed meeting. I

 {'name': '5 Undervalued Stocks to Buy as Their Stories Play Out _ Morningstar.pdf', 'page': 2, 'score': 1.208620309829712, 'aggregated_score': np.float64(1.0374444365488267)}


In [90]:
qa_chains.generate_answer()
answer = qa_chains.response
pprint(answer)

('According to David Sekera, the Morningstar US economics team is expecting a '
 '25-basis-point cut in the federal-funds rate for each of the next two '
 'meetings. They anticipate that the federal-funds rate will decline to a '
 'range of 3.00% to 3.25% by the end of next year. However, Sekera notes a '
 'caveat: if employment remains strong, there is a possibility that the Fed '
 'may not cut the rate as much as expected.\n'
 '\n'
 'In terms of the upcoming Fed meetings, Sekera indicates that the market '
 'currently implies a 0% chance of a rate cut at the next meeting, with only a '
 '10% chance for the July meeting. The probability increases to 50% for the '
 'September meeting, which aligns with their base case scenario for when the '
 'Fed might begin cutting rates.\n'
 '\n'
 'Regarding inflation data, Sekera suggests that if inflation numbers come out '
 'higher than expected, it could lead to a sharp market selloff and would call '
 "into question the Fed's ability to manage 

# Two-Stage RAG evaluation using generated QA pairs

In [21]:
predictions = list()
examples = list()
for example, idoc in zip(new_examples, index):
    print(f'Document {idoc}')
    question = example['qa_pairs']['query']
    pprint(question)
    answer = two_stage_rag(question)
    prediction = {'query': example['qa_pairs']['query'], 'answer': example['qa_pairs']['answer']}
    prediction["result"] = answer
    predictions.append(prediction)
    examples.append(
            example.get('qa_pairs')
        )

eval_chain = QAEvalChain.from_llm(llm)
graded_outputs_2RAG = eval_chain.evaluate(examples, predictions, prediction_key="result")
print('The result of RAG evaluation for the given example questions: ')

accuracy_2RAG = len([result for result in graded_outputs_2RAG if result['results'] == 'CORRECT'])/len(graded_outputs_2RAG)
print(f'\n Accuracy of the RAG pipeline: {accuracy_2RAG} \n')

graded_outputs_2RAG

Document 1
('What are the key economic indicators and earnings reports that David Sekera '
 'is focusing on in the week of July 15, 2024, and what implications do they '
 'have for consumer spending and stock valuations?')
- Key economic indicators  
- Earnings reports  
- David Sekera  
- Week of July 15, 2024  
- Implications for consumer spending  
- Stock valuations  
The shortened question:
 - Key economic indicators  
- Earnings reports  
- David Sekera  
- Week of July 15, 2024  
- Implications for consumer spending  
- Stock valuations  
The small chunks were retrieved
The DRS was calculated for relevant PDF documents

 ==== Ranked retrieved large chunks ==== 

The large chunks were retrieved
The aggregated scores were calculated for all retrieved chunks
The top score chunks were concatenated
Document 4
("What are the key factors influencing Morningstar's stock recommendations for "
 'Q4 2024, particularly regarding the economic outlook, interest rates, and '
 'specific stock p

[{'results': 'GRADE: INCORRECT'}, {'results': 'INCORRECT'}]

# Comparison to Hybrid RAG

In [24]:
hybrid_retrieval = Hybrid_Retrieval(pdf_manager, retrievers, config)
hybrid_RAG_QA = QAchains(retrievers, config)
def hybrid_rag(question,top_k_BM25, top_k_semantic, top_k_final, rrf_k = 60, hybrid = True):
    top_score_docs = hybrid_retrieval.hybrid_retriever(question, top_k_BM25, top_k_semantic, top_k_final, rrf_k, hybrid)
    hybrid_RAG_QA.top_score_docs = top_score_docs
    hybrid_RAG_QA.question = question
    answer = hybrid_RAG_QA.generate_answer()
    return answer
# answer_hybrid = hybrid_rag(question, 200, 50, 10)    
# pprint(answer_hybrid)

## QA pair evaluation loop

In [25]:
predictions_hybrid = list()
examples = list()
for example, idoc in zip(new_examples, index):
    question = example['qa_pairs']['query']
    print(f'Question:')
    pprint(question)
    answer = hybrid_rag(question, top_k_semantic, top_k_semantic, top_k_final)
    print(f'Answer:')
    pprint(answer)
    prediction = {'query': example['qa_pairs']['query'], 'answer': example['qa_pairs']['answer']}
    prediction["result"] = answer
    predictions_hybrid.append(prediction)
    examples.append(
            example.get('qa_pairs')
        )

eval_chain = QAEvalChain.from_llm(llm)
graded_outputs_hybrid = eval_chain.evaluate(examples, predictions_hybrid, prediction_key="result")
print('The result of RAG evaluation for the given example questions: ')

accuracy_hybrid = len([result for result in graded_outputs_hybrid if result['results'] == 'CORRECT'])/len(graded_outputs_hybrid)
print(f'\n Accuracy of the RAG pipeline: {accuracy_hybrid} \n')

graded_outputs_hybrid

Question:
('What are the key economic indicators and earnings reports that David Sekera '
 'is focusing on in the week of July 15, 2024, and what implications do they '
 'have for consumer spending and stock valuations?')
=== Hybrid Retrieval with BM25 and semantic search ===
Answer:
('In the week of July 15, 2024, David Sekera is focusing on several key '
 'economic indicators and earnings reports, particularly retail sales and the '
 'beginning of earnings season. \n'
 '\n'
 '1. **Retail Sales**: Sekera is particularly interested in the retail sales '
 'data, as it will provide insights into consumer spending patterns. He notes '
 'that low-income consumers have been under pressure from inflation for over a '
 'year, and there are emerging signs that middle-income households are also '
 'beginning to pull back on spending, especially in discretionary and '
 'indulgent categories. This trend could indicate a broader slowdown in '
 'consumer spending, which is critical for evaluating t

[{'results': 'GRADE: CORRECT'}, {'results': 'INCORRECT'}]

# Comparison to Normal RAG

In [26]:
predictions_rag = list()
examples = list()
for example, idoc in zip(new_examples, index):
    question = example['qa_pairs']['query']
    print(f'Question:')
    pprint(question)
    answer = hybrid_rag(question, top_k_semantic, top_k_semantic, top_k_final, hybrid = False)
    print(f'Answer:')
    pprint(answer)
    prediction = {'query': example['qa_pairs']['query'], 'answer': example['qa_pairs']['answer']}
    prediction["result"] = answer
    predictions_rag.append(prediction)
    examples.append(
            example.get('qa_pairs')
        )

eval_chain = QAEvalChain.from_llm(llm)
graded_outputs_rag = eval_chain.evaluate(examples, predictions_rag, prediction_key="result")
print('The result of RAG evaluation for the given example questions: ')

accuracy_rag = len([result for result in graded_outputs_rag if result['results'] == 'CORRECT'])/len(graded_outputs_rag)
print(f'\n Accuracy of the RAG pipeline: {accuracy_rag} \n')

graded_outputs_rag

Question:
('What are the key economic indicators and earnings reports that David Sekera '
 'is focusing on in the week of July 15, 2024, and what implications do they '
 'have for consumer spending and stock valuations?')
=== Semantic search retrieval only === 
Answer:
('In the week of July 15, 2024, David Sekera is focusing on several key '
 'economic indicators and earnings reports, particularly the Consumer Price '
 'Index (CPI) and Producer Price Index (PPI) numbers. The CPI and core CPI '
 'came in well below expectations, which is a positive sign for inflation '
 'control. Conversely, the PPI was slightly higher than expected, but the '
 'underlying components were lower than anticipated, suggesting that inflation '
 'pressures may not be as severe as previously thought.\n'
 '\n'
 'These indicators have significant implications for consumer spending. Sekera '
 'notes that low-income consumers have been under pressure from inflation for '
 'over a year, and there are signs that mi

[{'results': 'GRADE: INCORRECT'}, {'results': 'GRADE: INCORRECT'}]

# Comparison of RAG pipelines

In [27]:
# Create a dataframe
data = {
    'Method': ['RAG', 'Hybrid RAG', 'Two-Stage RAG'],
    'Large Scale Accuracy (%)': [accuracy_rag_large, accuracy_hybrid_large, accuracy_2RAG_large],
    'Small Scale Accuracy (%)': [accuracy_rag_small, accuracy_hybrid_small, accuracy_2RAG_small]
}
df = pd.DataFrame(data)
df_melted = df.melt(id_vars='Method', var_name='Experiment Scale', value_name='Accuracy (%)')
df_melted['Accuracy (%)'] = (df_melted['Accuracy (%)']*100).round(2)
df_melted.head()


# Set the aesthetic style of the plots
sns.set(style="whitegrid")

# Create the bar plot
plt.figure(figsize=(10, 6))
bar_plot = sns.barplot(
    x='Method',
    y='Accuracy (%)',
    hue='Experiment Scale',
    data=df_melted,
    # palette='viridis'
)

# Add percentage labels on top of the bars
for p in bar_plot.patches:
    height = p.get_height()
    bar_plot.annotate(f'{height}%',
                      (p.get_x() + p.get_width() / 2., height),
                      ha='center', va='bottom',
                      fontsize=11)

# Set labels and title
plt.ylim(0, 100)  # Adjusted to fit percentage scale
plt.title('Accuracy Comparison of RAG Methods Across Experiment Scales', fontsize=16)
plt.ylabel('Accuracy (%)')
plt.xlabel('Method')
plt.legend(title='Experiment Scale')

# Display the plot
plt.show()

NameError: name 'accuracy_rag_large' is not defined

# Debugging

In [None]:
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
# from backend.settings import is_streamlit_running


llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    # max_tokens=None,
    # timeout=None,
    max_retries=2,
    # api_key="...",
    # base_url="...",
    # organization="...",
    # other params...
)
# invoke it with a question
question = "According to the documents, what is Morningstar's view on the Federal Reserve's interest rate decisions for the remainder of 2024 and into 2025?"

shortening_prompt = """
You are an expert financial advisor tasked with shortening the original question. 
Your role is to reformulate the original question to short phrases with essential keywords.
Mostly focus on company names, consultant or advisor names.
The answer does not need to be complete sentense.
Do not convert words to abbreviations.

Original Question: "{original_question}"

Reformulated phrases: """            

custom_short_prompt = PromptTemplate.from_template(shortening_prompt)

shortening_chain = (
    {"original_question": RunnablePassthrough()}
    | custom_short_prompt
    | llm
    | StrOutputParser()
)
shortened_question = shortening_chain.invoke(question)
print(shortened_question)
# input_text = "The meaning of life is "
# llm.invoke(input_text)
# print(response)


