# Evaluating RAG Pipelines for Document-Based Question Answering

This notebook purpose is to build and inspect the parts of the **Two-Stage Consecutive RAG** pipeline.  
We use LangChain's `QAGenerateChain` to generate a set of question-answer (QA) pairs from the collection of PDF documents. This set will be used to evaluate the performance of the RAG pipeline.  
The performance of the Two-Stage RAG pipeline will be also compared to two other RAG pipelines—**Standard RAG** and **Hybrid RAG** for document-based QA. We compare the accuracy rates of each pipeline to determine their effectiveness in delivering precise and contextually relevant responses based on the provided documents.

# Load the necessary libraries

In [75]:
%load_ext autoreload
%autoreload 2
%matplotlib inline


# Required imports
# import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import os
import glob
from langchain.document_loaders import PyPDFLoader
from langchain.evaluation.qa import QAGenerateChain, QAEvalChain
from langchain_openai import ChatOpenAI
import numpy as np
from pprint import pprint
from hydra import compose, initialize
from omegaconf import DictConfig, OmegaConf
import random
# initialize(config_path="../configs", job_name="notebook_config")


import sys

# Assuming you're in notebooks/ directory
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

# from backend.settings import get_env_secrets  # now works!
from backend.my_lib.pdf_manager import PDFManager
from backend.my_lib.retrievers import Retrievers
from backend.my_lib.qa_chains import QAchains
from backend.my_lib.hybrid_retrieval import Hybrid_Retrieval
from backend.my_lib.LLMManager import LLMManager

config = OmegaConf.load("../configs/config.yaml")
print('\n ======== Configuration parameters:')
print(OmegaConf.to_yaml(config))



The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

settings:
  verbose: true
splitter:
  large_chunk_size: 2000
  small_chunk_size: 400
  paragraph_separator: "\n \n"
llm:
  openai_modelID: gpt-4o-mini
  embed_model_id: sentence-transformers/all-MiniLM-L12-v2
  local_llama_model: hugging-quants/Llama-3.2-3B-Instruct-Q4_K_M-GGUF
  local_llama_filename: '*q4_k_m.gguf'
Vectorstore:
  collection_name: large_chunks
  persist_directory: vector_store
  clear_existing: false
Retrieval:
  semantic_CE_model: cross-encoder/stsb-TinyBERT-L-4
  keyword_CE_model: cross-encoder/ms-marco-MiniLM-L-6-v2
  top_k_BM25: 200
  top_k_documents: 5
  top_k_semantic: 50
  top_k_final: 10



# Parameters

In [76]:
local_model = True # True: using local model, False: using OpenAI model
extract_sample_QA = False # True: using LLM to extract sample QA pairs from the pdfs
skip_qa_generation = True
data_root = "../data/sample_pdfs/" # sample pdfs from the repository
# data_root = "../data/pdfs_selected/" # your uploaded pdfs

# config = compose(config_name="config")
modelID = config.llm.openai_modelID
top_k_BM25 = config.Retrieval.top_k_BM25
top_k_semantic = config.Retrieval.top_k_semantic
top_k_final = config.Retrieval.top_k_final

question = "What are the expectations for the Federal Reserve's interest rate cuts according to David Sekera, and how do these expectations relate to the upcoming Fed meetings and inflation data?"
print('Default question:\n',question)

Default question:
 What are the expectations for the Federal Reserve's interest rate cuts according to David Sekera, and how do these expectations relate to the upcoming Fed meetings and inflation data?


# Sample QA test case generation

## Document-level chunks
For the QA sample generation it is sufficient to create only one chunk per pdf file.  
Later, we will perform separate finer-grained chunking for the RAG pipelines.

In [77]:
filenames = [f for f in glob.glob(data_root + '*.pdf') if os.path.isfile(f)]

documents = []
for file in filenames:    
    loader = PyPDFLoader(f'{file}')
    pages = loader.load()
    # add all pages together    
    combined_page_content = "".join([page.page_content for page in pages])
    document = {"page_content": combined_page_content, "metadata": {"source": file}}
    documents.append(document)
    print(file)
print(f'{len(documents)} PDF chunks created with one chunk per PDF\n')

../data/sample_pdfs/Carl Vine_ Japan Holds the ‘Most Fascinating Little Pocket of the Global Equity Market’ _ Morningstar.pdf
../data/sample_pdfs/5 Hot Stocks to Sell Before They Report Earnings _ Morningstar.pdf
../data/sample_pdfs/4 Risky Stocks to Sell and 4 Picks to Buy Instead _ Morningstar.pdf
../data/sample_pdfs/5 Undervalued Stocks to Buy as Their Stories Play Out _ Morningstar.pdf
../data/sample_pdfs/5 Undervalued Stocks to Buy During Q4 _ Morningstar.pdf
../data/sample_pdfs/August 2024 Stock Market Outlook_ Small-Cap and Value Stocks Shine _ Morningstar.pdf
../data/sample_pdfs/5 Stocks to Buy as the Market Rally Broadens _ Morningstar.pdf
7 PDF chunks created with one chunk per PDF



## QA pair generation using LLM

In [78]:
# generating n_samples of QA pairs from available PDFs
if extract_sample_QA:
    print("\n\nGenerating QA pairs from PDFs ... \n")
    n_samples = 2

    index = np.random.randint(1, len(documents), n_samples)
    sample_docs = [documents[i] for i in index]   
    llm = ChatOpenAI(temperature = 0.0, model=modelID)

    example_gen_chain = QAGenerateChain.from_llm(llm)
    new_examples = example_gen_chain.apply_and_parse(
        [{"doc": t} for t in sample_docs],
    )
    print(f'{len(new_examples)} QA pairs generated from PDFs\n')

    for i, example in enumerate(new_examples):
        source = sample_docs[i]['metadata']['source']
        print("\n Source file:\n", source)    
        example['source'] = source
        print(f'Question:')
        pprint(f'{example.get('qa_pairs').get('query')}')
        print('Answer:')
        pprint(f'{example.get('qa_pairs').get('answer')}')

# Local model loading

In [79]:
if local_model:
    from llama_cpp import Llama
    # repo_model = "hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF"
    repo_model = "hugging-quants/Llama-3.2-3B-Instruct-Q4_K_M-GGUF"
    filename = "*q4_k_m.gguf"

    # Instantiate the Llama object (this loads the model into RAM)
    llm_cpp = Llama.from_pretrained(
        repo_id=repo_model,
        filename=filename,
        local_dir="models",
        n_ctx=10000,
        verbose=False
    )
    llm_manager = LLMManager(llm_instance = llm_cpp)
    print("\n\nLlama-cpp model loaded")

else:
    llm_manager = LLMManager(llm_instance = None)
    print("\n\nOpenAI model loaded")

llama_context: n_ctx_per_seq (10000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h9



Llama-cpp model loaded


# Two-Stage RAG

## Building the RAG pipeline

In [80]:
import warnings
import numpy as np

# config.settings.verbose = True

pdf_manager = PDFManager(data_root, config)
pdf_manager.load_pdfs()
pdf_manager.chunk_documents()
pdf_manager.create_vectorstore()


2025-06-20 16:33:53,734 INFO [backend.my_lib.pdf_manager] Total document pages loaded: 134 from ../data/sample_pdfs/
2025-06-20 16:33:53,741 INFO [backend.my_lib.pdf_manager] Documents split into 758 small and 215 large chunks.
2025-06-20 16:33:53,743 INFO [sentence_transformers.SentenceTransformer] Use pytorch device_name: mps
2025-06-20 16:33:53,743 INFO [sentence_transformers.SentenceTransformer] Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L12-v2
2025-06-20 16:33:55,413 INFO [backend.my_lib.pdf_manager] Collection large_chunks is deleted
2025-06-20 16:33:56,089 INFO [backend.my_lib.pdf_manager] Vectorstore large_chunks created successfully with 215 documents.


In [81]:
# Create retrievers
retrievers = Retrievers(pdf_manager, config)
retrievers.setup_retrievers()

2025-06-20 16:33:57,408 INFO [sentence_transformers.cross_encoder.CrossEncoder] Use pytorch device: mps
2025-06-20 16:33:59,176 INFO [sentence_transformers.cross_encoder.CrossEncoder] Use pytorch device: mps
2025-06-20 16:33:59,589 INFO [backend.my_lib.retrievers] Retrievers created successfully.


In [82]:
# create qa chains
qa_chains = QAchains(retrievers, config, llm_manager)

def two_stage_rag(question):
    qa_chains.shorten_question(question)
    qa_chains.retrieve_context()
    answer = qa_chains.generate_answer()
    return answer

# Inspection of the two-stage RAG with a sample question

In [83]:
if extract_sample_QA:
    question = new_examples[0]['qa_pairs']['query']
pprint(question)
# Question shortening
qa_chains.shorten_question(question)
shortened_question = qa_chains.shortened_question
print(shortened_question)

("What are the expectations for the Federal Reserve's interest rate cuts "
 'according to David Sekera, and how do these expectations relate to the '
 'upcoming Fed meetings and inflation data?')
'\n        You are an expert financial advisor tasked with shortening the original question. \n        Your role is to reformulate the original question to short phrases with essential keywords.\n        Mostly focus on company names, consultant or advisor names.\n        The answer does not need to be complete sentense.\n        Do not convert words to abbreviations.\n\n        Original Question: "What are the expectations for the Federal Reserve\'s interest rate cuts according to David Sekera, and how do these expectations relate to the upcoming Fed meetings and inflation data?"\n\n        Reformulated phrases: \n        '


2025-06-20 16:34:01,182 INFO [backend.my_lib.qa_chains] The shortened question:
 - David Sekera
         - Federal Reserve
         - Interest rate cuts
         - Fed meetings
         - Inflation data


' - David Sekera\n         - Federal Reserve\n         - Interest rate cuts\n         - Fed meetings\n         - Inflation data'
'- David Sekera\n         - Federal Reserve\n         - Interest rate cuts\n         - Fed meetings\n         - Inflation data'
- David Sekera
         - Federal Reserve
         - Interest rate cuts
         - Fed meetings
         - Inflation data


In [91]:
# Test the retrievers
qa_chains.retrieve_context()
top_score_docs = qa_chains.top_score_docs
i_chunk = random.randint(0, len(top_score_docs)-1)
print('Number of top score chunks retrieved:',len(top_score_docs))
print('\n====== Content of a retrieved top score chunk:\n',top_score_docs[i_chunk].page_content)
print(f'\n====== The chunk metadata:\n {top_score_docs[i_chunk].metadata}')  

Batches: 100%|██████████| 7/7 [00:00<00:00, 34.10it/s]
2025-06-20 16:42:40,336 INFO [backend.my_lib.retrievers] 
200 small chunks retrieved
2025-06-20 16:42:40,336 INFO [backend.my_lib.retrievers] 
 ==== Samples of retrieved small chunks ==== 

2025-06-20 16:42:40,337 INFO [backend.my_lib.retrievers] Name: 5 Hot Stocks to Sell Before They Report Earnings _ Morningstar.pdf, Page: 17, Score: -4.4653472900390625, Content: What’s Happening in 
2025-06-20 16:42:40,337 INFO [backend.my_lib.retrievers] Name: 4 Risky Stocks to Sell and 4 Picks to Buy Instead _ Morningstar.pdf, Page: 2, Score: -9.496253967285156, Content: radars, some new Mor
2025-06-20 16:42:40,337 INFO [backend.my_lib.retrievers] Name: 5 Undervalued Stocks to Buy During Q4 _ Morningstar.pdf, Page: 17, Score: -11.258347511291504, Content: More in Markets
Mark
2025-06-20 16:42:40,338 INFO [backend.my_lib.qa_chains] 200 small chunks were retrieved
2025-06-20 16:42:40,338 INFO [backend.my_lib.retrievers] 
 ==== Sorted DRS documen

Number of top score chunks retrieved: 20

 More in Markets
Markets Brief: Is Your Portfolio an SUV or a Race Car?
Plus: Financials lead, while energy stocks stumble.
Dan Kemp• Oct 21, 2024
What’s Happening in the Markets This Week
The Federal Reserve Beige Book, home sales, and earnings from Coca-Cola and GE are on
the docket.
Frank Lee• Oct 19, 2024
Smart Investor: These Stocks Are Leading the Rally, Bank
ETFs Jump, and Top Large-Value ETFs
We wrap up our coverage of the markets and the week.
Tom Lauricella• Oct 19, 2024
About the Authors
David Sekera, CFA
Strategist
Dave Sekera, CFA, is a strategist, markets and economies, for Morningstar*. He provides comprehensive
valuation analysis of the US stock market based on the intrinsic valuations generated by our equity resear…
Share
View All
View All Authors
More from Author
4 Risky Stocks to Sell
and 4 Picks to Buy
Instead
David Sekera, CFA• Oct 14,
2024
US Stock Market
Outlook Q4 2024: Will
the Great Rotation…
David Sekera, CFA• Oct 14,

: 

In [73]:
qa_chains.generate_answer()
answer = qa_chains.response
pprint(answer)

("According to David Sekera, Morningstar's US economics team expects a "
 '25-basis-point cut in the federal-funds rate for each of the next two '
 'meetings. They are still looking for the federal-funds rate to get down to '
 '3.00% to 3.25% range by the end of next year. However, if employment remains '
 'hot, there is a chance that the Fed would not cut the federal-funds rate all '
 'the way down to their expectations. The team is looking for a combination of '
 'moderating inflation and slowing economic growth to provide the backdrop for '
 'the Fed to begin cutting interest rates at the September meeting. They '
 'expect a 0% chance of a rate cut at the September meeting, but this is '
 'expected to change as the September meeting approaches. The team is still '
 'looking for a 25-basis-point cut in the federal-funds rate for each of the '
 'next two meetings, and they expect the federal-funds rate to get down to '
 '3.00% to 3.25% range by the end of next year.')


# Two-Stage RAG evaluation using generated QA pairs

In [21]:
predictions = list()
examples = list()
for example, idoc in zip(new_examples, index):
    print(f'Document {idoc}')
    question = example['qa_pairs']['query']
    pprint(question)
    answer = two_stage_rag(question)
    prediction = {'query': example['qa_pairs']['query'], 'answer': example['qa_pairs']['answer']}
    prediction["result"] = answer
    predictions.append(prediction)
    examples.append(
            example.get('qa_pairs')
        )

eval_chain = QAEvalChain.from_llm(llm)
graded_outputs_2RAG = eval_chain.evaluate(examples, predictions, prediction_key="result")
print('The result of RAG evaluation for the given example questions: ')

accuracy_2RAG = len([result for result in graded_outputs_2RAG if result['results'] == 'CORRECT'])/len(graded_outputs_2RAG)
print(f'\n Accuracy of the RAG pipeline: {accuracy_2RAG} \n')

graded_outputs_2RAG

Document 1
('What are the key economic indicators and earnings reports that David Sekera '
 'is focusing on in the week of July 15, 2024, and what implications do they '
 'have for consumer spending and stock valuations?')
- Key economic indicators  
- Earnings reports  
- David Sekera  
- Week of July 15, 2024  
- Implications for consumer spending  
- Stock valuations  
The shortened question:
 - Key economic indicators  
- Earnings reports  
- David Sekera  
- Week of July 15, 2024  
- Implications for consumer spending  
- Stock valuations  
The small chunks were retrieved
The DRS was calculated for relevant PDF documents

 ==== Ranked retrieved large chunks ==== 

The large chunks were retrieved
The aggregated scores were calculated for all retrieved chunks
The top score chunks were concatenated
Document 4
("What are the key factors influencing Morningstar's stock recommendations for "
 'Q4 2024, particularly regarding the economic outlook, interest rates, and '
 'specific stock p

[{'results': 'GRADE: INCORRECT'}, {'results': 'INCORRECT'}]

# Comparison to Hybrid RAG

In [24]:
hybrid_retrieval = Hybrid_Retrieval(pdf_manager, retrievers, config)
hybrid_RAG_QA = QAchains(retrievers, config)
def hybrid_rag(question,top_k_BM25, top_k_semantic, top_k_final, rrf_k = 60, hybrid = True):
    top_score_docs = hybrid_retrieval.hybrid_retriever(question, top_k_BM25, top_k_semantic, top_k_final, rrf_k, hybrid)
    hybrid_RAG_QA.top_score_docs = top_score_docs
    hybrid_RAG_QA.question = question
    answer = hybrid_RAG_QA.generate_answer()
    return answer
# answer_hybrid = hybrid_rag(question, 200, 50, 10)    
# pprint(answer_hybrid)

## QA pair evaluation loop

In [25]:
predictions_hybrid = list()
examples = list()
for example, idoc in zip(new_examples, index):
    question = example['qa_pairs']['query']
    print(f'Question:')
    pprint(question)
    answer = hybrid_rag(question, top_k_semantic, top_k_semantic, top_k_final)
    print(f'Answer:')
    pprint(answer)
    prediction = {'query': example['qa_pairs']['query'], 'answer': example['qa_pairs']['answer']}
    prediction["result"] = answer
    predictions_hybrid.append(prediction)
    examples.append(
            example.get('qa_pairs')
        )

eval_chain = QAEvalChain.from_llm(llm)
graded_outputs_hybrid = eval_chain.evaluate(examples, predictions_hybrid, prediction_key="result")
print('The result of RAG evaluation for the given example questions: ')

accuracy_hybrid = len([result for result in graded_outputs_hybrid if result['results'] == 'CORRECT'])/len(graded_outputs_hybrid)
print(f'\n Accuracy of the RAG pipeline: {accuracy_hybrid} \n')

graded_outputs_hybrid

Question:
('What are the key economic indicators and earnings reports that David Sekera '
 'is focusing on in the week of July 15, 2024, and what implications do they '
 'have for consumer spending and stock valuations?')
=== Hybrid Retrieval with BM25 and semantic search ===
Answer:
('In the week of July 15, 2024, David Sekera is focusing on several key '
 'economic indicators and earnings reports, particularly retail sales and the '
 'beginning of earnings season. \n'
 '\n'
 '1. **Retail Sales**: Sekera is particularly interested in the retail sales '
 'data, as it will provide insights into consumer spending patterns. He notes '
 'that low-income consumers have been under pressure from inflation for over a '
 'year, and there are emerging signs that middle-income households are also '
 'beginning to pull back on spending, especially in discretionary and '
 'indulgent categories. This trend could indicate a broader slowdown in '
 'consumer spending, which is critical for evaluating t

[{'results': 'GRADE: CORRECT'}, {'results': 'INCORRECT'}]

# Comparison to Normal RAG

In [26]:
predictions_rag = list()
examples = list()
for example, idoc in zip(new_examples, index):
    question = example['qa_pairs']['query']
    print(f'Question:')
    pprint(question)
    answer = hybrid_rag(question, top_k_semantic, top_k_semantic, top_k_final, hybrid = False)
    print(f'Answer:')
    pprint(answer)
    prediction = {'query': example['qa_pairs']['query'], 'answer': example['qa_pairs']['answer']}
    prediction["result"] = answer
    predictions_rag.append(prediction)
    examples.append(
            example.get('qa_pairs')
        )

eval_chain = QAEvalChain.from_llm(llm)
graded_outputs_rag = eval_chain.evaluate(examples, predictions_rag, prediction_key="result")
print('The result of RAG evaluation for the given example questions: ')

accuracy_rag = len([result for result in graded_outputs_rag if result['results'] == 'CORRECT'])/len(graded_outputs_rag)
print(f'\n Accuracy of the RAG pipeline: {accuracy_rag} \n')

graded_outputs_rag

Question:
('What are the key economic indicators and earnings reports that David Sekera '
 'is focusing on in the week of July 15, 2024, and what implications do they '
 'have for consumer spending and stock valuations?')
=== Semantic search retrieval only === 
Answer:
('In the week of July 15, 2024, David Sekera is focusing on several key '
 'economic indicators and earnings reports, particularly the Consumer Price '
 'Index (CPI) and Producer Price Index (PPI) numbers. The CPI and core CPI '
 'came in well below expectations, which is a positive sign for inflation '
 'control. Conversely, the PPI was slightly higher than expected, but the '
 'underlying components were lower than anticipated, suggesting that inflation '
 'pressures may not be as severe as previously thought.\n'
 '\n'
 'These indicators have significant implications for consumer spending. Sekera '
 'notes that low-income consumers have been under pressure from inflation for '
 'over a year, and there are signs that mi

[{'results': 'GRADE: INCORRECT'}, {'results': 'GRADE: INCORRECT'}]

# Comparison of RAG pipelines

In [27]:
# Create a dataframe
data = {
    'Method': ['RAG', 'Hybrid RAG', 'Two-Stage RAG'],
    'Large Scale Accuracy (%)': [accuracy_rag_large, accuracy_hybrid_large, accuracy_2RAG_large],
    'Small Scale Accuracy (%)': [accuracy_rag_small, accuracy_hybrid_small, accuracy_2RAG_small]
}
df = pd.DataFrame(data)
df_melted = df.melt(id_vars='Method', var_name='Experiment Scale', value_name='Accuracy (%)')
df_melted['Accuracy (%)'] = (df_melted['Accuracy (%)']*100).round(2)
df_melted.head()


# Set the aesthetic style of the plots
sns.set(style="whitegrid")

# Create the bar plot
plt.figure(figsize=(10, 6))
bar_plot = sns.barplot(
    x='Method',
    y='Accuracy (%)',
    hue='Experiment Scale',
    data=df_melted,
    # palette='viridis'
)

# Add percentage labels on top of the bars
for p in bar_plot.patches:
    height = p.get_height()
    bar_plot.annotate(f'{height}%',
                      (p.get_x() + p.get_width() / 2., height),
                      ha='center', va='bottom',
                      fontsize=11)

# Set labels and title
plt.ylim(0, 100)  # Adjusted to fit percentage scale
plt.title('Accuracy Comparison of RAG Methods Across Experiment Scales', fontsize=16)
plt.ylabel('Accuracy (%)')
plt.xlabel('Method')
plt.legend(title='Experiment Scale')

# Display the plot
plt.show()

NameError: name 'accuracy_rag_large' is not defined

In [1]:
assert False

AssertionError: 

# Debugging

In [49]:

# lama-cpp-python test
# 2. Load the model and generate a response
from llama_cpp import Llama

# repo_model = "hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF"
repo_model = "hugging-quants/Llama-3.2-3B-Instruct-Q4_K_M-GGUF"
filename = "*q4_k_m.gguf"

# Instantiate the Llama object (this loads the model into RAM)
llm_cpp = Llama.from_pretrained(
    repo_id=repo_model,
    filename=filename,
    local_dir="models",
    n_ctx=10000,
    verbose=False
)

# Test the model
shortening_prompt = f"""
You are an expert financial advisor tasked with shortening the original question. 
Your role is to reformulate the original question to short phrases with essential keywords.
Mostly focus on company names, consultant or advisor names.
The answer does not need to be complete sentense.
Do not convert words to abbreviations.

Original Question: "{question}"

Reformulated phrases: 
"""
# sample_question = "Write a short, friendly greeting from a magical bookstore owner."
response = llm_cpp.create_completion(
    prompt=shortening_prompt,
    max_tokens=128,
    temperature=0,
    # top_p=0.95,
    stop=["\n\n"]
)

print(response["choices"][0]["text"].strip())

llama_context: n_ctx_per_seq (10000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h9

- David Sekera
- Federal Reserve
- Interest rate cuts
- Fed meetings
- Inflation data


In [52]:
llm_cpp.reset


AttributeError: 'Llama' object has no attribute 'stop_tokens'

In [53]:
from backend.my_lib.LLMManager import LLMManager

llm_manager = LLMManager(llm_instance = llm_cpp)

system_prompt = """
You are an expert financial advisor tasked with shortening the original question. 
Your role is to reformulate the original question to short phrases with essential keywords.
Mostly focus on company names, consultant or advisor names.
The answer does not need to be complete sentense.
Do not convert words to abbreviations.

Original Question: "{original_question}"

Reformulated phrases: 
"""
invoke_kwargs = {"original_question": question}
response = llm_manager.invoke(system_prompt, invoke_kwargs)
print(response)
# llm_manager.invoke(shortening_prompt)

- Morningstar
- Federal Reserve
- Interest rate decisions
- 2024
- 2025


In [12]:
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

prompt = PromptTemplate.from_template(template)
# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
# callback_manager = CallbackManager()

model_path="models/llama-3.2-3b-instruct-q4_k_m.gguf",
# model_path="models/llama-3.2-1b-instruct-q4_k_m.gguf",

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="models/llama-3.2-3b-instruct-q4_k_m.gguf",
    temperature=0,
    max_tokens=128,
    # top_p=1,
    # callback_manager=callback_manager,
    verbose=False,  # Verbose is required to pass to the callback manager
    stop=["\n\n"]
)

question_test = """
Question: A rap battle between Stephen Colbert and John Oliver
"""
response = llm.invoke(shortening_prompt)
print(response)

llama_context: n_batch is less than GGML_KQ_MASK_PAD - increasing to 64
llama_context: n_ctx_per_seq (512) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (

- Morningstar's view
- Federal Reserve interest rate decisions
- 2024 and 2025 timeframe


In [72]:
from backend.my_lib.qa_chains import QAchains

# llm_manager = LLMManager(llm_instance = None)
llm_manager = LLMManager(llm_instance = llm_cpp)

qa_chains = QAchains(retrievers, config, llm_manager = llm_manager)

qa_chains.shorten_question(question)
shortened_question = qa_chains.shortened_question
print(shortened_question)
qa_chains.retrieve_context()
answer = qa_chains.generate_answer()
print(answer)
# return answer

2025-06-20 01:26:49,318 INFO [backend.my_lib.qa_chains] The shortened question:
 - Morningstar
         - Federal Reserve
         - Interest Rate Decisions
         - 2024
         - 2025


- Morningstar
         - Federal Reserve
         - Interest Rate Decisions
         - 2024
         - 2025


Batches: 100%|██████████| 7/7 [00:00<00:00, 23.00it/s]
2025-06-20 01:26:49,658 INFO [backend.my_lib.qa_chains] The small chunks were retrieved
2025-06-20 01:26:49,658 INFO [backend.my_lib.qa_chains] The DRS was calculated for relevant PDF documents
Batches: 100%|██████████| 2/2 [00:00<00:00,  6.14it/s]
2025-06-20 01:26:50,050 INFO [backend.my_lib.retrievers] 
 ==== Ranked retrieved large chunks ==== 

2025-06-20 01:26:50,051 INFO [backend.my_lib.qa_chains] The large chunks were retrieved
2025-06-20 01:26:50,051 INFO [backend.my_lib.qa_chains] The aggregated scores were calculated for all retrieved chunks
2025-06-20 01:26:50,052 INFO [backend.my_lib.qa_chains] The top score chunks were concatenated


According to the document, Morningstar's US economics team still expects a 25-basis-point cut in the federal-funds rate for each of the next two meetings. However, they also project that long-term interest rates will be on a downward path in the second half of 2024 and in 2025, with the 10-year US Treasury yield averaging 4.25% in 2024 and averaging 4.00% in 2025. Furthermore, they still expect the federal-funds rate to get down to that 3.00% to 3.25% range by the end of next year.


In [73]:
from pprint import pprint
pprint(answer)

("According to the document, Morningstar's US economics team still expects a "
 '25-basis-point cut in the federal-funds rate for each of the next two '
 'meetings. However, they also project that long-term interest rates will be '
 'on a downward path in the second half of 2024 and in 2025, with the 10-year '
 'US Treasury yield averaging 4.25% in 2024 and averaging 4.00% in 2025. '
 'Furthermore, they still expect the federal-funds rate to get down to that '
 '3.00% to 3.25% range by the end of next year.')
