# Personal Financial Agent ‚Äî Evaluation & Synthetic Data Generation

This notebook demonstrates the full evaluation lifecycle for our Romanian Personal Financial Agent,
directly adapted from **AIE9 Sessions 9-10**.

## Structure
1. **Synthetic Data Generation** ‚Äî Generate test questions from financial documents using RAGAS
2. **RAG Evaluation ‚Äî Baseline** ‚Äî Evaluate with naive top-k retrieval
3. **RAG Evaluation ‚Äî Improved** ‚Äî Add Cohere reranking and compare scores
4. **Agent Evaluation** ‚Äî Test tool routing, topic adherence, MiFID II compliance

In [1]:
# Setup & Imports
import nest_asyncio
nest_asyncio.apply()  # Allow nested event loops in Jupyter

import os
import sys
import asyncio
import json
import shutil
import concurrent.futures
import pandas as pd
from IPython.display import display, HTML, Markdown

# Add app to path (works in Docker and locally)
sys.path.insert(0, '/app' if os.path.exists('/app/app') else os.path.abspath('..'))

from app.config import settings
from app.services.rag_service import rag_service
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

print(f'OpenAI API Key: {settings.openai_api_key[:8]}...')
print(f'Qdrant: {settings.qdrant_host}:{settings.qdrant_port}')
print(f'Collection: {settings.qdrant_collection}')


OpenAI API Key: sk-proj-...
Qdrant: qdrant:6333
Collection: financial_docs_ro


## 1. Synthetic Data Generation (SDG)

Using RAGAS `TestsetGenerator` to create synthetic question-answer pairs from our Romanian
financial documents. This follows the AIE9 Session 9 pattern.

The generator creates three types of questions:
- **Simple** ‚Äî Single-fact retrieval questions
- **Multi-Context** ‚Äî Questions requiring information from multiple chunks
- **Reasoning** ‚Äî Questions requiring inference from retrieved information

In [2]:
# Load documents for SDG
from langchain_community.document_loaders import PyMuPDFLoader

# Target the two key financial-product PDFs explicitly (reproducible)
pdf_files = [
    '/app/documents/brosura_fidelis.pdf',
    '/app/documents/tezaur_ghid_2023.pdf',
]

documents = []
for pdf in pdf_files:
    loader = PyMuPDFLoader(pdf)
    documents.extend(loader.load())

print(f'Loaded {len(documents)} pages from {len(pdf_files)} PDF files')
for doc in documents[:3]:
    print(f'  - {doc.metadata.get("source", "unknown")}: {doc.page_content[:100]}...')


Loaded 13 pages from 2 PDF files
  - /app/documents/brosura_fidelis.pdf: Investe»ôti √Æn viitor, at√¢t de u»ôor.
GHIDUL FIDELIS
PENTRU INVESTITORI
FIDELIS
...
  - /app/documents/brosura_fidelis.pdf: Titlurile de stat sunt instrumente fnanciare utile at√¢t statului, c√¢t »ôi 
popula»õiei.
Denumite »ôi ob...
  - /app/documents/brosura_fidelis.pdf: INVESTE»òTI
LA SIGUR
CU FIDELIS.
...


In [3]:
# Generate synthetic test set
from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

# Setup LLM and embeddings for SDG
generator_llm = LangchainLLMWrapper(ChatOpenAI(model='gpt-4o-mini', api_key=settings.openai_api_key))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(
    model=settings.embedding_model,
    api_key=settings.openai_api_key
))

# Create test set generator
generator = TestsetGenerator(
    llm=generator_llm,
    embedding_model=generator_embeddings,
)

# Run SDG in a separate thread to avoid Jupyter async deadlocks
def _run_sdg():
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    result = generator.generate_with_langchain_docs(
        documents=documents,
        testset_size=10,
    )
    loop.close()
    return result

with concurrent.futures.ThreadPoolExecutor() as pool:
    testset = pool.submit(_run_sdg).result()

test_df = testset.to_pandas()
print(f'Generated {len(test_df)} synthetic test questions')
print(f'Columns: {list(test_df.columns)}')
display(test_df.head(10))


Applying SummaryExtractor:   0%|          | 0/9 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/13 [00:00<?, ?it/s]

Node 0b1de1cb-ffbc-48c4-8c63-6e961a8ea529 does not have a summary. Skipping filtering.
Node 3fbbf1a4-c0a0-4453-a868-083f4c5302ca does not have a summary. Skipping filtering.
Node c9b56e7a-c52c-4b68-8826-73ac87f2637d does not have a summary. Skipping filtering.
Node fb240fef-a728-496b-bdcb-6fac83adff66 does not have a summary. Skipping filtering.


Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/35 [00:00<?, ?it/s]

Applying OverlapScoreBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

Generated 10 synthetic test questions
Columns: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name']


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Ce este ghidul FIDELIS pentru INVESTITORI?,"[Investe»ôti √Æn viitor, at√¢t de u»ôor.\nGHIDUL F...",Ghidul FIDELIS pentru INVESTITORI este un mate...,single_hop_specifc_query_synthesizer
1,What are the key benefits of investing in FIDE...,[Titlurile de stat sunt instrumente fnanciare ...,FIDELIS government bonds represent low-risk in...,single_hop_specifc_query_synthesizer
2,cum pot cumpara titluri de stat FIDELIS?,[CUM INTRI √éN POSESIA\nTITLURILOR DE STAT FIDE...,Titlurile de stat FIDELIS pot fi cumpƒÉrate de ...,single_hop_specifc_query_synthesizer
3,What is the significance of the S√¢nge de Inves...,"[Prin programul FIDELIS, Ministerul Finan»õelor...","The S√¢nge de Investitor campaign, part of the ...",single_hop_specifc_query_synthesizer
4,What are the implications of trading FIDELIS g...,[√éNTREBƒÇRI\n& RƒÇSPUNSURI\nImplicƒÉ vreun cost c...,FIDELIS government bonds are listed on the Bur...,single_hop_specifc_query_synthesizer
5,What are the characteristics and benefits of i...,[<1-hop>\n\nTitlurile de stat sunt instrumente...,"Titlurile de stat TEZAUR, as outlined in the G...",multi_hop_specific_query_synthesizer
6,Cum pot investi cetƒÉ»õenii rom√¢ni √Æn Titlurile ...,[<1-hop>\n\nCum pot fi cumpƒÉrate Titlurile de ...,CetƒÉ»õenii rom√¢ni care au √Æmplinit v√¢rsta de 18...,multi_hop_specific_query_synthesizer
7,Cum pot fi cumpƒÉrate Titlurile de stat TEZAUR ...,[<1-hop>\n\nPas 2\nAlimentare \nCont Subscrier...,"Pentru a cumpƒÉra Titlurile de stat TEZAUR, tre...",multi_hop_specific_query_synthesizer
8,Cum pot fi cumpƒÉrate Titlurile de stat TEZAUR ...,[<1-hop>\n\nCum pot fi cumpƒÉrate Titlurile de ...,Titlurile de stat TEZAUR pot fi cumpƒÉrate prin...,multi_hop_specific_query_synthesizer
9,Cum pot fi cumpƒÉrate titlurile de stat TEZAUR ...,[<1-hop>\n\nTitlurile de stat sunt instrumente...,Titlurile de stat TEZAUR pot fi cumpƒÉrate prin...,multi_hop_specific_query_synthesizer


In [4]:
# If SDG fails (e.g., not enough documents), use manually curated test questions
# This is a fallback that ensures the evaluation can always run

MANUAL_TEST_QUESTIONS = [
    {
        'user_input': 'Ce sunt titlurile de stat TEZAUR?',
        'reference': 'Titlurile TEZAUR sunt instrumente financiare emise de Ministerul Finantelor din Romania, destinate exclusiv persoanelor fizice rezidente. Au maturitati de 1, 3 sau 5 ani, dobanda fixa, si sunt 100% garantate de statul roman. Sunt scutite de impozit pe venit.',
    },
    {
        'user_input': 'Care sunt diferentele intre TEZAUR si FIDELIS?',
        'reference': 'TEZAUR nu se tranzactioneaza pe bursa si este scutit de impozit. FIDELIS este listat la BVB, poate fi tranzactionat pe piata secundara, si este impozitat cu 10% din 2023.',
    },
    {
        'user_input': 'Ce avantaje are TEZAUR fata de depozitele bancare?',
        'reference': 'Nu exista risc de pierdere a capitalului investit. Dobanzile sunt mai mari decat la depozitele bancare. Scutire de impozit pe venit. Accesibile de la 1 RON.',
    },
    {
        'user_input': 'Cum se pot achizitiona titlurile FIDELIS?',
        'reference': 'FIDELIS sunt listate la BVB si pot fi cumparate sau vandute pe piata secundara. Dobanda fixa, platita semestrial sub forma de cupon.',
    },
    {
        'user_input': 'Ce maturitati au titlurile de stat romanesti?',
        'reference': 'Titlurile TEZAUR si FIDELIS au maturitati de 1 an, 3 ani sau 5 ani. FIDELIS poate fi denominat in LEI sau EURO.',
    },
]

# Use SDG results if available, otherwise fall back to manual
try:
    if len(test_df) >= 5:
        # Auto-detect column names (RAGAS 0.2.x uses user_input/reference)
        q_col = 'user_input' if 'user_input' in test_df.columns else 'question'
        gt_col = 'reference' if 'reference' in test_df.columns else 'ground_truth'
        eval_questions = test_df[q_col].tolist()
        eval_ground_truths = test_df[gt_col].tolist()
        print(f'Using {len(eval_questions)} SDG-generated questions')
    else:
        raise ValueError('Not enough SDG questions')
except Exception:
    eval_questions = [q['user_input'] for q in MANUAL_TEST_QUESTIONS]
    eval_ground_truths = [q['reference'] for q in MANUAL_TEST_QUESTIONS]
    print(f'Using {len(eval_questions)} manually curated questions')

for i, q in enumerate(eval_questions, 1):
    print(f'{i}. {q}')


Using 10 SDG-generated questions
1. Ce este ghidul FIDELIS pentru INVESTITORI?
2. What are the key benefits of investing in FIDELIS government bonds for individual investors in Romania?
3. cum pot cumpara titluri de stat FIDELIS?
4. What is the significance of the S√¢nge de Investitor campaign in relation to FIDELIS government bonds?
5. What are the implications of trading FIDELIS government bonds on the Bursa de Valori Bucure»ôti?
6. What are the characteristics and benefits of investing in titlurile de stat TEZAUR compared to other state securities?
7. Cum pot investi cetƒÉ»õenii rom√¢ni √Æn Titlurile de stat TEZAUR »ôi care sunt pa»ôii necesari pentru subscriere?
8. Cum pot fi cumpƒÉrate Titlurile de stat TEZAUR conform ghidului investitorului din 2023?
9. Cum pot fi cumpƒÉrate Titlurile de stat TEZAUR conform ghidului investitorului 2023?
10. Cum pot fi cumpƒÉrate titlurile de stat TEZAUR »ôi care sunt avantajele titlurilor de stat FIDELIS?


---

## 1.5 RAG Pipeline Walkthrough

Before evaluating, let's **demonstrate each retrieval technique** used in our pipeline.
This follows the same pattern as AIE9 Session 11 (Advanced Retrieval with LangChain).

Our pipeline combines **four** retrieval strategies:

| # | Technique | Purpose |
|---|---|---|
| 1 | **ParentDocumentRetriever** | Small-to-big: search child chunks, return parent context |
| 2 | **BM25Retriever** | Sparse keyword matching (exact terms like "TEZAUR", "BVB") |
| 3 | **EnsembleRetriever** | Fuses BM25 (30%) + Vector (70%) via Reciprocal Rank Fusion |
| 4 | **CohereRerank** | Cross-encoder reranking to filter top-N most relevant chunks |


### ParentDocumentRetriever (Small-to-Big)

A "small-to-big" strategy ‚Äî the Parent Document Retriever works based on a simple principle:

1. We split the full document into large **parent** chunks (2000 chars).
2. Each parent chunk is further split into smaller **child** chunks (400 chars).
3. The child chunks are embedded and stored in **Qdrant** for similarity search.
4. The parent chunks are stored in an **in-memory DocStore**.
5. When we query, we match against child chunks but **return the parent** ‚Äî giving the LLM more surrounding context.

This is critical for our dense regulatory PDFs where a single sentence's meaning depends on surrounding paragraphs.


In [5]:
# Show ParentDocumentRetriever configuration
print('=== ParentDocumentRetriever Configuration ===')
print(f'Parent chunk size:   {settings.rag_parent_chunk_size} chars (returned to LLM)')
print(f'Parent overlap:      {settings.rag_parent_chunk_overlap} chars')
print(f'Child chunk size:    {settings.rag_child_chunk_size} chars (used for embedding search)')
print(f'Child overlap:       {settings.rag_child_chunk_overlap} chars')
print(f'Embedding model:     {settings.embedding_model}')
print(f'Vector DB:           Qdrant @ {settings.qdrant_host}:{settings.qdrant_port}')
print(f'Collection:          {settings.qdrant_collection}')

# Show the splitters from rag_service
print(f'\nParent splitter:     {rag_service.parent_splitter}')
print(f'Child splitter:      {rag_service.child_splitter}')


=== ParentDocumentRetriever Configuration ===
Parent chunk size:   2000 chars (returned to LLM)
Parent overlap:      200 chars
Child chunk size:    400 chars (used for embedding search)
Child overlap:       50 chars
Embedding model:     text-embedding-3-small
Vector DB:           Qdrant @ qdrant:6333
Collection:          financial_docs_ro

Parent splitter:     <langchain_text_splitters.character.RecursiveCharacterTextSplitter object at 0xffff39c4e6f0>
Child splitter:      <langchain_text_splitters.character.RecursiveCharacterTextSplitter object at 0xffff39c4e690>


### BM25 Retriever (Sparse Keyword Matching)

[BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on 
[Bag-of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) ‚Äî a sparse representation
that compares documents based on shared terms and their frequencies.

**Why BM25 matters for Romanian financial documents:** Embedding models can miss exact
acronyms like "BVB", "ASF", "FIDELIS", or "MiFID II". BM25 catches these exact-match
queries that dense retrieval might overlook.


In [6]:
# Ingest only the two target PDFs for evaluation
eval_docs_dir = '/tmp/eval_docs'
os.makedirs(eval_docs_dir, exist_ok=True)
for pdf in ['brosura_fidelis.pdf', 'tezaur_ghid_2023.pdf']:
    src = f'/app/documents/{pdf}'
    dst = f'{eval_docs_dir}/{pdf}'
    if not os.path.exists(dst):
        shutil.copy2(src, dst)

def _run_ingest():
    loop = asyncio.new_event_loop()
    result = loop.run_until_complete(rag_service.ingest_documents(eval_docs_dir))
    loop.close()
    return result

with concurrent.futures.ThreadPoolExecutor() as pool:
    ingest_result = pool.submit(_run_ingest).result()
print(f'Ingestion: {ingest_result}')

# Copy BM25/docstore pickles to default path so rag_service.query() can find them
for pkl in ['bm25_retriever.pkl', 'docstore.pkl']:
    pkl_src = f'{eval_docs_dir}/{pkl}'
    pkl_dst = f'/app/documents/{pkl}'
    if os.path.exists(pkl_src):
        shutil.copy2(pkl_src, pkl_dst)

# Demonstrate BM25 vs Vector retrieval
test_query = 'Ce este FIDELIS si cum se tranzactioneaza pe BVB?'
print(f'\nQuery: "{test_query}"\n')

# BM25 retrieval
if rag_service.bm25_retriever:
    bm25_docs = rag_service.bm25_retriever.invoke(test_query)[:3]
    print('=== BM25 Results (keyword matching) ===')
    for j, doc in enumerate(bm25_docs, 1):
        source = doc.metadata.get('source', 'unknown').split('/')[-1]
        print(f'  [{j}] {source} (p.{doc.metadata.get("page", "?")}): {doc.page_content[:120]}...')
else:
    print('BM25 not initialized')

# Dense vector retrieval
def _run_vector_query():
    loop = asyncio.new_event_loop()
    result = loop.run_until_complete(rag_service.query(test_query, use_reranking=False))
    loop.close()
    return result

print('\n=== Vector Results (semantic similarity) ===')
with concurrent.futures.ThreadPoolExecutor() as pool:
    vector_docs = pool.submit(_run_vector_query).result()
for j, doc in enumerate(vector_docs[:3], 1):
    source = doc.metadata.get('source', 'unknown').split('/')[-1]
    print(f'  [{j}] {source} (p.{doc.metadata.get("page", "?")}): {doc.page_content[:120]}...')


Ingestion: {'documents_processed': 2, 'status': 'already_ingested'}

Query: "Ce este FIDELIS si cum se tranzactioneaza pe BVB?"

=== BM25 Results (keyword matching) ===
  [1] brosura_fidelis.pdf (p.7): ExistƒÉ vreun risc asociat titlurilor de stat FIDELIS, av√¢nd √Æn 
vedere tranzac»õionarea acestora la BVB?
ExistƒÉ anumi»õi f...
  [2] tezaur_ghid_2023.pdf (p.3): Cum pot fi cumpƒÉrate Titlurile de stat TEZAUR
Pas 2
Alimentare 
Cont Subscriere
Se transferƒÉ sumele de bani dorite √Æn Co...
  [3] brosura_fidelis.pdf (p.2): Titlurile de stat sunt instrumente fnanciare utile at√¢t statului, c√¢t »ôi 
popula»õiei.
Denumite »ôi obliga»õiuni de stat, e...

=== Vector Results (semantic similarity) ===
  [1] brosura_fidelis.pdf (p.7): ExistƒÉ vreun risc asociat titlurilor de stat FIDELIS, av√¢nd √Æn 
vedere tranzac»õionarea acestora la BVB?
ExistƒÉ anumi»õi f...
  [2] brosura_fidelis.pdf (p.6): √éNTREBƒÇRI
& RƒÇSPUNSURI
ImplicƒÉ vreun cost cumpƒÉrarea »ôi de»õinerea titlurilor FIDELIS?
CumpƒÉrar

### EnsembleRetriever (Hybrid Fusion)

The Ensemble Retriever combines 2 or more retrievers using 
[Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf).

Our configuration: **30% BM25 + 70% Vector** ‚Äî this weights semantic understanding
higher but still benefits from exact keyword matching.


In [7]:
# Show the Ensemble Retriever configuration
print('=== EnsembleRetriever Configuration ===')
print(f'Retrievers:  BM25 + ParentDocumentRetriever (Vector)')
print(f'Weights:     [0.3, 0.7] (30% BM25, 70% Vector)')
print(f'Algorithm:   Reciprocal Rank Fusion')
print(f'Initial k:   {settings.rag_top_k} documents before reranking')

# Run ensemble retrieval
def _run_ensemble():
    loop = asyncio.new_event_loop()
    result = loop.run_until_complete(rag_service.query(test_query, use_reranking=False))
    loop.close()
    return result

print(f'\n=== Ensemble Results for: "{test_query}" ===')
with concurrent.futures.ThreadPoolExecutor() as pool:
    ensemble_docs = pool.submit(_run_ensemble).result()
for j, doc in enumerate(ensemble_docs[:5], 1):
    source = doc.metadata.get('source', 'unknown').split('/')[-1]
    print(f'  [{j}] {source} (p.{doc.metadata.get("page", "?")}): {doc.page_content[:120]}...')


=== EnsembleRetriever Configuration ===
Retrievers:  BM25 + ParentDocumentRetriever (Vector)
Weights:     [0.3, 0.7] (30% BM25, 70% Vector)
Algorithm:   Reciprocal Rank Fusion
Initial k:   10 documents before reranking

=== Ensemble Results for: "Ce este FIDELIS si cum se tranzactioneaza pe BVB?" ===
  [1] brosura_fidelis.pdf (p.7): ExistƒÉ vreun risc asociat titlurilor de stat FIDELIS, av√¢nd √Æn 
vedere tranzac»õionarea acestora la BVB?
ExistƒÉ anumi»õi f...
  [2] brosura_fidelis.pdf (p.6): √éNTREBƒÇRI
& RƒÇSPUNSURI
ImplicƒÉ vreun cost cumpƒÉrarea »ôi de»õinerea titlurilor FIDELIS?
CumpƒÉrarea titlurilor FIDELIS nu i...
  [3] brosura_fidelis.pdf (p.2): Titlurile de stat sunt instrumente fnanciare utile at√¢t statului, c√¢t »ôi 
popula»õiei.
Denumite »ôi obliga»õiuni de stat, e...
  [4] brosura_fidelis.pdf (p.4): CUM INTRI √éN POSESIA
TITLURILOR DE STAT FIDELIS?¬†
Titlurile de stat FIDELIS pot f cumpƒÉrate de persoanele fzice prin 
in...
  [5] brosura_fidelis.pdf (p.3): INVESTE»òTI
LA

### CohereRerank (Contextual Compression)

The final quality gate ‚Äî Cohere's `rerank-multilingual-v3.0` cross-encoder model
re-scores every candidate document against the query with deep understanding,
then selects the top-N most relevant. Unlike cosine similarity (which compares
embeddings), a cross-encoder sees the **full query and document together**.


In [8]:
# Run with reranking and compare
print(f'=== After Cohere Reranking for: "{test_query}" ===')
print(f'Reranker:     Cohere rerank-multilingual-v3.0')
print(f'Input docs:   {settings.rag_top_k} -> Output docs: {settings.rag_rerank_top_n}')
print()

def _run_reranked():
    loop = asyncio.new_event_loop()
    result = loop.run_until_complete(rag_service.query(test_query, use_reranking=True))
    loop.close()
    return result

with concurrent.futures.ThreadPoolExecutor() as pool:
    reranked_docs = pool.submit(_run_reranked).result()
for j, doc in enumerate(reranked_docs, 1):
    source = doc.metadata.get('source', 'unknown').split('/')[-1]
    relevance = doc.metadata.get('relevance_score', 'N/A')
    print(f'  [{j}] (score: {relevance}) {source} (p.{doc.metadata.get("page", "?")}): {doc.page_content[:120]}...')

# Summary
print(f'\nüìä Pipeline summary:')
print(f'  {len(documents)} PDF pages -> {settings.rag_parent_chunk_size}-char parent chunks -> {settings.rag_child_chunk_size}-char child chunks (embedded)')
print(f'  Query -> BM25 (30%) + Vector (70%) -> top-{settings.rag_top_k} -> Cohere Rerank -> top-{settings.rag_rerank_top_n} -> LLM')


=== After Cohere Reranking for: "Ce este FIDELIS si cum se tranzactioneaza pe BVB?" ===
Reranker:     Cohere rerank-multilingual-v3.0
Input docs:   10 -> Output docs: 5

  [1] (score: 0.8327813) brosura_fidelis.pdf (p.7): ExistƒÉ vreun risc asociat titlurilor de stat FIDELIS, av√¢nd √Æn 
vedere tranzac»õionarea acestora la BVB?
ExistƒÉ anumi»õi f...
  [2] (score: 0.8327813) brosura_fidelis.pdf (p.7): ExistƒÉ vreun risc asociat titlurilor de stat FIDELIS, av√¢nd √Æn 
vedere tranzac»õionarea acestora la BVB?
ExistƒÉ anumi»õi f...
  [3] (score: 0.7261344) brosura_fidelis.pdf (p.2): Titlurile de stat sunt instrumente fnanciare utile at√¢t statului, c√¢t »ôi 
popula»õiei.
Denumite »ôi obliga»õiuni de stat, e...
  [4] (score: 0.7261344) brosura_fidelis.pdf (p.2): Titlurile de stat sunt instrumente fnanciare utile at√¢t statului, c√¢t »ôi 
popula»õiei.
Denumite »ôi obliga»õiuni de stat, e...
  [5] (score: 0.26903743) brosura_fidelis.pdf (p.6): √éNTREBƒÇRI
& RƒÇSPUNSURI
ImplicƒÉ vreun cost cum

## 2. RAG Evaluation ‚Äî Baseline (No Reranking)

First, we evaluate the RAG pipeline with naive top-5 similarity search ‚Äî **no reranking**.
This establishes our baseline scores that we'll improve upon.

In [9]:
# Run baseline RAG evaluation (no reranking)
from datasets import Dataset

def evaluate_rag(questions, ground_truths, use_reranking=False):
    """Run RAG pipeline and collect results for RAGAS evaluation."""
    answers = []
    contexts = []
    llm = ChatOpenAI(model='gpt-4o-mini', api_key=settings.openai_api_key)

    async def _run():
        for i, question in enumerate(questions, 1):
            print(f'  [{i}/{len(questions)}] {question[:60]}...')
            docs = await rag_service.query(question, use_reranking=use_reranking)
            context_texts = [doc.page_content for doc in docs]
            context_str = '\n\n'.join(context_texts)
            prompt = f'Based on the following context, answer the question.\n\nContext:\n{context_str}\n\nQuestion: {question}\n\nAnswer:'
            response = await llm.ainvoke(prompt)
            answers.append(response.content)
            contexts.append(context_texts)

    def _thread_target():
        loop = asyncio.new_event_loop()
        loop.run_until_complete(_run())
        loop.close()

    with concurrent.futures.ThreadPoolExecutor() as pool:
        pool.submit(_thread_target).result()

    return answers, contexts

print('Running baseline RAG evaluation (no reranking)...')
baseline_answers, baseline_contexts = evaluate_rag(
    eval_questions, eval_ground_truths, use_reranking=False
)
print(f'Generated {len(baseline_answers)} answers')


Running baseline RAG evaluation (no reranking)...
  [1/10] Ce este ghidul FIDELIS pentru INVESTITORI?...
  [2/10] What are the key benefits of investing in FIDELIS government...
  [3/10] cum pot cumpara titluri de stat FIDELIS?...
  [4/10] What is the significance of the S√¢nge de Investitor campaign...
  [5/10] What are the implications of trading FIDELIS government bond...
  [6/10] What are the characteristics and benefits of investing in ti...
  [7/10] Cum pot investi cetƒÉ»õenii rom√¢ni √Æn Titlurile de stat TEZAUR...
  [8/10] Cum pot fi cumpƒÉrate Titlurile de stat TEZAUR conform ghidul...
  [9/10] Cum pot fi cumpƒÉrate Titlurile de stat TEZAUR conform ghidul...
  [10/10] Cum pot fi cumpƒÉrate titlurile de stat TEZAUR »ôi care sunt a...
Generated 10 answers


In [10]:
from ragas import evaluate as ragas_evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

# Create RAGAS dataset
baseline_dataset = Dataset.from_dict({
    'user_input': eval_questions,
    'response': baseline_answers,
    'retrieved_contexts': baseline_contexts,
    'reference': eval_ground_truths,
})

# Run RAGAS evaluation in a thread to avoid async deadlock
print('Running RAGAS metrics on baseline...')
def _run_ragas_baseline():
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    result = ragas_evaluate(
        dataset=baseline_dataset,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    )
    loop.close()
    return result

with concurrent.futures.ThreadPoolExecutor() as pool:
    baseline_result = pool.submit(_run_ragas_baseline).result()

baseline_scores = {k: round(v, 4) for k, v in baseline_result._repr_dict.items()}
print('\n=== Baseline RAG Scores ===')
for metric, score in baseline_scores.items():
    bar = '‚ñà' * int(score * 20) + '‚ñë' * (20 - int(score * 20))
    print(f'  {metric:<25} {bar} {score:.4f}')


Running RAGAS metrics on baseline...


Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

Task exception was never retrieved
future: <Task finished name='Task-472' coro=<AsyncClient.aclose() done, defined at /usr/local/lib/python3.12/site-packages/httpx/_client.py:2024> exception=RuntimeError('Event loop is closed')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/asyncio/tasks.py", line 314, in __step_run_and_handle_result
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 2031, in aclose
    await self._transport.aclose()
  File "/usr/local/lib/python3.12/site-packages/httpx/_transports/default.py", line 389, in aclose
    await self._pool.aclose()
  File "/usr/local/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 353, in aclose
    await self._close_connections(closing_connections)
  File "/usr/local/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 345, in _close_connections
    await connection.aclose()
  File "/usr/local/lib/p


=== Baseline RAG Scores ===
  faithfulness              ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë 0.9699
  answer_relevancy          ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë 0.9315
  context_precision         ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë 0.5730
  context_recall            ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë 0.9400


## 3. RAG Evaluation ‚Äî Improved (With Cohere Reranking)

Now we add **Cohere Rerank** (`rerank-multilingual-v3.0`) to the pipeline.
This retrieves top-5 candidates and reranks them down to top-3,
improving precision and relevance.

This is the **iteration story** required for certification ‚Äî we show measurable improvement.

In [11]:
# Run improved RAG evaluation (with Cohere reranking)
print('Running improved RAG evaluation (with Cohere reranking)...')
reranked_answers, reranked_contexts = evaluate_rag(
    eval_questions, eval_ground_truths, use_reranking=True
)

# Create RAGAS dataset
reranked_dataset = Dataset.from_dict({
    'user_input': eval_questions,
    'response': reranked_answers,
    'retrieved_contexts': reranked_contexts,
    'reference': eval_ground_truths,
})

# Run RAGAS evaluation in a thread
print('Running RAGAS metrics on reranked pipeline...')
def _run_ragas_reranked():
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    result = ragas_evaluate(
        dataset=reranked_dataset,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    )
    loop.close()
    return result

with concurrent.futures.ThreadPoolExecutor() as pool:
    reranked_result = pool.submit(_run_ragas_reranked).result()

reranked_scores = {k: round(v, 4) for k, v in reranked_result._repr_dict.items()}
print('\n=== Reranked RAG Scores ===')
for metric, score in reranked_scores.items():
    bar = '‚ñà' * int(score * 20) + '‚ñë' * (20 - int(score * 20))
    print(f'  {metric:<25} {bar} {score:.4f}')


Running improved RAG evaluation (with Cohere reranking)...
  [1/10] Ce este ghidul FIDELIS pentru INVESTITORI?...
  [2/10] What are the key benefits of investing in FIDELIS government...
  [3/10] cum pot cumpara titluri de stat FIDELIS?...
  [4/10] What is the significance of the S√¢nge de Investitor campaign...
  [5/10] What are the implications of trading FIDELIS government bond...
  [6/10] What are the characteristics and benefits of investing in ti...
  [7/10] Cum pot investi cetƒÉ»õenii rom√¢ni √Æn Titlurile de stat TEZAUR...
  [8/10] Cum pot fi cumpƒÉrate Titlurile de stat TEZAUR conform ghidul...
  [9/10] Cum pot fi cumpƒÉrate Titlurile de stat TEZAUR conform ghidul...
  [10/10] Cum pot fi cumpƒÉrate titlurile de stat TEZAUR »ôi care sunt a...
Running RAGAS metrics on reranked pipeline...


Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]


=== Reranked RAG Scores ===
  faithfulness              ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë 0.8698
  answer_relevancy          ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë 0.9368
  context_precision         ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë 0.9232
  context_recall            ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë 0.8567


In [12]:
# Side-by-side comparison
print('\n' + '='*70)
print('COMPARISON: Baseline vs Reranked (Cohere rerank-multilingual-v3.0)')
print('='*70)
print(f'{"Metric":<25} {"Baseline":>10} {"Reranked":>10} {"Delta":>10} {"Improved?":>10}')
print('-'*70)

comparison_data = []
for metric in baseline_scores:
    b = baseline_scores.get(metric, 0)
    r = reranked_scores.get(metric, 0)
    delta = r - b
    improved = '‚úÖ' if delta > 0 else ('‚ö†Ô∏è' if delta == 0 else '‚ùå')
    delta_str = f'+{delta:.4f}' if delta >= 0 else f'{delta:.4f}'
    print(f'{metric:<25} {b:>10.4f} {r:>10.4f} {delta_str:>10} {improved:>10}')
    comparison_data.append({
        'Metric': metric,
        'Baseline': b,
        'Reranked': r,
        'Delta': delta,
        'Improved': improved,
    })

print('\n')
comparison_df = pd.DataFrame(comparison_data)
display(comparison_df.style.format({'Baseline': '{:.4f}', 'Reranked': '{:.4f}', 'Delta': '{:+.4f}'}))



COMPARISON: Baseline vs Reranked (Cohere rerank-multilingual-v3.0)
Metric                      Baseline   Reranked      Delta  Improved?
----------------------------------------------------------------------
faithfulness                  0.9699     0.8698    -0.1001          ‚ùå
answer_relevancy              0.9315     0.9368    +0.0053          ‚úÖ
context_precision             0.5730     0.9232    +0.3502          ‚úÖ
context_recall                0.9400     0.8567    -0.0833          ‚ùå




Unnamed: 0,Metric,Baseline,Reranked,Delta,Improved
0,faithfulness,0.9699,0.8698,-0.1001,‚ùå
1,answer_relevancy,0.9315,0.9368,0.0053,‚úÖ
2,context_precision,0.573,0.9232,0.3502,‚úÖ
3,context_recall,0.94,0.8567,-0.0833,‚ùå


## 4. Agent Evaluation

We evaluate the full LangGraph Supervisor agent on:
- **Tool Call Accuracy** ‚Äî Does it route to the right tool?
- **Topic Adherence** ‚Äî Does the response stay on topic?
- **MiFID II Compliance** ‚Äî Does it add disclaimers when discussing investments?
- **Language Detection** ‚Äî Does it respond in the user's language?

In [13]:
# Agent evaluation
from app.services.agent_service import agent_service

# Initialize the agent (builds LangGraph, connects to Postgres)
# Must use await directly so Postgres pool shares Jupyter's event loop
await agent_service.setup()
print('Agent service initialized')

DEMO_USER_ID = '00000000-0000-0000-0000-000000000001'

AGENT_TEST_SCENARIOS = [
    {
        'category': 'RAG Query',
        'message': 'Ce este TEZAUR?',
        'expected_topics': ['TEZAUR', 'titluri de stat', 'garantat'],
        'should_have_disclaimer': True,
    },
    {
        'category': 'Market Search',
        'message': 'Care este cursul EUR/RON astazi?',
        'expected_topics': ['EUR', 'RON', 'curs'],
        'should_have_disclaimer': False,
    },
    {
        'category': 'Goals Query',
        'message': 'Care sunt obiectivele mele financiare?',
        'expected_topics': ['obiectiv', 'RON'],
        'should_have_disclaimer': False,
    },
    {
        'category': 'Language (EN)',
        'message': 'What are the differences between TEZAUR and FIDELIS?',
        'expected_topics': ['TEZAUR', 'FIDELIS'],
        'should_have_disclaimer': True,
    },
]

agent_results = []
for i, scenario in enumerate(AGENT_TEST_SCENARIOS, 1):
    print(f'\n--- Scenario {i}: {scenario["category"]} ---')
    print(f'Message: {scenario["message"]}')

    response = await agent_service.chat(
        message=scenario['message'],
        user_id=DEMO_USER_ID,
        session_id=f'eval-notebook-{i}',
    )

    topic_hits = sum(1 for t in scenario['expected_topics'] if t.lower() in response.lower())
    topic_score = topic_hits / len(scenario['expected_topics'])
    has_disclaimer = 'MiFID' in response or 'recomandare' in response.lower()
    disclaimer_ok = has_disclaimer == scenario['should_have_disclaimer']
    overall = topic_score * 0.7 + (1.0 if disclaimer_ok else 0.0) * 0.3

    agent_results.append({
        'Category': scenario['category'],
        'Topic Score': f'{topic_score:.0%}',
        'Disclaimer OK': '‚úÖ' if disclaimer_ok else '‚ùå',
        'Overall': f'{overall:.2f}',
        'Response Preview': response[:120] + '...',
    })
    print(f'  Score: {overall:.2f} | Topics: {topic_score:.0%} | Disclaimer: {"‚úÖ" if disclaimer_ok else "‚ùå"}')
    print(f'  Response: {response[:120]}...')

print('\n\n=== Agent Evaluation Summary ===')
agent_df = pd.DataFrame(agent_results)
display(agent_df)




Agent service initialized

--- Scenario 1: RAG Query ---
Message: Ce este TEZAUR?
  Score: 0.77 | Topics: 67% | Disclaimer: ‚úÖ
  Response: TEZAUR este un program de titluri de stat destinat cetƒÉ»õenilor rom√¢ni care au √Æmplinit v√¢rsta de 18 ani. Aceste titluri ...

--- Scenario 2: Market Search ---
Message: Care este cursul EUR/RON astazi?
  Score: 1.00 | Topics: 100% | Disclaimer: ‚úÖ
  Response: AstƒÉzi, pe 28 februarie 2026, cursul de schimb pentru 1 EUR este 5.0965 RON, conform BƒÉncii Na»õionale a Rom√¢niei.

Surse...

--- Scenario 3: Goals Query ---
Message: Care sunt obiectivele mele financiare?
  Score: 0.65 | Topics: 50% | Disclaimer: ‚úÖ
  Response: Nu ave»õi obiective financiare definite √Æn prezent. DacƒÉ dori»õi sƒÉ seta»õi un nou obiectiv financiar, vƒÉ pot ajuta cu plƒÉc...

--- Scenario 4: Language (EN) ---
Message: What are the differences between TEZAUR and FIDELIS?
  Score: 1.00 | Topics: 100% | Disclaimer: ‚úÖ
  Response: TEZAUR »ôi FIDELIS sunt ambele programe 

Unnamed: 0,Category,Topic Score,Disclaimer OK,Overall,Response Preview
0,RAG Query,67%,‚úÖ,0.77,TEZAUR este un program de titluri de stat dest...
1,Market Search,100%,‚úÖ,1.0,"AstƒÉzi, pe 28 februarie 2026, cursul de schimb..."
2,Goals Query,50%,‚úÖ,0.65,Nu ave»õi obiective financiare definite √Æn prez...
3,Language (EN),100%,‚úÖ,1.0,TEZAUR »ôi FIDELIS sunt ambele programe de titl...


In [14]:
# Final summary
print('='*60)
print('EVALUATION COMPLETE')
print('='*60)
print(f'\nRAG Baseline Scores:  {baseline_scores}')
print(f'RAG Reranked Scores:  {reranked_scores}')
print(f'Agent Scenarios:      {len(agent_results)} tested')
print(f'Agent Pass Rate:      {sum(1 for r in agent_results if float(r["Overall"]) >= 0.7)}/{len(agent_results)}')


EVALUATION COMPLETE

RAG Baseline Scores:  {'faithfulness': 0.9699, 'answer_relevancy': 0.9315, 'context_precision': 0.573, 'context_recall': 0.94}
RAG Reranked Scores:  {'faithfulness': 0.8698, 'answer_relevancy': 0.9368, 'context_precision': 0.9232, 'context_recall': 0.8567}
Agent Scenarios:      4 tested
Agent Pass Rate:      3/4
