# üìö Manual RAGAS Testset Builder ‚Äî FAISS + OpenAI

This notebook shows an **end-to-end, fully manual** workflow to evaluate a RAG system using **RAGAS 0.3.8**, with:

1. **Document ingestion**
   - Optional: Load a PDF using `PyPDFLoader`
   - Fallback: Use sample in-memory texts
2. **Chunking & Indexing**
   - Split into chunks with `RecursiveCharacterTextSplitter`
   - Create a **FAISS** vector store using **OpenAI embeddings**
3. **RAG Pipeline (FAISS + OpenAI)**
   - Retrieve top-k chunks
   - Pass them as context to LLM model
4. **Manual Testset Builder**
   - Build your own testset with columns:
     - `question`
     - `ground_truth`
     - `contexts` (reference contexts)
   - Auto-run your RAG pipeline to generate `answer` for each row
5. **RAGAS Evaluation**
   - Use RAGAS metrics to score your pipeline:
     - `context_precision`
     - `context_recall`
     - `faithfulness`
     - `answer_relevancy`




### 1. context_precision

Intuition: ‚ÄúOf the context I retrieved, how much was actually useful for answering the question?‚Äù

High precision ‚Üí most retrieved chunks are relevant to the question.

Low precision ‚Üí you‚Äôre pulling in lots of noisy / unrelated text.

Good when you want lean, focused retrieval (few but very relevant chunks).

Think: ‚ÄúDid I retrieve only what I needed?‚Äù

### 2. context_recall

Intuition: ‚ÄúDid I retrieve enough of the relevant information that exists in the knowledge base?‚Äù

High recall ‚Üí the retrieved context covers most of the ground truth facts needed.

Low recall ‚Üí you missed important pieces; the answer might be incomplete or wrong.

Good when you want to check if your retriever is missing key documents.

Think: ‚ÄúDid I retrieve everything I needed?‚Äù

### 3. faithfulness

Intuition: ‚ÄúIs the model‚Äôs answer faithful to the retrieved context, or is it hallucinating?‚Äù

High faithfulness ‚Üí every claim in the answer can be traced back to the provided context.

Low faithfulness ‚Üí the answer introduces facts not supported by the context, or contradicts it.

Evaluates grounding of the answer, not just fluency.

Think: ‚ÄúIs the answer sticking to the documents, or making things up?‚Äù

### 4. answer_relevancy

Intuition: ‚ÄúDoes the answer actually respond to the user‚Äôs question?‚Äù

High relevancy ‚Üí answer is on-topic, focused, and directly addresses the question.

Low relevancy ‚Üí answer is off-topic, generic, or partially ignores the question.

Evaluates the fit between question and answer, regardless of context.

Think: ‚ÄúIs this the answer a user wanted for this question?‚Äù

### Summary

context_precision ‚Üí quality of retrieved context (how clean it is).

context_recall ‚Üí completeness of retrieved context (did we miss things).

faithfulness ‚Üí does the answer match the context (no hallucinations).

answer_relevancy ‚Üí does the answer match the question (useful to the user).

In [1]:
# üîß Imports and Jupyter event-loop patch

import os
import json
import nest_asyncio

nest_asyncio.apply()



What is nest_asyncio?
import nest_asyncio
nest_asyncio.apply()


asyncio is Python‚Äôs built-in library for asynchronous code (async/await).

Jupyter notebooks already run an event loop (used by IPython).

When a library (like Ragas, or an async LLM client) tries to start another event loop, you can get errors like:

RuntimeError: This event loop is already running

nest_asyncio patches the current event loop so you can nest async event loops (i.e., run async code from inside already-running loops).

nest_asyncio.apply() = ‚ÄúAllow me to run async code from inside this already-running notebook loop without crashing.‚Äù

In [2]:
import os
from datasets import Dataset

from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document

from langchain_core.prompts import ChatPromptTemplate


from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy,
)


In [3]:
from dotenv import load_dotenv
load_dotenv()

True

#### üîë OpenAI client


In [4]:

embeddings = OpenAIEmbeddings()
llm = ChatOpenAI(model="gpt-4o")

print("LLM and embeddings objects created.")

LLM and embeddings objects created.


### üìÑ STEP 1 ‚Äî Load PDF(s) and split into chunks

In [5]:
# üëâ Set this to your PDF path (local file). 

PDF_PATH = "medical_health_sample.pdf"  

if os.path.exists(PDF_PATH):
    loader = PyPDFLoader(PDF_PATH)
    pdf_docs = loader.load()
    print(f"Loaded {len(pdf_docs)} pages from", PDF_PATH)
else:
    print("PDF not found. Set PDF_PATH to a real file if you want to load PDFs.")
    pdf_docs = []

# Split the pages into smaller chunks suitable for embeddings/RAG
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " "],
)

if pdf_docs:
    docs = text_splitter.split_documents(pdf_docs)
    print(f"Split into {len(docs)} text chunks.")
else:
    # Fallback/sample docs if no PDF is provided
    docs = [
        Document(page_content="Diabetes Mellitus is a chronic metabolic disorder characterized by elevated blood glucose levels. It occurs when the body cannot produce enough insulin or cannot effectively use the insulin it produces. Untreated diabetes can lead to complications such as kidney disease, neuropathy, and cardiovascular disorders. Type 1 diabetes is autoimmune, while Type 2 diabetes is often associated with lifestyle factors."),
        Document(page_content="Hypertension, commonly known as high blood pressure, is a condition in which the force of blood against artery walls is consistently too high. Long-term uncontrolled hypertension increases the risk of stroke, heart attack, and kidney failure. Lifestyle modifications such as reduced salt intake, regular exercise, and stress management are essential."),
        Document(page_content="People with diabetes have a higher risk of developing hypertension. Both disorders together greatly increase the likelihood of cardiovascular complications. Managing blood sugar levels, maintaining a healthy weight, and monitoring blood pressure regularly are crucial for prevention."),
    ]
    print("Using sample in-memory documents (no PDF loaded).")
    print(f"Number of sample chunks: {len(docs)}")

Loaded 1 pages from medical_health_sample.pdf
Split into 2 text chunks.


### üß† STEP 2 ‚Äî Build a FAISS Vector Store (in-memory RAG index)

In [6]:


texts = [d.page_content for d in docs]

faiss_vectorstore = FAISS.from_texts(
    texts=texts,
    embedding=embeddings,
)

faiss_retriever = faiss_vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4},
)

print("FAISS vector store built with", len(texts), "documents.")

FAISS vector store built with 2 documents.


### üîÅ STEP 3 ‚Äî Simple RAG function (Retriever +  LLM)

In [7]:


rag_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a helpful RAG assistant. Use ONLY the provided context to answer. If the answer is not in the context, say you don't know."),
        ("human", "Question: {question}\n\nContext:\n{context}\n\nAnswer in 2-4 sentences:"),
    ]
)



rag_chain = rag_prompt | llm

def answer_with_rag(question: str, retriever=faiss_retriever):
    """Retrieve top docs and answer using the LLM."""
    retrieved_docs = retriever.invoke(question)
    context = "\n\n".join([d.page_content for d in retrieved_docs])
    

    response = rag_chain.invoke({"question": question , "context": context})

    return {
        "question": question,
        "answer": response.content,
        "retrieved_contexts": [doc.page_content for doc in retrieved_docs],
    }

# Quick smoke test (this will call llm)
test_result = answer_with_rag("How are diabetes and hypertension related?")
print(test_result["answer"])

People with diabetes have a higher risk of developing hypertension, and when both conditions co-exist, they significantly increase the risk of cardiovascular complications. Effective management involves controlling blood sugar levels, maintaining a healthy weight, and regularly monitoring blood pressure to prevent these complications.


### 
### üß± STEP 4 ‚Äî Build a **manual** RAGAS testset

RAGAS expects your evaluation dataset to have (typically) these columns:

- `question` *(str)* ‚Äî The user question  
- `ground_truth` *(str)* ‚Äî The correct / expected answer (from human or reference)  
- `contexts` *(list[list[str]] or list[str])* ‚Äî The **retrieved** contexts passed to the LLM  
- `answer` *(str)* ‚Äî The answer your RAG pipeline actually produced  

In this notebook, we'll:

1. Manually define a small `test_data` list of dicts  
2. Turn it into a `datasets.Dataset`  
3. Use our RAG pipeline to fill the `answer` and (optionally) override `contexts`  
4. Run `ragas.evaluate(...)` on that dataset

### üßæ STEP 4a ‚Äî Define your manual testset rows here
- Edit/extend this list to match your domain.
- Each row:
    - question: what you would ask your RAG system
    - ground_truth: the ideal answer (short, factual, grounded in your docs)
    - contexts: OPTIONAL reference contexts you believe are relevant (can be empty; RAGAS mainly uses retrieved ones)

In [8]:

import pandas as pd

manual_df = pd.read_csv("manual_test_dataset.csv")
manual_test_data = manual_df.to_dict(orient="records")

print(f"Manual seed rows: {len(manual_test_data)}")

Manual seed rows: 4


In [9]:

# Convert manual test data to a DataFrame first (easier to inspect/edit)

display(manual_df)

Unnamed: 0,question,ground_truth,contexts
0,What is the primary characteristic of Diabetes...,It is characterized by elevated blood glucose ...,1. Diabetes Mellitus\nDiabetes Mellitus is a c...
1,Name two long-term complications of untreated ...,Kidney disease and cardiovascular disorders ar...,1. Diabetes Mellitus\nDiabetes Mellitus is a c...
2,What lifestyle factors help manage hypertension?,"Reduced salt intake, regular exercise, and str...","2. Hypertension\nHypertension, commonly known ..."
3,How are diabetes and hypertension related?,People with diabetes have a higher risk of dev...,3. Relationship Between Diabetes and Hypertens...


### STEP 4b ‚Äî Convert to Dataset and attach RAG answers

In [10]:

# Convert to HF Dataset
eval_dataset = Dataset.from_pandas(manual_df)

def run_rag_on_dataset(ds: Dataset):
    """For each row, call our RAG pipeline and attach `answer` and `contexts` (retrieved)."""
    answers = []
    retrieved_contexts = []

    for row in ds:
        q = row["question"]
        result = answer_with_rag(q)
        answers.append(result["answer"])
        retrieved_contexts.append(result["retrieved_contexts"])

    ds = ds.add_column("answer", answers)

    # If you want to let RAGAS metrics operate on retrieved contexts, name the column `contexts`
    # If your manual `contexts` column is important, you can rename it to something else first.
    if "contexts" in ds.column_names:
        ds = ds.remove_columns(["contexts"])
    ds = ds.add_column("contexts", retrieved_contexts)

    return ds

eval_dataset_with_answers = run_rag_on_dataset(eval_dataset)

print(eval_dataset_with_answers)

Dataset({
    features: ['question', 'ground_truth', 'answer', 'contexts'],
    num_rows: 4
})


### üìä STEP 5 ‚Äî Run RAGAS evaluation

In [11]:


metrics = [
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy,
]

result = evaluate(
    eval_dataset_with_answers,
    metrics=metrics,
    llm=llm,
    embeddings=embeddings,
)

result_df = result.to_pandas()
result_df

Evaluating:   0%|          | 0/16 [00:00<?, ?it/s]

Unnamed: 0,user_input,retrieved_contexts,response,reference,context_precision,context_recall,faithfulness,answer_relevancy
0,What is the primary characteristic of Diabetes...,[Understanding Diabetes and Hypertension\n1. D...,The primary characteristic of Diabetes Mellitu...,It is characterized by elevated blood glucose ...,1.0,1.0,1.0,0.999999
1,Name two long-term complications of untreated ...,[Understanding Diabetes and Hypertension\n1. D...,Two long-term complications of untreated diabe...,Kidney disease and cardiovascular disorders ar...,1.0,1.0,1.0,0.9876
2,What lifestyle factors help manage hypertension?,[essential.\nMedication may be required in mod...,Lifestyle factors that help manage hypertensio...,"Reduced salt intake, regular exercise, and str...",0.5,1.0,0.8,1.0
3,How are diabetes and hypertension related?,[Understanding Diabetes and Hypertension\n1. D...,People with diabetes have a higher risk of dev...,People with diabetes have a higher risk of dev...,0.5,1.0,1.0,0.909439



![img](raga_score.png "Optional title on hover")