<a href="https://colab.research.google.com/github/edgarbc/RAG-systems/blob/main/RAGAS_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAGAs Demo: Evaluating Retrieval-Augmented Generation Systems

August, 2025.

Edgar Bermudez.

This notebook is a practical demonstration of how to use the **RAGas** library to evaluate the performance of a Retrieval-Augmented Generation (RAG) system. You can read an extended version of this topic in my Medium post


**Who is this for?**

This notebook is intended for developers, researchers, and anyone interested in building and evaluating RAG pipelines. It provides a hands-on example of setting up a basic RAG system and using RAGas metrics to assess its quality.

**How it works:**

1.  **Setup**: Installs necessary libraries like `ragas`, `datasets`, `sentence-transformers`, and `faiss-cpu`.
2.  **Basic RAG Pipeline**: Demonstrates a simple RAG pipeline using a small in-memory knowledge base, a SentenceTransformer for embeddings, and FAISS for retrieval. It includes both a mock generator and the option to use OpenAI's API for text generation.
3.  **RAGas Evaluation**: Shows how to use the `ragas.evaluate` function to compute various metrics that assess different aspects of the RAG pipeline's performance, including:
    *   **Context Precision**: Measures how relevant the retrieved context is to the question.
    *   **Context Recall**: Measures how much of the relevant information in the ground truth is present in the retrieved context.
    *   **Faithfulness**: Measures how much the generated answer is supported by the retrieved context (requires an LLM).
    *   **Answer Correctness**: Measures how accurate the generated answer is compared to the ground truth (requires an LLM).

By running this notebook, you will learn how to build a basic RAG system and apply RAGas to quantitatively evaluate its effectiveness.

In [None]:
!pip install ragas datasets sentence-transformers faiss-cpu
# Optional for LLM-based generation/scoring:
!pip install openai

Collecting ragas
  Downloading ragas-0.3.2-py3-none-any.whl.metadata (21 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting appdirs (from ragas)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting diskcache>=5.6.3 (from ragas)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Collecting langchain-community (from ragas)
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain_openai (from ragas)
  Downloading langchain_openai-0.3.30-py3-none-any.whl.metadata (2.4 kB)
Collecting instructor (from ragas)
  Downloading instructor-1.10.0-py3-none-any.whl.metadata (11 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community->ragas)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community->ragas)
  Downloading pydantic_settings-

In [None]:
import os
import numpy as np
from datasets import Dataset
from sentence_transformers import SentenceTransformer
import faiss

In [None]:
# Toggle this to switch between a mock local generator and an API call
USE_OPENAI = True  # set True if you have OPENAI_API_KEY

# ---------------------------
# 1) Tiny knowledge base
# ---------------------------
documents = [
    "Retrieval-Augmented Generation (RAG) combines information retrieval and text generation.",
    "RAG improves factual accuracy by grounding answers in external documents.",
    "RAG mitigates hallucinations in large language models by citing retrieved sources.",
    "FAISS performs efficient similarity search over dense vector embeddings.",
    "OpenAI provides APIs for large language models."
]

# ---------------------------
# 2) Build a retriever (FAISS)
# ---------------------------
embedder = SentenceTransformer("all-MiniLM-L6-v2")
doc_vecs = embedder.encode(documents, convert_to_tensor=False)
dim = len(doc_vecs[0])
index = faiss.IndexFlatL2(dim)
index.add(np.array(doc_vecs, dtype="float32"))

def retrieve(query, k=2):
    qv = embedder.encode([query], convert_to_tensor=False)
    D, I = index.search(np.array(qv, dtype="float32"), k)
    return [documents[i] for i in I[0]]

# ---------------------------
# 3) Generator: mock or OpenAI
# ---------------------------
def mock_generate(question, contexts):
    # A tiny deterministic generator that stitches contexts
    return f"{question} Answer based on: " + " | ".join(contexts[:1])

if USE_OPENAI:
    import openai
    # Get the API key from Colab secrets
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

    def call_llm(prompt):
        # Simple chat completion; swap model as you like
        client = openai.OpenAI(api_key=OPENAI_API_KEY) # Pass the API key here
        resp = client.chat.completions.create( # Use the new chat completions method
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=180,
        )
        return resp.choices[0].message.content.strip()

    def generate(question, contexts):
        ctx = "\n".join(contexts)
        prompt = f"Use the context to answer.\nContext:\n{ctx}\n\nQuestion: {question}\nAnswer:"
        return call_llm(prompt)
else:
    generate = mock_generate

# ---------------------------
# 4) Queries with ground truth
# ---------------------------
qa_pairs = [
    {"question": "What is RAG?",
     "ground_truth": "RAG combines retrieval with generation to ground answers in external data."},
    {"question": "Why is RAG useful?",
     "ground_truth": "It reduces hallucinations and improves factual accuracy via retrieved documents."},
    {"question": "What is FAISS?",
     "ground_truth": "FAISS is a library for efficient similarity search on dense vectors."}
]

pipeline_rows = []
for qa in qa_pairs:
    ctx = retrieve(qa["question"], k=2)
    ans = generate(qa["question"], ctx)
    pipeline_rows.append({
        "question": qa["question"],
        "answer": ans,
        "contexts": ctx,
        "ground_truth": qa["ground_truth"],
    })

dataset = Dataset.from_list(pipeline_rows)
print(dataset[:2])  # peek

{'question': ['What is RAG?', 'Why is RAG useful?'], 'answer': ['RAG, or Retrieval-Augmented Generation, is a method that enhances factual accuracy by integrating information retrieval with text generation. It retrieves relevant external documents to provide a solid foundation for generating informed and accurate responses.', 'RAG is useful because it enhances the factual accuracy of responses by grounding answers in external documents. Additionally, it helps mitigate hallucinations in large language models by providing citations from retrieved sources, ensuring that the information presented is reliable and verifiable.'], 'contexts': [['RAG improves factual accuracy by grounding answers in external documents.', 'Retrieval-Augmented Generation (RAG) combines information retrieval and text generation.'], ['RAG improves factual accuracy by grounding answers in external documents.', 'RAG mitigates hallucinations in large language models by citing retrieved sources.']], 'ground_truth': ['R

In [None]:
!pip install langchain-openai



In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,          # LLM-based: needs an LLM
    answer_correctness,    # LLM-based
    context_precision,     # can be LLM or non-LLM variant under the hood
    context_recall,        # can be LLM or non-LLM variant
)
import os


openai_key = userdata.get('OPENAI_API_KEY')
os.environ["OPENAI_API_KEY"] = openai_key

# Option A: run only non-LLM metrics (works offline)
# Keep just context metrics if you don't want LLM scoring:
non_llm_results = evaluate(
    dataset,
    metrics=[context_precision, context_recall],
    llm=None,  # Explicitly set llm to None for non-LLM metrics
    embeddings=None, # Explicitly set embeddings to None for non-LLM metrics
)
print("Non-LLM results:", non_llm_results)

# Option B: enable LLM-based scoring (faithfulness, answer_correctness)
# You can pass model configs in newer RAGAS versions via run config / LLM adapters.
# Minimal example with OpenAI:
try:
    from langchain_openai import ChatOpenAI, OpenAIEmbeddings
    if USE_OPENAI:
        llm = ChatOpenAI(model="gpt-4o-mini")
        emb = OpenAIEmbeddings(model="text-embedding-3-small")
        llm_results = evaluate(
            dataset,
            metrics=[faithfulness, answer_correctness, context_precision, context_recall],
            llm=llm,
            embeddings=emb,
        )
        print("LLM-based results:", llm_results)
except Exception as e:
    print("LLM-based scoring unavailable or misconfigured:", e)

Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

Non-LLM results: {'context_precision': 0.8333, 'context_recall': 1.0000}


Evaluating:   0%|          | 0/12 [00:00<?, ?it/s]

LLM-based results: {'faithfulness': 0.7222, 'answer_correctness': 0.7848, 'context_precision': 0.8333, 'context_recall': 1.0000}
