## 🧠 What is RAGAS?
RAGAS (Retrieval-Augmented Generation Assessment Suite) is an open-source evaluation framework for measuring the performance and quality of RAG (Retrieval-Augmented Generation) systems. It allows developers to assess how well a RAG pipeline retrieves relevant information and generates accurate, context-aware responses.

## 📌 Key Features
 * Automatic Evaluation using LLMs (e.g., OpenAI GPT)
 * Multiple Metrics for both retriever and generator components
 * Supports Custom Datasets
 * Integration-friendly with LangChain, LlamaIndex, and other RAG frameworks

## 🔍 Why RAGAS?
Traditional evaluation methods like ROUGE or BLEU are not suitable for open-ended or generative tasks. RAGAS leverages large language models to act as evaluators, mimicking human judgment while maintaining scalability and consistency.

## Prerequisites
### 1. Navigate to the Project Directory
cd llm-sandbox

### 2. Create a Virtual Environment
python3 -m venv llm-env

### 3. Activate the Virtual Environment
source llm-env/bin/activate

### 4. Install Required Libraries
pip install ragas datasets openai pillow ipywidgets notebook ipykernel

### 5. Register the Virtual Environment as a Jupyter Kernel
python -m ipykernel install --user --name=llm-env --display-name "Python (llm-env)"


In [7]:
# Import necessary modules
import os  # For environment variable management
import getpass  # For secure input of API keys

# Import dataset utility from Hugging Face
from datasets import Dataset

# Import main evaluation function from ragas
from ragas import evaluate

# Import evaluation metrics from ragas
from ragas.metrics import (
    faithfulness,
    answer_correctness,
    context_precision,
    context_recall,
)

In [8]:
api_key = getpass.getpass("Please input OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = api_key

Please input OpenAI API key:  ········


In [1]:
# Define the evaluation data
data = {
    "question": [
        "What is LLM as a Judge?"
    ],
    "answer": [
        "LLM as a Judge is a concept where a large language model is used to evaluate the output of another model. It combines human-like reasoning with machine efficiency."
    ],
    "contexts": [
        [
            "Evaluating tasks performed by LLMs can be difficult due to their complexity and the diverse criteria involved. Traditional methods like rule-based assessment or similarity metrics (e.g., ROUGE, BLEU) often fall short when applied to the nuanced and varied outputs of LLMs.",
            "For instance, an AI assistant’s answer to a question can be: not grounded in context, repetitive, grammatically incorrect, excessively lengthy, incoherent. The list of criteria goes on. And even if we had a limited list, each of these would be hard to measure.",
            "To overcome this challenge, the concept of \"LLM as a Judge\" employs an LLM to evaluate another's output, combining human-like assessment with machine efficiency."
        ]
    ],
    "ground_truth": [
        "The concept of \"LLM as a Judge\" employs an LLM to evaluate another's output, combining human-like assessment with machine efficiency."
    ]
}

# Convert the dictionary to a Hugging Face Dataset object
dataset = Dataset.from_dict(data)
dataset.pandas().head()  # Display the first few rows of the dataset


NameError: name 'Dataset' is not defined

In [26]:
results = evaluate(dataset, metrics=[
    faithfulness,
    answer_correctness,
    context_precision,
    context_recall,
])

Evaluating: 100%|████████████████████████████████| 4/4 [00:09<00:00,  2.37s/it]


In [27]:
# Then use these metrics in evaluation
print("===== RAGAS Evaluation Results =====")
for metric, score in results.scores[0].items():
    print(f"{metric}: {score:.3f}")

===== RAGAS Evaluation Results =====
faithfulness: 1.000
answer_correctness: 0.838
context_precision: 0.333
context_recall: 1.000


In [28]:
results.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,answer_correctness,context_precision,context_recall
0,What is LLM as a Judge?,[Evaluating tasks performed by LLMs can be dif...,LLM as a Judge is a concept where a large lang...,"The concept of ""LLM as a Judge"" employs an LLM...",1.0,0.837517,0.333333,1.0
