# RAGAS Evaluation with Llama Stack - Demo

## Overview

This notebook demonstrates how to use the RAGAS (Retrieval-Augmented Generation Assessment) out-of-tree provider with Llama Stack to evaluate RAG systems.

**Prerequisites:**
- Complete **[`1.dataset_generation.ipynb`](1.dataset_generation.ipynb)** first to generate the evaluation dataset
- Ensure the `rag_evaluation_dataset.jsonl` file exists in the current directory

This notebook will:
1. Load the evaluation dataset generated in the previous notebook
2. Register the dataset with Llama Stack
3. Configure RAGAS evaluation metrics
4. Run evaluation using the RAGAS provider (remote mode via Kubeflow Pipelines)
5. Display and analyze the results

In [None]:
%pip install -r requirements.txt

In [None]:
import os
from datetime import datetime

import pandas as pd
from llama_stack_client import LlamaStackClient
from rich.pretty import pprint

from llama_stack_provider_ragas.constants import PROVIDER_ID_REMOTE

## Llama Stack Client Setup

This step verifies that:
- **Llama Stack is reachable** - The client can successfully connect to the Llama Stack service
- **Required models are available** - Both an inference model (`model_type='llm'`) and an embedding model (`model_type='embedding'`) are registered and accessible

In [None]:
client = LlamaStackClient(base_url="http://lsd-ragas-example-service:8321")
available_models = client.models.list()
assert any(model.model_type == "llm" for model in available_models)
assert any(model.model_type == "embedding" for model in available_models)

## Dataset Preparation

Load the RAG evaluation dataset generated from the previous notebook (`1.dataset_generation.ipynb`).

The dataset should be in JSONL format with the following structure:
- `question`: The question to evaluate
- `answer`: The generated answer
- `contexts`: List of context strings used for answer generation
- `ground_truth`: The ground truth answer or reference

**Note:** If you haven't run the first notebook yet, please complete it first to generate the `rag_evaluation_dataset.jsonl` file.

In [None]:
import json

jsonl_path = "rag_evaluation_dataset.jsonl"
evaluation_data = []

with open(jsonl_path, "r", encoding="utf-8") as f:
    for line in f:
        if line.strip():  # Skip empty lines
            data = json.loads(line)
            evaluation_data.append({
                "user_input": data["question"],
                "response": data["answer"],
                "retrieved_contexts": data["contexts"],
                "reference": data["ground_truth"],
            })

print(f"Loaded {len(evaluation_data)} evaluation examples from {jsonl_path}")

## Dataset Registration

Register the dataset with Llama Stack's Datasets API using the direct rows approach.

In [None]:
# De-register the dataset if it already exists
dataset_id = "ragas_demo_dataset"
try:
    client.datasets.unregister(dataset_id)
except Exception:
    pass

dataset_response = client.datasets.register(
    dataset_id=dataset_id,
    purpose="eval/question-answer",
    source={"type": "rows", "rows": evaluation_data},
    metadata={
        "provider_id": "localfs",
        "description": "Sample RAG evaluation dataset for Ragas demo",
        "size": len(evaluation_data),
        "format": "ragas",
        "created_at": datetime.now().isoformat(),
    },
)

## Benchmark Registration

A benchmark defines which RAGAS metrics to use for evaluation. Register one or more benchmarks with different providers or metric configurations.

### Available RAGAS Metrics

- **`answer_relevancy`**: Measures how relevant the generated answer is to the question (0-1, higher is better)
- **`context_precision`**: Measures how precise the retrieved contexts are for answering the question
- **`faithfulness`**: Measures how faithful the answer is to the retrieved contexts (checks for hallucinations)
- **`context_recall`**: Measures how much of the ground truth information is covered by the retrieved contexts
- **`answer_correctness`**: Measures how correct the answer is compared to the ground truth (combines semantic similarity and factual accuracy)

**Note:** You can enable multiple metrics by uncommenting them in the `scoring_functions` list below. More metrics provide more comprehensive evaluation but may increase computation time.

In [None]:
# Select which benchmark to use for evaluation
benchmark_id = "ragas_demo_benchmark__remote"

benchmarks_to_register = [
    (benchmark_id, PROVIDER_ID_REMOTE),
]

# Configure which RAGAS metrics to evaluate
# Uncomment additional metrics if needed
ragas_metrics = [
    "answer_relevancy",        # Measures answer relevance to the question
    # "context_precision",     # Measures precision of retrieved contexts
    # "faithfulness",          # Measures faithfulness to contexts
    # "context_recall",        # Measures coverage of ground truth by contexts
    # "answer_correctness",    # TODO: Currently throws error 500 - do not enable
]

# Register benchmarks
for benchmark_id_to_register, provider_id in benchmarks_to_register:
    # Unregister existing benchmark if it exists
    try:
        client.benchmarks.unregister(benchmark_id_to_register)
        print(f"Unregistered existing benchmark: {benchmark_id_to_register}")
    except Exception:
        pass  # Benchmark doesn't exist yet, which is fine
    
    # Register the benchmark
    benchmark_response = client.benchmarks.register(
        benchmark_id=benchmark_id_to_register,
        dataset_id=dataset_id,
        scoring_functions=ragas_metrics,
        provider_id=provider_id,
    )

benchmarks = client.benchmarks.list()
pprint(benchmarks)

## Evaluation Execution

Run the evaluation using our Ragas out-of-tree provider.

In [None]:
remote_job = client.alpha.eval.run_eval(
    benchmark_id=benchmark_id,
    benchmark_config={
        "eval_candidate": {
            "type": "model",
            "model": "vllm-inference/qwen3-14b-awq",
            "sampling_params": {"temperature": 0.1, "max_tokens": 100},
        },
        "scoring_params": {},
        # "num_examples": 1,
    },
)
pprint(remote_job)

## Results Display

In [None]:
# Wait a bit for the job to complete
import time
remote_job_status = client.alpha.eval.jobs.status(
    benchmark_id=benchmark_id, job_id=remote_job.job_id
)
while remote_job_status.status not in ("failed", "completed"):
    print(f"Status: {remote_job_status.status}")
    time.sleep(10)
    remote_job_status = client.alpha.eval.jobs.status(
        benchmark_id=benchmark_id, job_id=remote_job.job_id
    )
    
pprint(
    client.alpha.eval.jobs.status(
        benchmark_id=benchmark_id, job_id=remote_job.job_id
    )
)
remote_results = client.alpha.eval.jobs.retrieve(
    benchmark_id=benchmark_id, job_id=remote_job.job_id
)

In [None]:
# Print all metrics from remote results in a single table
df = pd.concat(
    {metric: pd.Series([r["score"] for r in result.score_rows])
     for metric, result in remote_results.scores.items()},
    axis=1
)

print(df)

## Summary

ðŸŽ‰ You have successfully completed the RAGAS evaluation demo!

### What You've Accomplished

1. âœ… Loaded the evaluation dataset generated from the SDG Hub RAG Flow
2. âœ… Registered the dataset with Llama Stack
3. âœ… Configured RAGAS evaluation metrics
4. âœ… Run evaluation using the RAGAS remote provider via Kubeflow Pipelines
5. âœ… Analyzed and visualized the evaluation results

### Understanding the Results

The RAGAS metrics provide insights into your RAG system's performance:

- **Answer Relevancy**: Measures how relevant the answer is to the question
- **Faithfulness**: Measures how faithful the answer is to the retrieved contexts
- **Context Precision**: Measures how precise the retrieved contexts are
- **Context Recall**: Measures how much of the ground truth is covered by contexts
