# Import

# RAG Evaluation with RAGAS Metrics

This notebook demonstrates how to evaluate a RAG (Retrieval-Augmented Generation) pipeline using RAGAS metrics.

## Overview

**Purpose**: Systematically evaluate RAG pipeline quality using LangSmith datasets and RAGAS evaluation metrics.

**Key Components**:
1. **LangSmith Dataset Integration**: Load evaluation datasets with reference questions and expected contexts
2. **RAG Pipeline**: Complete retrieval and generation pipeline using Qdrant and OpenAI
3. **RAGAS Metrics**: Evaluate using Faithfulness, Answer Relevancy, and ID-Based Context Precision

## RAGAS Metrics Implemented

- **Faithfulness**: Measures whether the generated answer is grounded in the retrieved context (prevents hallucination)
- **Answer Relevancy**: Evaluates how relevant the answer is to the user's question using semantic similarity
- **ID-Based Context Precision**: Compares retrieved product IDs against reference IDs to measure retrieval accuracy

## Requirements

- LangSmith API key for dataset access
- OpenAI API key for LLM and embeddings
- Qdrant running locally on port 6333 with `Amazon-items-collection-00`
- Environment variables loaded from `.env` file

In [None]:
import os
import openai

from langsmith import Client
from qdrant_client import QdrantClient

from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

from dotenv import load_dotenv
from pathlib import Path

# Find and load .env file from project root (searches parent directories)
current_dir = Path.cwd()
env_path = current_dir / ".env"

# Search parent directories for .env
while not env_path.exists() and env_path.parent != env_path.parent.parent:
    current_dir = current_dir.parent
    env_path = current_dir / ".env"

if env_path.exists():
    load_dotenv(dotenv_path=str(env_path))
else:
    raise FileNotFoundError("Could not find .env file in current or parent directories")

# Download an example reference data point from LangSmith

In [None]:
client = Client()

In [None]:
dataset = client.read_dataset(
    dataset_name="rag-evaluation-dataset"
)

In [None]:
dataset

In [None]:
client.list_examples(dataset_id=dataset.id, limit=10)

In [None]:
list(client.list_examples(dataset_id=dataset.id, limit=10))

In [None]:
list(client.list_examples(dataset_id=dataset.id, limit=10))[0].outputs

In [None]:
list(client.list_examples(dataset_id=dataset.id, limit=10))[0].inputs


In [None]:
reference_input = list(client.list_examples(dataset_id=dataset.id, limit=10))[0].inputs
reference_output = list(client.list_examples(dataset_id=dataset.id, limit=10))[0].outputs

# Rag Pipeline

In [None]:
import openai
from qdrant_client import QdrantClient


def get_embedding(text, model="text-embedding-3-small"):
    response = openai.embeddings.create(
        input=text,
        model=model,
    )
    
    return response.data[0].embedding


def retrieve_data(query, qdrant_client, k=5):
    
    query_embedding = get_embedding(query)
    
    results = qdrant_client.query_points(
        collection_name="Amazon-items-collection-00",
        query=query_embedding,
        limit=k,
    )
    
    retrieved_context_ids = []
    retrieved_context = []
    similarity_scores = []
    
    for point in results.points:
        retrieved_context_ids.append(point.payload["parent_asin"])
        retrieved_context.append(point.payload["description"])
        similarity_scores.append(point.score)
    
    return {
        "retrieved_context_ids": retrieved_context_ids,
        "retrieved_context": retrieved_context,
        "similarity_scores": similarity_scores
    }


def process_context(retrieved_context):
    context_ids = retrieved_context["retrieved_context_ids"]
    descriptions = retrieved_context["retrieved_context"]
    ratings = retrieved_context.get("ratings", [None] * len(context_ids))
    
    formatted_context = []
    for asin, description, rating in zip(context_ids, descriptions, ratings):
        if rating:
            formatted_context.append(f"- ID: {asin}, rating: {rating}, description: {description}")
        else:
            formatted_context.append(f"- ID: {asin}, description: {description}")
    
    return "\n".join(formatted_context)


def build_prompt(preprocessed_context, question):
    prompt = f"""
You are a shopping assistant that can answer questions about the products in stock.

You will be given a question and a list of context.

Instructions:
- You need to answer the question based on the provided context only.
- Never use word context and refer to it as the available products.

Context:
{preprocessed_context}

Question:
{question}
"""
    
    return prompt


def generate_answer(prompt):
    
    response = openai.chat.completions.create(
        model="gpt-5-nano",
        messages=[{"role": "system", "content": prompt}],
        reasoning_effort="minimal"
    )
    
    return response.choices[0].message.content


def rag_pipeline(question, top_k=5):
    
    qdrant_client = QdrantClient(url="http://localhost:6333")
    
    retrieved_context = retrieve_data(question, qdrant_client, top_k)
    preprocessed_context = process_context(retrieved_context)
    prompt = build_prompt(preprocessed_context, question)
    answer = generate_answer(prompt)
    
    final_result = {
        "answer": answer,
        "question": question,
        "retrieved_context_ids": retrieved_context["retrieved_context_ids"],
        "retrieved_context": retrieved_context["retrieved_context"],
        "similarity_scores": retrieved_context["similarity_scores"]
    }
    
    return final_result

In [None]:
rag_pipeline("Can I get some charger?", top_k=5)

# Implement RAGAS Metrics

RAGAS (RAG Assessment) provides specialized metrics for evaluating RAG systems.

## Setup Notes

**RAGAS API Evolution**: The library has undergone significant API changes:
- Modern API uses `llm_factory()` and requires explicit OpenAI client instances
- Embeddings require `LangchainEmbeddingsWrapper` for compatibility with `AnswerRelevancy` metric
- `single_turn_ascore()` is the async method for scoring metrics

**Why LangchainEmbeddingsWrapper?**
- The `AnswerRelevancy` metric requires embeddings with `embed_query()` and `embed_documents()` methods
- RAGAS's native `OpenAIEmbeddings` uses `embed_text()` and `embed_texts()` instead
- LangChain's wrapper provides the expected interface

In [None]:
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import ContextPrecision, ContextRecall, Faithfulness, AnswerRelevancy, IDBasedContextPrecision

In [None]:
from openai import OpenAI
from ragas.llms import llm_factory
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import OpenAIEmbeddings

openai_client = OpenAI()

ragas_llm = llm_factory("gpt-4o-mini", client=openai_client)
ragas_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))

In [None]:
reference_input

In [None]:
reference_output

In [None]:
result = rag_pipeline(reference_input["question"], top_k=5)

In [None]:
result

In [None]:
async def ragas_faithfulness(run, example):
    
    sample = SingleTurnSample(
        user_input=run["question"],
        response=run["answer"],
        retrieved_contexts=run["retrieved_context"]
    )
    
    scorer = Faithfulness(llm=ragas_llm)
    
    return await scorer.single_turn_ascore(sample)

In [None]:
await ragas_faithfulness(result, "")

In [None]:
async def ragas_response_relevancy(run, example):
    
    sample = SingleTurnSample(
        user_input=run["question"],
        response=run["answer"],
        retrieved_contexts=run["retrieved_context"]
    )
    
    scorer = AnswerRelevancy(llm=ragas_llm, embeddings=ragas_embeddings)
    
    return await scorer.single_turn_ascore(sample)

In [None]:
await ragas_response_relevancy(result, "")

In [None]:
async def ragas_context_precision_id_based(run, example):
    
    sample = SingleTurnSample(
        retrieved_context_ids=run["retrieved_context_ids"],
        reference_context_ids=example["reference_context_ids"]
    )
    
    scorer = IDBasedContextPrecision()
    
    return await scorer.single_turn_ascore(sample)

In [None]:
await ragas_context_precision_id_based(result, reference_output)