# Using Ragas to evaluate RAG pipelines

In this notebook, we will showcase how to use Opik with Ragas for monitoring and evaluation of RAG (Retrieval-Augmented Generation) pipelines.

There are two main ways to use Opik with Ragas:

1. Using Ragas metrics to score traces
2. Using the Ragas `evaluate` function to score a dataset

## Creating an account on Comet.com

[Comet](https://www.comet.com/site?from=llm&utm_source=opik&utm_medium=colab&utm_content=ragas&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm&utm_source=opik&utm_medium=colab&utm_content=ragas&utm_campaign=opik) and grab your API Key.

> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm&utm_source=opik&utm_medium=colab&utm_content=ragas&utm_campaign=opik) for more information.

In [None]:
%pip install --quiet --upgrade opik ragas nltk openai

In [None]:
import opik

opik.configure(use_local=False)

## Preparing our environment

First, we will configure the OpenAI API key.

In [None]:
import os
import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

## Integrating Opik with Ragas

### Using Ragas metrics to score traces

Ragas provides a set of metrics that can be used to evaluate the quality of a RAG pipeline, including but not limited to: `answer_relevancy`, `answer_similarity`, `answer_correctness`, `context_precision`, `context_recall`, `context_entity_recall`, `summarization_score`. You can find a full list of metrics in the [Ragas documentation](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/).

These metrics can be computed on the fly and logged to traces or spans in Opik. For this example, we will start by creating a simple RAG pipeline and then scoring it using the `answer_relevancy` metric.

#### Create the Ragas metric

In order to use the Ragas metric without using the `evaluate` function, you need to initialize the metric with a `RunConfig` object and an LLM provider. For this example, we will use LangChain as the LLM provider with the Opik tracer enabled.

We will first start by initializing the Ragas metric:

In [None]:
# Import the metric
from ragas.metrics import AnswerRelevancy

# Import some additional dependencies
from langchain_openai.chat_models import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from opik.evaluation.metrics import RagasMetricWrapper

# Initialize the Ragas metric
llm = LangchainLLMWrapper(ChatOpenAI())
emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

ragas_answer_relevancy = AnswerRelevancy(llm=llm, embeddings=emb)

# Wrap the Ragas metric with RagasMetricWrapper for Opik integration
answer_relevancy_metric = RagasMetricWrapper(
    ragas_answer_relevancy,
    track=True,  # This enables automatic tracing in Opik
)

Once the metric wrapper is set up, you can use it to score a sample question. The `RagasMetricWrapper` handles all the complexity of async execution and Opik integration automatically.

In [None]:
# For Jupyter notebook compatibility
# This is needed for async operations in Jupyter notebooks
import nest_asyncio

nest_asyncio.apply()

In [None]:
import os

os.environ["OPIK_PROJECT_NAME"] = "ragas-integration"

# Score a simple example using the RagasMetricWrapper
score_result = answer_relevancy_metric.score(
    user_input="What is the capital of France?",
    response="Paris",
    retrieved_contexts=["Paris is the capital of France.", "Paris is in France."],
)

print(f"Answer Relevancy score: {score_result.value}")
print(f"Metric name: {score_result.name}")

If you now navigate to Opik, you will be able to see that a new trace has been created in the `Default Project` project.

#### Score traces

You can score traces by using the `update_current_trace` function.

The advantage of this approach is that the scoring span is added to the trace allowing for a more fine-grained analysis of the RAG pipeline. It will however run the Ragas metric calculation synchronously and so might not be suitable for production use-cases.

In [None]:
from opik import track, opik_context


@track
def retrieve_contexts(question):
    # Define the retrieval function, in this case we will hard code the contexts
    return ["Paris is the capital of France.", "Paris is in France."]


@track
def answer_question(question, contexts):
    # Define the answer function, in this case we will hard code the answer
    return "Paris"


@track
def rag_pipeline(question):
    # Define the pipeline
    contexts = retrieve_contexts(question)
    answer = answer_question(question, contexts)

    # Score the pipeline using the RagasMetricWrapper
    score_result = answer_relevancy_metric.score(
        user_input=question, response=answer, retrieved_contexts=contexts
    )

    # Add the score to the current trace
    opik_context.update_current_trace(
        feedback_scores=[{"name": score_result.name, "value": score_result.value}]
    )

    return answer


rag_pipeline("What is the capital of France?")

#### Evaluating datasets using the Opik `evaluate` function

You can use Ragas metrics with the Opik `evaluate` function. This will compute the metrics on all the rows of the dataset and return a summary of the results.

The `RagasMetricWrapper` can be used directly with the Opik `evaluate` function - no additional wrapper code is needed!

In [None]:
from datasets import load_dataset
import opik


opik_client = opik.Opik()

# Create a small dataset
fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")

# Reformat the dataset to match the schema expected by the Ragas evaluate function
hf_dataset = fiqa_eval["baseline"].select(range(3))
dataset_items = hf_dataset.map(
    lambda x: {
        "user_input": x["question"],
        "reference": x["ground_truths"][0],
        "retrieved_contexts": x["contexts"],
    }
)
dataset = opik_client.get_or_create_dataset("ragas-demo-dataset")
dataset.insert(dataset_items)


# Create an evaluation task
def evaluation_task(x):
    return {
        "user_input": x["question"],
        "response": x["answer"],
        "retrieved_contexts": x["contexts"],
    }


# Use the RagasMetricWrapper directly - no need for custom wrapper!
opik.evaluation.evaluate(
    dataset,
    evaluation_task,
    scoring_metrics=[answer_relevancy_metric],
    task_threads=1,
)

#### Evaluating datasets using the Ragas `evaluate` function

If you looking at evaluating a dataset, you can use the Ragas `evaluate` function. When using this function, the Ragas library will compute the metrics on all the rows of the dataset and return a summary of the results.

You can use the `OpikTracer` callback to log the results of the evaluation to the Opik platform:

In [None]:
from datasets import load_dataset
from opik.integrations.langchain import OpikTracer
from ragas.metrics import context_precision, answer_relevancy, faithfulness
from ragas import evaluate

fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")

# Reformat the dataset to match the schema expected by the Ragas evaluate function
dataset = fiqa_eval["baseline"].select(range(3))

dataset = dataset.map(
    lambda x: {
        "user_input": x["question"],
        "reference": x["ground_truths"][0],
        "retrieved_contexts": x["contexts"],
    }
)

opik_tracer_eval = OpikTracer(tags=["ragas_eval"], metadata={"evaluation_run": True})

result = evaluate(
    dataset,
    metrics=[context_precision, faithfulness, answer_relevancy],
    callbacks=[opik_tracer_eval],
)

print(result)