# Step 3: Define metrics to use
In general, we want to evaluate 2 things: whether the context retrieved is good, and whether the overall answer is good.

We'll be tracking the following metrics, and for some we'll use the `ragas` library to evaluate.
## Assessing the context
- `context_correctness`: whether the `ground_truth_context` is included in the list of retrieved contexts
- `ground_truth_context_rank`: rank of the ground_truth_context in the retrieved context
- `context_rougel_score`: ROUGE-L score between the ground_truth_context and the top retrieved context
- `context_precision` (with `ragas`): how relevant the retrieved contexts are to the question, assessed with LLM

## Assessing the answer
- `faithfulness` (with `ragas`): does the answer use information from the context? (assessed with LLM )
- `answer_correctness` (with `ragas`): a combination of the following
    - how relevant is the answer to the question? based on cosine similarity of embeddings
    - whether the answer matches with the ground truth, assessed with LLM



In [3]:
import evaluate


def context_correctness(ground_truth_context, contexts):
    """whether the `ground_truth_context` is included in the list of retrieved contexts"""
    return ground_truth_context in contexts


def ground_truth_context_rank(ground_truth_context, contexts):
    """rank of the ground_truth_context in the retrieved contexts, -1 if not found"""
    if ground_truth_context is not None:
        try:
            return contexts.index(ground_truth_context)
        except:
            return -1
    else:
        return -1


def context_rougel_score(ground_truth_context, contexts):
    """ROUGE-L score between the ground_truth_context and the top retrieved context"""
    rouge = evaluate.load('rouge')
    return rouge.compute(predictions=[contexts[0]], references=[ground_truth_context])["rougeL"]


For `ragas`, we can simply call them together as follows:

In [2]:
from datasets import Dataset
from ragas import evaluate as ragas_evaluate
from ragas.metrics import context_precision, faithfulness, answer_correctness

from langchain_openai import ChatOpenAI


def evaluate_w_ragas(item, metrics=[context_precision, faithfulness, answer_correctness]):
    gpt4_llm = ChatOpenAI(model_name="gpt-4-turbo-preview", temperature=0)

    # Format the example into datasets, which ragas takes as inputs
    row_dataset = Dataset.from_pandas(item.to_frame().T)

    # Ragas by default takes in a batch of items and aggregates metrics together
    # So in this example, we need to pass one by one to get individual results.
    # However, it does run faster when you pass all metrics at once.
    ragas_eval_results = ragas_evaluate(row_dataset, metrics=metrics, llm=gpt4_llm)
    return ragas_eval_results