# Tutorial: Guide to LLM Evals by Aparna Dhinakaran
[Link](https://towardsdatascience.com/llm-evals-setup-and-the-metrics-that-matter-2cc27e8e35f3)

In this tutorial, Aparna shows how to evaluate whether a LLM can class reference texts as relevant or not given a query. I.e. she is evauating RAG relevance. At first read this was not obvious to me from seeing the code so this notebook might be helpful.

What is RAG relevance?
For example, if I ask an LLM how glaciers are formed, and we provide this LLM with a reference text about glaciers that explains how glacier texts are formed, then this LLM should class the reference text as relevant to the query.

In [None]:
from phoenix.experimental.evals import (
   RAG_RELEVANCY_PROMPT_TEMPLATE,
   RAG_RELEVANCY_PROMPT_RAILS_MAP,
   OpenAIModel,
   download_benchmark_dataset,
   llm_classify,
)
import tiktoken
from sklearn.metrics import precision_recall_fscore_support
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

In [None]:
# Download a "golden dataset" built into Phoenix
benchmark_dataset = download_benchmark_dataset(
   task="binary-relevance-classification", dataset_name="wiki_qa-train"
)

In [None]:
# let's have a a look at the benchmark dataset
benchmark_dataset

Okay, so I assume "query_text" is the query for your LLM, the "document_text" contains the reference text and "relevant" is the classification column.

In [None]:
query_text = benchmark_dataset.iloc[0,2]
document_text = benchmark_dataset.iloc[0,3]
print(f"")

## Prompt to evaluate document reference
The library has a template saved that basically instructs the LLM to check if a given text is relevant for the query provided.

In [None]:
RAG_RELEVANCY_PROMPT_TEMPLATE

Another very simple thing are the "rails". It is just a dictionary mapping the binary TRUE or FALSE to relevant and unrelated.

In [None]:
RAG_RELEVANCY_PROMPT_RAILS_MAP

## RAG relevance evaluation in action

In [None]:
# For the sake of speed, we'll just sample 100 examples in a repeatable way
benchmark_dataset = benchmark_dataset.sample(2, random_state=2023)
benchmark_dataset = benchmark_dataset.rename(
   columns={
       "query_text": "input",
       "document_text": "reference",
   },
)
# Match the label between our dataset and what the eval will generate
y_true = benchmark_dataset["relevant"].map({True: "relevant", False: "irrelevant"})
y_true

In [None]:
benchmark_dataset

In [None]:
# Any general purpose LLM should work here, but it is best practice to keep the temperature at 0
model = OpenAIModel(
   model="gpt-4",
   temperature=0.0,
)

# Rails will define our output classes
rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())


y_pred = llm_classify(dataframe=benchmark_dataset, 
                      model=model,
                      template=RAG_RELEVANCY_PROMPT_TEMPLATE,
                      rails=rails,
                      provide_explanation=False)

# Calculate evaluation metrics
precision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred)

In [None]:
model

In [None]:
str(RAG_RELEVANCY_PROMPT_TEMPLATE)

## Changing the prompt might change our evaluation

In [None]:
Template_new = '\nYou are comparing a reference text to a question and trying to determine if the reference text\ncontains information relevant to answering the question. Here is the data:\n    [BEGIN DATA]\n    ************\n    [Question]: {input}\n    ************\n    [Reference text]: {reference}\n    ************\n    [END DATA]\nCompare the Question above to the Reference text. You must determine whether the Reference text\ncontains information that can answer the Question. Please focus on whether the very specific\nquestion can be answered by the information in the Reference text.\nYour response must be single word, either "relevant" or "unrelated",\nand should not contain any text or characters aside from that word.\n"unrelated" means that the reference text does not contain an answer to the Question.\n"relevant" means the reference text contains an answer to the Question.'