# Tutorial: Guide to LLM Evals by Aparna Dhinakaran
[Link](https://towardsdatascience.com/llm-evals-setup-and-the-metrics-that-matter-2cc27e8e35f3)

In this tutorial, Aparna shows how to evaluate whether a LLM can class reference texts as relevant or not given a query. I.e. she is evauating RAG relevance.
For example, if I ask an LLM how glaciers are formed, and we provide this LLM with a reference text about glaciers that explains how glacier texts are formed, then this LLM should class the reference text as relevant to the query.

In [None]:
from phoenix.experimental.evals import (
   RAG_RELEVANCY_PROMPT_TEMPLATE,
   RAG_RELEVANCY_PROMPT_RAILS_MAP,
   OpenAIModel,
   download_benchmark_dataset,
   llm_classify,
)
import tiktoken
from sklearn.metrics import precision_recall_fscore_support
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

In [None]:
# Download a "golden dataset" built into Phoenix
benchmark_dataset = download_benchmark_dataset(
   task="binary-relevance-classification", dataset_name="wiki_qa-train"
)

In [None]:
# let's have a a look at the benchmark dataset
benchmark_dataset

Okay, so I assume "query_text" is the query for your LLM, the "document_text" contains the reference text and "relevant" is the classification column.

In [None]:
query_text = benchmark_dataset.iloc[0,2]
document_text = benchmark_dataset.iloc[0,3]
print(f"")

In [None]:
RAG_RELEVANCY_PROMPT_TEMPLATE

In [None]:
RAG_RELEVANCY_PROMPT_RAILS_MAP.values()

In [None]:
# Any general purpose LLM should work here, but it is best practice to keep the temperature at 0
model = OpenAIModel(
   model="gpt-4",
   temperature=0.0,
)