## Debug LLM based metrics using tracing

While evaluating using LLM based metrics, each metric may make one or more calls to the LLM. These traces are important to understand the results of the metrics and to debug any issues.
This notebook demonstrates how to export the LLM traces and analyze them.

## Evaluation
Do a sample evaluation using one of the LLM based metrics.

In [3]:
from datasets import load_dataset
from ragas import EvaluationDataset
from ragas import evaluate
from ragas.metrics._aspect_critic import AspectCriticWithReference

dataset = load_dataset("explodinggradients/amnesty_qa", "english_v3")


eval_dataset = EvaluationDataset.from_hf_dataset(dataset["eval"])

metric = AspectCriticWithReference(
    name="answer_correctness",
    definition="is the response correct compared to reference",
)

results = evaluate(eval_dataset[:5], metrics=[metric])

Repo card metadata block was not found. Setting CardData to empty.
Evaluating: 100%|██████████| 5/5 [00:02<00:00,  2.12it/s]


## Export LLM traces

In [5]:
results.traces

[{'answer_correctness': 1},
 {'answer_correctness': 0},
 {'answer_correctness': 0},
 {'answer_correctness': 0},
 {'answer_correctness': 0}]

Each of these are [`MetricTrace`](ragas.callbacks.MetricTrace) objects that contain the following fields:
- The input to the prompt 
- The output from the model

Both as pydantic objects.

To view this you can select the index of the trace you want to view and run the cell below.

In [4]:
results.traces[0]["answer_correctness"]

{'0_single_turn_aspect_critic_prompt_with_reference': {'input': AspectCriticInputWithReference(user_input="`user_input`: What are the global implications of the USA Supreme Court ruling on abortion? Answer using `retrieved context`: - In 2022, the USA Supreme Court handed down a decision ruling that overturned 50 years of jurisprudence recognizing a constitutional right to abortion.\n- This decision has had a massive impact: one in three women and girls of reproductive age now live in states where abortion access is either totally or near-totally inaccessible.\n- The states with the most restrictive abortion laws have the weakest maternal health support, higher maternal death rates, and higher child poverty rates.\n- The USA Supreme Court ruling has also had impacts beyond national borders due to the geopolitical and cultural influence wielded by the USA globally and the aid it funds.\n- SRR organizations and activists across the world have expressed fear about the ruling laying the gr

As you can see, it has the name of the prompt as the key and the input and output as the values. Since, I used AspectCriteriaMetric, the input and output is in the pydantic object used to parse input and output for the metric. You may convert it to a dictionary if needed. For example,

In [6]:
selected_trace = results.traces[0]["answer_correctness"]
selected_trace["0_single_turn_aspect_critic_prompt_with_reference"][
    "input"
].model_dump()

{'user_input': "`user_input`: What are the global implications of the USA Supreme Court ruling on abortion? Answer using `retrieved context`: - In 2022, the USA Supreme Court handed down a decision ruling that overturned 50 years of jurisprudence recognizing a constitutional right to abortion.\n- This decision has had a massive impact: one in three women and girls of reproductive age now live in states where abortion access is either totally or near-totally inaccessible.\n- The states with the most restrictive abortion laws have the weakest maternal health support, higher maternal death rates, and higher child poverty rates.\n- The USA Supreme Court ruling has also had impacts beyond national borders due to the geopolitical and cultural influence wielded by the USA globally and the aid it funds.\n- SRR organizations and activists across the world have expressed fear about the ruling laying the groundwork for anti-abortion legislative and policy attacks in other countries.\n- Advocates 

And that's it. Now you have learned how to export and analyze LLM calls made by ragas for evaluation. 