# RAGAS Framework
We will be making use of the RAGAS (RAG Assessement) framework to explore the concept of 
LLM Judge based metrics.
for more information about this framework go to https://docs.ragas.io/en/v0.3.0/getstarted/ (information in this worksheet is taken from there)

The first thing we will be looking into are some of the available metrics you can use to evaluate your RAG system

In [None]:
import os
os.environ["OPENAI_API_KEY"] = #TODO

In [None]:
# the first which we need to do is link the llm we want to use to the RAGAS framework

from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
import openai
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
openai_client = openai.OpenAI()

## Context Precision

Context Precision is a metric that evaluates the retriever's ability to rank relevant chunks higher than irrelevant ones for a given query in the retrieved context. Specifically, it assesses the degree to which relevant chunks in the retrieved context are placed at the top of the ranking.

It is calculated as the mean of the precision@k for each chunk in the context. Precision@k is the ratio of the number of relevant chunks at rank k to the total number of chunks at rank k.


<div>
<img src="context_precision_formula.png" width="700"/>
</div>

The reason why we care about the positioning in this metric is that items closer to the front of the context window will be the given the "most importance" by the LLM. for more information on this see https://arxiv.org/abs/2307.03172

In [None]:
#TODO: run this cell, play around with the inputs to understand how the metric is affected
# add more retrieved contexts (both relevant and irrelevant) to see how the score changes

from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithoutReference

context_precision = LLMContextPrecisionWithoutReference(llm=evaluator_llm)

sample = SingleTurnSample(
    user_input="Where is the Eiffel Tower located?",
    response="",
    retrieved_contexts=["The Statue of Liberty is in New York.","The Eiffel Tower is located in Paris.",], 
)


await context_precision.single_turn_ascore(sample)

## Context Recall

Context Recall measures how many of the relevant documents (or pieces of information) were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out.


<div>
<img src="context_recall_formula.png" width="700"/>
</div>

LLM based recall calculations are very expensive as they require evaluating if each piece of the embedded data is relevant to the question at hand. 


Within RAGAS the approach taken to compute recall is slightly confusing, but essentially the source information is passed to the metric calculator (reference argument), this is then broken up into a number of claims, recall is calculated based on the number of relevant claims in the source data and the number of relevant claims in the retrieved context

<div>
<img src="context_recall_ragas_formula.png" width="700"/>
</div>

In [None]:
#TODO: run this cell, play around with the inputs to understand how the metric is affected

from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import LLMContextRecall

sample = SingleTurnSample(
    user_input="where is the barno tower located?",
    response="",
    reference="The barno tower is located in Brussels. the statue of liberty is in new york, barno tower is not very far from the grand place",
    retrieved_contexts=["brussels is home to the barno tower"], 
)

context_recall = LLMContextRecall(llm=evaluator_llm)
await context_recall.single_turn_ascore(sample)

## Response Relevancy
The ResponseRelevancy metric measures how relevant a response is to the user input. Higher scores indicate better alignment with the user input, while lower scores are given if the response is incomplete or includes redundant information.


In [None]:
#TODO: run this cell, play around with the inputs to understand how the metric is affected
from ragas import SingleTurnSample 
from ragas.metrics import ResponseRelevancy

sample = SingleTurnSample(
        user_input="where is paris",
        response="paris is in france",
        retrieved_contexts=[]
    )

scorer = ResponseRelevancy(llm=evaluator_llm, embeddings=embeddings)
await scorer.single_turn_ascore(sample)

## Faithfullness

The Faithfulness metric measures how factually consistent a response is with the retrieved context. It ranges from 0 to 1, with higher scores indicating better consistency.

<div>
<img src="faithfullness_formula.png" width="700"/>
</div>



In [None]:
#TODO: run this cell, play around with the inputs to understand how the metric is affected

from ragas.dataset_schema import SingleTurnSample 
from ragas.metrics import Faithfulness

sample = SingleTurnSample(
        user_input="",
        response="brussels is the capital of belgium",
        retrieved_contexts=[
            "capital cities of europe include paris, berlin, madrid, rome and brussels. brussels is in belgium",
        ]
    )
scorer = Faithfulness(llm=evaluator_llm)
await scorer.single_turn_ascore(sample)

In [None]:
# TODO: Optional (simple) explore additional assessment metrics available in RAGAS 
# see https://docs.ragas.io/en/v0.3.0/concepts/metrics/available_metrics/