# Ragas

Ragas is a library that provides tools to supercharge the evaluation of Large Language Model (LLM) applications.

In [None]:
pip install ragas

In [30]:
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas import SingleTurnSample

In [26]:
llm = AzureChatOpenAI(
        deployment_name="gpt-4o",
        api_version="2023-06-01-preview"
    )

embeddings = AzureOpenAIEmbeddings(model="ada-002", openai_api_version="2024-06-01")

evaluator_llm = LangchainLLMWrapper(llm)
evaluator_embeddings = LangchainEmbeddingsWrapper(embeddings)

In [48]:
samples = {
    "rag": [
        SingleTurnSample(
            user_input="cual es el objetivo del proyecto de Fabian?",
            response="El objetivo del proyecto de Fabián es desarrollar un chatbot especializado que pueda ser entrenado con documentos propietarios de una empresa y que, en base al contenido ingestado, interprete correctamente consultas de usuarios y proporcione respuestas precisas y relevantes.",
            retrieved_contexts=[
                'Plan de proyecto del Trabajo Final\nCarrera de Especializaci´ on en Inteligencia Artificial\nIng. Fabi´ an Alejandro Massotto\n1. Descripci´ on t´ ecnica-conceptual del proyecto a realizar\nEl objetivo de este proyecto es desarrollar un chatbot especializado que pueda ser entrenado\ncon documentos propietarios de una empresa, y que, en base al contenido ingestado, interprete\ncorrectamente consultas de usuarios y proporcione respuestas precisas y relevantes.', 
                'Plan de proyecto del Trabajo Final\nCarrera de Especializaci´ on en Inteligencia Artificial\nIng. Fabi´ an Alejandro Massotto\nActa de constituci´ on del proyecto\nBuenos Aires, 23 de abril de 2024\nPor medio de la presente se acuerda con el Ing. Fabi´ an Alejandro Massotto que su Trabajo\nFinal de la Carrera de Especializaci´ on en Inteligencia Artificial se titular´ a “Desarrollo de un\nchatbot especializado para optimizar la b´ usqueda de informaci´ on en documentos propietarios” y',
                'Plan de proyecto del Trabajo Final\nCarrera de Especializaci´ on en Inteligencia Artificial\nIng. Fabi´ an Alejandro Massotto\nLa propuesta de valor de este proyecto radica en su capacidad para mejorar la eficiencia operativa\nde una empresa. Al facilitar el acceso a la informaci´ on, se reduce el tiempo dedicado a la\nb´ usqueda y se incrementa el tiempo disponible para tareas cr´ ıticas y estrat´ egicas. Adem´ as,\nla capacidad de adecuar el chatbot seg´ un las necesidades y documentos de cada empresa lo'
            ], 
        )
    ],
    "sql": [],
    "csv": [],
    "api": [
        SingleTurnSample(
            user_input="cuantos repositorios tiene el usuario fabimass?",
            response="El usuario fabimass tiene 25 repositorios.",
            retrieved_contexts=[
                """```python
        import requests

        username = 'fabimass'
        url = f'https://api.github.com/users/{username}/repos'
        response = requests.get(url)

        if response.status_code == 200:
            repos = response.json()
            result = len(repos)
        else:
            result = f'Error: {response.status_code}'

        result
        ```""", 
                '25',
            ], 
        )
    ]
}

## [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision)

Context Precision is a metric that measures the proportion of relevant chunks in the `retrieved_contexts`. It is calculated as the mean of the precision@k for each chunk in the context. Precision@k is the ratio of the number of relevant chunks at rank k to the total number of chunks at rank k.

$$
\text{Context Precision@K} = \frac{\sum_{k=1}^{K} \left( \text{Precision@k} \times v_k \right)}{\text{Total number of relevant items in the top } K \text{ results}}
$$

$$
\text{Precision@k} = {\text{true positives@k} \over  (\text{true positives@k} + \text{false positives@k})}
$$

Where $K$ is the total number of chunks in `retrieved_contexts` and $v_k \in \{0, 1\}$ is the relevance indicator at rank $k$.

The following metrics uses LLM to identify if a retrieved context is relevant or not.

In [None]:
from ragas.metrics import LLMContextPrecisionWithoutReference

context_precision = LLMContextPrecisionWithoutReference(llm=evaluator_llm)

In [38]:
print("RAG agent:")
for sample in samples["rag"]:
    print(await context_precision.single_turn_ascore(sample))

RAG agent:
0.99999999995


In [39]:
print("API agent:")
for sample in samples["api"]:
    print(await context_precision.single_turn_ascore(sample))

API agent:
0.99999999995


## [Response Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/answer_relevance)

The `ResponseRelevancy` metric measures how relevant a response is to the user input. Higher scores indicate better alignment with the user input, while lower scores are given if the response is incomplete or includes redundant information.  

This metric is calculated using the `user_input` and the `response` as follows:  

1. Generate a set of artificial questions (default is 3) based on the response. These questions are designed to reflect the content of the response.  
2. Compute the cosine similarity between the embedding of the user input ($E_o$) and the embedding of each generated question ($E_{g_i}$).  
3. Take the average of these cosine similarity scores to get the **Answer Relevancy**:  

$$
\text{Answer Relevancy} = \frac{1}{N} \sum_{i=1}^{N} \text{cosine similarity}(E_{g_i}, E_o)
$$  

$$
\text{Answer Relevancy} = \frac{1}{N} \sum_{i=1}^{N} \frac{E_{g_i} \cdot E_o}{\|E_{g_i}\| \|E_o\|}
$$  

Where:  
- $E_{g_i}$: Embedding of the $i^{th}$ generated question.  
- $E_o$: Embedding of the user input.  
- $N$: Number of generated questions (default is 3).  

**Note**: While the score usually falls between 0 and 1, it is not guaranteed due to cosine similarity's mathematical range of -1 to 1.

An answer is considered relevant if it directly and appropriately addresses the original question. This metric focuses on how well the answer matches the intent of the question, without evaluating factual accuracy. It penalizes answers that are incomplete or include unnecessary details.

In [41]:
from ragas.metrics import ResponseRelevancy

scorer = ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings)

In [42]:
print("RAG agent:")
for sample in samples["rag"]:
    print(await scorer.single_turn_ascore(sample))

RAG agent:
0.9678979668947822


In [43]:
print("API agent:")
for sample in samples["api"]:
    print(await scorer.single_turn_ascore(sample))

API agent:
0.9826498514154302


In [81]:
from ragas.dataset_schema import  SingleTurnSample, MultiTurnSample, EvaluationDataset
from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
from ragas.metrics import TopicAdherenceScore


sample_input = [
HumanMessage(content="cuantos repositorios tiene el usuario fabimass?"),
AIMessage(content="El usuario fabimass tiene 25 repositorios"),
HumanMessage(content="cual es la capital de Francia?"),
AIMessage(content="No lo se"),
]


sample = MultiTurnSample(user_input=sample_input, reference_topics=["github"])
scorer = TopicAdherenceScore(llm = evaluator_llm, mode="precision")
await scorer.multi_turn_ascore(sample)

0.9999999999

In [73]:
from ragas.metrics import ToolCallAccuracy
from ragas.dataset_schema import  MultiTurnSample
from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall

sample = [
    HumanMessage(content="What's the weather like in New York right now?"),
    AIMessage(content="The current temperature in New York is 75°F and it's partly cloudy.", tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"})
    ]),
    HumanMessage(content="Can you translate that to Celsius?"),
    AIMessage(content="Let me convert that to Celsius for you.", tool_calls=[
        ToolCall(name="temperature_conversion", args={"temperature_fahrenheit": 75})
    ]),
    ToolMessage(content="75°F is approximately 23.9°C."),
    AIMessage(content="75°F is approximately 23.9°C.")
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="fabi", args={"location": "New York"}),
        ToolCall(name="temperature_conversion", args={"temperature_fahrenheit": 75})
    ]
)

scorer = ToolCallAccuracy()
await scorer.multi_turn_ascore(sample)

0.0