# Evaluating RAG Pipelines with Ragas
In this notebook, we will be exploring how one might evaluate the effectiveness of a retrieval augumented generation (RAG) pipeline using an open source framework called **Ragas**. Ragas seeks to provide feedback on the final outputs of a RAG pipeline across a number of different attributes, including faithfulness, answer relevancy, and more.

Now, for better or worse, this framework uses language models (LLMs) to perform this evaluation. In a typical tabular data model, we would ideally calculate scores ranging from F1 score, ROC AUC, RMSE, and more to derive our metrics. These are mathematically derived, whereas we're doing something taboo here by effectively evaluating one model with another model! Unfortunately, the only other option we have is human evaluation, and that comes with its own upsides and downsides. My encouragement would be that if you're uncertain about relying on language models for evaluating your product's performance, maybe you should consider a hybrid approach where you supplement the AI evaluation with a human evaluation.

In this notebook, we will cover the following topics:

- **Generating a Testset**: Ideally, you would use your own product's logs to evaluate using Ragas, but you may want to test out different RAG optimizations prior to your product going live. One option for experimenting in this way is by generating our own testset using language models. Specifically, Ragas offers a way to generate our own testset via language model.
- **Calculating the Metrics**: In the following section, we will very briefly demonstrate how to actually calculate the Ragas metrics. You'll find that they are very easily to generate from a coding perspective!
- **Explaining the Metrics**: After we have derived the metrics via code, we will do a much deeper dive into what each of these metrics are and how they are scored behind the scenes.

## Noteboook Setup
All this work was completed via a Kaggle notebook. As such, you may have to do some minor refactoring of this particular notebook setup if you are working in a different context. Beyond this "Notebook Setup" section, all other code should be the same regardless of your compute environment.

In [None]:
# Setting the Python libraries we will need to install
pip_installs = [
    'langchain',
    'langchain-core',
    'langchain-community',
    'langchain_openai'
    'langchain-text-splitters',
    'ragas'
]

# Performing the Kaggle pip install
from pip_install import perform_pip_install
# perform_pip_install(pip_installs)

In [None]:
# Importing the necessary Python libraries
import os
import pandas as pd
from datasets import Dataset
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain_community.document_loaders import DataFrameLoader
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas import evaluate
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context, conditional
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_relevancy, context_recall, context_entity_recall, answer_similarity, answer_correctness
from ragas.metrics.critique import harmfulness, maliciousness, coherence, correctness, conciseness

from load_api_keys import load_api_keys

In [None]:
# Loading in my personal API keys from Kaggle Secrets
api_keys = load_api_keys()
os.environ['OPENAI_API_KEY'] = api_keys['OPENAI_API_KEY']

In [None]:
# Loading in our knowledge item (KI) dataset as a Pandas DataFrame
df_kis = pd.read_csv('/kaggle/input/synthetic-it-related-knowledge-items/synthetic_knowledge_items.csv')
df_kis = df_kis[['ki_topic', 'ki_text']]

# Loading the KI data with the LangChain data loader# Creating the document loader around our Pandas dataframe
ki_doc_loader = DataFrameLoader(data_frame = df_kis, page_content_column = 'ki_text')

# Loading the documents as LangChain documents using the dataframe document loader
ki_docs = ki_doc_loader.load()

## Generating a RAG Testset with Ragas
(Note: In reading the GitHub issues as of June 2024, it does appear that an overhauled version of the Ragas testset generation is coming soon. That said, you may not have to do the custom bit we're going to have to do to simulate the `ground_truth` if you are reading this in the future!)

In a normal use case scenario, you would collect this data as part of your logging process. In a real use case, we would look at our logs to determine...

- What was the user's original **question**?
- What **context** was returned from the vectorstore to support the user's question?
- How did the language model provide an **AI answer** based on the context and user's question supplied to it?

Because this is just an experimental context, we're going to synthesize our own RAG testset using the **Ragas** framework. Ragas uses generator and critic LLMs to produce the synthetic dataset. We'll discuss the more specific steps down below, but generally speaking, the **generator LLM** is used to produce content in support of this synthetic dataset while the **critic LLM** ensures that thing generated by the generator LLM is of appropriate quality. (Note: You'll notice I'm using GPT-4o for both the generator and critic LLM. This is because I personally am not concerned using the same for both.)

The Ragas synthetic testset framework can generate 4 different kinds of rows, which Ragas refers to as **evolutions**:

- **Simple**: As the name implies, this is a simple, straightforward question generated per the document chunk retrieved from the vectorstore.
- **Multi-Context**: In all other scenarios, we generate a question and AI answer per a single piece of retrieved context. In this scenario, we use an embedding similarity on the originally retrieved context to also fetch one other additional piece of context. We then use these two bits of multiple context (hence, multi-context) to generate the synthetic question and AI answer.
- **Reasoning**: In this scenario, the generator LLM looks to specifically synthesize a question that requires a deeper level of reasoning than what may be typically found in the "simple" scenario.
- **Conditional**: In this scenario, the generator LLM seeks to ensure that the synthesized question has a special condition in it that must be addressed in order to successfully answer the question.

Now that we have a general understanding of what Ragas is trying to do with its test set synthesis, let's walk through the steps of how Ragas synthesizes its dataset.

1. Based on the number of results you want (`test_size`), Ragas will attempt to randomly choose that many numbers of document chunks from your provided set. If you request more results than chunks available, Ragas will add a weighting score each time a chunk is selected so that that chunk has less of a random chance of being selected the next time. Each synthetic question / AI answer will be generated around this chunk.
    - Note: If you do NOT provide your own chunking strategy per the documents you provide, Ragas will do a default chunking of 400 character chunks
2. Before proceeding forward, the critic LLM checks the randomly selected chunk to see if it's a viable one to be using. (If it is not viable, a new random chunk is selected instead.)
3. Based on the Ragas evolution, the generator LLM will generate a simulated question using the appropriate prompt engineering provided by the Ragas framework.
4. The critic LLM will check the quality of the question generated by the generator LLM in the previous step. (If the qusetion is not of quality, the generator LLM gives it another shot.)
5. (Optional step) If the Ragas evolution type is "reasoning" or "conditional", the question generated as part of step 3, the question is passed back through the generator LLM to compress the question in size. (As you can guess, the critic LLM also checks the quality of this compression.)
6. The generator LLM extracts for what it feels are the most important pieces of information from the context to answer the question.
7. The generator LLM uses the question generated by step 3 (or optionally step 5) to produce an AI answer to the question also using the original document chunk as context and extracted information as part of step 6.
8. The critic LLM checks the quality of the AI answer generated in step 7. (If the critic LLM determines the answer is not quality, the generator LLM is given another chance.)

In [None]:
# Setting the generator LLM, critic LLM, and embeddings algorithms
chat_model = ChatOpenAI(model = 'gpt-4o')
embeddings = OpenAIEmbeddings()

# Setting the 
testset_generator = TestsetGenerator.from_langchain(
    generator_llm = chat_model,
    critic_llm = chat_model,
    embeddings = embeddings
)

In [None]:
# # Generating the testset with our KI documents
# testset = testset_generator.generate_with_langchain_docs(
#     documents = ki_docs,
#     test_size = 10,
#     distributions = {
#         simple: 0.5,
#         reasoning: 0.2,
#         multi_context: 0.2,
#         conditional: 0.1
#     }
# )

# # Transforming the testset to a Pandas DataFrame
# df_testset = testset.to_pandas()

### Simulating Our Own Ground Truth
As of June 2024, the Ragas testset generator is a bit broken. Specifically, it generates rows that include the question, context, and AI answer, but it does **not** include any simulated ground truth. That said, we're going to have to simulate it ourselves! (Note: It's possible that the future iteration of the Ragas testset generator addresses this appropriately.)

In [None]:
# # Preparing the DataFrame to generate the ground truth
# df_testset.rename(columns = {'ground_truth': 'answer'}, inplace = True)
# df_testset['ground_truth'] = ''

In [None]:
# Creating the ground truth simulation prompt template
GT_SIMULATION_PROMPT = '''You are an expert evaluator for question-answering systems. Your task is to provide the ideal ground truth answer based on the given question and context. Please follow these guidelines:

1. Question: {question}

2. Context: {context}

3. Instructions:
   - Carefully analyze the question and the provided context.
   - Formulate a comprehensive and accurate answer based solely on the information given in the context.
   - Ensure your answer directly addresses the question.
   - Include all relevant information from the context, but do not add any external knowledge.
   - If the context doesn't contain enough information to fully answer the question, state this clearly and provide the best possible partial answer.
   - Use a formal, objective tone.

Remember, your goal is to provide the ideal answer that should be used as the benchmark for evaluating the AI's performance.'''

In [None]:
# Creating the prompt engineering emplate to generate the simulated ground truth
gt_generation_prompt = ChatPromptTemplate.from_messages(messages = [
    HumanMessagePromptTemplate.from_template(template = GT_SIMULATION_PROMPT)
])

# Instantiating the Llama 3 model via Perplexity
llama_model = ChatOpenAI(api_key = api_keys['PERPLEXITY_API_KEY'],
                         base_url = 'https://api.perplexity.ai',
                         model = 'llama-3-70b-instruct')

# Creating the inference chain to generate the simulated ground truth
gt_generation_chain = gt_generation_prompt | llama_model

In [None]:
def generate_ground_truth_text(row):
    '''
    Generates simulated ground truth text per a given the provided question and context
    
    Inputs:
        - row (Pandas DataFrame record): A single record from the Pandas DataFrame
        
    Returns:
        - gt_text (str): The ground truth text generated by the AI model per the record
    '''
    
    # Checking to see if the ground truth text has already been generated
    if row['ground_truth'] == '':
        
        # Generating the ground truth text
        gt_text = gt_generation_chain.invoke(
            {
                'question': row['question'],
                'context': row['contexts']
            }
        ).content
        
        return gt_text
    
    else:
        
        # Returning what is already in place if the string is not empty
        return row['ground_truth']

In [None]:
# # Generating the ground truth
# df_testset['ground_truth'] = df_testset.apply(generate_ground_truth_text, axis = 1)
# df_testset.to_csv('/kaggle/working/df_testset.csv', index = False)

### Converting the Pandas DataFrame into a Ragas-Compatible Dataset
Unfortunately, there doesn't seem to be a clean way to get a typical Pandas DataFrame into something that is compatible via Ragas directly, so we'll have to generate our own bit of code to accomplish this.

In [None]:
# Loading in the testset from CSV
df_testset = pd.read_csv('/kaggle/working/df_testset.csv')

In [None]:
def pandas_to_ragas(df):
    '''
    Converts a Pandas DataFrame into a Ragas-compatible dataset
    
    Inputs:
        - df (Pandas DataFrame): The input DataFrame to be converted
        
    Returns:
        - ragas_testset (Hugging Face Dataset): A Hugging Face dataset compatible with the Ragas framework
    '''
    # Ensure all text columns are strings and handle NaN values
    text_columns = ['question', 'ground_truth', 'answer']
    for col in text_columns:
        df[col] = df[col].fillna('').astype(str)
        
    # Convert 'contexts' to a list of lists
    df['contexts'] = df['contexts'].fillna('').astype(str).apply(lambda x: [x] if x else [])
    
    # Converting the DataFrame to a dictionary
    data_dict = df[['question', 'contexts', 'answer', 'ground_truth']].to_dict('list')
    
    # Loading the dictionary as a Hugging Face dataset
    ragas_testset = Dataset.from_dict(data_dict)
    
    return ragas_testset

In [None]:
# Converting the Pandas DataFrame into a Ragas-compatible Hugging Face dataset
ragas_testset = pandas_to_ragas(df = df_testset)

## Generating the Ragas Metrics
Now that we have generated the testset, we are ready to calculate the Ragas metrics! You'll see that actually doing this is very easy per the cells below.

There is **one big gotcha** here: I had a problem running into rate limit issues. Specifically, if you try generating all these metrics in one swoop (as I have below), the Ragas framework is doing a parallel, asynchronous calculation, meaning a lot of calls to the LLM at once! There are many ways around this; my lazy way around it was by changing the model from `gpt-4o` to `gpt-3.5-turbo` since the latter has much higher rate limits by default.

In [None]:
# # Generating the Ragas scores
# ragas_scores = evaluate(
#     dataset = ragas_testset,
#     llm = ChatOpenAI(model = 'gpt-3.5-turbo'),
#     metrics = [
#         faithfulness,
#         answer_relevancy,
#         context_precision,
#         context_relevancy,
#         context_recall,
#         context_entity_recall,
#         answer_similarity,
#         answer_correctness,
#         harmfulness,
#         maliciousness,
#         coherence,
#         correctness,
#         conciseness
#     ]
# )
# # Converting the Ragas scores to a Pandas DataFrame
# df_ragas_scores = ragas_scores.to_pandas()
#
# # Saving the Ragas scores to a CSV file
# df_ragas_scores.to_csv('/kaggle/working/ragas_scores.csv', index = False)

In [None]:
# Loading in the Ragas scores from file
df_ragas_scores = pd.read_csv('/kaggle/working/ragas_scores.csv')

## A Deep Dive into Each Ragas Metric
Now that we have generated the Ragas scores in the cells above, we're ready to start making sense of them. At a high level, Ragas supports three different kinds of metrics:

- **Component-Wise Evaluation Metrics**: These metrics are calculated around specific bits of information to hone in on a particular understanding. For example, the context recall score is specifically calculated between the retrieved context and ground truth, but it does not use the AI generated answer as part of its calculation.
- **End-to-End Evaluation Metrics**: As the name implies, these specific metrics cover an evaluation of the full end-to-end RAG pipeline.
- **Aspect Critique Metrics**: These are a special kind of metrics that specifically look to raise concerns about a particular subject matter. For example, a few aspect critiques include harmfulness and cohesiveness.

### Component-Wise Evaluation Metrics
Let's delve into the individual component-wise evaluation metrics!

#### Faithfulness
Faithfulness measures **the factual consistency of the AI answer as compared to the context**. It works by first asking the LLM to generate a number of "simpler sentences" based on each individual sentence in the AI answer. Each of these individual "simpler sentences" are compared against the context to determine if the sentence is *faithful* to the context. A faithful sentence is given a score of 1, while unfaithful sentences are given a score of 0. The final score is then calculated by taking the total score of faithful sentences divided by the total number of simpler sentences.

<div style="text-align:center;">
    <div style="border: 2px solid black; padding: 10px; display: inline-block">
        $$
        \text{Faithfulness score} = {|\text{Number of claims in the generated answer that can be inferred from given context}| \over |\text{Total number of claims in the generated answer}|}
        $$
    </div>
</div>
<p style="text-align: center; font-style: italic;">LaTeX Representation of Faithfulness Score, from Ragas Documentation</p>

**Total API Calls: 2**
- 1 LLM call to generate the "simpler sentences"
- 1 LLM call to judge the faithfulness of the "simpler sentences" to the context

**Data Required**
- Original question / query
- Retrieved context
- AI generated answer

In [None]:
# Displaying the faithfulness score alongside the supporting data
df_ragas_scores[['question', 'contexts', 'answer', 'faithfulness']]

#### Answer Relevancy 
This is perhaps the most complex metric provided by Ragas. What it generally seeks to do is to **determine how relevant an AI answer is to the question**. How it does this is rather complex and is easier to share in bulleted form:

1. The LLM is prompted to generate a new question based on the given AI answer. Immediately after, the LLM is prompted to determine whether or not the AI answer is noncommittal. (A noncommittal AI answer would say something like "I don't know.") A noncommittal AI answer is given a score of 1; a "committal" (or, non-noncommittal) AI answer is given a score of 0.
2. The generated question and original question are both embedded using an embedding algorithm.
3. A cosine similarity score is calculated between the embeddings generated above.
4. The noncommittal score is multiplied by the cosine similarity score to produce the final score. Because the noncommittal score is binary, any noncommittal statement essentially automatically gets a final score of 0 whereas any "committal" answer passes through the cosine similarity as the final score.

<div style="text-align:center;">
    <div style="border: 2px solid black; padding: 10px; display: inline-block">
        $$
        \text{answer relevancy} = \frac{1}{N} \sum_{i=1}^{N} cos(E_{g_i}, E_o)
        $$
        $$
        \text{answer relevancy} = \frac{1}{N} \sum_{i=1}^{N} \frac{E_{g_i} \cdot E_o}{\|E_{g_i}\|\|E_o\|}
        $$
    </div>
</div>

Where: 

* $E_{g_i}$ is the embedding of the generated question $i$.
* $E_o$ is the embedding of the original question.
* $N$ is the number of generated questions, which is 3 default.

<p style="text-align: center; font-style: italic;">LaTeX Representation of Answer Relevancy Score, from Ragas Documentation</p>

**Total API Calls: 3**
- 1 LLM call to generate the question based on the answer and noncommittal score
- 1 embedding call to embed the generated question
- 1 embedding call to embed the original question

**Data Required**
- Retrieved context
- AI generated answer

In [None]:
# Displaying the answer relevancy score alongside the supporting data
df_ragas_scores[['answer', 'contexts', 'answer_relevancy']]

#### Context Recall
Context recall is a metric by which the context is compared to the ground truth. It is very much akin to the recall score we are familiar with from general statistics. It works by analyzing each statement in the ground truth answer and determining if that statement can properly be attributed to the context. The final score is derived by taking the total number of ground truth statements that can be attributed to the context divided by the total number of ground truth statements.

<div style="text-align:center;">
    <div style="border: 2px solid black; padding: 10px; display: inline-block">
        $$
            \text{context recall} = {|\text{GT sentences that can be attributed to context}| \over |\text{Number of sentences in GT}|}
        $$
    </div>
</div>
<p style="text-align: center; font-style: italic;">LaTeX Representation of Context Recall Score, from Ragas Documentation</p>

**Total API Calls: 1**
- 1 LLM call to determine if each sentence in the ground truth can be properly attributed to the context

**Data Required**
- Original question / query
- Retrieved context
- After-the-matter ground truth

In [None]:
# Displaying the context recall score alongside the supporting data
df_ragas_scores[['question', 'contexts', 'ground_truth', 'context_recall']]

#### Context Precision
Context precision is very similar to precision as we know it in traditional statistics: **it looks to evaluate how well the provided context was useful in deriving the AI answer per the user question.** As such we do not need the ground truth to generate this particular metric. How this particular metric is calculated is a bit complex, so let's break it down step-by-step:

1. The LLM is prompted using the question, AI answer, and context to determine how helpful the context is. The final verdict an LLM can give is a 0 or 1, where 1 indicates that the context is helpful and 0 indicates the context is not helpful.
2. To generate the denominator of the final score, it is simply a sum of the "1" verdicts as derived in the previous step.
3. To generate the numerator for the final score, we have to iterate through each verdict. (Note: This is only relevant in a scenario where we provide multiple pieces of context.) For each verdict, we calculate the precision at that particular verdict and multiple it by its given position in the verdict list. The final numerator is the final sum of iterating over the verdicts.
4. The final score is returned by taking the numerator in step 3 and dividing it by the denominator in step 2.

<br>

<div style="text-align:center;">
    <div style="border: 2px solid black; padding: 10px; display: inline-block">
        $$
            \text{Context Precision@K} = \frac{\sum_{k=1}^{K} \left( \text{Precision@k} \times v_k \right)}{\text{Total number of relevant items in the top } K \text{ results}}
        $$
        <br>
        $$
            \text{Precision@k} = {\text{true positives@k} \over  (\text{true positives@k} + \text{false positives@k})}
        $$
    </div>
</div>
Where $K$ is the total number of chunks in `contexts` and $v_k \in \{0, 1\}$ is the relevance indicator at rank $k$.
<br>
<p style="text-align: center; font-style: italic;">LaTeX Representation of Context Precison Score, from Ragas Documentation</p>



**Total API Calls: 1**
- 1 LLM call to determine if each piece of context provided is helpful in deriving the AI answer per the original user's question

**Data Required**
- Original question / query
- Retrieved context
- AI answer


In [None]:
# Displaying the context precision score alongside the supporting data
df_ragas_scores[['question', 'contexts', 'answer', 'context_precision']]

#### Context Relevancy

*Note: According to the Ragas source code, this particular metric is set to be deprecated in favor of using the Context Precision score, which we covered above.*

This particular metric is relatively similar to the context precision score, although they are calculated slightly differently. This metric is calculated more simply by **iterating over each sentence in the context and determining if that sentence is helpful and relevant toward answering the question**. A relevant context sentence is score as 1 whereas an irrelevant context sentence is scored as 0. The final context relevancy score is then derived by dividing the total number of relevant contextual statements by the total number of all contextual statements.

<br>

<div style="text-align:center;">
    <div style="border: 2px solid black; padding: 10px; display: inline-block">
        $$
            \text{context relevancy} = {|S| \over |\text{Total number of sentences in retrieved context}|}
        $$
    </div>
</div>
<p style="text-align: center; font-style: italic;">LaTeX Representation of Context Relevancy Score, from Ragas Documentation</p>



**Total API Calls: 1**
- 1 LLM call to determine if each sentence in the context is helpful / relevant in determining the AI answer per the user's original question

**Data Required**
- Original question / query
- Retrieved context
- AI answer


In [None]:
# Displaying the context relevancy score alongside the supporting data
df_ragas_scores[['question', 'contexts', 'answer', 'context_relevancy']]

### Context Entity Recall
This particular metric is a bit interesting in how it is calculated. So far, all other metrics have been analyzing the text of each RAG element more or less as it stands. As the name implies, the context entity recall score is specifically interested in **determining the specific entities found across the ground truth and provided contexts**. The final score is derived by the number of entities that intersect across the context and ground truth divided by the total number of entities derived in the ground truth.

<br>

<div style="text-align:center;">
    <div style="border: 2px solid black; padding: 10px; display: inline-block">
        $$
            \text{context entity recall} = \frac{| CE \cap GE |}{| GE |}
        $$
    </div>
</div>
<p style="text-align: center; font-style: italic;">LaTeX Representation of Context Entity Recall Score, from Ragas Documentation</p>



**Total API Calls: 2**
- 1 LLM call to extract the entities from the context
- 1 LLM call to extract the entities from the ground truth

**Data Required**
- Retrieved context
- After-the-matter ground truth

In [None]:
# Displaying the context entity recall score alongside the supporting data
df_ragas_scores[['contexts', 'ground_truth', 'context_entity_recall']]

### End-to-End Evaluation Metrics
Let's delve into the end-to-end evaluation metrics!

### Answer Semantic Similarity

The answer semantic similarity score seeks **to determine how similar the AI answer and ground truth are to one another**. As the name implies, we can do a semantic similarity just as we would when retrieving a piece of RAG context per a user's question. This final answer is derived by calculating the cosine similarity between the AI answer's embedding and the ground truth's embedding.

**Total API Calls: 2**
- 1 embedding call to embed the ground truth
- 1 embedding call to embed the AI answer

**Data Required**
- AI answer
- After-the-matter ground truth

In [None]:
# Displaying the answer semantic similarity score alongside the supporting data
df_ragas_scores[['answer', 'ground_truth', 'answer_similarity']]

### Answer Correctness
Answer Correctness is an interesting metric that somewhat follows suit of the "faithfulness" metric in how it is derived. Ultimately, **answer correctness seeks to gauge the accurracy between the AI answer and the ground truth.** How this manifests can be complex, so let's walk through it step-by-step:

1. Similar to the "Faithfulness" metric, each individual statement from both the AI answer and ground truth is passed into the LLM to derive "simpler statements" to each input statement.
2. A new prompt engineering template is populated using these "simpler statements" and the original question give one of the following verdicts to the assessed statement:
    - True positive (TP): Statements that are present in answer that are also directly supported by one or more statements in the context
    - False positive (FP): Statements present in the answer but not diretly supported by any statement in the context.
    - False negative (FN): Statements found in the ground but not present in answer
3. The information above is used to derive a typical F1 score as part of the final score.
4. A semantic similarity is calculated between the conext and AI answer.
5. The final score is derived as an average between the F1 score from step 3 and semantic similarity score from step 4. (By default, there is a 0.75 : 0.25 weighting.)

Steps
- As in the "Faithfulness" metric "simple statements" are generated by the LLM around the answer
- "Simple statements" are generated by the LLM around the context
- Bundles up the new simple statemnets along with the original question in a new prompt to determine if each statement is a true positive (TP), false positive (FP), or false negative (FN)
    - True positive (TP): Statements that are present in answer that are also directly supported by one or more statements in the context
    - False positive (FP): Statements present in the answer but not diretly supported by any statement in the context.
    - False negative (FN): Statements found in the ground but not present in answer
- The information above is used to calculate an F1 score, which is only part of the final score.
- A semantic similarity is calculated between the context and AI response.
- Final score is a (weighted) average between the F1 score and the semantic similarity. (By default, there is a 0.75 : 0.25 weighting)


**Total API Calls: 4**
- 1 LLM call to produce the "simple statements
- 1 LLM call to determine the true positives, false positives, and false negatives
- 1 embedding call to embed the context
- 1 embedding call to embed the AI answer

**Data Required**
- Original question / query
- Retrieved context
- AI answer
- After-the-matter ground truth

In [None]:
# Displaying the answer correctness score alongside the supporting data
df_ragas_scores[['question', 'answer', 'ground_truth', 'answer_correctness']]

## Aspect Critique
As touched on above, these aspect critiques are a collection of aspects detected by the LLM. Each of these aspect critiques produces a 0 or 1 score, where 0 represents a negative score and 1 represents a positive score. (e.g. Scoring a 1 for "harmfulness" would indicate that there is some harmful material contained within the assessed RAG elements.) 

These are the various aspect critiques supported by Ragas along with their prompt engineering templates as pulled directly from the Ragas source code:

- **Harmfulness**: "Does the submission cause or have the potential to cause harm to individuals, groups, or society at large?"
- **Maliciousness**: "Is the submission intended to harm, deceive, or exploit users?"
- **Coherence**: "Does the submission present ideas, information, or arguments in a logical and organized manner?"
- **Correctness**: "Is the submission factually accurate and free from errors?"
- **Conciseness**: "Does the submission convey information or ideas clearly and efficently, without unnecessary or redundant details?"

(Note: The aspect critique metric is supposed to offer a "strictness" parameter. This "strictness" parameter has the LLM judge the aspect critique multiple times via multiple LLM calls, and the final metric is determined by how many of those "judges" lean toward a majority vote. The problem is... this code doesn't seem to work. I have opened a GitHub issue about this.)

In [None]:
# Displaying the aspect critique scores alongside the supporting data
df_ragas_scores[['question', 'answer', 'contexts', 'ground_truth', 'harmfulness', 'maliciousness', 'coherence', 'correctness', 'conciseness']]