## Use Ragas to evaluate RAG pipeline

Ragas is an open source project for evaluating RAG components.  [Paper](https://arxiv.org/abs/2309.15217), [Code](https://docs.ragas.io/en/stable/getstarted/index.html), [Docs](https://docs.ragas.io/en/stable/getstarted/index.html), [Intro blog](https://medium.com/towards-data-science/rag-evaluation-using-ragas-4645a4c6c477).

<div>
<img src="images/ragas_eval_image.png" width="80%"/>
</div>

**Please note that RAGAS can use a large amount of OpenAI api token consumption.** <br> 

Read through this notebook carefully and pay attention to the number of questions and metrics you want to evaluate.



In [1]:
# IN ORDER TO USE HUGGINGFACE EMBEDDINGS, FIRST DOWNLOAD THE MODEL.

import torch
from langchain_huggingface import HuggingFaceEmbeddings

# Initialize torch settings for device-agnostic code
N_GPU = torch.cuda.device_count()
DEVICE = torch.device('cuda:N_GPU' if torch.cuda.is_available() else 'cpu')

# Use an embedding model.
model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device': DEVICE}
encode_kwargs = {'normalize_embeddings': True}
embed_model = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)
EMBEDDING_DIM = embed_model.dict()['client'].get_sentence_embedding_dimension()
print(f"MODEL: {model_name}, EMBEDDING_DIM: {EMBEDDING_DIM}")

  from tqdm.autonotebook import tqdm, trange


MODEL: BAAI/bge-large-en-v1.5, EMBEDDING_DIM: 1024


In [2]:
# !python -m pip install -U ragas dataset
import ragas
print(f"ragas: {ragas.__version__}")

ragas: 0.2.2






In [19]:
import os, sys, pprint
import pandas as pd
import numpy as np
import ragas, datasets

# Libraries to customize ragas critic model.
from ragas.llms import LangchainLLMWrapper
from langchain_community.chat_models import ChatOllama

# Libraries to customize ragas embedding model.
from langchain_huggingface import HuggingFaceEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper

# TODO:  Galileo says RAG metrics should be:
# 1. Completeness of answer to question.  Example if the user
# asks for capitals of France and Germany, but only answers
# Paris, that is incomplete.
# 2. Utilization of context in the answer.
# 3. Relevance of context to the answer. Entailment?
# 4. Groundedness of the context to the question.  
# Low score if context does not contain the answer.

# Import the evaluation metrics.
from ragas.metrics import (
    context_precision, 
    LLMContextRecall, 
    NonLLMContextPrecisionWithReference,  
    FactualCorrectness,
    Faithfulness,
    SemanticSimilarity,
    NoiseSensitivity,
    ResponseRelevancy
    )
from ragas import evaluate

# Get the current working directory.
cwd = os.getcwd()
relative_path = '/blog_eval_answers_weaviate.csv'
file_path = cwd + relative_path
# print(f"file_path: {file_path}")

# Read ground truth answers from a CSV file.
eval_df = pd.read_csv(file_path, header=0, skip_blank_lines=True)
display(eval_df.head())

Unnamed: 0,Question,ground_truth_answer,ground_truth_contexts,recursive_context_512_k_2,recursive_context_512_k_2_weaviate,Custom_RAG_answer,Custom_RAG_answer_weaviate,llama3.2_ollama_answer,llama3.2_ollama_answer_weaviate
0,What do the parameters for HNSW mean?,"* M: maximum degree, or number of connections ...","* M: maximum degree, or number of connections ...","the node closest to the target in this layer, ...","the node closest to the target in this layer, ...",The parameters for HNSW (Hierarchical Navigabl...,The parameters for HNSW (Hierarchical Navigabl...,In HNSW (Hierarchical Navigable Small World Gr...,In HNSW (Hierarchical Navigable Small World Gr...
1,What are good default values for HNSW paramete...,"M=16, efConstruction=32, ef=32","M=16, efConstruction=32, ef=32",Select your Milvus distribution first. Index b...,Select your Milvus distribution first. Index b...,"For HNSW with 25K vectors of dimension 1024, g...","For HNSW with 25K vectors of dimension 1024, g...","Based on the Milvus documentation, here are so...","Based on the Milvus documentation, here are so..."
2,What does nlist vs nprobe mean in ivf_flat?,# nlist: controls how the vector data is part...,# nlist: controls how the vector data is part...,FAQ What is the difference between FLAT index ...,"performance. The default value is 0 , where Mi...","In the IVF_FLAT index, ""nlist"" refers to the n...","In IVF_FLAT, ""nlist"" refers to the number of c...","In an IVF-FLAT index, `nlist` refers to the nu...","In IVF_FLAT, ""nlist"" refers to the number of c..."
3,What is the default AUTOINDEX index and vector...,Index type = HNSW and distance metric=IP Inner...,Index type = HNSW and distance metric=IP Inner...,"True, and auto_id is enabled for the primary k...",Set up index for the collection 4.1. Set up t...,The default AUTOINDEX index in Milvus is typic...,The default AUTOINDEX index in Milvus uses the...,The default AUTOINDEX index type uses L2 as it...,"The default AutoIndex configuration for an ""AU..."


In [22]:
# 1. Define function to create a RAGAS dataset.
def assemble_ragas_dataset(input_df):
    """Assemble a RAGAS HuggingFace Dataset from an input pandas df."""

    # Assemble Ragas lists: questions, ground_truth_answers, retrieval_contexts, ground truth contexts.
    question_list, truth_list, context_list, reference_contexts = [], [], [], []

    # Get all the questions.
    question_list = input_df.Question.to_list()

    # Get all the ground truth answers.
    truth_list = input_df.ground_truth_answer.to_list()

    # Get all the ground truth contexts.
    reference_context_list = input_df.ground_truth_contexts.to_list()
    reference_context_list = [[context] for context in reference_context_list]

    # Get all the Milvus Retrieval Contexts as list[list[str]]
    context_list = input_df.recursive_context_512_k_2.to_list()
    context_list = [[context] for context in context_list]

    # Get all the RAG answers.
    rag_answer_list = input_df.Custom_RAG_answer.to_list()

    # Create a HuggingFace Dataset from the ground truth lists.
    ragas_ds = datasets.Dataset.from_dict({"question": question_list,
                            "contexts": context_list,
                            "reference_contexts": reference_context_list,
                            "answer": rag_answer_list,
                            "ground_truth": truth_list
                            })
    return ragas_ds

# 2. Define function to evaluate RAGAS model.
def evaluate_ragas_model(pandas_eval_df, 
                         ragas_eval_metrics, 
                         what_to_evaluate='CONTEXTS',
                         cols_to_evaluate=['recursive_context_512_k_2', 'html_context_512_k_2'],
                         llm="gpt-4o-mini"):
    """Evaluate the RAGAS model using the input pandas df."""

    temp_df = pandas_eval_df.copy()
    ragas_results_df_list = []
    scores = []

    # Loop through cols_to_evaluate and evaluate each one.
    for col in cols_to_evaluate:
        print(f"evaluating col: {col}")

        # Replace the Custom_RAG_context with the chunks to evaluate.
        if what_to_evaluate == "CONTEXTS":
            # Keep the Custom_RAG_answer as is.
            # Replace the Custom_RAG_context with the col context.
            temp_df['recursive_context_512_k_2'] = temp_df[col]

        # Replace the Custom_RAG_answer with the LLM answer to evaluate.
        elif what_to_evaluate == "ANSWERS":
            # Keep the Custom_RAG_context as is.
            # Replace the Custom_RAG_answer with the col answer.
            temp_df['Custom_RAG_answer'] = temp_df[col]

        # Assemble the RAGAS dataset.
        ragas_eval_ds = assemble_ragas_dataset(temp_df)

        # Evaluate the RAGAS model.
        ragas_results = ragas.evaluate(
            dataset=ragas_eval_ds, 
            metrics=ragas_eval_metrics,
            llm=llm)

        # Return evaluations as pandas df.
        temp = ragas_results.to_pandas()

        # Print the first row of ragas_results w/column names.
        print(temp.head(1))

        # Calculate an average score for Retrieval Contexts or Generated Answers.
        temp_score = -1.0
        if what_to_evaluate == "CONTEXTS":
            print(f"Evaluate chunking: {col}, ",end="")
            # Calculate context F1 scores.
            # Note: col names for context, precision change depending on metric names.
            temp['context_f1'] = \
                2.0 * temp.context_precision * temp.context_recall \
                / (temp.context_precision + temp.context_recall)
            temp = temp.fillna(0.0)
            # Calculate Retrieval average score.
            avg_retrieval_f1 = np.round(temp.context_f1.mean(),2)
            temp_score = avg_retrieval_f1

        elif what_to_evaluate == "ANSWERS":
            print(f"Evaluate LLM: {col}, ",end="")
            # Calculate avg LLM answer scores across all floating point number scores between 0 and 1.
            # temp['avg_answer_score'] = (temp.answer_relevancy + temp.answer_similarity + temp.answer_correctness) / 3
            temp['avg_answer_score'] = temp.answer_correctness
            avg_answer_score = np.round(temp.avg_answer_score.mean(),4)
            temp_score = avg_answer_score
        print(f"avg_score: {temp_score}")

        # Add column what was evaluated.
        temp['evaluated'] = col
        # Append temp to the list of results.
        ragas_results_df_list.append(temp)
        
        # Append dictionary of scores to scores list.
        scores.append({f"{col}": temp_score})

    # Return concantenated results and scores.
    ragas_results_df = pd.concat(ragas_results_df_list, ignore_index=True)
    return ragas_results_df, scores

### Choose evaluator LLM and evaluator embedding mode 

In [9]:
# # Change the default llm-as-critic LLM.
# LLM_NAME = "gpt-4o-mini" #OpenAI
# ragas_llm = ragas.llms.llm_factory(model=LLM_NAME)
# print(f"llm: {ragas_llm}")

# Change the default llm-as-critic LLM to local llama3.2 
LLM_NAME = 'llama3.2:1b'
ragas_llm = LangchainLLMWrapper(langchain_llm=ChatOllama(model=LLM_NAME))
print(f"llm: {ragas_llm}")

# Change the default embeddings models to use model on HuggingFace.
EMB_NAME = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device': DEVICE}
encode_kwargs = {'normalize_embeddings': True}
lc_embed_model = HuggingFaceEmbeddings(
    model_name=EMB_NAME,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)
ragas_emb = LangchainEmbeddingsWrapper(embeddings=lc_embed_model)

llm: LangchainLLMWrapper(run_config=RunConfig(timeout=180, max_retries=10, max_wait=60, max_workers=16, exception_types=(<class 'Exception'>,), log_tenacity=False, seed=42), multiple_completion_supported=False)


In [28]:
##########################################
# Set the evaluation type.
EVALUATE_WHAT = 'ANSWERS' 
# EVALUATE_WHAT = 'CONTEXTS'
##########################################

# Set the columns to evaluate.
if EVALUATE_WHAT == 'CONTEXTS':
    cols_to_evaluate=\
    ['recursive_context_512_k_2', 'recursive_context_512_k_2_weaviate']
elif EVALUATE_WHAT == 'ANSWERS':
    cols_to_evaluate=['Custom_RAG_answer', 'Custom_RAG_answer_weaviate']

# Set the metrics to evaluate.
if EVALUATE_WHAT == 'ANSWERS':
    eval_metrics=[
        NoiseSensitivity(llm=ragas_llm),
        ResponseRelevancy(llm=ragas_llm),
        FactualCorrectness(llm=ragas_llm), 
        Faithfulness(llm=ragas_llm),
        SemanticSimilarity(embeddings=ragas_emb)
        ]
elif EVALUATE_WHAT == 'CONTEXTS':
    eval_metrics=[
        context_precision,
        LLMContextRecall(llm=ragas_llm),
        # NonLLMContextPrecisionWithReference(embeddings=ragas_emb),
        ]

# Execute the evaluation.
print(f"Evaluating {EVALUATE_WHAT} using {eval_df.shape[0]} eval questions:")
ragas_result, scores = evaluate_ragas_model(
    eval_df, 
    eval_metrics, 
    what_to_evaluate=EVALUATE_WHAT,
    cols_to_evaluate=cols_to_evaluate,
    llm=ragas_llm,)


Evaluating ANSWERS using 4 eval questions:
evaluating col: Custom_RAG_answer


Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Exception raised in Job[18]: TimeoutError()
Exception raised in Job[1]: TimeoutError()
Exception raised in Job[10]: TimeoutError()
Exception raised in Job[2]: TimeoutError()
Exception raised in Job[11]: TimeoutError()
Exception raised in Job[3]: TimeoutError()
Exception raised in Job[12]: TimeoutError()
Exception raised in Job[13]: TimeoutError()
Exception raised in Job[5]: TimeoutError()
Exception raised in Job[6]: TimeoutError()
Exception raised in Job[15]: TimeoutError()
Exception raised in Job[7]: TimeoutError()
Exception raised in Job[16]: TimeoutError()
Exception raised in Job[8]: TimeoutError()
Exception raised in Job[17]: TimeoutError()
Exception raised in Job[0]: TimeoutError()


                              user_input  \
0  What do the parameters for HNSW mean?   

                                  retrieved_contexts  \
0  [the node closest to the target in this layer,...   

                                  reference_contexts  \
0  [* M: maximum degree, or number of connections...   

                                            response  \
0  The parameters for HNSW (Hierarchical Navigabl...   

                                           reference  \
0  * M: maximum degree, or number of connections ...   

   noise_sensitivity_relevant  answer_relevancy  factual_correctness  \
0                         NaN               NaN                  NaN   

   faithfulness  semantic_similarity  
0           NaN             0.775278  
Evaluate LLM: Custom_RAG_answer, 

AttributeError: 'DataFrame' object has no attribute 'answer_correctness'

In [13]:
type(ragas_result)
ragas_result

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_precision,context_recall,context_f1,evaluated
0,What do the parameters for HNSW mean?,"[the node closest to the target in this layer,...","[* M: maximum degree, or number of connections...",The parameters for HNSW (Hierarchical Navigabl...,"* M: maximum degree, or number of connections ...",1.0,1.0,1.0,recursive_context_512_k_2
1,What are good default values for HNSW paramete...,[Select your Milvus distribution first. Index ...,"[M=16, efConstruction=32, ef=32]","For HNSW with 25K vectors of dimension 1024, g...","M=16, efConstruction=32, ef=32",1.0,0.0,0.0,recursive_context_512_k_2
2,What does nlist vs nprobe mean in ivf_flat?,[FAQ What is the difference between FLAT index...,[# nlist: controls how the vector data is par...,"In the IVF_FLAT index, ""nlist"" refers to the n...",# nlist: controls how the vector data is part...,1.0,0.5,0.666667,recursive_context_512_k_2
3,What is the default AUTOINDEX index and vector...,"[True, and auto_id is enabled for the primary ...",[Index type = HNSW and distance metric=IP Inne...,The default AUTOINDEX index in Milvus is typic...,Index type = HNSW and distance metric=IP Inner...,1.0,1.0,1.0,recursive_context_512_k_2
4,What do the parameters for HNSW mean?,"[the node closest to the target in this layer,...","[* M: maximum degree, or number of connections...",The parameters for HNSW (Hierarchical Navigabl...,"* M: maximum degree, or number of connections ...",1.0,1.0,1.0,recursive_context_512_k_2_weaviate
5,What are good default values for HNSW paramete...,[Select your Milvus distribution first. Index ...,"[M=16, efConstruction=32, ef=32]","For HNSW with 25K vectors of dimension 1024, g...","M=16, efConstruction=32, ef=32",1.0,0.0,0.0,recursive_context_512_k_2_weaviate
6,What does nlist vs nprobe mean in ivf_flat?,"[performance. The default value is 0 , where M...",[# nlist: controls how the vector data is par...,"In the IVF_FLAT index, ""nlist"" refers to the n...",# nlist: controls how the vector data is part...,1.0,0.666667,0.8,recursive_context_512_k_2_weaviate
7,What is the default AUTOINDEX index and vector...,[Set up index for the collection 4.1. Set up ...,[Index type = HNSW and distance metric=IP Inne...,The default AUTOINDEX index in Milvus is typic...,Index type = HNSW and distance metric=IP Inner...,1.0,1.0,1.0,recursive_context_512_k_2_weaviate


In [14]:
# Calculate and print the percent improvements.
if EVALUATE_WHAT == 'ANSWERS':
    # Sort scores from highest to lowest
    sorted_scores = sorted(scores, key=lambda item: sum(item.values()), reverse=True)
    pprint.pprint(sorted_scores)
    # Calculate the percent improvement of the best LLM over the worst LLM.
    highest_score = list(sorted_scores[0].values())[0]
    lowest_score = list(sorted_scores[-1].values())[0]
    best_llm = list(sorted_scores[0].keys())[0]
    worst_llm = list(sorted_scores[-1].keys())[0]
    percent_better = (highest_score - lowest_score) / lowest_score * 100
    print(f"{best_llm} {np.round(percent_better,0)}% improvement over {worst_llm}.")

elif EVALUATE_WHAT == 'CONTEXTS':
    pprint.pprint(scores)
    percent_better = np.abs(scores[0]['recursive_context_512_k_2'] - scores[1]['recursive_context_512_k_2_weaviate']) \
                     / scores[0]['recursive_context_512_k_2'] * 100
    print(f"Retrieval DB {np.round(percent_better,0)}% improvement.")

# Display the evaluation details.
display(ragas_result)

[{'recursive_context_512_k_2': 0.67},
 {'recursive_context_512_k_2_weaviate': 0.7}]
Retrieval DB 4.0% improvement.


Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_precision,context_recall,context_f1,evaluated
0,What do the parameters for HNSW mean?,"[the node closest to the target in this layer,...","[* M: maximum degree, or number of connections...",The parameters for HNSW (Hierarchical Navigabl...,"* M: maximum degree, or number of connections ...",1.0,1.0,1.0,recursive_context_512_k_2
1,What are good default values for HNSW paramete...,[Select your Milvus distribution first. Index ...,"[M=16, efConstruction=32, ef=32]","For HNSW with 25K vectors of dimension 1024, g...","M=16, efConstruction=32, ef=32",1.0,0.0,0.0,recursive_context_512_k_2
2,What does nlist vs nprobe mean in ivf_flat?,[FAQ What is the difference between FLAT index...,[# nlist: controls how the vector data is par...,"In the IVF_FLAT index, ""nlist"" refers to the n...",# nlist: controls how the vector data is part...,1.0,0.5,0.666667,recursive_context_512_k_2
3,What is the default AUTOINDEX index and vector...,"[True, and auto_id is enabled for the primary ...",[Index type = HNSW and distance metric=IP Inne...,The default AUTOINDEX index in Milvus is typic...,Index type = HNSW and distance metric=IP Inner...,1.0,1.0,1.0,recursive_context_512_k_2
4,What do the parameters for HNSW mean?,"[the node closest to the target in this layer,...","[* M: maximum degree, or number of connections...",The parameters for HNSW (Hierarchical Navigabl...,"* M: maximum degree, or number of connections ...",1.0,1.0,1.0,recursive_context_512_k_2_weaviate
5,What are good default values for HNSW paramete...,[Select your Milvus distribution first. Index ...,"[M=16, efConstruction=32, ef=32]","For HNSW with 25K vectors of dimension 1024, g...","M=16, efConstruction=32, ef=32",1.0,0.0,0.0,recursive_context_512_k_2_weaviate
6,What does nlist vs nprobe mean in ivf_flat?,"[performance. The default value is 0 , where M...",[# nlist: controls how the vector data is par...,"In the IVF_FLAT index, ""nlist"" refers to the n...",# nlist: controls how the vector data is part...,1.0,0.666667,0.8,recursive_context_512_k_2_weaviate
7,What is the default AUTOINDEX index and vector...,[Set up index for the collection 4.1. Set up ...,[Index type = HNSW and distance metric=IP Inne...,The default AUTOINDEX index in Milvus is typic...,Index type = HNSW and distance metric=IP Inner...,1.0,1.0,1.0,recursive_context_512_k_2_weaviate


In [None]:
########### CHANGE THE VECTOR DB ###########
# F1-Score weaviate: 0.7 (4% improvement)
# F1-Score milvus: 0.67 
####################################################

########### CHANGE THE EMBEDDING MODEL #############
# F1-Score OpenAI text-embedding-3-small: 0.84  (20% improvement)
# F1-Score HuggingFace BAAI/bge-large-en-v1.5: 0.7
####################################################

########### CHANGE THE CHUNKING STRATEGY ###########
# F1-Score 'html_context_512_k_2': 0.77  (108% improvement)
# F1-Score 'parent_context_1536_k1': 0.68 (84% improvement)
# F1-Score 'recursive_context_512_k_2': 0.64
# F1-Score 'semantic_context_k_2_summary': 0.64
# F1-Score 'semantic_context_k_1': 0.37
####################################################

############## CHANGE THE LLM ######################
# Avg mistralai mixtral_8x7b_instruct score = 0.7031 (6% improvement)
# Avg llama3_70b_anyscale_chat score = 0.6888
# Avg llama3_70b_groq_instruct score = 0.6867
# Avg llama_3_8b_ollama_instruct score = 0.6783
# Avg openai gpt-3.5-turbo score = 0.665
####################################################


In [None]:
# Delete the Milvus collection and doc store.
# del vectorstore, retriever, store

In [12]:
# Props to Sebastian Raschka for this handy watermark.
# !python -m pip install watermark
%load_ext watermark

%watermark -a 'Christy Bergman' -v -p unstructured,lxml,torch,weaviate,langchain,ollama,openai,ragas --conda

Author: Christy Bergman

Python implementation: CPython
Python version       : 3.12.5
IPython version      : 8.26.0

unstructured: 0.15.13
lxml        : 5.3.0
torch       : 2.4.0
weaviate    : 4.8.1
langchain   : 0.3.3
ollama      : 0.3.3
openai      : 1.51.0
ragas       : 0.2.2

conda environment: n/a

