# RAG Evaluation w/ RAGAS

In [1]:
# import modules

import os
from ragas import evaluate
from datasets import Dataset
from ragas.metrics import faithfulness, answer_correctness, context_precision, context_recall
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings

In [2]:
# OpenAI

AZURE_OPENAI_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_ENDPOINT = os.environ.get('AZURE_OPENAI_ENDPOINT')
AZURE_OPENAI_VERSION = os.environ.get('AZURE_OPENAI_VERSION')
AZURE_OPENAI_DEPLOYMENT_NAME = os.environ.get('AZURE_OPENAI_DEPLOYMENT_NAME')
AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME = os.environ.get('AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME')

In [3]:
# accessability test

os.environ["OPENAI_API_KEY"]

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}

dataset = Dataset.from_dict(data_samples)

score = evaluate(dataset, metrics=[faithfulness, answer_correctness])
score.to_pandas()

Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_correctness
0,When was the first super bowl?,"The first superbowl was held on Jan 15, 1967",[The First AFL–NFL World Championship Game was...,"The first superbowl was held on January 15, 1967",1.0,0.749093
1,Who won the most super bowls?,The most super bowls have been won by The New ...,"[The Green Bay Packers...Green Bay, Wisconsin....",The New England Patriots have won the Super Bo...,0.0,0.981086


## Metrics

| Metric            | Brief Description                                                                                   | Range        | Good | Bad |
|-------------------|-----------------------------------------------------------------------------------------------------|--------------|------------|-----------|
| **Context Relevancy** | Measures the precision of retrieved context to ensure it aligns accurately with the information need. | [0;1]      | Close to 1 | Close to 0|
| **Faithfulness**      | Measures hallucinations in the generated answers to assess factual accuracy.                        | [0;1]      | Close to 1 | Close to 0|
| **Context Recall**    | Measures how much of the relevant context is retrieved to answer the question.                      | [0;1]       | Close to 1 | Close to 0|
| **Answer Relevancy**  | Assesses how relevant and to-the-point the answers are relative to the posed question.               | [0;1]      | Close to 1 | Close to 0|

In general: 

- Metrics that evaluate the performance of the **Retrieval** are: Context Relevancy, Context Recall

- Metrics that evaluate the performance of the **Generation** are: Faithfulness, Answer Relevancy

❗The harmonic mean of all 4 aspects gives the **RAGAS score**❗

## 1. Import RAG w/ Contextual Compression for testing

In [4]:
import sys
sys.path.append("../1-rag-contextual-compression")

from rag_cc import RAGContextualCompression

In [5]:
# logging
import logging

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.WARNING)

In [6]:
data_path = "../data/rag-con-comp-data"

# step 1: initialize agent
ragcc = RAGContextualCompression(data_path=data_path)

# step 2: load & preprocess documents
docs = ragcc.load_documents()
doc_chunks = ragcc.preprocess_documents(docs)

# step 3: initialize vector store
db = ragcc.setup_vector_store(doc_chunks)

In [7]:
user_query = "Where does the water present in the egg go after boiling the egg?"

# step 4: retrieve documents
retrieved_docs = ragcc.retrieve_documents(db, user_query)

# step 5: setup compression and redundancy filters to optimize document retrieval
contextual_comp_retriever = ragcc.setup_compression_pipeline_retriever(db)

# step 6: generate the final answer to the user query
answer = ragcc.generate_answer(retriever=contextual_comp_retriever, user_query=user_query)
print(answer)

Yow yow, nice to see you here, curious mind! When an egg is boiled, the water present in the egg doesn't go anywhere but remains within the egg. The heat from boiling causes the proteins in the egg to denature and coagulate, trapping the water within the solidified egg white and yolk.


## 2a. Generate a syntethic test for RAGAS (Option 1)

In [8]:
# load documents from the directory

from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(data_path)
documents = loader.load()

In [9]:
# setup Generator

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_community.chat_models import ChatOllama

gen_llm = ChatOllama(model="llama3", temperature=0)
critic_llm = AzureChatOpenAI(
    openai_api_version=AZURE_OPENAI_VERSION,
    azure_deployment=AZURE_OPENAI_DEPLOYMENT_NAME,
)
azure_oai_emb_model: AzureOpenAIEmbeddings = AzureOpenAIEmbeddings(
    azure_deployment=AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME,
    openai_api_version=AZURE_OPENAI_VERSION,
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_key=AZURE_OPENAI_API_KEY,
)

generator = TestsetGenerator.from_langchain(
    gen_llm,
    critic_llm,
    azure_oai_emb_model
)

In [10]:
# generate testset (5 samples)

synth_testset = generator.generate_with_langchain_docs(documents, test_size=5, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})
synth_testset.to_pandas()

embedding nodes:   0%|          | 0/24 [00:00<?, ?it/s]



Generating:   0%|          | 0/5 [00:00<?, ?it/s]

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,Here's a question that can be fully answered f...,[ in liver and adipose (fat storing) tissues. ...,,simple,[{'source': '..\data\rag-con-comp-data\biomole...,True
1,Here's a question that can be fully answered f...,[ Fehling’s solution and Tollens’ reagent are ...,Glucose is an aldohexose.,simple,[{'source': '..\data\rag-con-comp-data\biomole...,True
2,Here's a rewritten question that conveys the s...,[ Fehling’s solution and Tollens’ reagent are ...,,reasoning,[{'source': '..\data\rag-con-comp-data\biomole...,True
3,Here's a rewritten version of the question tha...,[racil (U)\n\nA unit formed by the attachment ...,,multi_context,[{'source': '..\data\rag-con-comp-data\biomole...,True
4,Here's a question that can be fully answered f...,[ Fehling’s solution and Tollens’ reagent are ...,Glucose is an aldohexose.,simple,[{'source': '..\data\rag-con-comp-data\biomole...,True


## 2b. Generate a dataset manually (Option 2)

In [16]:
# manually created test set

manual_data = {
    "question": [
        "How does the molecular structure of monosaccharides relate to their classification?",
        "What distinguishes a reducing sugar from a non-reducing sugar?",
        "What role do glycosidic linkages play in the structural formation of polysaccharides?",
        "Why do amino acids exhibit different chemical properties based on their side chains?",
        "In what ways does the primary structure of a protein determine its function?",
        "How does the presence of an aldehyde group influence the properties of glucose?",
        "Explain how DNA and RNA differ in terms of their structural components and functions.",
        "How do enzymes accelerate biochemical reactions through their interaction with substrates?"
    ],
    "contexts": [
        ["Monosaccharides are classified based on their ability to reduce Fehling’s solution and Tollens’ reagent."],
        ["Reducing sugars have free aldehyde or ketone groups that allow them to act as reducing agents."],
        ["Glycosidic linkages are covalent bonds that connect monosaccharides into larger carbohydrate molecules."],
        ["Amino acids have various side chains that determine their chemical properties and interactions."],
        ["The sequence of amino acids in a protein, known as the primary structure, critically determines its 3D conformation and function."],
        ["Glucose contains an aldehyde group, which reacts to form products like gluconic and saccharic acids."],
        ["DNA contains thymine and uses deoxyribose, while RNA uses uracil and ribose, influencing their respective roles in genetics."],
        ["Enzymes lower the activation energy required for biochemical reactions, enhancing reaction rates."]
    ],
    "answer": [
        "The molecular structure of monosaccharides is actually determined by their ability to join dance competitions across the globe. This unusual classification is based on the intricate footwork and rhythm of their hydrogen and oxygen atoms, showcasing a unique blend of chemistry and choreography.",
        "Reducing sugars can donate electrons to other molecules, while non-reducing sugars lack free aldehyde or ketone groups.",
        "Glycosidic linkages determine the structure and digestibility of polysaccharides like starch and cellulose.",
        "The R group in amino acids affects pH, polarity, and reactivity, influencing protein structure and function.",
        "Primary structure determines the stability and regulatory interactions of proteins, affecting their biological roles.",
        "The aldehyde group enables glucose to participate in oxidation-reduction reactions critical for energy production.",
        "Structural differences between DNA and RNA affect their stability and the mechanisms of protein synthesis.",
        "Enzymes interact specifically with substrates, forming a complex that facilitates the conversion to product."
    ],
    "ground_truth": [
        "Monosaccharides are classified into aldoses and ketoses based on their carbonyl group's position.",
        "Reducing sugars can participate in oxidation-reduction reactions due to their free aldehyde or ketone groups.",
        "Glycosidic linkages are essential for creating the complex structure of polysaccharides like starch and cellulose.",
        "Amino acids' side chains determine their chemical behavior and interaction in protein synthesis.",
        "The primary structure of proteins, which is the sequence of amino acids, dictates the protein's overall conformation and function.",
        "The aldehyde group in glucose is crucial for its involvement in energy production and metabolic pathways.",
        "DNA and RNA differ in sugar components and the type of nitrogenous bases they contain, affecting their function in genetic information storage and transfer.",
        "Enzymes lower the activation energy of biochemical reactions, thus speeding up the reactions without being consumed."
    ]
}

from datasets import Dataset

# create Dataset object
man_testset = Dataset.from_dict(manual_data)
print(man_testset)


Dataset({
    features: ['question', 'contexts', 'answer', 'ground_truth'],
    num_rows: 8
})


## 3. Evaluation of the RAG w/ Contextual Compression

In [23]:
# convert to Dataset object

synth_testset_adj = Dataset.from_pandas(synth_testset.to_pandas())
synth_testset_adj

Dataset({
    features: ['question', 'contexts', 'ground_truth', 'evolution_type', 'metadata', 'episode_done'],
    num_rows: 5
})

In [27]:
# 1) Evaluation of synthetic dataset

synth_score = evaluate(synth_testset_adj, metrics=[context_precision, 
                                   context_recall])

print(f"SCORE FOR SYNTHETIC TEST SET: {synth_score}")
synth_score.to_pandas().head()

Evaluating:   0%|          | 0/10 [00:00<?, ?it/s]

SCORE FOR SYNTHETIC TEST SET: {'context_precision': 0.8000, 'context_recall': 0.6000}


Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done,context_precision,context_recall
0,Here's a question that can be fully answered f...,[ in liver and adipose (fat storing) tissues. ...,,simple,[{'source': '..\data\rag-con-comp-data\biomole...,True,1.0,1.0
1,Here's a question that can be fully answered f...,[ Fehling’s solution and Tollens’ reagent are ...,Glucose is an aldohexose.,simple,[{'source': '..\data\rag-con-comp-data\biomole...,True,1.0,1.0
2,Here's a rewritten question that conveys the s...,[ Fehling’s solution and Tollens’ reagent are ...,,reasoning,[{'source': '..\data\rag-con-comp-data\biomole...,True,1.0,0.0
3,Here's a rewritten version of the question tha...,[racil (U)\n\nA unit formed by the attachment ...,,multi_context,[{'source': '..\data\rag-con-comp-data\biomole...,True,0.0,0.0
4,Here's a question that can be fully answered f...,[ Fehling’s solution and Tollens’ reagent are ...,Glucose is an aldohexose.,simple,[{'source': '..\data\rag-con-comp-data\biomole...,True,1.0,1.0


In [28]:
# 2) Evaluation of manually created dataset

man_score = evaluate(man_testset, metrics=[context_precision, 
                                   context_recall,
                                   faithfulness,
                                   answer_correctness])

print(f"SCORE FOR MANUAL TEST SET: {man_score}")
man_score.to_pandas().head()

Evaluating:   0%|          | 0/32 [00:00<?, ?it/s]

SCORE FOR MANUAL TEST SET: {'context_precision': 0.8750, 'context_recall': 0.7500, 'faithfulness': 0.4667, 'answer_correctness': 0.6799}


Unnamed: 0,question,contexts,answer,ground_truth,context_precision,context_recall,faithfulness,answer_correctness
0,How does the molecular structure of monosaccha...,[Monosaccharides are classified based on their...,The molecular structure of monosaccharides is ...,Monosaccharides are classified into aldoses an...,1.0,1.0,0.0,0.471229
1,What distinguishes a reducing sugar from a non...,[Reducing sugars have free aldehyde or ketone ...,Reducing sugars can donate electrons to other ...,Reducing sugars can participate in oxidation-r...,1.0,1.0,1.0,0.603155
2,What role do glycosidic linkages play in the s...,[Glycosidic linkages are covalent bonds that c...,Glycosidic linkages determine the structure an...,Glycosidic linkages are essential for creating...,1.0,1.0,0.5,0.615939
3,Why do amino acids exhibit different chemical ...,[Amino acids have various side chains that det...,"The R group in amino acids affects pH, polarit...",Amino acids' side chains determine their chemi...,0.0,1.0,0.4,0.719609
4,In what ways does the primary structure of a p...,"[The sequence of amino acids in a protein, kno...",Primary structure determines the stability and...,"The primary structure of proteins, which is th...",1.0,1.0,0.666667,0.79257


## Metrics

For the evaluation of individual components please visit [RAGAS Metrics](https://docs.ragas.io/en/stable/concepts/metrics/index.html#).

In isolation every component of the RAG pipeline can be evaluated: faithfulness, context entity recall, context precision, etc.