RAGAS framework demo
- Faithfulness Score/Context Precision and recall scores
- Generate Synthetic Data and Evaluate based on the RAGAS framework


In [None]:
!pip install transformers

In [None]:
!pip install faiss-cpu

In [None]:
!pip install ragas
!pip install helper-utils
!pip install pypdf
!pip install chromadb
!pip install langchain
!pip install sentence_transformers
!pip install openai

In [22]:
import os
os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"

**Faithfulness Score**

- This measures the factual consistency of the **generated answer against the given context**. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.

- The generated answer is regarded as faithful if all the claims that are made in the answer can be inferred from the given context. To calculate this a set of claims from the generated answer is first identified. Then each one of these claims are cross checked with given context to determine if it can be inferred from given context or not. The faithfulness score is given by divided by (1)¶

In [4]:
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness



data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'],
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}

dataset = Dataset.from_dict(data_samples)

score = evaluate(dataset,metrics=[faithfulness,answer_correctness])
score.to_pandas()

Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_correctness
0,When was the first super bowl?,"The first superbowl was held on Jan 15, 1967",[The First AFL–NFL World Championship Game was...,"The first superbowl was held on January 15, 1967",0.0,0.749095
1,Who won the most super bowls?,The most super bowls have been won by The New ...,"[The Green Bay Packers...Green Bay, Wisconsin....",The New England Patriots have won the Super Bo...,0.0,0.981078


**Context Precision/Recall Scores**

In [6]:
from datasets import load_dataset

# loading the V2 dataset
amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2")
amnesty_qa

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/5.72k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/1.90k [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/70.8k [00:00<?, ?B/s]

Generating eval split: 0 examples [00:00, ? examples/s]

DatasetDict({
    eval: Dataset({
        features: ['question', 'ground_truth', 'answer', 'contexts'],
        num_rows: 20
    })
})

In [5]:
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)

In [7]:
from ragas import evaluate

result = evaluate(
    amnesty_qa["eval"],
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

result

Evaluating:   0%|          | 0/80 [00:00<?, ?it/s]

{'context_precision': 0.9333, 'faithfulness': 0.5205, 'answer_relevancy': 0.9776, 'context_recall': 0.8037}

In [8]:
df = result.to_pandas()
df.head()

Unnamed: 0,question,ground_truth,answer,contexts,context_precision,faithfulness,answer_relevancy,context_recall
0,What are the global implications of the USA Su...,The global implications of the USA Supreme Cou...,The global implications of the USA Supreme Cou...,"[- In 2022, the USA Supreme Court handed down ...",1.0,1.0,0.988014,1.0
1,Which companies are the main contributors to G...,"According to the Carbon Majors database, the m...","According to the Carbon Majors database, the m...","[In recent years, there has been increasing pr...",1.0,0.076923,0.962193,1.0
2,Which private companies in the Americas are th...,The largest private companies in the Americas ...,"According to the Carbon Majors database, the l...",[The issue of greenhouse gas emissions has bec...,0.833333,1.0,0.991168,1.0
3,What action did Amnesty International urge its...,Amnesty International urged its supporters to ...,Amnesty International urged its supporters to ...,"[In the case of the Ogoni 9, Amnesty Internati...",1.0,0.2,0.983242,1.0
4,What are the recommendations made by Amnesty I...,The recommendations made by Amnesty Internatio...,Amnesty International made several recommendat...,"[In recent years, Amnesty International has fo...",1.0,0.047619,0.989105,1.0


**Demonstrate the procedure of generating ground-truth data using gpt 4.**

Replicating code from https://medium.com/@Stan_DS/efficient-rag-model-assessment-using-ragas-c9153643abb1

**Step 1: load dataset and Modules**

In [10]:
# from helper_utils import word_wrap
from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
import os
import openai
from openai import OpenAI
from dotenv import load_dotenv, find_dotenv
# import umap.umap_ as umap
import numpy as np
from tqdm import tqdm
from sentence_transformers import CrossEncoder
import os
import openai
from getpass import getpass

_ = load_dotenv('.env')
os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
from langchain.schema import Document
from sentence_transformers import SentenceTransformer

# Load Tesla 2023 10K report
reader = PdfReader("/content/tsla-20231231-gen.pdf")

**Step 2: Data preprocessing**

In [18]:
# Extract text from each page and store with page numbers
pdf_texts = []
# loop through each page of a pdf document, extracts the text
# strip leading or trailing whitespace
# the extracted text, along with page number, is stored in a list 'pdf_texts'
for page_num, page in enumerate(reader.pages):
    text = page.extract_text().strip()
    if text:
        pdf_texts.append({"page_number": page_num + 1, "content": text})

# Split text by sentences while maintaining page number
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". "],
    # maximum size of each chunk
    chunk_size=1000,
    # number of characters that overlap between consecutive chunks
    chunk_overlap=0
)

# Split each page's content and store in a list with metadata
character_split_texts = []
# iterate over each page's text stored in 'pdf_texts'
# use character_splitted to split the text into chunks based on the previously
# defined character splitter
for entry in pdf_texts:
    chunks = character_splitter.split_text(entry["content"])
    for chunk in chunks:
        character_split_texts.append({"page_number": entry["page_number"], "content": chunk})

# Print an example chunk and total number of chunks
print(character_split_texts[10]["content"])
print(f"\nTotal chunks: {len(character_split_texts)}")

such	risks	have	occurred	at	the	time	of	this	filing.	We	do	not	assume	any	obligation	to	update	any	forward-looking	statements.

Total chunks: 528


In [19]:
# Check if there are at least 3 chunks to print
if len(character_split_texts) >= 3:
    # Iterate over the slice of the list containing the three chunks starting from index 10
    for i in range(10, 13):  # This will access chunks 11, 12, and 13
        if i < len(character_split_texts):  # Check if the index is within the range of the list
            chunk = character_split_texts[i]
            print(f"Page Number: {chunk['page_number']}")
            print(chunk['content'])
            print("\n------------------------------------------------\n")
        else:
            print("No more chunks available.")
else:
    print("Not enough chunks available to display.")

print(f"Total chunks: {len(character_split_texts)}")


Page Number: 4
such	risks	have	occurred	at	the	time	of	this	filing.	We	do	not	assume	any	obligation	to	update	any	forward-looking	statements.

------------------------------------------------

Page Number: 5
Table	of	Contents
PART	I
ITEM	1.	BUSINESS
Overview
We	design,	develop,	manufacture,	sell	and	lease	high-performance	fully	electric	vehicles	and	energy	generation	and	storage	systems,	and	offer
services	related	to	our	products.	We	generally	sell	our	products	directly	to	customers,	and	continue	to	grow	our	customer-facing	infrastructure	through	a
global	network	of	vehicle	showrooms	and	service	centers,	Mobile	Service,	body	shops,	Supercharger	stations	and	Destination	Chargers	to	accelerate	the
widespread	adoption	of	our	products.	We	emphasize	performance,	attractive	styling	and	the	safety	of	our	users	and	workforce	in	the	design	and
manufacture	of	our	products	and	are	continuing	to	develop	full	self-driving	technology	for	improved	safety.	We	also	strive	to	lower	the	cost	of	ownership

In [None]:
print(character_split_texts[10]["content"])

such	risks	have	occurred	at	the	time	of	this	filing.	We	do	not	assume	any	obligation	to	update	any	forward-looking	statements.


In [20]:
# Tokenize the sentence chunks
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

# Split each character chunk while maintaining metadata
token_split_texts = []
for entry in character_split_texts:
    chunks = token_splitter.split_text(entry["content"])
    for chunk in chunks:
        token_split_texts.append({"page_number": entry["page_number"], "content": chunk})

# Create base_docs structure
base_docs = []
for entry in token_split_texts:
    base_docs.append(Document(page_content=entry["content"], metadata={"page_number": entry["page_number"]}))

from langchain.vectorstores import Chroma
from langchain.embeddings import SentenceTransformerEmbeddings

# Define the embedding function using SentenceTransformer
embedding_function = SentenceTransformerEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

# Use the embedding function with Chroma
vectorstore = Chroma.from_documents(base_docs, embedding_function)

# create a base retiever for later use
base_retriever = vectorstore.as_retriever(search_kwargs={"k" : 2})

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [23]:
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser
from langchain.prompts import ChatPromptTemplate
#generate questions based on the doc contents
from tqdm import tqdm
import pandas as pd
from datasets import Dataset
from langchain.chat_models import ChatOpenAI


# input consists of the first 10 text chunks from the chunking process (docs[:10])
# use GPT 3.5 turbo to generate questions below

question_schema = ResponseSchema(
    name="question",
    description="a question about the context."
)

question_response_schemas = [
    question_schema,
]
question_output_parser = StructuredOutputParser.from_response_schemas(question_response_schemas)
format_instructions = question_output_parser.get_format_instructions()
question_generation_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")

bare_prompt_template = "{content}"
bare_template = ChatPromptTemplate.from_template(template=bare_prompt_template)

qa_template = """\
You are a University Professor creating a test for advanced students. For each context, create a question that is specific to the context. Avoid creating generic or general questions.

question: a question about the context.

Format the output as JSON with the following keys:
question

context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=base_docs[0],
    format_instructions=format_instructions
)

question_generation_chain = bare_template | question_generation_llm

response = question_generation_chain.invoke({"content" : messages})
output_dict = question_output_parser.parse(response.content)
for k, v in output_dict.items():
  print(k)
  print(v)

#create qac_triples
qac_triples = []

for text in tqdm(base_docs[:10]):
  messages = prompt_template.format_messages(
      context=text,
      format_instructions=format_instructions
  )
  response = question_generation_chain.invoke({"content" : messages}) # genrate questions
  try:
    output_dict = question_output_parser.parse(response.content) #question and answer
  except Exception as e:
    continue
  output_dict["context"] = text
  qac_triples.append(output_dict)

question
What is the trading symbol for Tesla's common stock and on which exchange is it registered?


100%|██████████| 10/10 [00:17<00:00,  1.71s/it]


In [24]:
# Use the generated questions and text chunks to prompt GPT-4o to generate ground truth answers,
# and save these into a ground truth dataset. GPT-4o acts as a language expert, answering the questions using the provided context with perfect information.
# These answers are treated as ground truths.


#add answer to qac_triples
answer_generation_llm = ChatOpenAI(model="gpt-4o", temperature=0)

answer_schema = ResponseSchema(
    name="answer",
    description="an answer to the question"
)

answer_response_schemas = [
    answer_schema,
]

answer_output_parser = StructuredOutputParser.from_response_schemas(answer_response_schemas)
format_instructions = answer_output_parser.get_format_instructions()

qa_template = """\
You are a University Professor creating a test for advanced students. For each question and context, create an answer.

answer: a answer about the context.

Format the output as JSON with the following keys:
answer

question: {question}
context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=qac_triples[0]["context"],
    question=qac_triples[0]["question"],
    format_instructions=format_instructions
)

answer_generation_chain = bare_template | answer_generation_llm

response = answer_generation_chain.invoke({"content" : messages})
output_dict = answer_output_parser.parse(response.content)
for k, v in output_dict.items():
  print(k)
  print(v)
for triple in tqdm(qac_triples):
  messages = prompt_template.format_messages(
      context=triple["context"],
      question=triple["question"],
      format_instructions=format_instructions
  )
  response = answer_generation_chain.invoke({"content" : messages})
  try:
    output_dict = answer_output_parser.parse(response.content)
  except Exception as e:
    continue
  triple["answer"] = output_dict["answer"]
#ground truth dataset

ground_truth_qac_set = pd.DataFrame(qac_triples)
ground_truth_qac_set["context"] = ground_truth_qac_set["context"].map(lambda x: str(x.page_content))
ground_truth_qac_set = ground_truth_qac_set.rename(columns={"answer" : "ground_truth"})


eval_dataset = Dataset.from_pandas(ground_truth_qac_set)
eval_dataset.to_csv("/content/groundtruth_eval_dataset.csv")

question
What is the trading symbol for Tesla's common stock?
context
page_content='united states securities and exchange commission washington, d. c. 20549 form 10 - k ( mark one ) x annual report pursuant to section 13 or 15 ( d ) of the securities exchange act of 1934 for the fiscal year ended december 31, 2023 or o transition report pursuant to section 13 or 15 ( d ) of the securities exchange act of 1934 for the transition period from _ _ _ _ _ _ _ _ _ to _ _ _ _ _ _ _ _ _ commission file number : 001 - 34756 tesla, inc. ( exact name of registrant as specified in its charter ) delaware 91 - 2197729 ( state or other jurisdiction of incorporation or organization ) ( i. r. s. employer identification no. ) 1 tesla road austin, texas 78725 ( address of principal executive offices ) ( zip code ) ( 512 ) 516 - 8177 ( registrant ’ s telephone number, including area code ) securities registered pursuant to section 12 ( b ) of the act : title of each class trading symbol ( s ) name of each 

100%|██████████| 10/10 [00:32<00:00,  3.20s/it]


Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

11036

In [25]:
# import metrics to evaluate
from datasets import Dataset
import pandas as pd
from tqdm import tqdm
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_relevancy,
    answer_correctness,
    answer_similarity
)

In [51]:
# run RAG pipline and combine RAG prediction with ground truth data generated

from ragas.metrics.critique import harmfulness
from ragas import evaluate

def create_ragas_dataset(rag_pipeline, eval_dataset):
    rag_dataset = []
    for row in tqdm(eval_dataset):
        # answer = rag_pipeline.invoke({"question" : row["question"]})
        answer = rag_pipeline(question=row['question'], context=row.get('context', ''))
        rag_dataset.append(
            {"question" : row["question"],
             "answer" : answer["response"].content,
             "contexts" : [context.page_content for context in answer["context"]],
             "ground_truths" : [row["ground_truth"]]
             }
        )
    rag_df = pd.DataFrame(rag_dataset)
    rag_eval_dataset = Dataset.from_pandas(rag_df)
    return rag_eval_dataset

In [None]:
from datasets import Dataset
import pandas as pd
from transformers import pipeline, AutoTokenizer, AutoModelForQuestionAnswering

# evaluate the RAG performance
# by comparing the RAG model predictions with the ground truths.

def evaluate_ragas_dataset(ragas_dataset):
    result = evaluate(
        ragas_dataset,
        metrics=[
            context_precision,
            faithfulness,
            answer_relevancy,
            context_recall,
            context_relevancy,
            answer_correctness,
            answer_similarity
        ],
    )
    return result

# Load the evaluation dataset
eval_dataset = Dataset.from_csv("/content/groundtruth_eval_dataset.csv")

# initialize rag_pipeline
rag_pipeline = pipeline(
    "document-question-answering",
    retriever=base_retriever
)


# Create the RAGAS dataset
ragas_dataset = create_ragas_dataset(rag_pipeline, eval_dataset=eval_dataset)

# Evaluate the RAGAS dataset
evaluation_results = evaluate_ragas_dataset(ragas_dataset)
print(evaluation_results)


In [None]:
# visualize the results dictionary
pd.DataFrame.from_dict(evaluation_results, orient=’index’)