In this project we will look into Evaluating different indices on the `BlockchainSolanaDataset` dataset.

<b> Evaluation of RAG can be costly GPT-4 is being used. Please keep track of the cost. You can try to run on lesser data to reduce cost.

In [1]:
import nest_asyncio
nest_asyncio.apply()

In [2]:
import os

In [3]:
from dotenv import load_dotenv, find_dotenv
# load_dotenv('D:/.env')
OPENAI_API_KEY = os.environ['OPENAI_API_KEY']

NameError: name 'os' is not defined

#### Download Evaluation Dataset of `BlockchainSolanaDataset` from LlamaDatasets which is based on [Blockchain to Solana Paper](https://arxiv.org/pdf/2207.05240.pdf)

In [5]:
from llama_index.core.llama_dataset import LabelledRagDataset
from llama_index.packs.rag_evaluator import RagEvaluatorPack
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

Download the required files from the below link and move them to the folders mentioned below in the code

https://github.com/run-llama/llama-datasets/tree/main/llama_datasets/blockchain_solana

In [6]:
rag_dataset = LabelledRagDataset.from_json("./data/rag_dataset.json")
documents = SimpleDirectoryReader(input_dir="./data/source_files").load_data()

In [8]:
eval_queries = [example.query for example in rag_dataset.examples]
eval_answers = [example.reference_answer for example in rag_dataset.examples]

In [9]:
documents[0]

Document(id_='7019eced-b956-4704-8f74-555e22625877', embedding=None, metadata={'page_label': '1', 'file_name': 'BlockchainSolana.pdf', 'file_path': '/home/jupyter-prashant/RAG systems using LlamaIndex/Module 4 - Evaluation of RAG systems/M4_L7_Evaluating Different Indices in Rag Pipeline/data/source_files/BlockchainSolana.pdf', 'file_type': 'application/pdf', 'file_size': 594798, 'creation_date': '2024-05-14', 'last_modified_date': '2024-03-31'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='From Bitcoin to Solana – Innovating Blockchain towards \nEnterprise Applications  \nXiangyu Li, Xinyu Wang , Tingli Kong , Junhao Zheng and Min Luo  \nGeorgia Institute of Technology, Atlanta, GA 30332, USA  \nmluo60@gatech.edu  \nAbstract. This survey presents a c

In [10]:
len(eval_queries)

58

In [11]:
eval_queries[0]

'What are the key issues preventing the wide adoption of blockchain technology in enterprise applications, and how has Solana addressed these issues?'

In [12]:
eval_answers[0]

'The key issues preventing the wide adoption of blockchain technology in enterprise applications are scalability and performance. However, recent advances in Solana have demonstrated that it is possible to significantly improve on these issues. Solana has achieved this by innovating on data structure, processes, and algorithms. It has consolidated various time-consuming algorithms and security enforcements, and has differentiated and balanced users and their responsibilities and rights while maintaining the required security and integrity that blockchain systems inherently offer.'

# LLM

In [13]:
from llama_index.llms.openai import OpenAI
gpt35 = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

# Embedding Model

In [1]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

ModuleNotFoundError: No module named 'llama_index.embeddings.huggingface'

# Build Vector Store Index

In [15]:
from llama_index.core import VectorStoreIndex
vector_store_index = VectorStoreIndex.from_documents(documents, embed_model=embed_model, llm=gpt35, show_progress=False)

In [17]:
vector_store_query_engine = vector_store_index.as_query_engine()

# Build Keyword Table Index

In [18]:
from llama_index.core.indices import SimpleKeywordTableIndex
keyword_table_index = SimpleKeywordTableIndex.from_documents(
    documents,
    embed_model=embed_model, llm=gpt35,
    show_progress=False
)

In [19]:
keyword_table_query_engine = keyword_table_index.as_query_engine()

# Download `RagEvaluatorPack`

In [20]:
from llama_index.packs.rag_evaluator import RagEvaluatorPack

In [21]:
rag_evaluator_pack = RagEvaluatorPack(
    rag_dataset=rag_dataset,
    query_engine=vector_store_query_engine
)

#Evaluating Vector Store Index

Compute the metrics for the responses generated by vector store retriever

In [22]:
gpt4 = OpenAI(model='gpt-4o')

In [23]:
rag_evaluator_pack = RagEvaluatorPack(
    rag_dataset=rag_dataset,
    query_engine=vector_store_query_engine,
    judge_llm=gpt4
)

vector_benchmark_df = await rag_evaluator_pack.arun(
    batch_size=10,  # batches the number of openai api calls to make
    sleep_time_in_seconds=1,  # seconds to sleep before making an api call
)

Batch processing of predictions: 100%|██████████| 10/10 [00:05<00:00,  1.70it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:08<00:00,  1.24it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:05<00:00,  1.72it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:05<00:00,  1.73it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:22<00:00,  2.21s/it]
Batch processing of predictions: 100%|██████████| 8/8 [00:05<00:00,  1.35it/s]
Batch processing of evaluations:  95%|█████████▌| 29/30.5 [03:57<00:12,  8.19s/it]


In [24]:
vector_benchmark_df.columns = ['VectorStore Index']

vector_benchmark_df

Unnamed: 0_level_0,VectorStore Index
metrics,Unnamed: 1_level_1
mean_correctness_score,4.413793
mean_relevancy_score,0.982759
mean_faithfulness_score,0.982759
mean_context_similarity_score,0.937571


#Evaluating Keyword Table Index

Compute the metrics with Keyword Table Index

In [25]:
rag_evaluator_pack = RagEvaluatorPack(
    rag_dataset=rag_dataset,
    query_engine=keyword_table_query_engine,
    judge_llm=gpt4
)

keyword_table_benchmark_df = await rag_evaluator_pack.arun(
    batch_size=10,  # batches the number of openai api calls to make
    sleep_time_in_seconds=1,  # seconds to sleep before making an api call
)

Batch processing of predictions: 100%|██████████| 10/10 [00:22<00:00,  2.29s/it]
Batch processing of predictions: 100%|██████████| 10/10 [00:05<00:00,  1.76it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:07<00:00,  1.43it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:05<00:00,  1.98it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:04<00:00,  2.12it/s]
Batch processing of predictions: 100%|██████████| 8/8 [00:05<00:00,  1.37it/s]
Batch processing of evaluations:  95%|█████████▌| 29/30.5 [04:32<00:14,  9.40s/it]


In [26]:
keyword_table_benchmark_df.columns = ['Keyword Table Index']

keyword_table_benchmark_df

Unnamed: 0_level_0,Keyword Table Index
metrics,Unnamed: 1_level_1
mean_correctness_score,4.362069
mean_relevancy_score,0.913793
mean_faithfulness_score,0.965517
mean_context_similarity_score,0.922394


# Display results

In [27]:
import pandas as pd

results_df = pd.concat([vector_benchmark_df.T, keyword_table_benchmark_df.T])


In [28]:
results_df

metrics,mean_correctness_score,mean_relevancy_score,mean_faithfulness_score,mean_context_similarity_score
VectorStore Index,4.413793,0.982759,0.982759,0.937571
Keyword Table Index,4.362069,0.913793,0.965517,0.922394
