<a href="https://colab.research.google.com/github/aishwaryaprabhat/Advanced-RAG/blob/main/Advanced_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Download and Environment Preparation

In [1]:
!bash download_dataset.sh # get from https://github.com/aishwaryaprabhat/Advanced-RAG/blob/main/download_dataset.sh

Cloning into 'DataRepository'...
remote: Enumerating objects: 47, done.[K
remote: Counting objects: 100% (39/39), done.[K
remote: Compressing objects: 100% (27/27), done.[K
remote: Total 47 (delta 12), reused 21 (delta 7), pack-reused 8[K
Receiving objects: 100% (47/47), 49.80 MiB | 11.03 MiB/s, done.
Resolving deltas: 100% (12/12), done.
Updating files: 100% (25/25), done.
Archive:  DataRepository/high-performance-rag/Camel Papers Test.zip
  inflating: source_docs/Acute respiratory distress syndrome in an alpaca cria.pdf  
  inflating: source_docs/Alpaca liveweight variations and fiber production in Mediterranean range of Chile.pdf  
Archive:  DataRepository/high-performance-rag/Camel Papers Train.zip
  inflating: source_docs/Antibody response to the epsilon toxin ofClostridium perfringensfollowing vaccination of Lama glamacrias.pdf  
  inflating: source_docs/Comparative pigmentation of sheep, goats, and llamas what colors are possible through selection.pdf  
  inflating: source_d

In [15]:
# %pip install llama-index pypdf sentence_transformers typing_extensions==4.7.1 nest_asyncio -U -q
%pip install --upgrade -r requirements.txt -U

Note: you may need to restart the kernel to use updated packages.


In [1]:
import os
# from google.colab import userdata
import nest_asyncio

nest_asyncio.apply()
os.environ['OPENAI_API_KEY'] = 'sk-*'
os.environ['HUGGINGFACE_API_TOKEN'] = 'hf_*'

In [2]:
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.llms import OpenAI
from llama_index import SimpleDirectoryReader

# Initialize an embedding model from Hugging Face using the "BAAI/bge-small-en" model.
embedding_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")

# Create an OpenAI GPT-3.5 model instance with no randomness in responses (temperature=0).
llm = OpenAI(model="gpt-3.5-turbo", temperature=0, api_key=os.environ['OPENAI_API_KEY'])

# Load data from a directory named 'source_docs' using SimpleDirectoryReader.
source_docs = SimpleDirectoryReader('source_docs').load_data()

  return torch._C._cuda_getDeviceCount() > 0


# Advanced RAG Techniques

## Baseline 'Vanilla' RAG

### Parse source_docs into nodes

In [3]:
from llama_index.node_parser import SimpleNodeParser

# Create a SimpleNodeParser instance with default settings, but with specified chunk overlap and size.
baseline_parser = SimpleNodeParser.from_defaults(
    chunk_overlap=200,
    chunk_size=1024
)

# Use the parser to extract nodes from the documents in 'source_docs'.
baseline_nodes = baseline_parser.get_nodes_from_documents(source_docs)

In [4]:
from llama_index import VectorStoreIndex
from llama_index import ServiceContext

# Create a ServiceContext with default settings, including the previously defined language model (llm), embedding model, and node parser.
baseline_context = ServiceContext.from_defaults(llm=llm, embed_model=embedding_model, node_parser=baseline_parser)

# Initialize a VectorStoreIndex with the baseline nodes and the service context.
baseline_index = VectorStoreIndex(baseline_nodes, service_context=baseline_context)

# Persist the baseline index in a directory named "baseline_index".
baseline_index.storage_context.persist(persist_dir="baseline_index")

In [5]:
# Convert the baseline index into a query engine capable of finding the top 3 most similar entries.
baseline_query_engine = baseline_index.as_query_engine(similarity_top_k=3)

# Perform a query with the baseline query engine asking about the influence of camelid genetics on wool quality.
baseline_response = baseline_query_engine.query("How do camelid genetics influence wool quality?")

# Retrieve the response from the query.
baseline_response.response

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


'Camelid genetics can influence wool quality. The inheritance of coat colors in alpacas and llamas, which are types of camelids, has been studied. Additionally, major genes affecting alpaca fiber traits have been analyzed. The expression patterns of keratin intermediate filament and keratin associated protein genes in wool follicles have also been investigated. These studies suggest that genetic factors play a role in determining the quality of wool in camelids.'

## Sentence Window Parser

In [6]:
from llama_index.node_parser import SentenceWindowNodeParser

# Initialize a SentenceWindowNodeParser with default settings, including a window size of 6 and specific metadata keys.
sentence_parser = SentenceWindowNodeParser.from_defaults(
    window_size=6,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

# Parse nodes from the documents in 'source_docs' using the sentence parser.
sentence_nodes = sentence_parser.get_nodes_from_documents(source_docs)

# Create a ServiceContext using the sentence parser along with the previously defined language model and embedding model.
sentence_context = ServiceContext.from_defaults(llm=llm, embed_model=embedding_model, node_parser=sentence_parser)

In [7]:
from llama_index import VectorStoreIndex

# Create a VectorStoreIndex with the parsed sentence nodes and the defined service context.
sentence_index = VectorStoreIndex(sentence_nodes, service_context=sentence_context)

# Persist the sentence index in a directory named "sentence_index" for future use.
sentence_index.storage_context.persist(persist_dir="sentence_index")

In [8]:
from llama_index.indices.postprocessor import MetadataReplacementPostProcessor

# Convert the sentence index into a query engine, configuring it to find the top 3 most similar entries.
# It also uses a postprocessor to replace metadata with the 'window' key values.
sentence_query_engine = sentence_index.as_query_engine(
    similarity_top_k=3,
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)

# Perform a query using the sentence query engine about the influence of camelid genetics on wool quality.
sentence_response = sentence_query_engine.query("How do camelid genetics influence wool quality?")

# Retrieve the response from the query.
sentence_response.response

'Camelid genetics influence wool quality through various mechanisms. One important aspect is coat color genetics, where llamas and alpacas exhibit a wide range of natural colors and patterns. Llamas, in particular, have greater color variation compared to alpacas. This variation is attributed to the selection process during domestication, where llamas were primarily selected for body size and fiber weight rather than color uniformity or fiber fineness. \n\nAdditionally, the composition and interactions of keratin intermediate filaments (KIFs) and keratin-associated proteins (KAPs) play a crucial role in determining fiber characteristics. Fiber growth in mammals, including camelids, is a cyclical process regulated by genetics, nutrition, and hormones. The proteins that form the fiber are encoded by keratin genes (KRT) and keratin-associated proteins (KRTAP), which are expressed in a highly regulated manner during hair follicle growth. \n\nFurthermore, genetic selection programs have bee

## Automerging Retrival (Using Hierarchical Nodes)

In [9]:
from llama_index.node_parser import HierarchicalNodeParser

# Initialize a HierarchicalNodeParser with default settings.
hierarchical_parser = HierarchicalNodeParser.from_defaults()

# Parse nodes from the documents in 'source_docs' using the hierarchical parser.
hierarchical_nodes = hierarchical_parser.get_nodes_from_documents(source_docs)

# Create a ServiceContext using the hierarchical parser along with the previously defined language model and embedding model.
hierarchical_context = ServiceContext.from_defaults(llm=llm, embed_model=embedding_model, node_parser=hierarchical_parser)

In [10]:
from llama_index import VectorStoreIndex, StorageContext

# Create a VectorStoreIndex with the parsed hierarchical nodes and the specified service context.
hierarchical_index = VectorStoreIndex(hierarchical_nodes, service_context=hierarchical_context)

# Persist the hierarchical index in a directory named "hierarchical_index" for future use.
hierarchical_index.storage_context.persist(persist_dir="hierarchical_index")

In [11]:
from llama_index.retrievers.auto_merging_retriever import AutoMergingRetriever
from llama_index.query_engine import RetrieverQueryEngine

# Initialize an AutoMergingRetriever with the hierarchical index set as the retriever, configured for top 3 similarity matches.
retriever = AutoMergingRetriever(hierarchical_index.as_retriever(similarity_top_k=3), storage_context=hierarchical_index.storage_context, verbose=True)

# Create a RetrieverQueryEngine using the AutoMergingRetriever.
amretriever_query_engine = RetrieverQueryEngine.from_args(retriever)

# Perform a query using the AMRetriever query engine about the influence of camelid genetics on wool quality.
amretriever_response = amretriever_query_engine.query("How do camelid genetics influence wool quality?")

# Retrieve the response from the query.
amretriever_response.response

'Camelid genetics play a significant role in determining wool quality. While there is still much to be understood in this field, recent advancements in genetic understanding have shed light on the genetic mechanisms that regulate economically important fiber traits in South American camelids. Mutations responsible for some monogenic or oligogenic traits have been identified, enabling molecular testing to assist breeding decisions. Additionally, the development of a 76K SNPs array for the alpaca will facilitate the identification of genes affecting more complex traits through genome-wide association studies. These advancements in genomics and the discovery of genetic variants are expected to contribute to the improvement of wool quality in camelids.'

# Evaluating RAG Performance

### Creating the dataset that will be used for evaluation of RAG methods

In [33]:
import random
from llama_index.evaluation import (DatasetGenerator, QueryResponseDataset)
from llama_index.evaluation import (
    CorrectnessEvaluator,
    SemanticSimilarityEvaluator,
    RelevancyEvaluator,
    FaithfulnessEvaluator
)

# Set the number of nodes to be used for evaluation.
num_nodes_eval = 30

# Randomly select a sample of nodes from baseline_nodes for evaluation.
sample_eval_nodes = random.sample(baseline_nodes, num_nodes_eval)

# Initialize a dataset generator with the sampled nodes, baseline service context, progress display enabled,
# and generating 3 questions per chunk.
dataset_generator = DatasetGenerator(
    sample_eval_nodes,
    service_context=baseline_context,
    show_progress=True,
    num_questions_per_chunk=3,
)

# Asynchronously generate the evaluation dataset from the nodes.
evaluation_dataset = await dataset_generator.agenerate_dataset_from_nodes()

# Initialize evaluators with the baseline service context.
correctness = CorrectnessEvaluator(service_context=baseline_context)
semanticsimilarity = SemanticSimilarityEvaluator(service_context=baseline_context)
relevancy = RelevancyEvaluator(service_context=baseline_context)
faithfulness = FaithfulnessEvaluator(service_context=baseline_context)

  dataset_generator = DatasetGenerator(



  0%|          | 0/30 [00:00<?, ?it/s][A[A[A


  3%|▎         | 1/30 [00:03<01:28,  3.04s/it][A[A[A


  7%|▋         | 2/30 [00:03<00:39,  1.41s/it][A[A[A


 20%|██        | 6/30 [00:03<00:08,  2.84it/s][A[A[A


 30%|███       | 9/30 [00:03<00:04,  4.71it/s][A[A[A


 37%|███▋      | 11/30 [00:03<00:03,  5.78it/s][A[A[A


 50%|█████     | 15/30 [00:03<00:01,  9.38it/s][A[A[A


 60%|██████    | 18/30 [00:03<00:00, 12.01it/s][A[A[A


 70%|███████   | 21/30 [00:04<00:00, 12.64it/s][A[A[A


 80%|████████  | 24/30 [00:04<00:00, 12.27it/s][A[A[A


 87%|████████▋ | 26/30 [00:04<00:00,  8.17it/s][A[A[A


 93%|█████████▎| 28/30 [00:05<00:00,  8.78it/s][A[A[A


100%|██████████| 30/30 [00:06<00:00,  4.97it/s][A[A[A



  0%|          | 0/3 [00:00<?, ?it/s][A[A[A


 33%|███▎      | 1/3 [00:01<00:03,  1.76s/it][A[A[A


 67%|██████▋   | 2/3 [00:02<00:01,  1.26s/it][A[A[A


100%|██████████| 3/3 [00:03<00:00,  1.0

In [34]:
import numpy as np
from llama_index.evaluation import BatchEvalRunner

# Define the maximum number of samples to use for evaluation.
max_samples = 10

# Extract evaluation questions from the evaluation dataset.
evaluation_questions = evaluation_dataset.questions

# Compile expected responses from the question-response pairs in the evaluation dataset.
expected_responses = [response for (question, response) in evaluation_dataset.qr_pairs]

# Create a dictionary mapping evaluation types to their respective evaluator objects.
evaluator_dict = {
    "correctness": correctness,
    "faithfulness": faithfulness,
    "relevancy": relevancy,
    "semanticsimilarity": semanticsimilarity,
}

# Initialize a BatchEvalRunner with the evaluator dictionary, specifying 2 workers and progress display.
batch_eval_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)

### Evaluate Baseline RAG

In [35]:
from llama_index.evaluation.eval_utils import get_responses, get_results_df

baseline_responses = get_responses(
    evaluation_questions[:max_samples],
    baseline_index.as_query_engine(similarity_top_k=3),
    show_progress=True

)


baseline_evaluation_results = await batch_eval_runner.aevaluate_responses(
    queries=evaluation_questions[:max_samples],
    responses=baseline_responses[:max_samples],
    reference=expected_responses[:max_samples],
)




  0%|          | 0/10 [00:00<?, ?it/s][A[A[A


 10%|█         | 1/10 [00:02<00:20,  2.27s/it][A[A[A


 20%|██        | 2/10 [00:02<00:08,  1.07s/it][A[A[A


 30%|███       | 3/10 [00:02<00:04,  1.41it/s][A[A[A


 40%|████      | 4/10 [00:03<00:03,  1.89it/s][A[A[A


 50%|█████     | 5/10 [00:03<00:03,  1.55it/s][A[A[A


 60%|██████    | 6/10 [00:04<00:01,  2.04it/s][A[A[A


 70%|███████   | 7/10 [00:04<00:01,  1.82it/s][A[A[A


 80%|████████  | 8/10 [00:05<00:01,  1.78it/s][A[A[A


 90%|█████████ | 9/10 [00:06<00:00,  1.38it/s][A[A[A


100%|██████████| 10/10 [00:07<00:00,  1.34it/s][A[A[A



  0%|          | 0/40 [00:00<?, ?it/s][A[A[A


  2%|▎         | 1/40 [00:00<00:25,  1.54it/s][A[A[A


  5%|▌         | 2/40 [00:00<00:16,  2.29it/s][A[A[A


 12%|█▎        | 5/40 [00:01<00:08,  4.18it/s][A[A[A


 15%|█▌        | 6/40 [00:01<00:07,  4.64it/s][A[A[A


 18%|█▊        | 7/40 [00:02<00:09,  3.40it/s][A[A[A


 20%|██        | 8/40 [00

In [36]:
results_df = get_results_df(
    [baseline_evaluation_results],
    ['Baseline RAG'],
    ["correctness", "relevancy", "faithfulness", "semanticsimilarity"],
)

results_df.rename(columns={'names': 'RAG Method'}, inplace=True)

results_df

Unnamed: 0,RAG Method,correctness,relevancy,faithfulness,semanticsimilarity
0,Baseline RAG,4.05,0.7,0.7,0.971417


### Evaluating Sentence Window Retrieval

In [37]:
sentence_responses = get_responses(
    evaluation_questions[:max_samples],
    sentence_index.as_query_engine(similarity_top_k=3),
    show_progress=True

)


sentence_evaluation_results = await batch_eval_runner.aevaluate_responses(
    queries=evaluation_questions[:max_samples],
    responses=sentence_responses[:max_samples],
    reference=expected_responses[:max_samples],
)




  0%|          | 0/10 [00:00<?, ?it/s][A[A[A


 10%|█         | 1/10 [00:02<00:20,  2.27s/it][A[A[A


 20%|██        | 2/10 [00:02<00:10,  1.29s/it][A[A[A


 40%|████      | 4/10 [00:02<00:03,  1.92it/s][A[A[A


 70%|███████   | 7/10 [00:03<00:01,  2.78it/s][A[A[A


 80%|████████  | 8/10 [00:04<00:00,  2.01it/s][A[A[A


 90%|█████████ | 9/10 [00:05<00:00,  1.84it/s][A[A[A


100%|██████████| 10/10 [00:06<00:00,  1.65it/s][A[A[A



  0%|          | 0/40 [00:00<?, ?it/s][A[A[A


  2%|▎         | 1/40 [00:00<00:21,  1.77it/s][A[A[A


  5%|▌         | 2/40 [00:00<00:15,  2.52it/s][A[A[A


 15%|█▌        | 6/40 [00:01<00:06,  5.19it/s][A[A[A


 18%|█▊        | 7/40 [00:03<00:18,  1.74it/s][A[A[A


 20%|██        | 8/40 [00:03<00:18,  1.75it/s][A[A[A


 25%|██▌       | 10/40 [00:04<00:14,  2.14it/s][A[A[A


 28%|██▊       | 11/40 [00:04<00:12,  2.41it/s][A[A[A


 32%|███▎      | 13/40 [00:05<00:08,  3.13it/s][A[A[A


 35%|███▌      | 14/40

In [38]:
results_df = get_results_df(
    [sentence_evaluation_results],
    ['Sentence Window Retrieval'],
    ["correctness", "relevancy", "faithfulness", "semanticsimilarity"],
)

results_df.rename(columns={'names': 'RAG Method'}, inplace=True)

results_df

Unnamed: 0,RAG Method,correctness,relevancy,faithfulness,semanticsimilarity
0,Sentence Window Retrieval,3.85,0.9,1.0,0.968124


### Evaluating Automerging Retrival

In [39]:
amr_responses = get_responses(
    evaluation_questions[:max_samples],
    amretriever_query_engine,
    show_progress=True

)


amr_evaluation_results = await batch_eval_runner.aevaluate_responses(
    queries=evaluation_questions[:max_samples],
    responses=amr_responses[:max_samples],
    reference=expected_responses[:max_samples],
)




  0%|          | 0/10 [00:00<?, ?it/s][A[A[A


 10%|█         | 1/10 [00:02<00:21,  2.35s/it][A[A[A


 20%|██        | 2/10 [00:02<00:09,  1.13s/it][A[A[A


 30%|███       | 3/10 [00:03<00:06,  1.12it/s][A[A[A


 50%|█████     | 5/10 [00:03<00:02,  2.16it/s][A[A[A


 60%|██████    | 6/10 [00:03<00:01,  2.23it/s][A[A[A


 70%|███████   | 7/10 [00:04<00:01,  2.33it/s][A[A[A


 80%|████████  | 8/10 [00:04<00:00,  2.29it/s][A[A[A


100%|██████████| 10/10 [00:08<00:00,  1.18it/s][A[A[A



  0%|          | 0/40 [00:00<?, ?it/s][A[A[A


  2%|▎         | 1/40 [00:00<00:20,  1.94it/s][A[A[A


  5%|▌         | 2/40 [00:00<00:10,  3.55it/s][A[A[A


  8%|▊         | 3/40 [00:01<00:15,  2.39it/s][A[A[A


 10%|█         | 4/40 [00:01<00:17,  2.07it/s][A[A[A


 12%|█▎        | 5/40 [00:04<00:46,  1.33s/it][A[A[A


 15%|█▌        | 6/40 [00:05<00:37,  1.10s/it][A[A[A


 18%|█▊        | 7/40 [00:05<00:25,  1.28it/s][A[A[A


 22%|██▎       | 9/40 [00

> Merging 1 nodes into parent node.
> Parent node id: 2f1ac003-dd48-497c-b343-0349d5916c7a.
> Parent node text: The effects of different capture methods on cortisol are summarised in Table 2. The response to c...



In [40]:
results = get_results_df(
    [amr_evaluation_results],
    ['Automerging Retrieval'],
    ["correctness", "relevancy", "faithfulness", "semanticsimilarity"],
)

results_df.rename(columns={'names': 'RAG Method'}, inplace=True)

results_df

Unnamed: 0,RAG Method,correctness,relevancy,faithfulness,semanticsimilarity
0,Sentence Window Retrieval,3.85,0.9,1.0,0.968124


### Summary of Results

In [41]:
results_df = get_results_df(
    [baseline_evaluation_results, sentence_evaluation_results, amr_evaluation_results],
    ['Baseline RAG', 'Sentence Window Retrieval', 'Automerging Retrieval'],
    ["correctness", "relevancy", "faithfulness", "semanticsimilarity"],
)

results_df.rename(columns={'names': 'RAG Method'}, inplace=True)

results_df

Unnamed: 0,RAG Method,correctness,relevancy,faithfulness,semanticsimilarity
0,Baseline RAG,4.05,0.7,0.7,0.971417
1,Sentence Window Retrieval,3.85,0.9,1.0,0.968124
2,Automerging Retrieval,4.2,0.9,0.8,0.982428


# Tracking RAG Evaluation Results on MLFlow

In [22]:
%pip install mlflow azureml-mlflow -U -q

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [45]:
from azureml import core
from azureml.core import Workspace

import mlflow

ws = Workspace.from_config()
mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())

mlflow.set_experiment("advanced_rag")

for index, row in results_df.iterrows():
    with mlflow.start_run(run_name=f"{row['RAG Method']}"):
        for metric in ["correctness", "relevancy", "faithfulness", "semanticsimilarity"]:
            mlflow.log_metric(metric, row[metric])

2024/01/10 05:52:48 INFO mlflow.tracking.fluent: Experiment with name 'advanced_rag' does not exist. Creating a new experiment.
