# Retrieval Augmented Generation
This notebook provides an introduction to Retrieval Augmented Generation. Especially, we will learn how to improve our RAG based on different embeddings, LLMs, etc.

The focus here is the methodology of how to improve your Rag using a nort-star metric (e.g. Ragas Score). Your actual use case might be different, and the best chunking, prompt, embedding, retrieval strategy, and LLM for it might be different. Use this methodology to pick the best one for your use case.

For this tutorial, we'll use LlamaIndex, which provides some abstractions over underlying APIs used for building LLM applications. For more about RAG and LlamaIndex's definitions, see [here](https://docs.llamaindex.ai/en/stable/getting_started/concepts.html).

In [1]:
!pip install ragas==0.0.22

[0m

In [2]:
import ragas

ragas.__version__

'0.0.22'

In [3]:
import hashlib
import os
from glob import glob

from datasets import DatasetDict

# Create a directory to store the content
content_folder = os.path.join(os.path.abspath(""), ".content/")
documents_folder = os.path.join(os.path.abspath(""), ".content/docs/")
os.makedirs(documents_folder, exist_ok=True)

NUM_DOCUMENTS = 100


# Function to save article content to a file
def save_article_content(text, folder):
    try:
        # Fetching the content of the city's Wikipedia page
        checksum = hashlib.md5(text.encode("utf-8")).hexdigest()
        file_path = os.path.join(folder, checksum + ".txt")
        with open(file_path, "w") as file:
            file.write(text)
        return file_path
    except Exception as e:
        print(e)
        return str(e)


dataset = DatasetDict.load_from_disk(f"{content_folder}/rag_sciq_data.hf")
print(f"Dataset contains {len(dataset)} rows")

# Saving the content of each train set document in a file
saved_files = []
for row in dataset["train"]:
    if row["support"]:
        saved_files.append(save_article_content(row["support"], documents_folder))
    if NUM_DOCUMENTS and len(saved_files) >= NUM_DOCUMENTS:
        break
# We'll load documents that we've already downloaded in the Synthetic Dataset for RAG
data_dir = os.path.join(os.path.abspath(""), ".content/docs")
input_files = glob(os.path.join(data_dir, "*.txt"))
print(f"{len(input_files)} files in folder: {input_files[0]}, ...")

Dataset contains 3 rows
100 files in folder: /Users/rahulparundekar/workspaces/course-openai-api/rag/.content/docs/1e41916248531ab7c35d0c9895b9e097.txt, ...


In [4]:
from dotenv import load_dotenv

load_dotenv()

True

## A simple RAG with LlamaIndex (using defaults)

In [5]:
from llama_index import ServiceContext, SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms import OpenAI

First, build the index from all the documents we have. Using default chunking, and embedding strategies.

In [6]:
service_context = ServiceContext.from_defaults(llm=OpenAI())
documents = SimpleDirectoryReader(input_files=input_files).load_data("*.txt")
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

Loading files: 100%|██████████| 100/100 [00:00<00:00, 3183.80file/s]


In [7]:
query_engine = index.as_query_engine()

In [8]:
from random import randint

example_one = randint(0, len(input_files))
question = dataset["train"][example_one]["question"]
expected_answer = dataset["train"][example_one]["answer"]

response = query_engine.query(question)
print("Question:")
print(question)
print("\nAnswer:")
print(str(response))
print("\nExpected Answer:")
print(expected_answer)

Question:
What secures together immovable joints and prevents them from moving?

Answer:
Dense collagen secures together immovable joints and prevents them from moving.

Expected Answer:
Dense Collagen


In [9]:
example_two = randint(0, len(input_files))
question = dataset["train"][example_two]["question"]
expected_answer = dataset["train"][example_two]["answer"]

response = query_engine.query(question)
print("Question:")
print(question)
print("\nAnswer:")
print(str(response))
print("\nExpected Answer:")
print(expected_answer)

Question:
Which cycle tracks the flow of nitrogen through an ecosystem?

Answer:
The nitrogen cycle tracks the flow of nitrogen through an ecosystem.

Expected Answer:
Nitrogen Cycle


In [10]:
example_three = randint(0, len(input_files))
question = dataset["train"][example_three]["question"]
expected_answer = dataset["train"][example_three]["answer"]

response = query_engine.query(question)
print("Question:")
print(question)
print("\nChunks:")
for node in response.source_nodes:
    print("--------------------------")
    print(str(node.text))
    print("--------------------------")
print("\nAnswer:")
print(str(response))
print("\nExpected Answer:")
print(expected_answer)

Question:
Cutting down on the use of chemical fertilizers and preserving wetlands are ways to prevent what "unlivable" regions in bodies of water?

Chunks:
--------------------------
Cutting down on the use of chemical fertilizers is one way to prevent dead zones in bodies of water. Preserving wetlands is also important. Wetlands are habitats such as swamps, marshes, and bogs where the ground is soggy or covered with water much of the year. Wetlands slow down and filter runoff before it reaches bodies of water. Wetlands also provide breeding grounds for many different species of organisms.
--------------------------
--------------------------
Some animals change their depth by changing their density. Recall that things that are denser than their surroundings sink while those that are less dense than their surroundings float. Most fish have a swim bladder, a special sac that is filled with gases from their blood. When the fish's swim bladder is full, it is less dense than the surroundin

## Evaluation of RAGs using Ragas

So, for the question + documents + answers we have in our truncated dataset, let's calculate some metrics to help us improve the model.

We'll use Ragas score. Let's benchmark whatevet model Llama Index is using by default.

## Ragas 
You need the following columns for Ragas Evaluation
- question: list[str] - These are the questions your RAG pipeline will be evaluated on.
- answer: list[str] - The answer generated from the RAG pipeline and given to the user.
- contexts: list[list[str]] - The contexts that were passed into the LLM to answer the question.
- ground_truths: list[list[str]] - The ground truth answer to the questions. (only required if you are using context_recall)


## Ragas Metrics:
The harmonic mean of these 4 aspects gives you the ragas score which is a single measure of the performance of your QA system across all the important aspects.
- faithfulness - the factual consistency of the answer to the context base on the question.
- context_precision - a measure of how relevant the retrieved context is to the question. Conveys quality of the retrieval pipeline.
- answer_relevancy - a measure of how relevant the answer is to the question
- context_recall: measures the ability of the retriever to retrieve all the necessary information needed to answer the question.

In [11]:
from llama_index import ServiceContext
from llama_index.llms import OpenAI
from ragas.metrics import (
    answer_relevancy,
    context_precision,
    context_recall,
    faithfulness,
)

In [16]:
from statistics import harmonic_mean

import nest_asyncio
from datasets import Dataset
from llama_index.embeddings import OpenAIEmbedding
from ragas import evaluate
from tqdm import tqdm

nest_asyncio.apply()


def run_eval(embed_model=None, llm_model=None, dimensions=None):
    questions = []
    answers = []
    contexts = []
    ground_truths = []

    if embed_model:
        if dimensions:
            embedding_model = OpenAIEmbedding(model=embed_model, dimensions=dimensions)
        else:
            embedding_model = OpenAIEmbedding(model=embed_model)
        if llm_model:
            the_service_context = ServiceContext.from_defaults(embed_model=embedding_model, llm=OpenAI(model=llm_model))
        else:
            the_service_context = ServiceContext.from_defaults(embed_model=embedding_model, llm=OpenAI())
    else:
        if llm_model:
            the_service_context = ServiceContext.from_defaults(llm=OpenAI(model=llm_model))
        else:
            the_service_context = ServiceContext.from_defaults(llm=OpenAI())

    documents = SimpleDirectoryReader(input_files=input_files).load_data("*.txt")
    index = VectorStoreIndex.from_documents(documents, service_context=the_service_context)
    query_engine = index.as_query_engine()

    for index in tqdm(range(0, len(input_files))):
        row = dataset["train"][index]
        # The Question
        question = row["question"]
        questions.append(question)

        # The Answer
        response = query_engine.query(question)
        answer = str(response)
        answers.append(answer)

        # Contexts
        context = []
        for node in response.source_nodes:
            context.append(str(node.text))
        contexts.append(context)

        # Ground Truth
        actual_answer = row["answer"]
        ground_truths.append([actual_answer])

    eval_dataset = Dataset.from_dict(
        {"question": questions, "contexts": contexts, "answer": answers, "ground_truths": ground_truths}
    )

    result = evaluate(
        eval_dataset,
        metrics=[
            context_precision,
            faithfulness,
            answer_relevancy,
            context_recall,
        ],
    )
    baseline_ragas = result
    ragas_score = harmonic_mean(list(result.values()))

    return baseline_ragas, ragas_score

In [17]:
baseline_ragas, baseline_score = run_eval(embed_model="text-embedding-ada-002", llm_model="gpt-3.5-turbo")
print("Baseline Ragas Scores:", baseline_ragas)
print("Baseline Overall Ragas Scores:", baseline_score)

Loading files: 100%|██████████| 100/100 [00:00<00:00, 1659.75file/s]
100%|██████████| 100/100 [04:10<00:00,  2.50s/it]


evaluating with [context_precision]


100%|██████████| 7/7 [00:14<00:00,  2.04s/it]


evaluating with [faithfulness]


100%|██████████| 7/7 [00:26<00:00,  3.81s/it]


evaluating with [answer_relevancy]


100%|██████████| 7/7 [00:42<00:00,  6.05s/it]


evaluating with [context_recall]


100%|██████████| 7/7 [00:18<00:00,  2.71s/it]


Baseline Ragas Scores: {'context_precision': 0.7900, 'faithfulness': 0.9083, 'answer_relevancy': 0.9067, 'context_recall': 0.9225}
Baseline Overall Ragas Scores: 0.8784084064903125


In [18]:
small_35t_ragas, small_35t_score = run_eval(embed_model="text-embedding-3-small", llm_model="gpt-3.5-turbo")
print("Text Embedding 3 small + GPT 3.5 Turbo Small Ragas Score:", small_35t_ragas)
print("Text Embedding 3 small + GPT 3.5 Turbo Overall Ragas Score:", small_35t_score)

Loading files:   0%|          | 0/100 [00:00<?, ?file/s]

Loading files: 100%|██████████| 100/100 [00:00<00:00, 1062.56file/s]
100%|██████████| 100/100 [03:22<00:00,  2.03s/it]


evaluating with [context_precision]


100%|██████████| 7/7 [00:14<00:00,  2.04s/it]


evaluating with [faithfulness]


100%|██████████| 7/7 [00:26<00:00,  3.85s/it]


evaluating with [answer_relevancy]


100%|██████████| 7/7 [00:37<00:00,  5.42s/it]


evaluating with [context_recall]


100%|██████████| 7/7 [00:20<00:00,  2.93s/it]


Text Embedding 3 small + GPT 3.5 Turbo Small Ragas Score: {'context_precision': 0.8000, 'faithfulness': 0.9117, 'answer_relevancy': 0.9084, 'context_recall': 0.9463}
Text Embedding 3 small + GPT 3.5 Turbo Overall Ragas Score: 0.888005528949841


In [19]:
large_35t_ragas, large_35t_score = run_eval(embed_model="text-embedding-3-large", llm_model="gpt-3.5-turbo")
print("Text Embedding 3 Large Ragas Score:", large_35t_ragas)
print("Text Embedding 3 Large Overall Ragas Score:", large_35t_score)

Loading files: 100%|██████████| 100/100 [00:00<00:00, 2068.42file/s]
100%|██████████| 100/100 [04:57<00:00,  2.98s/it]


evaluating with [context_precision]


100%|██████████| 7/7 [00:12<00:00,  1.82s/it]


evaluating with [faithfulness]


100%|██████████| 7/7 [00:32<00:00,  4.70s/it]


evaluating with [answer_relevancy]


100%|██████████| 7/7 [00:52<00:00,  7.57s/it]


evaluating with [context_recall]


100%|██████████| 7/7 [00:19<00:00,  2.85s/it]


Text Embedding 3 Large Ragas Score: {'context_precision': 0.8000, 'faithfulness': 0.9117, 'answer_relevancy': 0.9095, 'context_recall': 0.9258}
Text Embedding 3 Large Overall Ragas Score: 0.883677716654227


In [20]:
small_4_ragas, small_4_score = run_eval(embed_model="text-embedding-3-small", llm_model="gpt-4")
print("Text Embedding 3 + GPT 4 Ragas Score:", small_4_ragas)
print("Text Embedding 3 + GPT 4 Overall Ragas Score:", small_4_score)

Loading files: 100%|██████████| 100/100 [00:00<00:00, 2311.17file/s]
100%|██████████| 100/100 [03:52<00:00,  2.32s/it]


evaluating with [context_precision]


100%|██████████| 7/7 [00:13<00:00,  1.89s/it]


evaluating with [faithfulness]


100%|██████████| 7/7 [00:37<00:00,  5.40s/it]


evaluating with [answer_relevancy]


100%|██████████| 7/7 [00:36<00:00,  5.24s/it]


evaluating with [context_recall]


100%|██████████| 7/7 [00:18<00:00,  2.69s/it]


Text Embedding 3 + GPT 4 Ragas Score: {'context_precision': 0.8000, 'faithfulness': 0.9080, 'answer_relevancy': 0.8718, 'context_recall': 0.9450}
Text Embedding 3 + GPT 4 Overall Ragas Score: 0.8778556070635826


In [21]:
small_4t_ragas, small_4t_score = run_eval(embed_model="text-embedding-3-small", llm_model="gpt-4-turbo-preview")
print("Text Embedding 3 + GPT 4 Turbo Small Ragas Score:", small_4t_ragas)
print("Text Embedding 3 + GPT 4 Turbo Ragas Score:", small_4t_score)

Loading files: 100%|██████████| 100/100 [00:00<00:00, 2186.37file/s]
100%|██████████| 100/100 [03:41<00:00,  2.21s/it]


evaluating with [context_precision]


100%|██████████| 7/7 [00:12<00:00,  1.76s/it]


evaluating with [faithfulness]


100%|██████████| 7/7 [00:34<00:00,  4.88s/it]


evaluating with [answer_relevancy]


100%|██████████| 7/7 [00:40<00:00,  5.86s/it]


evaluating with [context_recall]


100%|██████████| 7/7 [00:21<00:00,  3.03s/it]


Text Embedding 3 + GPT 4 Turbo Small Ragas Score: {'context_precision': 0.7800, 'faithfulness': 0.8960, 'answer_relevancy': 0.9107, 'context_recall': 0.9311}
Text Embedding 3 + GPT 4 Turbo Ragas Score: 0.8752526577076438


In [22]:
large_4t_ragas, large_4t_score = run_eval(embed_model="text-embedding-3-large", llm_model="gpt-4-turbo-preview")
print("Text Embedding 3 Large + GPT 4 Turbo Small Ragas Score:", large_4t_ragas)
print("Text Embedding 3 Large + GPT 4 Turbo Ragas Score:", large_4t_score)

Loading files: 100%|██████████| 100/100 [00:00<00:00, 2331.22file/s]
100%|██████████| 100/100 [03:56<00:00,  2.37s/it]


evaluating with [context_precision]


100%|██████████| 7/7 [00:12<00:00,  1.74s/it]


evaluating with [faithfulness]


100%|██████████| 7/7 [00:35<00:00,  5.11s/it]


evaluating with [answer_relevancy]


100%|██████████| 7/7 [00:37<00:00,  5.41s/it]


evaluating with [context_recall]


100%|██████████| 7/7 [00:20<00:00,  2.94s/it]


Text Embedding 3 Large + GPT 4 Turbo Small Ragas Score: {'context_precision': 0.8000, 'faithfulness': 0.9030, 'answer_relevancy': 0.9130, 'context_recall': 0.9071}
Text Embedding 3 Large + GPT 4 Turbo Ragas Score: 0.8781295283806929


In [23]:
large_256_4t_ragas, large_256_4t_score = run_eval(
    embed_model="text-embedding-3-small", dimensions=256, llm_model="gpt-4-turbo-preview"
)
print("Text Embedding 3 Large (256) + GPT 4 Turbo Small Ragas Score:", large_256_4t_ragas)
print("Text Embedding 3 Large (256) + GPT 4 Turbo Ragas Score:", large_256_4t_score)

Loading files:   0%|          | 0/100 [00:00<?, ?file/s]

Loading files: 100%|██████████| 100/100 [00:00<00:00, 1701.00file/s]
100%|██████████| 100/100 [05:15<00:00,  3.16s/it]


evaluating with [context_precision]


100%|██████████| 7/7 [00:11<00:00,  1.66s/it]


evaluating with [faithfulness]


100%|██████████| 7/7 [00:37<00:00,  5.34s/it]


evaluating with [answer_relevancy]


100%|██████████| 7/7 [00:35<00:00,  5.02s/it]


evaluating with [context_recall]


100%|██████████| 7/7 [00:24<00:00,  3.52s/it]

Text Embedding 3 Large (256) + GPT 4 Turbo Small Ragas Score: {'context_precision': 0.7850, 'faithfulness': 0.8993, 'answer_relevancy': 0.9123, 'context_recall': 0.9444}
Text Embedding 3 Large (256) + GPT 4 Turbo Ragas Score: 0.8809172569758493





In [None]:
import pandas as pd

In [None]:
comparison = {
    "Embedding Model": [
        "text-embedding-ada-002",
        "text-embedding-3-small",
        "text-embedding-3-large",
        "text-embedding-3-small",
        "text-embedding-3-small",
        "text-embedding-3-large",
    ],
    "LLM Model": [
        "gpt-3.5-turbo",
        "gpt-3.5-turbo",
        "gpt-3.5-turbo",
        "gpt-4",
        "gpt-4-turbo-preview",
        "gpt-4-turbo-preview",
    ],
    "Ragas Score": [baseline_score, small_35t_score, large_35t_score, small_4_score, small_4t_score, large_4t_score],
}
df = pd.DataFrame.from_dict(comparison)
df.head(n=10)

Unnamed: 0,Embedding Model,LLM Model,Ragas Score
0,text-embedding-ada-002,gpt-3.5-turbo,0.915011
1,text-embedding-3-small,gpt-3.5-turbo,0.916095
2,text-embedding-3-large,gpt-3.5-turbo,0.915233
3,text-embedding-3-small,gpt-4,0.922723
4,text-embedding-3-small,gpt-4-turbo-preview,0.92066
5,text-embedding-3-large,gpt-4-turbo-preview,0.915466


'0.1.1'