# Retrieval Augmented Generation
This notebook provides an introduction to Retrieval Augmented Generation. Especially, we will learn how to improve our RAG based on different embeddings, LLMs, etc.

The focus here is the methodology of how to improve your Rag using a nort-star metric (e.g. Ragas Score). Your actual use case might be different, and the best chunking, prompt, embedding, retrieval strategy, and LLM for it might be different. Use this methodology to pick the best one for your use case.

For this tutorial, we'll use LlamaIndex, which provides some abstractions over underlying APIs used for building LLM applications. For more about RAG and LlamaIndex's definitions, see [here](https://docs.llamaindex.ai/en/stable/getting_started/concepts.html).

In [None]:
!pip install llama-index==0.9.48 datasets tqdm python-dotenv spacy

In [12]:
import hashlib
import os
from glob import glob

from datasets import DatasetDict

# Create a directory to store the content
content_folder = os.path.join(os.path.abspath(""), ".content/")
documents_folder = os.path.join(os.path.abspath(""), ".content/docs/")
os.makedirs(documents_folder, exist_ok=True)

NUM_DOCUMENTS = 100


# Function to save article content to a file
def save_article_content(text, folder):
    try:
        # Fetching the content of the city's Wikipedia page
        checksum = hashlib.md5(text.encode("utf-8")).hexdigest()
        file_path = os.path.join(folder, checksum + ".txt")
        with open(file_path, "w") as file:
            file.write(text)
        return file_path
    except Exception as e:
        print(e)
        return str(e)


dataset = DatasetDict.load_from_disk(f"{content_folder}/rag_sciq_data.hf")
print(f"Dataset contains {len(dataset)} rows")

# Saving the content of each train set document in a file
saved_files = []
for row in dataset["train"]:
    if row["support"]:
        saved_files.append(save_article_content(row["support"], documents_folder))
    if NUM_DOCUMENTS and len(saved_files) >= NUM_DOCUMENTS:
        break
# We'll load documents that we've already downloaded in the Synthetic Dataset for RAG
data_dir = os.path.join(os.path.abspath(""), ".content/docs")
input_files = glob(os.path.join(data_dir, "*.txt"))
print(f"{len(input_files)} files in folder: {input_files[0]}, ...")

Dataset contains 3 rows
100 files in folder: /Users/rahulparundekar/workspaces/course-openai-api/rag/.content/docs/1e41916248531ab7c35d0c9895b9e097.txt, ...


In [13]:
from dotenv import load_dotenv

load_dotenv()

True

## A simple RAG with LlamaIndex (using defaults)

In [14]:
from llama_index import ServiceContext, SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms import OpenAI

ModuleNotFoundError: No module named 'llama_index'

First, build the index from all the documents we have. Using default chunking, and embedding strategies.

In [None]:
service_context = ServiceContext.from_defaults(llm=OpenAI())
documents = SimpleDirectoryReader(input_files=input_files).load_data("*.txt")
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

In [None]:
query_engine = index.as_query_engine()

In [None]:
from random import randint

example_one = randint(0, len(input_files))
question = dataset["train"][example_one]["question"]
expected_answer = dataset["train"][example_one]["answer"]

response = query_engine.query(question)
print("Question:")
print(question)
print("\nAnswer:")
print(str(response))
print("\nExpected Answer:")
print(expected_answer)

In [None]:
example_two = randint(0, len(input_files))
question = dataset["train"][example_two]["question"]
expected_answer = dataset["train"][example_two]["answer"]

response = query_engine.query(question)
print("Question:")
print(question)
print("\nAnswer:")
print(str(response))
print("\nExpected Answer:")
print(expected_answer)

In [None]:
example_three = randint(0, len(input_files))
question = dataset["train"][example_three]["question"]
expected_answer = dataset["train"][example_three]["answer"]

response = query_engine.query(question)
print("Question:")
print(question)
print("\nChunks:")
for node in response.source_nodes:
    print("--------------------------")
    print(str(node.text))
    print("--------------------------")
print("\nAnswer:")
print(str(response))
print("\nExpected Answer:")
print(expected_answer)

## Evaluation of RAGs using Ragas

So, for the question + documents + answers we have in our truncated dataset, let's calculate some metrics to help us improve the model.

We'll use Ragas score. Let's benchmark whatevet model Llama Index is using by default.

## Ragas 
You need the following columns for Ragas Evaluation
- question: list[str] - These are the questions your RAG pipeline will be evaluated on.
- answer: list[str] - The answer generated from the RAG pipeline and given to the user.
- contexts: list[list[str]] - The contexts that were passed into the LLM to answer the question.
- ground_truths: list[list[str]] - The ground truth answer to the questions. (only required if you are using context_recall)


## Ragas Metrics:
The harmonic mean of these 4 aspects gives you the ragas score which is a single measure of the performance of your QA system across all the important aspects.
- faithfulness - the factual consistency of the answer to the context base on the question.
- context_precision - a measure of how relevant the retrieved context is to the question. Conveys quality of the retrieval pipeline.
- answer_relevancy - a measure of how relevant the answer is to the question
- context_recall: measures the ability of the retriever to retrieve all the necessary information needed to answer the question.

In [None]:
from llama_index import ServiceContext
from llama_index.evaluation import CorrectnessEvaluator
from llama_index.llms import OpenAI
from ragas.metrics import (
    answer_relevancy,
    context_precision,
    context_recall,
    faithfulness,
)

In [None]:
from statistics import harmonic_mean, mean

import nest_asyncio
from datasets import Dataset
from llama_index.embeddings import OpenAIEmbedding
from ragas import evaluate
from tqdm import tqdm

nest_asyncio.apply()


def run_eval(embed_model=None, llm_model=None, dimensions=None):
    questions = []
    answers = []
    contexts = []
    ground_truths = []

    correctness_scores = []

    if embed_model:
        if dimensions:
            embedding_model = OpenAIEmbedding(model=embed_model, dimensions=dimensions)
        else:
            embedding_model = OpenAIEmbedding(model=embed_model)
        if llm_model:
            the_service_context = ServiceContext.from_defaults(embed_model=embedding_model, llm=OpenAI(model=llm_model))
        else:
            the_service_context = ServiceContext.from_defaults(embed_model=embedding_model, llm=OpenAI())
    else:
        if llm_model:
            the_service_context = ServiceContext.from_defaults(llm=OpenAI(model=llm_model))
        else:
            the_service_context = ServiceContext.from_defaults(llm=OpenAI())

    documents = SimpleDirectoryReader(input_files=input_files).load_data("*.txt")
    index = VectorStoreIndex.from_documents(documents, service_context=the_service_context)
    query_engine = index.as_query_engine()

    service_context = ServiceContext.from_defaults(llm=OpenAI())
    evaluator = CorrectnessEvaluator(service_context=service_context)

    for index in tqdm(range(0, len(input_files))):
        row = dataset["train"][index]
        # The Question
        question = row["question"]
        questions.append(question)

        # The Answer
        response = query_engine.query(question)
        answer = str(response)
        answers.append(answer)

        # Contexts
        context = []
        for node in response.source_nodes:
            context.append(str(node.text))
        contexts.append(context)

        # Ground Truth
        actual_answer = row["answer"]
        ground_truths.append([actual_answer])

        # Correctness with llama-index
        correctness = evaluator.evaluate(
            query=question,
            response=answer,
            reference=actual_answer,
        )
        correctness_scores.append(correctness.score)

    eval_dataset = Dataset.from_dict(
        {"question": questions, "contexts": contexts, "answer": answers, "ground_truths": ground_truths}
    )

    result = evaluate(
        eval_dataset,
        metrics=[
            context_precision,
            faithfulness,
            answer_relevancy,
            context_recall,
        ],
    )
    baseline_ragas = result
    ragas_score = harmonic_mean(list(result.values()))

    return mean(correctness_scores), baseline_ragas, ragas_score

In [None]:
baseline_correctness, baseline_ragas, baseline_score = run_eval(
    embed_model="text-embedding-ada-002", llm_model="gpt-3.5-turbo"
)
print("Baseline Correctness Score:", baseline_correctness)
print("Baseline Ragas Scores:", baseline_ragas)
print("Baseline Overall Ragas Scores:", baseline_score)

In [None]:
small_35t_correctness, small_35t_ragas, small_35t_score = run_eval(
    embed_model="text-embedding-3-small", llm_model="gpt-3.5-turbo"
)
print("Text Embedding 3 small + GPT 3.5 Turbo Small Correctness Score:", small_35t_correctness)
print("Text Embedding 3 small + GPT 3.5 Turbo Small Ragas Score:", small_35t_ragas)
print("Text Embedding 3 small + GPT 3.5 Turbo Overall Ragas Score:", small_35t_score)

In [None]:
large_35t_correctness, large_35t_ragas, large_35t_score = run_eval(
    embed_model="text-embedding-3-large", llm_model="gpt-3.5-turbo"
)
print("Text Embedding 3 Large Correctness Score:", large_35t_correctness)
print("Text Embedding 3 Large Ragas Score:", large_35t_ragas)
print("Text Embedding 3 Large Overall Ragas Score:", large_35t_score)

In [None]:
small_4_correctness, small_4_ragas, small_4_score = run_eval(embed_model="text-embedding-3-small", llm_model="gpt-4")
print("Text Embedding 3 + GPT 4 Correctness Score:", small_4_correctness)
print("Text Embedding 3 + GPT 4 Ragas Score:", small_4_ragas)
print("Text Embedding 3 + GPT 4 Overall Ragas Score:", small_4_score)

In [None]:
small_4t_correctness, small_4t_ragas, small_4t_score = run_eval(
    embed_model="text-embedding-3-small", llm_model="gpt-4-turbo-preview"
)
print("Text Embedding 3 + GPT 4 Turbo Small Correctness Score:", small_4t_correctness)
print("Text Embedding 3 + GPT 4 Turbo Small Ragas Score:", small_4t_ragas)
print("Text Embedding 3 + GPT 4 Turbo Ragas Score:", small_4t_score)

In [None]:
large_4t_correctness, large_4t_ragas, large_4t_score = run_eval(
    embed_model="text-embedding-3-large", llm_model="gpt-4-turbo-preview"
)
print("Text Embedding 3 Large + GPT 4 Turbo Small Correctness Score:", large_4t_correctness)
print("Text Embedding 3 Large + GPT 4 Turbo Small Ragas Score:", large_4t_ragas)
print("Text Embedding 3 Large + GPT 4 Turbo Ragas Score:", large_4t_score)

In [None]:
large_256_4t_correctness, large_256_4t_ragas, large_256_4t_score = run_eval(
    embed_model="text-embedding-3-small", dimensions=256, llm_model="gpt-4-turbo-preview"
)
print("Text Embedding 3 Large (256) + GPT 4 Turbo Small Correctness Score:", large_256_4t_correctness)
print("Text Embedding 3 Large (256) + GPT 4 Turbo Small Ragas Score:", large_256_4t_ragas)
print("Text Embedding 3 Large (256) + GPT 4 Turbo Ragas Score:", large_256_4t_score)

In [None]:
import pandas as pd

In [None]:
comparison = {
    "Embedding Model": [
        "text-embedding-ada-002",
        "text-embedding-3-small",
        "text-embedding-3-large",
        "text-embedding-3-small",
        "text-embedding-3-small",
        "text-embedding-3-large",
    ],
    "LLM Model": [
        "gpt-3.5-turbo",
        "gpt-3.5-turbo",
        "gpt-3.5-turbo",
        "gpt-4",
        "gpt-4-turbo-preview",
        "gpt-4-turbo-preview",
    ],
    "Correctness": [
        baseline_correctness,
        small_35t_correctness,
        large_35t_correctness,
        small_4_correctness,
        small_4t_correctness,
        large_4t_correctness,
    ],
    "Ragas Score": [baseline_score, small_35t_score, large_35t_score, small_4_score, small_4t_score, large_4t_score],
}
df = pd.DataFrame.from_dict(comparison)
df.head(n=10)