# Evaluate Llama3.1 vs Llama3 for your RAG system with Unstructured, GPT-4o, Ragas, and LangChain

Author: Nina Lopatina from Unstructured

Nina's X handle: [@NinaLopatina](https://x.com/ninalopatina)

Nina's LinkedIn: https://www.linkedin.com/in/ninalopatina

Last updated: 07.24.24

Llama3 quantized RAG content sections borrowed from [@MariaKhalusova](https://x.com/MariaKhalusova)

In this 10-step tutorial, we'll evaluate 8B parameter Llama3.1 vs Llama3 for your RAG system on your pdf. This notebook parses a pdf, generates a synthetic test data set, sets up RAG with the model of your choice, then evaluates the RAG system's performance across several key [metrics](https://docs.ragas.io/en/latest/concepts/metrics/index.html). To compare several models, run through this notebook several times and it will automatically save your results. I found the following results for a 82 question test set for Llama3 vs. 3.1:


_________________________________________

In [None]:
# Do not run this cell again
print(table)

╒════════════════════╤══════════╤════════════╕
│ Metric             │   Llama3 │   Llama3.1 │
╞════════════════════╪══════════╪════════════╡
│ Context Precision  │   1.0000 │     1.0000 │
├────────────────────┼──────────┼────────────┤
│ Faithfulness       │   0.1830 │     0.5621 │
├────────────────────┼──────────┼────────────┤
│ Answer Relevancy   │   0.7094 │     0.7531 │
├────────────────────┼──────────┼────────────┤
│ Context Recall     │   0.9636 │     0.9709 │
├────────────────────┼──────────┼────────────┤
│ Answer Similarity  │   0.7931 │     0.9179 │
├────────────────────┼──────────┼────────────┤
│ Answer Correctness │   0.3452 │     0.5638 │
╘════════════════════╧══════════╧════════════╛


Note that this is a small sample, we just tested a small set of questions on one paper -- but this is a compelling preliminary result, that I would follow with a subjective evaluation + larger scale evaluation, before deciding to update a model running in production.

In this tutorial, we will first demonstrate how easily your pdfs can be transformed into meaningfully chunked text segments to create data and a RAG system from via Unstructured's API. Leveraging the Ragas evaluation framework, this synthetic test dataset enables evaluation of your Llama3 RAG system's performance.

Evaluating RAG systems comprehensively is challenging due to the need to have many custom questions and answers per document on which to evaluate performance. Rather than having human labelers pore over long documents, these can be created synthetically with a powerful, cost effective model like GPT-4o. GPT-4 has long been the standard for challenging tasks like creating synthetic test datasets for RAG, and the recent release of GPT-4o brings the cost down x2 and speed up x2, in addition to improvements across 50 languages!

However, even GPT-4o by default is not good at creating diverse samples as it tends to follow common paths. Ragas expands upon this by employing an evolutionary generation paradigm, where questions with different characteristics such as reasoning, conditioning, multi-context, and more are systematically crafted from the provided set of documents.

We'll use [Unstructured API](https://unstructured.io/api-key-hosted) for preprocessing PDF files, Ragas for the [test set generation](https://docs.ragas.io/en/latest/getstarted/testset_generation.html) and [evaluation](https://docs.ragas.io/en/latest/getstarted/evaluation.html) framework, [OpenAI's GPT-4o](https://platform.openai.com/docs/models) to do the Q & A data generation. We'll build a simple RAG system with the latest LLM from Meta - Llama 3.1, specifically the `Llama-3.1-8B-Instruct` version that you can get on Hugging Face, FAISS for vector storage, and HuggingFace `transformers` to get the model, and [LangChain](https://www.langchain.com/) for integration.

For this demo, we are downloading an interesting recent paper on [How faithful are RAG models? Quantifying the tug-of-war
between RAG and LLMs’ internal prior (Wu et al., arXiv, 2024)](https://arxiv.org/pdf/2404.19737) in case you want to read further on RAG evaluation, and to give wide berth to the knowledge cutoff (April 2024) so you don't have to immediately find a novel interesting paper to try this with a later model :). The knowledge cutoff for Llama3.1 was December 2023.

Note that this notebook draws on prior work from Unstructured, [RAG system setup](https://colab.research.google.com/drive/1BJYYyrPVe0_9EGyXqeNyzmVZDrCRZwsg#scrollTo=rmCLySVHDngw) and [Test data generation](https://t.co/YaEr3ALQ1T)

_________________________________________

1. To get started, install all the libraries, and sign up for your [Unstructured API key](https://unstructured.io/api-key-hosted) with a 2 week free trial for up to 1000 pages per day. You can find your API credentials in your dashboard.

In [None]:
!pip install -q unstructured[all-docs] langchain transformers accelerate bitsandbytes sentence-transformers faiss-gpu ragas python-dotenv

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━[0m [32m604.2/981.5 kB[0m [31m17.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.5/47.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m710.3 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.0/117.0 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing m

In [None]:
#This upgrade allows Llama3.1 to be compared directly to Llama3 in the below framework
!pip install --upgrade transformers

Collecting transformers
  Downloading transformers-4.43.2-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.43.2-py3-none-any.whl (9.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.42.4
    Uninstalling transformers-4.42.4:
      Successfully uninstalled transformers-4.42.4
Successfully installed transformers-4.43.2


2. [Mount your Google drive locally](https://colab.research.google.com/notebooks/io.ipynb) -- there will be a pop up asking you to connect to your google drive -- to load your [dotenv file](https://www.youtube.com/watch?app=desktop&v=t_pFWwLtza0&t=7), and for local storage of the synthetic data you create. Set paths and otherwise set up your env

  The files to process themselves will be downloaded via URL.

  The secret parameters to set in your .env file are:
  
  UNSTRUCTURED_API_KEY=YOUR_API_KEY

  UNSTRUCTURED_PARTITION_ENDPOINT=API_URL

  OPENAI_API_KEY=YOUR_API_KEY

  HF_TOKEN=YOUR_TOKEN

### Note that in this notebook, you are sharing your Google Drive with the colab notebook itself, not with Unstructured or Hugging Face or OpenAI.

#### If you prefer not to give this notebook access to your Google Drive, you can access your .env and save your synthetic data in another fashion, e.g. by downloading this notebook as a .ipynb and running it locally with local directory access.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
import dotenv

dotenv.load_dotenv('/content/drive/MyDrive/.env')

True

In [None]:
# Define the path in Google Drive where you want to save the files
drive_path = "/content/drive/MyDrive/outputs"

# Define the filenames for files we want to save: the test set and document elements
test_set_filename = "test_set.csv"

# Define the full paths for the test set and elements
test_set_path = os.path.join(drive_path, test_set_filename)

# Decide which model to evaluate:
model_version = 'llama3.1' # or 'llama3' for downstream configs

3. Download, partition, and chunk your file so that the logical structure of the document is preserved for better question generation and RAG results. Note that this requires access to your google drive to save these and subsequent data.

  While Ragas and GPT-4o can be used with any document dataset to generate synthetic test data, only Unstructured enables this generation on top of your unstructured data across connectors, file types, and languages.

  If you wish to scale up the documents you are processing, check out our [source connectors](https://docs.unstructured.io/api-reference/ingest/source-connectors/overview) to update the below ingestion pipeline to effortlessly connect to your data

In [None]:
from unstructured_client import UnstructuredClient

client = UnstructuredClient(
    api_key_auth=os.environ.get("UNSTRUCTURED_API_KEY"),
    server_url=os.environ.get("UNSTRUCTURED_PARTITION_ENDPOINT"),
)

In [None]:
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError
from unstructured.staging.base import dict_to_elements
import requests
import tempfile
import csv
import pickle

# Feel free to swap out any pdf or other unstructured document URL of your choice below!
path_to_pdf = 'https://arxiv.org/pdf/2404.10198'

# Function to download file from URL
def download_file(url):
    try:
        response = requests.get(url)
        print("Download succeeded")
        return response.content
    except Exception as e:
        print("Download failed:", e)
        return None

In [None]:
# Download the PDF file
pdf_content = download_file(path_to_pdf)

# Check if download was successful
if pdf_content:
    # Create a temporary file to save the PDF content
    with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp_file:
        tmp_file.write(pdf_content)
        tmp_file_path = tmp_file.name

    # Preprocess with Unstructured
    with open(tmp_file_path, "rb") as f:
        files = shared.Files(
            content=f.read(),
            file_name=tmp_file.name,
        )
        req = shared.PartitionParameters(
            files=files,
            chunking_strategy="by_title",
            max_characters=512,
        )
        try:
            resp = client.general.partition(req)
            elements = dict_to_elements(resp.elements)

        except SDKError as e:
            print(e)

    # Clean up: Remove temporary file
    os.remove(tmp_file_path)

else:
    print("File download failed.")

Download succeeded


4. Create LangChain documents from document chunks and their metadata

In [None]:
from langchain_core.documents import Document

# Generating the raw documents we will use for synthetic test data generation
documents = []
for element in elements:
    metadata = element.metadata.to_dict()
    documents.append(Document(page_content=element.text, metadata=metadata))

5. Import and combine Ragas + OpenAI's GPT-4o for testset generation. In this section, we will define which model to use to generate the questions and answers ('generator_llm'), as well as to evaluate the quality of the answers ('critic_llm'). We have chosen an even distribution across question types since we do not know which kinds of questions a user would ask about these data.  

In [None]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# generator with openai models
generator_llm = ChatOpenAI(model="gpt-4o") # "gpt-3.5-turbo-16k" is another option
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

# Change resulting question type distribution
distributions = {
    simple: 0.33,
    multi_context: 0.33,
    reasoning: 0.34
}

6. Run the generation with your documents, set the number of questions, and use the distribution defined above for the question distribution:

Note that this will take 5 minutes to run for the 100 questions in this example and we are saving this to Google Drive to minimize repeated API calls.

In [None]:
import pandas as pd

# Check if the test set file exists
if not os.path.exists(test_set_path):
  # use generator.generate_with_llamaindex_docs if you use llama-index as document loader
  testset = generator.generate_with_langchain_docs(documents, 100, distributions)
  test_df = testset.to_pandas()

  # Save the test set to a CSV file
  test_df.to_csv(test_set_path, index=False)
  print("Test set saved successfully.")
else:
  print("Test set file already exists. Loading...")
  test_df = pd.read_csv(test_set_path)


Test set file already exists. Loading...


In [None]:
# Remove nan and specific phrase
clean_test = test_df[
    (test_df['question'].notna()) &
    (test_df['ground_truth'] != 'nan') &
    (test_df['ground_truth'].notna()) &
    (test_df['ground_truth'] != 'The answer to given question is not present in context')
]
print(len(clean_test), 'filtered questions and answers in the test set')

82 filtered questions and answers in the test set


Voila, your questions and answers to evaluate your RAG system with! Let's take a look at a few as a sanity check

In [None]:
import itertools

# Iterate over the specified number of rows
num_rows = 5
for index, row in itertools.islice(clean_test.iterrows(), num_rows):
    print(f"Question: {row['question']}")
    print(f"Ground Truth: {row['ground_truth']}")
    print("-" * 30)  # Adding a separator for better readability

Question: Why is probability calibration considered a promising approach to reduce prior and context bias?
Ground Truth: Probability calibration is considered a promising approach to reduce prior and context bias because it deserves further investigation and serves as a natural baseline for future methods.
------------------------------
Question: How often do advanced LLMs like GPT-4o override their correct prior knowledge when presented with incorrect information in retrieved documents?
Ground Truth: Advanced LLMs like GPT-4o override their correct prior knowledge over 60% of the time when presented with incorrect information in the retrieved documents.
------------------------------
Question: How is GPT-4 used to generate questions based on the text?
Ground Truth: GPT-4 is queried to generate a question based on the text, along with the ground truth answer and the excerpt used to generate the question.
------------------------------
Question: What are the different correction methods

It is outside the scope of this tutorial to thoroughly evaluate these synthetic test data, however I will note that my qualitative assessment was that the questions seemed reasonable and somewhat diverse (with a lot of questions about references), with some obvious errors. Here is a question I generated on a previous run:

------------------------------
Question: Which model, with Mistral-7B, helps assess GPT-4's RAG?

Ground Truth: GPT3.5 (gpt-3.5-turbo-0125)

------------------------------

Ironically, GPT-4o did not correctly identify itself as the source of the questions and ground truth in the referenced paper. So we will take the final results with a grain of salt.

7. With your test data ready to go, we will create our Llama3.1 RAG system to evaluate

  This is a gated model, which means you first need to go to the [model's page](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct), log in, review terms and conditions, and request access to it. To use the model in the notebook, you need to log in with your Hugging Face token (get it in your profile's settings). To run this tutorial in the free Colab GPU, we'll need to quantize the model:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Ingest the documents we made earlier into the FAISS vectorstore. Set up the retriever.


In [None]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings, OpenAIEmbeddings
# Generating the embeddings and retriever for Llama3
llama3_db = FAISS.from_documents(documents, HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5"))
retriever = llama3_db.as_retriever(search_type="similarity", search_kwargs={"k": 4})


  warn_deprecated(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

if model_version == 'llama3':
  model_name = "meta-llama/Meta-Llama-3-8B-Instruct" # vs. "meta-llama/Meta-Llama-3.1-8B-Instruct"
elif model_version == 'llama3.1':
  model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"



bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)


Set up Llama 3 and a simple RAG chain.

Make sure to follow the prompt format for best results:

```
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_msg_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ model_answer_1 }}<|eot_id|>
```

In [None]:
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from transformers import pipeline
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


In [None]:
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=200,
    eos_token_id=terminators,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

prompt_template = """
<|start_header_id|>user<|end_header_id|>
You are an assistant for answering questions about a paper about machine learning you have read.
You are given the extracted parts of a long document and a question. Provide a conversational answer.
If you don't know the answer, just say "I do not know." Don't make up an answer.
Question: {question}
Context: {context}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.9k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

  warn_deprecated(


Tada! Your RAG is ready to use. Pass a question, the retriver will add relevant context from your document, and Llama3 will generate an answer. Here, my document was a chapter from a book on IPM that stands for "Integrated Pest Management".

In [None]:
question = "What is this paper about RAG models about?"
rag_chain.invoke(question)

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


'This paper appears to be discussing the limitations and potential pitfalls of Retrieval Augmented Generation (RAG) models, specifically in regards to their ability to handle incorrect or misleading information retrieved from external sources. The authors seem to be highlighting the importance of evaluating the robustness and reliability of RAG models, particularly in situations where they may be exposed to noisy or corrupted data. They also mention the need to develop metrics to assess the performance of RAG models in these scenarios. It looks like the paper is exploring the challenges and complexities associated with using RAG models to augment the capabilities of Large Language Models (LLMs).'

8. Let's generate and save responses from Llama3 to our test set and add them to our dataframe. Note that this took 37 minutes to run for 100 questions!

In [None]:
if model_version == 'llama3':
  answer_col = "llama3_answer"
elif model_version == 'llama3.1':
  answer_col = "llama3.1_answer"

if answer_col not in clean_test.columns:
  test_questions = clean_test['question'].tolist()
  responses = rag_chain.batch(test_questions)

  for index, response in enumerate(responses):
      clean_test.at[index, answer_col] = response
else:
  print(answer_col +' answers already exist')

llama3.1_answer answers already exist


In [None]:
  # Save the updated test set to a CSV file
  clean_test.to_csv(test_set_path, index=False)

9. We will evaluate our systems' performance by following Ragas's [evaluation documentation](https://docs.ragas.io/en/latest/getstarted/evaluation.html).   

In [None]:
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import context_precision, faithfulness, answer_relevancy, context_recall, answer_similarity, answer_correctness

def process_and_evaluate(clean_test, answer_col, model_version):
    def prepare_data(df, answer_column):
        eval_data = df.copy()
        eval_data = eval_data.rename(columns={answer_column: 'answer'})
        eval_data = eval_data[['question', 'ground_truth', 'answer', 'contexts']]

        num_nan_answers = eval_data['answer'].isna().sum()
        print(num_nan_answers, 'NaN responses removed')

        eval_data_cleaned = eval_data.dropna(subset=['answer'])
        eval_data = eval_data_cleaned

        eval_data['contexts'] = eval_data['contexts'].apply(lambda x: [x])

        return eval_data

    def create_dataset(eval_data):
        dataset_dict = {
            'question': eval_data['question'].tolist(),
            'contexts': eval_data['contexts'].tolist(),
            'answer': eval_data['answer'].tolist(),
            'ground_truth': eval_data['ground_truth'].tolist()
        }
        return Dataset.from_dict(dataset_dict)

    def run_evaluation(dataset):
        return evaluate(
            dataset,
            metrics=[
                context_precision,
                faithfulness,
                answer_relevancy,
                context_recall,
                answer_similarity,
                answer_correctness,
            ],
        )

    # Main execution
    eval_data = prepare_data(clean_test, answer_col)
    dataset = create_dataset(eval_data)
    result = run_evaluation(dataset)

    print(model_version)
    print(result)
    return result

# Usage example:
# result = process_and_evaluate(clean_test, 'llama3_answer', 'Llama3')

In [None]:
result1 = process_and_evaluate(clean_test, 'llama3_answer', 'Llama3')

11 NaN responses removed


Evaluating:   0%|          | 0/426 [00:00<?, ?it/s]

Llama3


In [None]:
result2 = process_and_evaluate(clean_test, 'llama3.1_answer', 'Llama3.1')

0 NaN responses removed


Evaluating:   0%|          | 0/492 [00:00<?, ?it/s]

Llama3.1


10. Let's take a look at the results we previewed at the start!

In [None]:
from tabulate import tabulate
data = [
    ["Metric", "Llama3", "Llama3.1"],
    ["Context Precision", result1["context_precision"], result2["context_precision"]],
    ["Faithfulness", result1["faithfulness"], result2["faithfulness"]],
    ["Answer Relevancy", result1["answer_relevancy"], result2["answer_relevancy"]],
    ["Context Recall", result1["context_recall"], result2["context_recall"]],
    ["Answer Similarity", result1["answer_similarity"], result2["answer_similarity"]],
    ["Answer Correctness", result1["answer_correctness"], result2["answer_correctness"]]
]

# Create and display the table
table = tabulate(data, headers="firstrow", tablefmt="fancy_grid", floatfmt=".4f")
print(table)

╒════════════════════╤══════════╤════════════╕
│ Metric             │   Llama3 │   Llama3.1 │
╞════════════════════╪══════════╪════════════╡
│ Context Precision  │   1.0000 │     1.0000 │
├────────────────────┼──────────┼────────────┤
│ Faithfulness       │   0.1830 │     0.5621 │
├────────────────────┼──────────┼────────────┤
│ Answer Relevancy   │   0.7094 │     0.7531 │
├────────────────────┼──────────┼────────────┤
│ Context Recall     │   0.9636 │     0.9709 │
├────────────────────┼──────────┼────────────┤
│ Answer Similarity  │   0.7931 │     0.9179 │
├────────────────────┼──────────┼────────────┤
│ Answer Correctness │   0.3452 │     0.5638 │
╘════════════════════╧══════════╧════════════╛


Historical Note: I first created this notebook in May 2024, and then re-ran it with minor changes in July 2024.

My original stats for Llama3 were:

'context_precision': 0.9867,

'faithfulness': 0.8297,

answer_relevancy': 0.8643,

'context_recall': 0.9733

You can see that's a massive difference for some of these metrics. And I think the new metrics better match my brief subjective evaluation.

'context_precision': 1.0000,

'faithfulness': 0.1684,

'answer_relevancy': 0.6984,

'context_recall': 0.9777



So, what changed? I can guess that it was either

(1) better Q/A generation by GPT-4o, or better evaluation, since the model has likely been updated in the last 2 months

or

(2) An update to the evaluation flow by Ragas

It was not due to a decrease in performance by Llama3, since the model weights have been static since April 18, 2024.

This is not the first time a notebook I wrote worked better months after I created it, without any changes on my end, so sometimes it pays to wait until the AI improves 😀