# Evaluation

One thing you might be wondering is how we can evaluate the RAG process. Well, it's hard. There are a few possible techniques we can use. And here we will demonstrate a few here:

- Semantic similarity

- Faithfulness

The core of these methods (any many methods that evaluate RAG systems) involves feeding the entire paper into an LLM and asking it to generate some questions and some answers based on the paper. We can then assess things like semantic similarity. We can also ask the model to evaluate whether the answer it gave can actually be inferred from the context given.

In [1]:
from llama_index.readers.file import PyMuPDFReader
from llama_index.core.node_parser import SentenceSplitter

from pydantic import BaseModel, Field

import fitz

from PIL import Image
import matplotlib.pyplot as plt

import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

import dotenv
import os

from openai import OpenAI

from jinja2 import Environment, FileSystemLoader, select_autoescape
from typing import Any
import json

dotenv.load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

We will use the same approach as previous notebook. So we have moved a bunch of our code into a `utils.py` file. We have mostly kept things the same, but have a look over it and make sure you understand how it all works.

In [4]:
from utils import chunker, DocumentDB, load_template

loader = PyMuPDFReader()
documents = loader.load(file_path="data/paper.pdf")
text_chunks, doc_idxs = chunker(chunk_size=1024, overlap=128, documents=documents)

# we can do this because we have secretly called the 
doc_db = DocumentDB("paper_db", path="../data-storage-and-ingestion/")

In [5]:
doc_db.query_db("Abstract")

{'ids': [['chunk_30', 'chunk_12']],
 'distances': [[0.7026067185204117, 0.7119312067010161]],
 'metadatas': [[{'doc_idx': 27}, {'doc_idx': 11}]],
 'embeddings': None,
 'documents': [['A Philosophical Introduction to Language Models\nPart I\nQiu, L., Shaw, P., Pasupat, P., Nowak, P., Linzen, T., Sha, F. & Toutanova, K. (2022), Improving\nCompositional Generalization with Latent Structure and Data Augmentation, in ‘Proceedings of the\n2022 Conference of the North American Chapter of the Association for Computational Linguistics:\nHuman Language Technologies’, Association for Computational Linguistics, Seattle, United States,\npp. 4341–4362.\nQuilty-Dunn, J., Porot, N. & Mandelbaum, E. (2022), ‘The Best Game in Town: The Re-Emergence of\nthe Language of Thought Hypothesis Across the Cognitive Sciences’, Behavioral and Brain Sciences\npp. 1–55.\nRaffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. & Liu, P. J. (2020),\n‘Exploring the limits of transfer le

### Generate question answer pairs
For this, we will use `gpt-4o` because we want high quality question answer pairs. Ideally, you would do this with humans - subject matter experts would carefully hand-craft these pairs.

The first stage is to then generate 10 Q&A pairs using pydantic again. The implementations presented here closely follow the method used by the [RAGAS](https://docs.ragas.io/en/stable/getstarted/index.html#get-started) library.

We implement a Pydantic BaseModel class that will house our list of questions.

In [6]:
class QAPairs(BaseModel):
    questions: list[str] = Field(..., title="List of questions")
    answers: list[str] = Field(..., title="List of answers")

print(QAPairs.model_json_schema())

{'properties': {'questions': {'items': {'type': 'string'}, 'title': 'List of questions', 'type': 'array'}, 'answers': {'items': {'type': 'string'}, 'title': 'List of answers', 'type': 'array'}}, 'required': ['questions', 'answers'], 'title': 'QAPairs', 'type': 'object'}


Next, we need a prompt that we can use to generate these Q&A pairs. It looks something like this:

---
```
You are a reading comprehension system that is an expert at extracting information from academic papers.
Your task is to carefully read the provided text "CONTEXT" and then generate question and answer pairs.
Your questions should be concise. Your answers should be as detailed as possible, including any mathematical or numerical results from the text.
You should aim to produce approximately one paragraph for your answers (100-200 words).
Your questions should be a mixture of general, high-level concepts, and also highly detailed questions about specific points, including any mathematical or numerical results.
You should respond in JSON format according to the following schema:

{{ schema }}

You should generate {{ number }} question and answer pairs.
```
---

In [7]:
system_prompt_qa = load_template(
    "prompts/qa_generation_system_prompt.jinja",
    {
        "number" : 10,
        "schema" : QAPairs.model_json_schema()
    }
)

In [8]:
print(system_prompt_qa)

You are a reading comprehension system that is an expert at extracting information from academic papers.
Your task is to carefully read the provided text "CONTEXT" and then generate question and answer pairs.
Your questions should be concise. Your answers should be as detailed as possible, including any mathematical or numerical results from the text.
You should aim to produce approximately one paragraph for your answers (100-200 words).
Your questions should be a mixture of general, high-level concepts, and also highly detailed questions about specific points, including any mathematical or numerical results.
You should respond in JSON format according to the following schema:

{'properties': {'questions': {'items': {'type': 'string'}, 'title': 'List of questions', 'type': 'array'}, 'answers': {'items': {'type': 'string'}, 'title': 'List of answers', 'type': 'array'}}, 'required': ['questions', 'answers'], 'title': 'QAPairs', 'type': 'object'}

You should generate 10 question and answe

Next, we need to the pages of the pdf as a single text string

In [9]:
pdf_text = " ".join([doc.text for doc in documents])

Finally, we are in a position to generate our question answer pairs using `gpt-4o`

In [10]:
client = OpenAI()

user_prompt = (
    f"CONTEXT:\n\n{pdf_text}"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": system_prompt_qa},
        {"role": "user", "content": user_prompt}
    ],
    temperature=0.1,
    response_format={"type": "json_object"}
)

We then create the `QAPairs` object using the LLM output, and also save it to file.

In [11]:
questions_answers = QAPairs(**json.loads(response.choices[0].message.content))

# save the Q&A to file
with open("data/qa.json", "w") as f:
    json.dump(questions_answers.dict(), f, indent=4)

What does an example look like?

In [12]:
print(questions_answers.questions[0])
print('---')
print(questions_answers.answers[0])

What are the main philosophical debates surrounding large language models (LLMs) like GPT-4?
---
The philosophical debates surrounding LLMs like GPT-4 focus on their linguistic and cognitive competence. These debates echo classic discussions about artificial neural networks as cognitive models. Key topics include compositionality, language acquisition, semantic competence, grounding, world models, and cultural knowledge transmission. The success of LLMs challenges long-held assumptions about neural networks, but further empirical investigation is needed to understand their internal mechanisms. Philosophers are particularly interested in whether LLMs can model human cognitive processes better than classical symbolic models, and whether they possess genuine understanding or merely mimic human-like responses.


### Semantic Similarity

We can now try and do cosine similarity scores between the returned contexts and the actual answers.

In [13]:
from utils import rag_query

example_query = questions_answers.questions[0]

response, context = rag_query(
    query=example_query,
    n_context=5,
    doc_db=doc_db,
    return_context=True
)

print(response)

The main philosophical debates surrounding large language models (LLMs) like GPT-4 include:

1. **Cognitive Competence**: There is ongoing disagreement about whether LLMs can be meaningfully ascribed linguistic or cognitive competence, echoing classic debates about the status of artificial neural networks as cognitive models.

2. **Intelligence vs. Behavior**: Philosophers question the link between intelligence and observable behavior, as LLMs can produce human-like responses without necessarily understanding the inputs, leading to discussions about whether they are merely "Blockheads" that regurgitate information.

3. **Data Contamination**: Concerns about "data contamination" arise when LLMs' training sets include the very questions they are assessed on, complicating comparisons between human and LLM performance.

4. **Internal Mechanisms**: There is a need for further empirical investigation to understand the internal mechanisms of LLMs, as their ability to generate novel outputs ra

### Semantic similarity
First look at semantic similarity between the predicted response and the desired response.

In [15]:
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity

client = OpenAI()

response_embedding = client.embeddings.create(
    input=response,
    model="text-embedding-3-small"
).data[0].embedding

answer_embedding = client.embeddings.create(
    input=questions_answers.answers[0],
    model="text-embedding-3-small"
).data[0].embedding

In [16]:
cosine_similarity([response_embedding], [answer_embedding])

array([[0.85309252]])

In [17]:
print(response)
print('---')
print(questions_answers.answers[0])

The main philosophical debates surrounding large language models (LLMs) like GPT-4 include:

1. **Cognitive Competence**: There is ongoing disagreement about whether LLMs can be meaningfully ascribed linguistic or cognitive competence, echoing classic debates about the status of artificial neural networks as cognitive models.

2. **Intelligence vs. Behavior**: Philosophers question the link between intelligence and observable behavior, as LLMs can produce human-like responses without necessarily understanding the inputs, leading to discussions about whether they are merely "Blockheads" that regurgitate information.

3. **Data Contamination**: Concerns about "data contamination" arise when LLMs' training sets include the very questions they are assessed on, complicating comparisons between human and LLM performance.

4. **Internal Mechanisms**: There is a need for further empirical investigation to understand the internal mechanisms of LLMs, as their ability to generate novel outputs ra

### Faithfulness
This is a little more complicated. First, we get an LLM to extract key statements from the answer. For example:

```python
[
    ['This study was conducted by Mallinson et al.'],
    ['The main focus is to investigate avalanches and criticality in self-organized nanoscale network.']
    ['They analyzed electrical conductance.']
    ['They analyzed the behavior of the networks under various stimulus conditions.']
]
```

We then ask a second LLM to look at each statement and see if that statement can be inferred from the text, assigning a score of 0 for no, and 1 for yes.

To do this, we create two additional Pydantic classes:

In [18]:
class Statements(BaseModel):
    simpler_statements: list[str] = Field(..., description="the simpler statements")


class StatementFaithfulnessAnswer(BaseModel):
    statement: str = Field(..., description="the original statement, word-for-word")
    reason: str = Field(..., description="the reason of the verdict")
    verdict: int = Field(..., description="the verdict(0/1) of the faithfulness.")


class Faithfulness(BaseModel):
    answers: list[StatementFaithfulnessAnswer] = Field(..., description="the faithfulness answers")
    score: float = Field(..., description="the faithfulness score")


We also create two more prompts `statement_instruction`, and `faithfulness_instruction`

---
```
Given a piece of text, analyze the complexity of each sentence and break down each sentence into one or more fully understandable statements while also ensuring no pronouns are used in each statement. Format the outputs in JSON, according to the following schema:

{{ schema }}

Here is a new piece of text:

{{ statement }}
```
---

---
```
Your task is to judge the faithfulness of a statement based on a given context. For the statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context.

You will give the exact statement, the reason, and the verdict.

Format the outputs in JSON, according to the following schema:

{{ schema }}

Here is a statement:

{{ statement }}
```
---

In [19]:
def get_statements(answer):
    prompt = load_template(
        "prompts/faithfulness/statement_instruction.jinja",
        {
            "schema" : Statements.model_json_schema(),
            "text" : answer
        }
    )

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": answer}
        ],
        temperature=0.0,
        response_format={"type": "json_object"}
    ).choices[0].message.content

    return Statements(**json.loads(response))

In [21]:
statements = get_statements(response)

In [22]:
from rich.pretty import pprint
print(response)
pprint(statements)

The main philosophical debates surrounding large language models (LLMs) like GPT-4 include:

1. **Cognitive Competence**: There is ongoing disagreement about whether LLMs can be meaningfully ascribed linguistic or cognitive competence, echoing classic debates about the status of artificial neural networks as cognitive models.

2. **Intelligence vs. Behavior**: Philosophers question the link between intelligence and observable behavior, as LLMs can produce human-like responses without necessarily understanding the inputs, leading to discussions about whether they are merely "Blockheads" that regurgitate information.

3. **Data Contamination**: Concerns about "data contamination" arise when LLMs' training sets include the very questions they are assessed on, complicating comparisons between human and LLM performance.

4. **Internal Mechanisms**: There is a need for further empirical investigation to understand the internal mechanisms of LLMs, as their ability to generate novel outputs ra

In [23]:
def get_faithfulness(statements : Statements, context):
    context_joined = " ".join(context)
    faithfulness_answers = []

    for statement in statements.simpler_statements:
        prompt = load_template(
            "prompts/faithfulness/faithfulness_instruction.jinja",
            {
                "schema" : StatementFaithfulnessAnswer.model_json_schema(),
                "statement" : statement,
                "context" : context_joined
            }
        )

        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": prompt},
                {"role": "user", "content": context_joined}
            ],
            temperature=0.0,
            response_format={"type": "json_object"}
        ).choices[0].message.content

        faithfulness_answers.append(StatementFaithfulnessAnswer(**json.loads(response)))

    score = sum([answer.verdict for answer in faithfulness_answers]) / len(faithfulness_answers)

    return Faithfulness(answers=faithfulness_answers, score=score)

In [24]:
results = get_faithfulness(statements, context)

In [25]:
pprint(results)