# Offline (RAG) evaluation

recap:

In [1]:
# module1
def rag(q):
    search_results = search(q)
    prompt = build_prompt(q, search_results)
    answers = llm(prompt)
    return answers

For the second module we talked about llm's that are opensource and can be used to replace ChatGPT. We also talked about vector search and how to evaluate it.

Evaluating retrieval

* Hit-rate
* MMR (mean reciprocal rank)

Now comes how to evaluate **the whole RAG**

There are two types of evaluation:

* Offline evaluation: We build a system and test it using some metric *before* deploying it
    * Cosine similarity
    * LLM as a judge

    ground truth dataset

    answer (original) -> question (using llm) -> answer (llm)

    cosine_similarity(answer_orginal, answer_llm)
    llm_as_a_judge(answer_orginal, answer_llm)
    llm_as_a_judge(question, answer_llm)

* Online evaluation: We evaluate the system once it ahs been deployed.

    * A/B tests, experiments
    * User feedback

* Monitoring: Observing the overall health of the system (can include the online evaluation) and how good the answer is.

In this cause we can re-use our ground truth data

### Load our documents


In [2]:
import requests 

docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-vector-search/eval/documents-with-ids.json?raw=1'
docs_response = requests.get(docs_url)
documents = docs_response.json()

### Load ground truth

In [3]:
import pandas as pd

base_url = "https://github.com/DataTalksClub/llm-zoomcamp/blob/main"
relative_url = '03-vector-search/eval/ground-truth-data.csv'
ground_truth_url = f'{base_url}/{relative_url}?raw=1'

df_ground_truth = pd.read_csv(ground_truth_url)
# filtering just one course
df_ground_truth = df_ground_truth[df_ground_truth.course == "machine-learning-zoomcamp"]
ground_truth = df_ground_truth.to_dict(orient="records")

In [4]:
ground_truth[10]

{'question': 'Are sessions recorded if I miss one?',
 'course': 'machine-learning-zoomcamp',
 'document': '5170565b'}

In [5]:
doc_idx = {
    d['id']: d for d in documents
}
doc_idx[ground_truth[10]['document']]['text']

'Everything is recorded, so you won‚Äôt miss anything. You will be able to ask your questions for office hours in advance and we will cover them during the live stream. Also, you can always ask questions in Slack.'

## Index data

In [6]:
from sentence_transformers import SentenceTransformer

model_name = 'multi-qa-MiniLM-L6-cos-v1'
model = SentenceTransformer(model_name)

  from tqdm.autonotebook import tqdm, trange


In [9]:
from elasticsearch import Elasticsearch

es_client = Elasticsearch("http://localhost:9200")

index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "id": {"type": "keyword"},
            "question_text_vector": {
                "type": "dense_vector",
                "dims": 384,
                "index": True,
                "similarity": "cosine"
            },
        }
    }
}

index_name = "course-questions"
es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [10]:
from tqdm.auto import tqdm

for doc in tqdm(documents):
    question = doc['question']
    text = doc['text']
    doc["question_text_vector"] = model.encode(question + ' ' + text)

    es_client.index(index=index_name, document=doc)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 948/948 [01:46<00:00,  8.89it/s]


In [11]:
def elastic_search_knn(field, vector, course):

    knn = {
        "field": field,
        "query_vector": vector,
        "k": 5,
        "num_candidates": 10000,
        "filter": {
            "term": {
                "course": course
            }
        }
    }

    search_query = {
        "knn": knn,
        "_source": ["text", "section", "question", "course", "id"]
    }

    
    es_results = es_client.search(
        index=index_name,
        body=search_query
    )

    return [hit['_source'] for hit in es_results['hits']['hits']]

def question_text_vector_knn(q):
    
    question = q['question']
    course = q['course']

    v_q = model.encode(question)  

    return elastic_search_knn("question_text_vector", v_q, course)

In [12]:
question_text_vector_knn(
    {
        "question": "Are sessions recorded if I miss one?",
        "course":"machine-learning-zoomcamp"
    }
)

[{'question': 'What if I miss a session?',
  'course': 'machine-learning-zoomcamp',
  'section': 'General course-related questions',
  'text': 'Everything is recorded, so you won‚Äôt miss anything. You will be able to ask your questions for office hours in advance and we will cover them during the live stream. Also, you can always ask questions in Slack.',
  'id': '5170565b'},
 {'question': 'Is it going to be live? When?',
  'course': 'machine-learning-zoomcamp',
  'section': 'General course-related questions',
  'text': 'The course videos are pre-recorded, you can start watching the course right now.\nWe will also occasionally have office hours - live sessions where we will answer your questions. The office hours sessions are recorded too.\nYou can see the office hours as well as the pre-recorded course videos in the course playlist on YouTube.',
  'id': '39fda9f0'},
 {'question': 'The same accuracy on epochs',
  'course': 'machine-learning-zoomcamp',
  'section': '8. Neural Networks 

# The RAG Flow

In [29]:
def build_prompt(query, search_results):
    '''
    Builds a prompt for a course teaching assistant based on the given query and search results.
    The prompt includes the question, context from the FAQ database, and instructions for answering.    
    Returns a list of formatted prompt elements.
    '''

    context = context = ""

    for doc in search_results:
        context += f"section: {doc['question']}\nquestion: {doc['question']}\nanswer: {doc['text']} \n \n"

    prompt_template = [
    """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

""",
f"""
QUESTION: {query}
""",
f"""
CONTENT: {context}
"""
]

    return [element.strip() for element in prompt_template]

    

In [30]:
import os
from dotenv import load_dotenv
import vertexai
from vertexai.generative_models import GenerativeModel
from vertexai import generative_models

load_dotenv()
"""
Initialize the environment variables and generative model for vertexai.
"""

PROJECT_ID = os.getenv('PROJECT_ID')
REGION = os.getenv('REGION')

vertexai.init(
    project = PROJECT_ID,
    location = REGION
)

vertex_llm = GenerativeModel("gemini-1.5-pro")

generation_config = {
    "max_output_tokens": 8192,
    "temperature": 0,
    "top_p": 0.95,
}

safety_config = {
    generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
    generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
    generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
    generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
}

def llm(prompt):
    responses = vertex_llm.generate_content(
        prompt,
        generation_config=generation_config,
        safety_settings=safety_config,
        stream=True,
    )

    return "".join(response.text for response in responses)

In [31]:
def rag(query:dict) -> str:
    search_results = question_text_vector_knn(query)
    prompt = build_prompt(query["question"], search_results)
    return llm(prompt)

In [32]:
ground_truth[10]

{'question': 'Are sessions recorded if I miss one?',
 'course': 'machine-learning-zoomcamp',
 'document': '5170565b'}

In [23]:
rag(ground_truth[10])

'Yes, all sessions are recorded.'

In [32]:
doc_idx[ground_truth[10]['document']]['text']

'Everything is recorded, so you won‚Äôt miss anything. You will be able to ask your questions for office hours in advance and we will cover them during the live stream. Also, you can always ask questions in Slack.'

## Cosine similarity metric

In [24]:
answer_orig = doc_idx[ground_truth[10]['document']]['text']

answer_llm = rag(ground_truth[10])

v_orig = model.encode(answer_orig)
v_llm = model.encode(answer_llm)

v_orig.dot(v_llm)

0.48217267

In [33]:
answers = {}

In [40]:
import time

for i, rec in tqdm(enumerate(ground_truth)):
    if i in answers:
        continue    
    
    doc_id = rec['document']
    answer_orig = doc_idx[doc_id]['text']
    answer_llm = rag(rec)

    answers[i] = {
        "answer_llm": answer_llm,
        "answer_orig": answer_orig,
        "document": doc_id,
        "question": rec["question"],
        "course": rec["course"]
    }
    time.sleep(10)

1830it [3:00:24,  5.92s/it]


In [49]:
results_gemini = [None] * len(ground_truth)

In [50]:
len(ground_truth)

1830

In [59]:
#for i, val in answers.items():
#    results_gemini[i] = val.copy()
#    results_gemini[i].update(ground_truth[i])

In [48]:
answers

{0: {'answer_llm': 'This question is not answerable from the given context. This FAQ section explains how to access course materials, how long the course is, if you can join late, and where to find the course Slack channel. It does not mention where to sign up for the course. \n',
  'answer_orig': 'Machine Learning Zoomcamp FAQ\nThe purpose of this document is to capture frequently asked technical questions.\nWe did this for our data engineering course and it worked quite well. Check this document for inspiration on how to structure your questions and answers:\nData Engineering Zoomcamp FAQ\nIn the course GitHub repository there‚Äôs a link. Here it is: https://airtable.com/shryxwLd0COOEaqXo\nwork',
  'document': '0227b872',
  'question': 'Where can I sign up for the course?',
  'course': 'machine-learning-zoomcamp'},
 1: {'answer_llm': 'You can find the link to sign up in the course GitHub repository: https://airtable.com/shryxwLd0COOEaqXo \n',
  'answer_orig': 'Machine Learning Zoomca

In [51]:
for i, val in answers.items():
   results_gemini[i] = val.copy()

In [52]:
import pandas as pd

df_gemini = pd.DataFrame(results_gemini)

In [53]:
df_gemini.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,This question is not answerable from the given...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can find the link to sign up in the course...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,This document only contains information about ...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,This FAQ section does not contain the answer t...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,This FAQ section does not contain information ...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


In [54]:
df_gemini.sample(n=5).to_dict(orient="records")

[{'answer_llm': 'This FAQ database does not contain information about the key concepts covered in Week 9 of the Serverless Deep Learning course. Therefore, I cannot answer your question. \n',
  'answer_orig': 'TODO',
  'document': '3ee083ab',
  'question': 'What are the key concepts covered in Week 9 of Serverless Deep Learning?',
  'course': 'machine-learning-zoomcamp'},
 {'answer_llm': 'The models for week 9 are on Github: https://github.com/DataTalksClub/machine-learning-zoomcamp/releases \n',
  'answer_orig': 'The week 9 uses a link to github to fetch the models.\nThe original link was moved to here:\nhttps://github.com/DataTalksClub/machine-learning-zoomcamp/releases',
  'document': 'f826cba4',
  'question': 'What website hosts the models for week 9?',
  'course': 'machine-learning-zoomcamp'},
 {'answer_llm': 'A few individually important features might be highly correlated with others, and dropping some might be fine. \n',
  'answer_orig': "I‚Äôm just looking back at the lessons 

In [55]:
df_gemini.to_csv("data/results-gemini15-pro.csv")