# LLM-as-a-judge



In [1]:
prompt1_template = """
You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.
Your task is to analyze the relevance of the generated answer compared to the original answer provided.
Based on the relevance and similarity of the generated answer to the original answer, you will classify
it as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Original Answer: {answer_orig}
Generated Question: {question}
Generated Answer: {answer_llm}

Please analyze the content and context of the generated answer in relation to the original
answer and provide your evaluation in parsable JSON without using code blocks:

{{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}}
""".strip()

In [2]:
prompt2_template = """
You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.
Your task is to analyze the relevance of the generated answer to the given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Question: {question}
Generated Answer: {answer_llm}

Please analyze the content and context of the generated answer in relation to the question
and provide your evaluation in parsable JSON without using code blocks:

{{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}}
""".strip()

For efficiency, we willsample some records, so the evaluation costs are not as high

In [3]:
import pandas as pd

base_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main'
relative_url = '04-monitoring/data/results-gpt4o-mini-cosine.csv'
url = f'{base_url}/{relative_url}?raw=1'

df_gpt4o_mini = pd.read_csv(url)


In [4]:
df_gpt4o_mini.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course,cosine
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp,0.388594
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp,0.29836
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp,0.583048
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp,-0.019481
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp,0.358467


In [5]:
df_sample = df_gpt4o_mini.sample(n=150, random_state=1)

In [13]:
samples = df_sample.to_dict(orient="records")

In [14]:
record = samples[0]

In [16]:
record

{'answer_llm': 'The syntax for using `precision_recall_fscore_support` in Python is as follows:\n\n```python\nfrom sklearn.metrics import precision_recall_fscore_support\nprecision, recall, fscore, support = precision_recall_fscore_support(y_val, y_val_pred, zero_division=0)\n```',
 'answer_orig': 'Scikit-learn offers another way: precision_recall_fscore_support\nExample:\nfrom sklearn.metrics import precision_recall_fscore_support\nprecision, recall, fscore, support = precision_recall_fscore_support(y_val, y_val_pred, zero_division=0)\n(Gopakumar Gopinathan)',
 'document': '403bbdd8',
 'question': 'What is the syntax for using precision_recall_fscore_support in Python?',
 'course': 'machine-learning-zoomcamp',
 'cosine': 0.9010756}

In [19]:
prompt = prompt1_template.format(**record)

In [20]:
print(_)

You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.
Your task is to analyze the relevance of the generated answer compared to the original answer provided.
Based on the relevance and similarity of the generated answer to the original answer, you will classify
it as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Original Answer: Scikit-learn offers another way: precision_recall_fscore_support
Example:
from sklearn.metrics import precision_recall_fscore_support
precision, recall, fscore, support = precision_recall_fscore_support(y_val, y_val_pred, zero_division=0)
(Gopakumar Gopinathan)
Generated Question: What is the syntax for using precision_recall_fscore_support in Python?
Generated Answer: The syntax for using `precision_recall_fscore_support` in Python is as follows:

```python
from sklearn.metrics import precision_recall_fscore_support
precision, recall, fscore, support = precision_recall_fscore_support(y_val, y_val_

In [21]:
from openai import OpenAI

client = OpenAI()

def llm(prompt, model='gpt-4o'):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

In [None]:
# answer = llm(prompt, model = "gpt-4o-mini")

In [1]:
import os
from dotenv import load_dotenv

import google.generativeai as genai

load_dotenv()

GOOGLE_API_KEY = os.environ['GOOGLE_API_KEY']

genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel('gemini-pro')


safety_settings={
    'HARM_CATEGORY_HARASSMENT': 'BLOCK_NONE',
    'HARM_CATEGORY_HATE_SPEECH': 'BLOCK_NONE',
    'HARM_CATEGORY_SEXUALLY_EXPLICIT': 'BLOCK_NONE',
    'HARM_CATEGORY_DANGEROUS_CONTENT': 'BLOCK_NONE'
}
generation_config=genai.types.GenerationConfig(
    candidate_count=1, 
    temperature=0
)

def llm(prompt):
    responses = model.generate_content(
        contents = prompt,
        safety_settings = safety_settings,
        generation_config = generation_config
    )

    return "".join(response.text for response in responses)

  from .autonotebook import tqdm as notebook_tqdm


In [12]:
responses = model.generate_content(
    contents = "why is the sky blue?",
    safety_settings = safety_settings,
    generation_config = generation_config
)


In [17]:
response.usage_metadata

prompt_token_count: 7
candidates_token_count: 116
total_token_count: 123

In [38]:
answer = llm(prompt)

In [39]:
import json

json.loads(answer)

{'Relevance': 'RELEVANT',
 'Explanation': 'The generated answer provides the exact syntax for using `precision_recall_fscore_support` in Python, which is the same as the original answer. It also includes an example of how to use the function, which is not present in the original answer but is still relevant to the question.'}

In [40]:
from tqdm.auto import tqdm

evaluations = []

for record in tqdm(samples):
    prompt = prompt1_template.format(**record)
    evaluation = llm(prompt)
    evaluations.append(evaluation)

100%|██████████| 150/150 [05:01<00:00,  2.01s/it]


In [41]:
json_evaluations = []

for i, evaluation in enumerate(evaluations):
    json_evaluation = json.loads(evaluation)
    json_evaluations.append(json_evaluation)

In [42]:
df_evaluations = pd.DataFrame(json_evaluations)

In [43]:
df_evaluations.Relevance.value_counts()

Relevance
RELEVANT           110
PARTLY_RELEVANT     27
NON_RELEVANT        13
Name: count, dtype: int64

Using the downloaded data isntead

In [26]:
base_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main'
relative_url = '04-monitoring/data/evaluations-aqa.csv'
url = f'{base_url}/{relative_url}?raw=1'

df_evaluations = pd.read_csv(url)


In [27]:
df_evaluations.head()

Unnamed: 0,Relevance,Explanation
0,RELEVANT,The generated answer directly addresses the ge...
1,RELEVANT,The generated answer directly addresses the qu...
2,RELEVANT,The generated answer directly addresses the qu...
3,RELEVANT,The generated answer accurately addresses the ...
4,NON_RELEVANT,The generated answer discusses a pip version e...


In [30]:
df_evaluations.Relevance.value_counts()

Relevance
RELEVANT           124
PARTLY_RELEVANT     16
NON_RELEVANT        10
Name: count, dtype: int64

In [29]:
df_evaluations.Relevance.describe()

count          150
unique           3
top       RELEVANT
freq           124
Name: Relevance, dtype: object

For the second prompt, which corresponds to q -> a

In [None]:
evaluations_2 = []

for record in tqdm(samples):
    prompt = prompt2_template.format(**record)
    evaluation = llm(prompt)
    evaluations_2.append(evaluation)
    
json_evaluations_2 = []

for i, str_eval in enumerate(evaluations_2):
    json_eval = json.loads(str_eval)
    json_evaluations_2.append(json_eval)
    
df_evaluations_2 = pd.DataFrame(json_evaluations_2)

In [31]:
base_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main'
relative_url = '04-monitoring/data/evaluations-qa.csv'
url = f'{base_url}/{relative_url}?raw=1'

df_evaluations_2 = pd.read_csv(url)

In [32]:
df_evaluations_2.head()

Unnamed: 0,Relevance,Explanation
0,RELEVANT,The generated answer directly provides the cor...
1,RELEVANT,The generated answer directly addresses the qu...
2,RELEVANT,The generated answer correctly specifies the c...
3,RELEVANT,The generated answer directly addresses the qu...
4,PARTLY_RELEVANT,The generated answer addresses a potential cau...


In [33]:
df_evaluations_2.Relevance.value_counts()

Relevance
RELEVANT           129
PARTLY_RELEVANT     18
NON_RELEVANT         3
Name: count, dtype: int64