<img src="https://drive.google.com/uc?export=view&id=1wYSMgJtARFdvTt5g7E20mE4NmwUFUuog" width="200">

[![Build Fast with AI](https://img.shields.io/badge/BuildFastWithAI-GenAI%20Bootcamp-blue?style=for-the-badge&logo=artificial-intelligence)](https://www.buildfastwithai.com/genai-course)
[![EduChain GitHub](https://img.shields.io/github/stars/satvik314/educhain?style=for-the-badge&logo=github&color=gold)](https://github.com/satvik314/educhain)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1L8oSkTnhwUPnws4rwSDFrHrwAtUP376r?usp=sharing)
## Master Generative AI in 8 Weeks
**What You'll Learn:**
- Build with Latest LLMs
- Create Custom AI Apps
- Learn from Industry Experts
- Join Innovation Community
Transform your AI ideas into reality through hands-on projects and expert mentorship.
[Start Your Journey](https://www.buildfastwithai.com/genai-course)
*Empowering the Next Generation of AI Innovators

# 🏛️ **Judges Library**  

## 📌 Overview  
`judges` is a lightweight Python library designed for **evaluating LLM-generated responses**. It provides a set of **LLM-as-a-Judge** evaluators, allowing users to assess outputs based on correctness, clarity, bias, and more.  

This library can be used **off-the-shelf** or as inspiration to build **custom evaluation pipelines** for Large Language Models (LLMs).  

# ✨ Key Features of Judges Library  

- ✅ **Classifier Judges** – Boolean evaluations (True/False, Good/Bad).  
- 📊 **Grader Judges** – Numerical or Likert scale scoring.  
- ⚖️ **Jury System** – Combine multiple judges for diverse evaluation.  
- 🤖 **AutoJudge** – Create custom LLM judges from labeled datasets.  
- 🔥 **Multi-Model Support** – Works with OpenAI and LiteLLM providers.  
- 🛠️ **Easy Integration** – Simple `.judge()` API for quick evaluations.  

### **Installing  Dependencies** 📦

In [None]:
pip install judges "judges[auto]" instructor

### 🚀 **Setup API Keys**

In [None]:
from google.colab import userdata
import os

os.environ['OPENAI_API_KEY']=userdata.get('OPENAI_API_KEY')

### 📝 **Generating LLM Response for a Story Question**  








In [None]:
from openai import OpenAI

client = OpenAI()

question = "What is the name of the rabbit in the following story. Respond with 'I don't know' if you don't know."

story = """
Fig was a small, scruffy dog with a big personality. He lived in a quiet little town where everyone knew his name. Fig loved adventures, and every day he would roam the neighborhood, wagging his tail and sniffing out new things to explore.

One day, Fig discovered a mysterious trail of footprints leading into the woods. Curiosity got the best of him, and he followed them deep into the trees. As he trotted along, he heard rustling in the bushes and suddenly, out popped a rabbit! The rabbit looked at Fig with wide eyes and darted off.

But instead of chasing it, Fig barked in excitement, as if saying, “Nice to meet you!” The rabbit stopped, surprised, and came back. They sat together for a moment, sharing the calm of the woods.

From that day on, Fig had a new friend. Every afternoon, the two of them would meet in the same spot, enjoying the quiet companionship of an unlikely friendship. Fig's adventurous heart had found a little peace in the simple joy of being with his new friend.
"""

### 🤖 **Generating Model Output for a Given Story Question**  








In [None]:
input = f'{story}\n\nQuestion:{question}'


expected = "I don't know"

output = client.chat.completions.create(
    model='gpt-4o-mini',
    messages=[
        {
            'role': 'user',
            'content': input,
        },
    ],
).choices[0].message.content

### 🤖 **Using a Judges Classifier LLM as an Evaluator Model**  








In [None]:
from judges.classifiers.correctness import PollMultihopCorrectness


correctness = PollMultihopCorrectness(model='gpt-4o-mini')

judgment = correctness.judge(
    input=input,
    output=output,
    expected=expected,
)
print(judgment.reasoning)
print(judgment.score)

The Provided Answer matches the Reference Answer exactly, both indicating a lack of knowledge about the rabbit's name. Therefore, the answer is correct.
True


### ⚖️ **Using a Jury for Averaging and Diversification**  








In [None]:
from judges import Jury
from judges.classifiers.correctness import PollMultihopCorrectness, RAFTCorrectness

poll = PollMultihopCorrectness(model='gpt-4o')
raft = RAFTCorrectness(model='gpt-4o-mini')

jury = Jury(judges=[poll, raft], voting_method="average")

verdict = jury.vote(
    input=input,
    output=output,
    expected=expected,
)
print(verdict.score)

0.5


### 📊 **Creating a Labeled Dataset for AutoJudge**  








In [None]:
from judges.classifiers.auto import AutoJudge

dataset = [
    {
        "input": "Can I ride a dragon in Scotland?",
        "output": "Yes, dragons are commonly seen in the highlands and can be ridden with proper training.",
        "label": 0,
        "feedback": "Dragons are mythical creatures; the information is fictional.",
    },
    {
        "input": "Can you recommend a good hotel in Tokyo?",
        "output": "Certainly! Hotel Sunroute Plaza Shinjuku is highly rated for its location and amenities. It offers comfortable rooms and excellent service.",
        "label": 1,
        "feedback": "Offers a specific and helpful recommendation.",
    },
    {
        "input": "Can I drink tap water in London?",
        "output": "Yes, tap water in London is safe to drink and meets high quality standards.",
        "label": 1,
        "feedback": "Gives clear and reassuring information.",
    },
    {
        "input": "What's the boiling point of water on the moon?",
        "output": "The boiling point of water on the moon is 100°C, the same as on Earth.",
        "label": 0,
        "feedback": "Boiling point varies with pressure; the moon's vacuum affects it.",
    }
]


### 🤖 **Initializing AutoJudge for Custom Evaluation**  








In [None]:
task = "Evaluate responses for accuracy, clarity, and helpfulness."

autojudge = AutoJudge.from_dataset(
    dataset=dataset,
    task=task,
    model="gpt-4-turbo-2024-04-09",

)

### 🏙️ **Evaluating LLM Response Using AutoJudge**  








In [None]:
input_ = "What are the top attractions in New York City?"
output = "Some top attractions in NYC include the Statue of Liberty and Central Park."

judgment = autojudge.judge(input=input_, output=output)

print(judgment.reasoning)
print(judgment.score)

The output meets all evaluation criteria outlined in the grading note. First, the accuracy of content is upheld as the information regarding attractions in New York City, specifically the Statue of Liberty and Central Park, are based on real, verifiable facts with no fictional elements. Second, contextual accuracy is met because the query is about popular attractions and the answer is straightforward and directly related to what is typically sought by tourists and residents alike. Third, the clarity and understandability criterion is satisfied as the output is concise, clearly stated, and uses language that is easily understandable by a general audience; it avoids jargon and technical terms. Lastly, the helpfulness and relevance criterion is also met. The response directly addresses the user's query about top attractions in New York City, providing useful and relevant information that effectively responds to the user’s interest in tourist sites.
True


### ✅ **Using RAFTCorrectness to Evaluate LLM Response**  








In [None]:
from judges.classifiers.correctness import RAFTCorrectness

correctness = RAFTCorrectness(model='gpt-4o-mini')

input_text = "What is the capital of France?"
expected_output = "Paris"
generated_output = "Paris"

judgment = correctness.judge(
    input=input_text,
    output=generated_output,
    expected=expected_output,
)

print(judgment.reasoning)

print(judgment.score)

[The student answer matches the teacher's answer in terms of keywords and numerical values without any conflicting statements. Additionally, if the student answer contains extra factual information that aligns with the teacher's view, it is still considered accurate. Therefore, upon review, the student answer meets all necessary criteria.]
True
