<br>
<h2 style = "font-size:60px; font-family:Garamond ; font-weight : normal; background-color: #f6f5f5 ; color : #fe346e; text-align: center; border-radius: 100px 100px;">Evaluation Metric (Approx)</h2>
<br>

- LLM-as-a-judge
- English Confidence Score
- Sequence Similarity Matcher

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">LLM-as-a-judge</h1></span>

#### This is an attempt to understand the evaluation metric using any 3 open-source LLMs as judge

![](https://cdn.prod.website-files.com/64bd90bdba579d6cce245aec/66d3fdd1f10fc3992b6c9d75_66d3fd227a958b870c333da4_llm-judge-metric.png)

In [1]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "False"

import numpy as np
import pandas as pd

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

import warnings
warnings.filterwarnings('ignore')

In [2]:
judge_model_list = [
    "/kaggle/input/llama-3.2/transformers/3b-instruct/1",
    "/kaggle/input/qwen2.5/transformers/3b-instruct/1",
    "/kaggle/input/gemma-2/transformers/gemma-2-2b-it/2/",    
]

models = [AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', torch_dtype=torch.bfloat16) for model_name in judge_model_list]
tokenizers = [AutoTokenizer.from_pretrained(model_name) for model_name in judge_model_list]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
def llm_judge(prompt, response, criteria, model, tokenizer):
    """
    Evaluate a list of responses with scores (0-9) using a Hugging Face Transformer model.

    Args:
        prompt (str): The initial task or question given to the respondents.
        response (str): Response to evaluate.
        criteria (str): Evaluation criteria to judge the response.
        model: Huggingface model to use.
        tokenizer: Huggingface tokenizer to use.

    Returns:
        score: Score for the response.
    """
    # Build the evaluation prompt
    evaluation_prompt = f"""
You are an expert judge scoring responses to the following prompt:

Prompt: {prompt}

Evaluation Criteria: {criteria}

Provide a score between 0 and 9 (inclusive) for the response. Do not provide any explanation.

Here is the response to evaluate:
"""
    evaluation_prompt += f"\nResponse: {response}\nScore:"

    # Tokenize the input
    inputs = tokenizer(evaluation_prompt, return_tensors="pt")
    inputs = {k: v.to('cuda') for k, v in inputs.items()}

    # Generate output
    outputs = model.generate(
        **inputs,
        max_new_tokens=5,
        num_return_sequences=1,
        temperature=0,
        do_sample=False
    )

    # Decode the output
    evaluation_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

    try:
        score_line = evaluation_output.split(f"Response:")[1].split("\nScore:")[1].strip()
        score = float(score_line.split()[0])  # Extract the numeric score
    except Exception:
        score = 0

    return score


judging_criteria = "Clarity, relevance to the topic, and strength of the argument."

In [4]:
submission_df = pd.read_csv("/kaggle/input/gemma-2-naive-submission/submission.csv")
submission_df.head()

Unnamed: 0,id,topic,essay
0,1097671,Compare and contrast the importance of self-re...,\n\n## The Balancing Act: Self-Reliance vs. Ad...
1,1726150,Evaluate the effectiveness of management consu...,\n\nManagement consulting firms play a crucial...
2,3211968,Discuss the role of self-reliance in achieving...,\n\n## The Architect Within: Self-Reliance as ...


In [5]:
avg_qs = []
avg_variances = []

for i, row in submission_df.iterrows():
    task_prompt = f"Write an essay on the topic {row['topic']}"
    results = [llm_judge(task_prompt, row['essay'], judging_criteria, model, tokenizer) for (model, tokenizer) in zip(models, tokenizers)]
    avg_qs.append(np.mean(results))
    avg_variances.append(np.var(results))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In [6]:
submission_df['avg_q'] = avg_qs
submission_df['avg_variance'] = avg_variances

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">English Language Confidence</h1></span>

In [7]:
# Install the lingua-language-detector package
!pip install lingua-language-detector

# Import necessary modules
from lingua import Language, LanguageDetectorBuilder

Collecting lingua-language-detector
  Downloading lingua_language_detector-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (349 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m349.2/349.2 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading lingua_language_detector-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (74.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.7/74.7 MB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lingua-language-detector
Successfully installed lingua-language-detector-2.0.2


In [8]:
# Build the language detector
detector = LanguageDetectorBuilder.from_all_languages().build()

In [9]:
english_confidence = []

for i, row in submission_df.iterrows():
    # Compute language confidence values
    results = detector.compute_language_confidence_values(row['essay'])
    confidence = next((result.value for result in results if result.language == Language.ENGLISH), 0.0)
    english_confidence.append(confidence)

In [10]:
submission_df['avg_e'] = english_confidence

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Sequence Similarity Score</h1></span>

#### Based on this [comment](https://www.kaggle.com/competitions/llms-you-cant-please-them-all/discussion/549809#3062886) I assume that sequence similarity refers to similarity between different essays in the test data. Feel free to correct me if this isn't the case

In [11]:
import difflib
from itertools import combinations

In [12]:
essays = submission_df['essay'].values
similarities = [
        difflib.SequenceMatcher(a=essay1, b=essay2).ratio() for essay1, essay2 in combinations(essays, 2)
    ]
avg_s = sum(similarities) / len(similarities)

In [13]:
MIN_S = 0.2
avg_s_clipped = max(avg_s, MIN_S)

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Final Score</h1></span>

![](https://imgur.com/VRlKKgq.png)

In [14]:
MAX_Q = 9
final_score = (submission_df['avg_variance'].mean() / (MAX_Q - submission_df['avg_q'].mean())) * (submission_df['avg_e'].mean() / avg_s_clipped)

In [15]:
final_score

9.10958904109589

![Upvote!](https://img.shields.io/badge/Upvote-If%20you%20like%20my%20work-07b3c8?style=for-the-badge&logo=kaggle)