# LLM-as-judge and TLM on ELI5 dataset

## Setup

The two main requirements to replicate are an OPENAI_API_KEY and a TLM_API_KEY.

You can get an OPENAI_API_KEY by signing up for OpenAI at https://platform.openai.com/signup.

You can try TLM for free at https://cleanlab.ai/tlm/


In [None]:
%pip install datasets cleanlab-tlm openai --quiet

In [None]:
# Set your API key
import os
os.environ["CLEANLAB_TLM_API_KEY"] = "<API key>"  # Get your API key from: https://tlm.cleanlab.ai/

In [None]:
import os
REQUIRED_CREDS = [
    "OPENAI_API_KEY",  # https://platform.openai.com/
    ]


# Try/except in case you're using google colab for creds
try:
  from google.colab import userdata
  for cred in REQUIRED_CREDS:
    os.environ[cred] = userdata.get(cred)
except ImportError:
  pass

# Also support dotenv, if preferred
try:
  from dotenv import load_dotenv
  load_dotenv()
except ImportError:
  pass

for cred in REQUIRED_CREDS:
    assert cred in os.environ, f"{cred} not found in environment variables. Please set it before proceeding."

In [None]:
from datasets import load_dataset

ds = load_dataset("explodinggradients/ELI5")
data = ds["train"].to_pandas()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/480 [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/48.6k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/74.9k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/56 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/100 [00:00<?, ? examples/s]

In [None]:
data.head()

Unnamed: 0,user_input,reference,response,target
0,What is the significance of evolutionary devel...,"Evolutionary developmental biology, often refe...","Evolutionary developmental biology, or 'evo-de...",1
1,What is the Theory of Sediment Transport and h...,The Theory of Sediment Transport is a fundamen...,The Theory of Sediment Transport is like a big...,0
2,What is the Theory of Isostasy and how does it...,The Theory of Isostasy is a concept in geology...,The Theory of Isostasy is like saying the Eart...,0
3,What are the key concepts in the Theory of Dig...,The Theory of Digital Computation encompasses ...,The Theory of Digital Computation is like a bi...,1
4,What is the Germ Theory of Disease and how did...,The Germ Theory of Disease is a scientific the...,The Germ Theory of Disease is like saying tiny...,1


In [None]:
import os
from cleanlab_tlm import TLM
tlm = TLM()

Since `tlm.get_trustworthy_score` takes `prompt` and `response`, we'll turn the `context` and `question` into a single prompt:

In [None]:
data["context_and_question_as_prompt"] = data.apply(
    lambda row: f"Answer the QUESTION using information **only** from CONTEXT, but answer using only ELI5 language.\n\nCONTEXT:\n{row['reference']}\nQUESTION:\n{row['user_input']}",
    axis=1
)

In [None]:
data.head()

## LLM-as-judge



In [None]:
CONFIDENCE_PROMPT = """Evaluate how confident you are that the given Answer is an accurate response to the Question, given the Answer was supposed to be written in ELI5 language while remaining factually accurate.
Please assign a Score using the following 5-point scale:
1: You are not confident that the Answer addresses the Question at all, the Answer may be entirely off-topic or irrelevant to the Question.
2: You have low confidence that the Answer addresses the Question, there are doubts and uncertainties about the accuracy of the Answer.
3: You have moderate confidence that the Answer addresses the Question, the Answer seems reasonably accurate and on-topic, but with room for improvement.
4: You have high confidence that the Answer addresses the Question, the Answer provides accurate information that addresses most of the Question.
5: You are extremely confident that the Answer addresses the Question, the Answer is highly accurate, relevant, and effectively addresses the Question in its entirety.
The output should strictly use the following template: Explanation: [provide a brief reasoning you used to derive the rating Score] and then write 'Score: <rating>' on the last line.
"""

Through trial and error, we also found the following score extraction regexes to be useful:

In [None]:
import re

regex_list = [
    re.compile(
        r".*(?:^|\n)\s*score:?\s*\(?(?P<answer>one|two|three|four|five|[12345])\)?",
        flags=re.DOTALL | re.IGNORECASE
    ),
    re.compile(
        r".*\(?(?P<answer>one|two|three|four|five|[12345])\)?",
        flags=re.DOTALL | re.IGNORECASE
    )
]

In [None]:
import numpy as np
from textwrap import dedent

def construct_confidence_prompt_final(context, question, response):
    template = f"""Context:\n{context}\nQuestion:\n{question}\nAnswer:\n{response}\n\n{CONFIDENCE_PROMPT}"""
    return dedent(template)

def parse_llm_as_judge_score(text: str):
    try:
        score = int(text.split("Score:")[-1].strip())
        return (score - 1) / 4  # Normalize to 0-1 range
    except:
        return np.nan

print(construct_confidence_prompt_final(data['reference'][0], data['user_input'][0], data['response'][0]))

Context:
Evolutionary developmental biology, often referred to as 'evo-devo,' is significant because it provides insights into how developmental processes influence evolutionary changes. By studying the genetic and developmental mechanisms that lead to the formation of different structures in organisms, evo-devo helps explain how complex traits evolve and diversify. It bridges the gap between microevolutionary processes, such as genetic mutations, and macroevolutionary patterns, like the emergence of new species. This field has revealed that small changes in developmental genes can lead to significant morphological differences, offering a deeper understanding of the evolutionary history and relationships among species.
Question:
What is the significance of evolutionary developmental biology in understanding the evolution of species?
Answer:
Evolutionary developmental biology, or 'evo-devo,' is like a detective story about how living things change over time. It helps us understand how t

Creating a function to run both of these benchmarks on the ELI5 dataset

In [None]:
import pandas as pd
import json

from datetime import datetime

def score_tlm(data):
    tlm_scores = tlm.get_trustworthiness_score(data["context_and_question_as_prompt"].tolist(), data["response"].tolist())
    data['tlm_score'] = [score['trustworthiness_score'] for score in tlm_scores]
    print(f"Finished computing TLM Scores: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    return data

def score_llm_as_judge(data):
    gpt4o_mini = TLM("base", options={"model": "gpt-4o-mini"})
    data["llm_as_judge_raw"] = gpt4o_mini.prompt([construct_confidence_prompt_final(row['reference'], row["user_input"], row["response"]) for _, row in data.iterrows()])
    data["llm_as_judge_score"] = data["llm_as_judge_raw"].apply(lambda res: parse_llm_as_judge_score(res['response']))
    print(f"Finished computing LLM-as-judge Scores: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    return data

def score_dataset_with_metrics(in_data: pd.DataFrame):
    # We copy the input data to avoid mutating the original dataframe
    data = in_data.copy()
    print(f"Starting to Score Dataset: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

    data = score_tlm(data)
    data = score_llm_as_judge(data)

    return data

def save_to_json(data_df, file_name, column_name, dataset_name):
    if os.path.exists(filename):
    with open(file_path, 'r') as file:
        results = json.load(file)
    else:
        results = {}

    # Update the results JSON with new scores
    scores = data_df[column_name].tolist()
    results[dataset_name] = [{'score': score} for score in scores]

    with open(filename, 'w') as file:
      json.dump(results, file, indent=4)

In [None]:
result = score_dataset_with_metrics(data)

# Save results to model specific JSON
file_name = 'TLM.json'
column_name = 'tlm_score'
dataset_name = 'ELI5'
save_to_json(result, file_name, column_name, dataset_name)

file_name = 'LLM-as-judge.json'
column_name = 'llm_as_judge_score'
dataset_name = 'ELI5'
save_to_json(result, file_name, column_name, dataset_name)

result.to_csv(f"tlm_eli5.csv")