# LLM Evals with humans, AI judges, and GPT token probabilities

This notebook demonstrates how to use Cleanlab's [CROWDLAB](https://cleanlab.ai/blog/multiannotator/) method for reliable LLM Evaluation with multiple judges (a mix of human and LLM evaluators).

Here we consider the MT-Bench dataset, which contains: many user requests, two possible responses for each request from different LLM models, and annotations regarding which of the two responses is considered better. Each example has a varying number of judge annotations provided by authors of the original paper and other "experts" (graduate students). We use CROWDLAB to: produce high-quality final consensus annotations (to enable accurate LLM Evals) as well as measure the quality of the annotators. CROWDLAB relies on probabilistic predictions from any ML model -- here we use logprobs from GPT-4 applied in the LLM-as-judge framework.

You can use the same technique for any LLM Evals involving multiple human/AI judges, to help your team better evaluate models. Read more in our [blog](https://cleanlab.ai/blog/team-llm-evals/).


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cleanlab/examples/blob/master/llm_evals_w_crowdlab/llm_evals_w_crowdlab.ipynb)     

# Step 1: Data Cleaning and Exploration

We'll install requirements, load the data, and explore it.

In [None]:
!pip install datasets pandas openai numpy cleanlab



In [3]:
from datasets import load_dataset

dataset = load_dataset("lmsys/mt_bench_human_judgments")

In [4]:
import numpy as np
import pandas as pd

from cleanlab.multiannotator import get_label_quality_multiannotator, get_majority_vote_label

In [5]:
dataset

DatasetDict({
    gpt4_pair: Dataset({
        features: ['question_id', 'model_a', 'model_b', 'winner', 'judge', 'conversation_a', 'conversation_b', 'turn'],
        num_rows: 2400
    })
    human: Dataset({
        features: ['question_id', 'model_a', 'model_b', 'winner', 'judge', 'conversation_a', 'conversation_b', 'turn'],
        num_rows: 3355
    })
})

In [6]:
dataset["human"].to_json("human_judgments.json")
dataset["gpt4_pair"].to_json("gpt4_pair_judgments.json")


Creating json from Arrow format:   0%|          | 0/4 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

11356420

We'll be using the "human" data set, which is where different humans evaluated which LLM most correctly answered the question. This task has the additional (realistic) wrinkle that different LLM pairings have different numbers of graders. We can see that evaluators are named `author_{i}` and `expert_{i}`, which refers to the paper's authors and graduate students, respectively.

In [7]:
# Group by 'question_id', 'model_a', 'model_b', and 'turn'
grouped = pd.DataFrame(dataset["human"]).groupby(['question_id', 'model_a', 'model_b', 'turn'])

# Aggregate to find unique judges in each group
unique_judges_distribution = grouped['judge'].unique()

# Print the distribution of unique judges
print(unique_judges_distribution)

question_id  model_a          model_b          turn
81           alpaca-13b       gpt-3.5-turbo    1       [author_2, expert_17]
                                               2       [author_2, expert_17]
                              vicuna-13b-v1.2  1                  [expert_0]
                                               2                  [expert_0]
             claude-v1        alpaca-13b       1                 [expert_37]
                                                               ...          
160          vicuna-13b-v1.2  claude-v1        2       [expert_0, expert_36]
                              gpt-3.5-turbo    1       [author_4, expert_43]
                                               2       [author_4, expert_43]
                              llama-13b        1                  [expert_0]
                                               2                  [expert_0]
Name: judge, Length: 2396, dtype: object


Now that we have our data, we'll do a bit of cleaning.


First, we're going to need to integer-ize some of these labels. In particular, the algorithm wants the selected label classes (in this example (`model_a`, `model_b`) to be integers, so let's do that.

Also, this dataset is divided up by "turns", but for simplicity's sake, we're going to use the first turn per pairing.



In [8]:
human_df = pd.DataFrame(dataset["human"])

In [9]:
human_df.shape

(3355, 8)

In [10]:
human_df.head()

Unnamed: 0,question_id,model_a,model_b,winner,judge,conversation_a,conversation_b,turn
0,81,alpaca-13b,gpt-3.5-turbo,model_b,author_2,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,1
1,81,alpaca-13b,gpt-3.5-turbo,model_b,author_2,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,2
2,81,alpaca-13b,gpt-3.5-turbo,model_b,expert_17,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,1
3,81,alpaca-13b,gpt-3.5-turbo,model_b,expert_17,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,2
4,81,alpaca-13b,vicuna-13b-v1.2,model_b,expert_0,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,1


What is the distribution of judges?

In [79]:
human_df.judge.value_counts()

judge
expert_24    205
author_4     204
author_0     180
expert_0     144
author_3     106
            ... 
expert_18     10
expert_54     10
expert_30      6
author_1       6
expert_52      4
Name: count, Length: 65, dtype: int64

In [80]:
human_df.judge.str.contains("expert").value_counts()

judge
True     2668
False     687
Name: count, dtype: int64

In [11]:
#integer-ize winner labels
mapping_dict = dict(model_a=0, model_b=1)
human_df['winner_binary'] = human_df['winner'].apply(lambda s: mapping_dict.get(s))

We also transform the dataset from long to wide, using `pivot_table`:

In [12]:
human_df_per_judge = human_df.pivot_table(
    index=['question_id', 'model_a', 'model_b'],
    columns='judge',
    values=['winner_binary'],
    aggfunc='first'  # This aggfunc drops terms beyond 1
)


In [13]:
human_df_per_judge.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,winner_binary,winner_binary,winner_binary,winner_binary,winner_binary,winner_binary,winner_binary,winner_binary,winner_binary,winner_binary,winner_binary,winner_binary,winner_binary,winner_binary,winner_binary,winner_binary,winner_binary,winner_binary,winner_binary,winner_binary,winner_binary
Unnamed: 0_level_1,Unnamed: 1_level_1,judge,author_0,author_1,author_2,author_3,author_4,author_5,author_6,expert_0,expert_1,expert_10,...,expert_52,expert_53,expert_54,expert_55,expert_56,expert_57,expert_6,expert_7,expert_8,expert_9
question_id,model_a,model_b,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2
81,alpaca-13b,gpt-3.5-turbo,,,1.0,,,,,,,,...,,,,,,,,,,
81,alpaca-13b,vicuna-13b-v1.2,,,,,,,,1.0,,,...,,,,,,,,,,
81,claude-v1,alpaca-13b,,,,,,,,,,,...,,,,,,,,,,
81,claude-v1,llama-13b,,,,,,,,,,,...,,,,,,,,,,
81,gpt-3.5-turbo,alpaca-13b,,,,,,,,,,,...,,,,,,,,,,


What is the distribution of how many judges we have per-conversation?

In [14]:
human_df_per_judge.count(axis=1).value_counts()

1    759
2    269
3     54
4      6
5      3
Name: count, dtype: int64

This is perfectly messy! The majority of conversations have only a single judge but several have 2, 3, 4 and even 5 judges!

# Step 2: Creating an Quantitative ML Judge

The CROWDLAB algorithm requires two inputs:
1. Judgements from Human or AI evaluators. We have that
2. A quantitative model score. We'll use GPT-4 to construct that now!


Using the first row in the dataset, we'll build up a prompt and function to recover model scores from GPT-4's internal workings

In [15]:
human_df['conversation_a'].iloc[0]

[{'content': 'Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.',
  'role': 'user'},
 {'content': 'I recently had the pleasure of visiting Hawaii and it quickly became one of my favorite places. From the stunning beaches to the lush mountains, this place has it all. The people are incredibly friendly and the culture is alive and well. One of the highlights of my trip was visiting the Polynesian Cultural Center. Here, I was able to learn about the culture of the native Hawaiian people and try my hand at traditional crafts and activities. I also had a chance to explore some of the natural wonders of the island, including the breathtaking Hanauma Bay and the majestic Waimea Canyon. Whether you’re looking for a relaxing beach vacation or an adventure filled with culture and nature, Hawaii is the perfect destination.',
  'role': 'assistant'},
 {'content': 'Rewrite your previous response. Start every sentence with 

In [16]:
example_convo = human_df['conversation_b'].iloc[0]
example_convo

[{'content': 'Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.',
  'role': 'user'},
 {'content': "Aloha! I recently had the pleasure of embarking on a trip to the beautiful island of Hawaii, and let me tell you, the cultural experiences and must-see attractions did not disappoint.\n\nFirst on my list was a visit to the Polynesian Cultural Center. This interactive experience immerses you in the cultures of the Pacific Islands, from the intricate dances of the Maori people of New Zealand to the fire knife dancing of Samoa. The center also features a canoe pageant, where different island nations showcase their unique styles of canoeing. It was truly a feast for the senses and a fascinating insight into the diverse cultures of the Pacific.\n\nNext up was a trip to the North Shore, which boasts some of the best surf spots in the world. I watched in awe as surfers of all levels tackled the massive waves, and even 

We'll want to label the two different conversation to compare.

I have an intuition (Not tested - would be good!) that calling the assistants ("Assistant A") and ("Assistant B") consistently will help the ML Judge differentiate the two conversations, so we'll restructure the conversation a bit:

In [17]:
def conversation_to_text(conversation_obj_list, assistant_label):
    result_txt = ""
    for conv_obj in conversation_obj_list:
        result_txt += f"{conv_obj['role'].upper()} {assistant_label.upper() if conv_obj['role'] == 'assistant' else ''}: {conv_obj['content']} \n"
    return result_txt

In [18]:
print(conversation_to_text(example_convo, 'b'))

USER : Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions. 
ASSISTANT B: Aloha! I recently had the pleasure of embarking on a trip to the beautiful island of Hawaii, and let me tell you, the cultural experiences and must-see attractions did not disappoint.

First on my list was a visit to the Polynesian Cultural Center. This interactive experience immerses you in the cultures of the Pacific Islands, from the intricate dances of the Maori people of New Zealand to the fire knife dancing of Samoa. The center also features a canoe pageant, where different island nations showcase their unique styles of canoeing. It was truly a feast for the senses and a fascinating insight into the diverse cultures of the Pacific.

Next up was a trip to the North Shore, which boasts some of the best surf spots in the world. I watched in awe as surfers of all levels tackled the massive waves, and even had the chance to take a lesson 

That looks good, so we'll produce our entire prompt function now:

In [19]:
def produce_prompt_for_llm_evaluation(conversation_a, conversation_b):
    prompt_preamble = f"""
    You are a logical and accurate converation reading and grading AI system.
    You will be shown two conversations between USER and ASSISTANT.
    Read each conversation carefully and decide which one better complies with the USER's instructions
    Please output ONLY "A" if the ASSISTANT in conversation A better complies with the USER's demands, and output only "B" if the ASSISTANT
    in conversation B better complies with the USER's demands

    <Conversation A>
    {conversation_to_text(conversation_a, "a")}
    </Conversation A>

    That was conversation A, here is conversation B:

    <Conversation B>
    {conversation_to_text(conversation_b, "b")}
    </Conversation B>

    Please respond with "A" if Assistant A was better and "B" if Assistant B was better. ONLY RETURN "A" OR "B"
    """
    return prompt_preamble


In [20]:
example_prompt = produce_prompt_for_llm_evaluation(human_df['conversation_a'].iloc[0], human_df['conversation_b'].iloc[0])

In [21]:
print(example_prompt)


    You are a logical and accurate converation reading and grading AI system.
    You will be shown two conversations between USER and ASSISTANT.
    Read each conversation carefully and decide which one better complies with the USER's instructions
    Please output ONLY "A" if the ASSISTANT in conversation A better complies with the USER's demands, and output only "B" if the ASSISTANT
    in conversation B better complies with the USER's demands

    <Conversation A>
    USER : Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions. 
ASSISTANT A: I recently had the pleasure of visiting Hawaii and it quickly became one of my favorite places. From the stunning beaches to the lush mountains, this place has it all. The people are incredibly friendly and the culture is alive and well. One of the highlights of my trip was visiting the Polynesian Cultural Center. Here, I was able to learn about the culture of the native

# Step 3: Using the ML System to judge conversations!

Now that we have prompts, we need to call the LLM and obtain  our probability estimates.

In [23]:
# Ensure you have the OPENAI_API_KEY defined in your environmental variables, or add as api_key argument to OpenAI() call:
from openai import OpenAI

openai_client = OpenAI()

In [24]:
def get_completion(
    messages: list[dict[str, str]],
    model: str = "gpt-4",
    max_tokens=10,
    temperature=0,
    stop=None,
    seed=123,
    tools=None,
    logprobs=None,  # whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message..
    top_logprobs=None,
) -> str:
    params = {
        "model": model,
        "messages": messages,
        "max_tokens": max_tokens,
        "temperature": temperature,
        "stop": stop,
        "seed": seed,
        "logprobs": logprobs,
        "top_logprobs": top_logprobs,
    }
    if tools:
        params["tools"] = tools

    completion = client.chat.completions.create(**params)
    return completion

In [46]:
def get_prob_scores(prompt: str, model_name: str, top_logprobs: int, tools=None, client=openai_client, **kwargs):
    completion = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        tools=tools,
        logprobs=True,
        top_logprobs=top_logprobs,
        stop=None,
        **kwargs)
    prob_dict = {tlp.token: np.round(np.exp(tlp.logprob),2) for tlp in
                 completion.choices[0].logprobs.content[0].top_logprobs
                }
    return prob_dict


In [40]:
prompt="Please return ONLY one of the following two letters A or B:"

res = get_prob_scores(client=openai_client,
                      prompt=prompt,
                      model_name="gpt-4o",
                      top_logprobs=2)

In [41]:
res

{'A': 0.73, 'B': 0.27}

In [58]:
human_df.head()

Unnamed: 0,question_id,model_a,model_b,winner,judge,conversation_a,conversation_b,turn,winner_binary
0,81,alpaca-13b,gpt-3.5-turbo,model_b,author_2,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,1,1.0
1,81,alpaca-13b,gpt-3.5-turbo,model_b,author_2,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,2,1.0
2,81,alpaca-13b,gpt-3.5-turbo,model_b,expert_17,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,1,1.0
3,81,alpaca-13b,gpt-3.5-turbo,model_b,expert_17,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,2,1.0
4,81,alpaca-13b,vicuna-13b-v1.2,model_b,expert_0,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,1,1.0


In [61]:
for_llm_df = human_df.drop_duplicates(subset=['question_id', 'model_a', 'model_b'])

In [62]:
for_llm_df.head()

Unnamed: 0,question_id,model_a,model_b,winner,judge,conversation_a,conversation_b,turn,winner_binary
0,81,alpaca-13b,gpt-3.5-turbo,model_b,author_2,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,1,1.0
4,81,alpaca-13b,vicuna-13b-v1.2,model_b,expert_0,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,1,1.0
6,81,claude-v1,alpaca-13b,model_a,expert_37,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,1,0.0
8,81,claude-v1,llama-13b,model_a,expert_27,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,1,0.0
10,81,gpt-3.5-turbo,alpaca-13b,model_a,expert_24,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,1,0.0


Let's test the function on the first row!

In [51]:
for_llm_df['conversation_prompt_text'] = human_df.apply(
    lambda s: produce_prompt_for_llm_evaluation(s['conversation_a'], s['conversation_b']),
    axis=1)

In [52]:
for_llm_df.iloc[0]['conversation_prompt_text']

'\n    You are a logical and accurate converation reading and grading AI system.\n    You will be shown two conversations between USER and ASSISTANT.\n    Read each conversation carefully and decide which one better complies with the USER\'s instructions\n    Please output ONLY "A" if the ASSISTANT in conversation A better complies with the USER\'s demands, and output only "B" if the ASSISTANT\n    in conversation B better complies with the USER\'s demands\n\n    <Conversation A>\n    USER : Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions. \nASSISTANT A: I recently had the pleasure of visiting Hawaii and it quickly became one of my favorite places. From the stunning beaches to the lush mountains, this place has it all. The people are incredibly friendly and the culture is alive and well. One of the highlights of my trip was visiting the Polynesian Cultural Center. Here, I was able to learn about the culture 

In [53]:
get_prob_scores(client=openai_client, prompt=for_llm_df.iloc[1]['conversation_prompt_text'], model_name="gpt-4o", top_logprobs=2, max_tokens=3)

{'A': 0.99, 'B': 0.01}

## Scoring our conversations

The next line of code will run our data against GPT-4o and incur charges, so only run it once:

In [54]:
for_llm_df['score_results'] = for_llm_df['conversation_prompt_text'].apply(lambda s: get_prob_scores(prompt=s, client=openai_client, model_name="gpt-4o", max_tokens=10, top_logprobs=2))

In [55]:
for_llm_df.to_pickle("mt-bench-scored.pkl")

In [57]:
for_llm_df.head()

Unnamed: 0,question_id,model_a,model_b,conversation_a,conversation_b,conversation_prompt_text,score_results
0,81,alpaca-13b,gpt-3.5-turbo,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,\n You are a logical and accurate converati...,"{'B': 0.94, 'A': 0.06}"
4,81,alpaca-13b,vicuna-13b-v1.2,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,\n You are a logical and accurate converati...,"{'A': 1.0, 'B': 0.0}"
6,81,claude-v1,alpaca-13b,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,\n You are a logical and accurate converati...,"{'A': 1.0, 'B': 0.0}"
8,81,claude-v1,llama-13b,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,\n You are a logical and accurate converati...,"{'A': 1.0, 'B': 0.0}"
10,81,gpt-3.5-turbo,alpaca-13b,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,\n You are a logical and accurate converati...,"{'A': 1.0, 'B': 0.0}"


# Step 4: Consolidate fetches scores with judges and run CROWDLAB

In [66]:
score_results_only = for_llm_df.set_index(['question_id', 'model_a', 'model_b'])[['score_results']]

In [67]:
score_results_only = score_results_only[score_results_only.index.isin(human_df_per_judge.index)]

In [69]:
score_results_only['A'] = score_results_only['score_results'].apply(lambda d: d.get('A',0))
score_results_only['B'] = score_results_only['score_results'].apply(lambda d: d.get('B',0))

In [70]:
results = get_label_quality_multiannotator(human_df_per_judge, score_results_only[['A', 'B']].to_numpy(), verbose=False)

In [85]:
consensus_results = results["label_quality"]
consensus_results["consensus_label"] = consensus_results["consensus_label"].apply(lambda i: {0:"A",1:"B"}.get(i))

In [88]:
consensus_results.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,consensus_label,consensus_quality_score,annotator_agreement,num_annotations
question_id,model_a,model_b,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
81,alpaca-13b,gpt-3.5-turbo,B,0.911493,1.0,2
81,alpaca-13b,vicuna-13b-v1.2,A,0.615063,0.0,1
81,claude-v1,alpaca-13b,A,0.956241,1.0,1
81,claude-v1,llama-13b,A,0.954403,1.0,1
81,gpt-3.5-turbo,alpaca-13b,A,0.95362,1.0,2


In [89]:
results["annotator_stats"]["worst_class"] = results["annotator_stats"]["worst_class"].apply(lambda i: {0:"A",1:"B"}.get(i))
results["annotator_stats"].head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,annotator_quality,agreement_with_consensus,worst_class,num_examples_labeled
Unnamed: 0_level_1,judge,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
winner_binary,author_1,0.46763,0.333333,B,3
winner_binary,expert_5,0.564449,0.5625,B,16
winner_binary,expert_35,0.621383,0.625,A,16
winner_binary,expert_48,0.665313,0.611111,A,18
winner_binary,expert_20,0.67555,0.7,B,20
winner_binary,expert_47,0.683031,0.7,A,20
winner_binary,expert_40,0.699576,0.666667,A,21
winner_binary,expert_53,0.711944,0.764706,A,17
winner_binary,expert_28,0.730157,0.772727,B,22
winner_binary,expert_45,0.765455,0.823529,B,17


Do authors perform better than than "experts"?

In [73]:
results["annotator_stats"].sort_values("annotator_quality", ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,annotator_quality,agreement_with_consensus,worst_class,num_examples_labeled
Unnamed: 0_level_1,judge,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
winner_binary,expert_30,0.967435,1.000000,0,3
winner_binary,expert_52,0.955679,1.000000,0,1
winner_binary,expert_3,0.952204,1.000000,0,17
winner_binary,expert_39,0.946821,0.947368,0,19
winner_binary,expert_4,0.943187,0.952381,1,21
winner_binary,...,...,...,...,...
winner_binary,expert_48,0.665313,0.611111,0,18
winner_binary,expert_35,0.621383,0.625000,0,16
winner_binary,expert_5,0.564449,0.562500,1,16
winner_binary,author_1,0.467630,0.333333,1,3


In [77]:
results["annotator_stats"].groupby(results["annotator_stats"].index.get_level_values(1).str.contains('expert'))['annotator_quality'].agg(['count', 'mean', 'median'])


Unnamed: 0,count,mean,median
False,7,0.804958,0.851882
True,57,0.835945,0.857924


Experts and authors seem to have roughly similar annotator quality! That's a neat observation, especially since we don't have ground truth labels.

Learn more about proper Evals that combine human and LLM judges in our [blog](https://cleanlab.ai/blog/team-llm-evals/).