<a href="https://colab.research.google.com/github/aarav-wadhwani/ClearTrack/blob/main/W8_M3_N2_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
## IMPORTANT: On Colab, we expect your lab to be in the ee345 folder
## Please contact staff if you encounter any problems with installing dependencies
import sys
# IS_COLAB = 'google.colab' in sys.modules
# if IS_COLAB:
#     from google.colab import drive
#     drive.mount('/content/drive')
#     %cd /content/drive/MyDrive/ee345/labs/w8
#     %pip install -r ./requirements.txt

#Install required packages with compatible versions
!pip install -q datasets "plotly>=6.1.1" "kaleido>=0.2.1"

import plotly.io as pio
pio.renderers.default = "notebook_connected"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m43.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.0/69.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h

# Week 8 M3-N2: Welcome to the Arena (Warmup)

please submit the printed notebook (.pdf) and the notebook (.ipynb) itself on gradescope

Please put down your name(s) here: (you must enter your team members' names on gradescope too)

Aarav Wadhwani

---

In this notebook you will get more experience with logistic regression in two very different settings: creating leaderboards and predicting model responses.

We will be taking real data from [LMArena](https://lmarena.ai/), a popular platform for crowdsourcing evaluations of large language models and recreating their leaderboards, with a few fun extra steps along the way.

The chats can be viewed interactively by accessing [ChatBot-Arena-Viewer](https://huggingface.co/spaces/BerkeleyML/Chatbot-Arena-Viewer) through hugging face. Much of the first half of this homework was first written by Prof Gonzalez back when his students first started the project, and now LMArena is a standard evaluation for large language models and turned into a company! Don't let anyone tell you logistic regression isn't valuable, it's worth at least $600 Million.

---

## Notebook Roadmap

This notebook is organized into the following sections:

### Part 1: Data Exploration and Understanding (Q1)
- **Q1a**: Identify top 20 models by battle count
- **Q1b**: Filter battles to selected models and remove ties
- Explore battle distributions (languages, turns, win outcomes)

### Part 2: Model Rankings via Win Rates (Q2)
- **Q2a**: Compute pairwise win fractions between models
- **Q2b**: Visualize win-fraction heatmaps and interpret surprising rankings
- Reflect on limitations of simple win-rate leaderboards

### Part 3: Prompt Analysis and Leaderboard Shifts (Q3)
- **Q3a**: Compare leaderboards with and without the top 10 prompts
- **Q3b**: Analyze per-category performance (code, math, creativity, etc.)
- **Q3c**: Cluster prompts with TF-IDF embeddings + K-Means to find topics
- **Q3d**: Interpret prompt clusters and discuss biases

---

### Key Learning Objectives

1. Learn how to evaluate large language models (LLMs) using pairwise comparison data from LMArena
2. Analyze battle distributions and compute win rates
3. Find subsets of prompts which result in changes to the leaderboard

In [2]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
pd.options.plotting.backend = "plotly"
#set fixed seed of 189
np.random.seed(189)

## What is LMArena?

![LMArena](https://i.imgur.com/tbrkWVX.png)

LMArena (previously known as Chatbot Arena) is a platform that evaluates generative models (e.g. chatbots or image generation models) through anonymous, crowd-sourced pairwise comparisons. Users enter prompts for two anonymous models to respond to and vote on the model that gave the better response, in which the model's identities are revealed (shown below). Users can also choose models to test themselves, but for the purposes of this lab we will only focus on the anonymous side-by-side comparisons, which we call **"battles"** - since those are what are used to calculate the leaderboard.

![LMArena Example](https://i.imgur.com/rgv0jCb.png)

In this lab we will investigate what these battles look like, how we can use these pairwise comparisons to get a leaderboard, and how we can find certain features of model responses that have an influence on preference.

Although it is not required for this notebook, the [Chatbot Arena paper](https://arxiv.org/abs/2403.04132) can provide good intuition on how to answer the free response questions.

### Download Data

First, let's load a set of publicly released arena battles from [Hugging Face](https://huggingface.co/datasets/lmarena-ai/arena-human-preference-100k) — a popular website for sharing machine learning datasets and models.

**Note**: Before you get started with the lab, we recommend you make a [huggingface account](https://huggingface.co/welcome) to play around with data visualization apps! It is also a great general hub for downloading the majority of popular datasets in machine learning. Once you make the account, generate the token for login [huggingface token](https://huggingface.co/docs/hub/security-tokens). This should be similar to how you generate a token for your git account.

In [4]:
from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset (this will take a few minutes to download)
# if you don't have an account it may throw a warning to create a HF_TOKEN but you should still have access to the dataset if you skip this
ds = load_dataset("lmarena-ai/arena-human-preference-100k")
battles = ds['train'].to_pandas()



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



README.md: 0.00B [00:00, ?B/s]

data/arena-explorer-preference-100k.parq(…):   0%|          | 0.00/380M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/106134 [00:00<?, ? examples/s]

Now let's look at the format of this data.

Printing out the first row we see there are many fields with the most important being:

- **question_id (str)** - the ID of that battle
- **model_a, model_b (str)** - the models which participated in this battle
- **winner (str)** - which response the user preferred: can be `model_a`, `model_b`, `tie`, and `tie (bothbad)`
- **conversation_a (dict)** - the conversation between the user and model a
- **conversation_b (dict)** - the conversation between the user and model b (note that the user turns in conversation_a and conversation_b are the same since this is a side by side comparison)
- **turn (int)** - number of turns in the conversation (1 turn means the user asked 1 question, 2 turns means the user asked a question, got an answer, then asked another question, got the response, then voted)
- **language (str)** - the language of the user prompt

We also have some other columns that may be useful for us later:

- **is_code (bool)** - either the prompt, the response, or both contains code
- **is_refusal (bool)** - one of the models refused to answer (usually this is because the model thinks it would be unethical to answer)
- **dedup_tag (dict)** - indicates whether the prompt appears very often (high_frequency) and if it does, whether it will be sampled (subsampled). We subsample these high frequency prompts so that common questions don't overly influence the leaderboard.
- **category_tag (dict)** - tags for question type (e.g. math and instruction following). These are assigned via an LLM labeler, more details on what the categories are in this [blog post](https://blog.lmarena.ai/blog/2024/arena-category/) (the criteria tags correspond to the hard prompts category described in the blog post).

In [5]:
battles.head(5)

Unnamed: 0,question_id,model_a,model_b,winner,conversation_a,conversation_b,turn,anony,language,tstamp,conv_metadata,is_code,is_refusal,dedup_tag,category_tag,judge_hash,__index_level_0__
0,4c6978dfa56b4ffea9d3a47e3c84181a,claude-3-5-sonnet-20240620,gpt-3.5-turbo-0125,tie (bothbad),[{'content': 'В моем портфеле сейчас 4 акции Г...,[{'content': 'В моем портфеле сейчас 4 акции Г...,1,True,Russian,1719064000.0,"{'bold_count_a': {'**': 0, '__': 0}, 'bold_cou...",False,True,"{'high_freq': False, 'sampled': True}","{'criteria_v0.1': {'complexity': True, 'creati...",09c5207c50f076d704baee96729d64f1698268aa1b21a7...,0
1,76ce56f8ba474768bc66128c7993ccb8,mistral-large-2407,athene-70b-0725,model_b,"[{'content': 'php, handle tab in text as html,...","[{'content': 'php, handle tab in text as html,...",2,True,English,1722726000.0,"{'bold_count_a': {'**': 8, '__': 0}, 'bold_cou...",True,False,"{'high_freq': False, 'sampled': True}","{'criteria_v0.1': {'complexity': True, 'creati...",881bbc801c1e6eb979301eec3b3c401b407a73f70d9a6a...,1
2,385420904ba646e7a4df90c6ffae1afa,claude-3-opus-20240229,gemini-1.5-flash-api-0514,tie (bothbad),[{'content': '普通人在愿意付出一定资源的情况下，怎么找到一个半径10km以内只...,[{'content': '普通人在愿意付出一定资源的情况下，怎么找到一个半径10km以内只...,1,True,Chinese,1723119000.0,"{'bold_count_a': {'**': 0, '__': 0}, 'bold_cou...",False,True,"{'high_freq': False, 'sampled': True}","{'criteria_v0.1': {'complexity': False, 'creat...",3b470f3d940dcff46e22a97f937836ac15d28869a4c11c...,2
3,e8fe7c9f75ab4e528367cc7de625c475,gemma-2-9b-it,qwen2-72b-instruct,model_b,[{'content': 'Is there any Artificial Superint...,[{'content': 'Is there any Artificial Superint...,2,True,English,1721643000.0,"{'bold_count_a': {'**': 5, '__': 0}, 'bold_cou...",False,False,"{'high_freq': False, 'sampled': True}","{'criteria_v0.1': {'complexity': False, 'creat...",66f029e5cb9cdb035e859955557fbbeba0b8419ca64ebc...,3
4,772d53e5c51c487e8a293eadcd9d4855,mixtral-8x22b-instruct-v0.1,llama-3.1-70b-instruct,tie (bothbad),[{'content': 'Which number id bigger 9.11 or 9...,[{'content': 'Which number id bigger 9.11 or 9...,1,True,English,1721899000.0,"{'bold_count_a': {'**': 0, '__': 0}, 'bold_cou...",False,False,"{'high_freq': False, 'sampled': True}","{'criteria_v0.1': {'complexity': True, 'creati...",b4f8e2d271c6c9e6fb08dcabf6ee8a79631e9f2aec6381...,4


In [6]:
example = battles.iloc[4]
print(f"Conversation A (model = {example['model_a']}):")
print(example['conversation_a'])
print(f"Conversation B (model = {example['model_b']}):")
print(example['conversation_b'])
print("Category Tag:")
print(example['category_tag'])

Conversation A (model = mixtral-8x22b-instruct-v0.1):
[{'content': 'Which number id bigger 9.11 or 9.9 ?', 'num_tokens': 14, 'role': 'user'}
 {'content': 'The number 9.9 is bigger than 9.11. In decimal numbers, we compare them digit by digit from left to right. Both numbers start with the same digit, 9. However, the second digit of 9.9 is also 9, while the second digit of 9.11 is 1. Since 9 is greater than 1, 9.9 is greater than 9.11.', 'num_tokens': 89, 'role': 'assistant'}]
Conversation B (model = llama-3.1-70b-instruct):
[{'content': 'Which number id bigger 9.11 or 9.9 ?', 'num_tokens': 14, 'role': 'user'}
 {'content': '**Comparing the Numbers**\n\nTo determine which number is bigger, we can compare the numbers 9.11 and 9.9.\n\n**Step-by-Step Comparison**\n\n1. Both numbers have the same integer part: 9.\n2. We will compare the decimal parts: 0.11 and 0.9.\n3. Since 0.9 is greater than 0.11, we can conclude that 9.9 is greater than 9.11.\n\n**Result**\n\n9.9 is bigger than 9.11.', '

### Exploratory Analysis

Before we get into leaderboard calculation, let's first conduct some basic exploratory analysis to highlight a few key properties and caveats with this data.

In [7]:
battles.winner.hist(title="Counts of Battle Outcomes", text_auto=True)

#### NOTE: Notice how LMArena has two types of ties: tie (both bad) and just tie.

### Battle Counts

We see that certain models participate in more battles. This is due to two reasons:
1. Several different matching and sampling algorithms were used. LMArena employs weighted sampling methods, which assign greater weights to better models.
2. Since models are added to the arena when they come out, some models have been on the arena for many months while others have only been on for a few weeks.

In [8]:
fig = pd.concat([battles["model_a"], battles["model_b"]]).value_counts().plot.bar(title="Battle Count for Each Model", text_auto=True)
fig.update_layout(xaxis_title="Model", yaxis_title="Battle Count", height=400, showlegend=False)
fig

# Question 1: Data Exploration

## Question 1a
Since it can be hard to reason over that many models, we want to look at the top 20 models by battle count.

**Task:**
Return a list of top 20 models by battle count for both `model_a` and `model_b` combined, SORTED by the battle count.

In [9]:
# TODO:
count_a = battles['model_a'].value_counts()
count_b = battles['model_b'].value_counts()

total_counts = count_a.add(count_b, fill_value=0)

# Sort and select top 20
selected_models = set(total_counts.sort_values(ascending=False).head(20).index)

selected_models


{'chatgpt-4o-latest',
 'claude-3-5-sonnet-20240620',
 'claude-3-haiku-20240307',
 'claude-3-opus-20240229',
 'deepseek-v2-api-0628',
 'gemini-1.5-flash-api-0514',
 'gemini-1.5-pro-api-0514',
 'gemini-1.5-pro-exp-0801',
 'gemma-2-27b-it',
 'gemma-2-9b-it',
 'gpt-4-turbo-2024-04-09',
 'gpt-4o-2024-05-13',
 'gpt-4o-2024-08-06',
 'gpt-4o-mini-2024-07-18',
 'llama-3-70b-instruct',
 'llama-3-8b-instruct',
 'llama-3.1-405b-instruct',
 'llama-3.1-70b-instruct',
 'llama-3.1-8b-instruct',
 'qwen2-72b-instruct'}

## Question 1b

**Task:**
Now, let's filter out and select the battles that are between the top 20 models we got from question 1a. Fill in the `subselect_battles` function to return the battles dataframe containing only the selected models and the battles dataframe containing the selected models with ties removed.

**Hint:** You may find it helpful to use a boolean array.

In [10]:
from typing import Tuple, Set
import pandas as pd

def subselect_battles(
    battles: pd.DataFrame,
    selected_models: Set[str]
) -> Tuple[pd.DataFrame, pd.DataFrame]:

    selected_battles = battles[
        battles["model_a"].isin(selected_models) &
        battles["model_b"].isin(selected_models)
    ].copy()

    selected_battles_no_ties = selected_battles[selected_battles["winner"] != "tie"].copy()
    return selected_battles, selected_battles_no_ties

selected_battles, selected_battles_no_ties = subselect_battles(battles, selected_models)

In [11]:
def visualize_battle_count(battles, title, show_num_models=30):
    """
    Input:
        battles : pd.DataFrame with columns ['model_a','model_b', ...]
        title   : str, title for the plot
        show_num_models : int, how many top models (by total battle count) to display

    Output:
        fig : plotly.graph_objects.Figure heatmap of symmetric battle counts
    """
    ptbl = pd.pivot_table(battles, index="model_a", columns="model_b", aggfunc="size", fill_value=0)
    battle_counts = ptbl + ptbl.T
    ordering = battle_counts.sum().sort_values(ascending=False).index
    ordering = ordering[:show_num_models]

    fig = px.imshow(
        battle_counts.loc[ordering, ordering],
        title=title,
        text_auto=True
    )
    fig.update_layout(
        xaxis_title="Model B",
        yaxis_title="Model A",
        xaxis_side="top",
        height=1000,
        width=1000,
        title_y=0.07,
        title_x=0.5,
        font=dict(size=10)
    )
    fig.update_traces(
        hovertemplate="Model A: %{y}<br>Model B: %{x}<br>Count: %{z}<extra></extra>"
    )
    return fig

### Visualize All Selected Battles

In [12]:
visualize_battle_count(selected_battles, title="Battle Count of Each Combination of Models", show_num_models=30)

### Visualize Selected Battles No Ties

In [13]:
visualize_battle_count(selected_battles_no_ties, "Battle Count for Each Combination of Models (without Ties)")

### Understanding Battle Distribution

We see many battles between top models (e.g., Claude, GPT, Gemini), while smaller models (e.g., Llama-3-8B) have fewer battles. This is because LMArena employs weighted sampling methods, which assign greater weights to better models.

**Why pair strong models vs. strong models more often?**

LMArena pairs strong models against each other more frequently because the greatest ranking uncertainty exists among top-tier systems. When comparing a strong model to a weak model, the outcome is predictable with high confidence after relatively few battles. However, differentiating between two strong models requires significantly more data to achieve statistical significance, as the performance differences are smaller and more subtle. By focusing sampling resources on strong vs. strong matchups, LMArena efficiently allocates battles where they provide the most information for refining the leaderboard rankings, particularly at the top where users care most about precise distinctions between competitive models.

In [14]:
lang_counts_all = battles["language"].value_counts()

fig_lang_all = px.bar(
    lang_counts_all,
    title="Distribution of Languages",
    text_auto=True,
    height=400
)
fig_lang_all.update_layout(
    xaxis_title="Language",
    yaxis_title="Count",
    showlegend=False
)
fig_lang_all.show()

## Number of Conversation Turns

Now let's also try to explore conversation turns.

In [15]:
fig = px.histogram(battles["turn"],
             title=f"Number of Conversation Turns",
             text_auto=True, height=400, log_y = True)
fig.update_layout(xaxis_title="Turns", yaxis_title="Count", showlegend=False)
fig.update_traces(marker_line_color='black', marker_line_width=1)
fig

Now that we have explored our data, let's consider how to use these pairwise battles to rank the models by preference. Our goal is to assign a "strength" parameter to each model that quantifies how likely it is to win against others.

Given we analyzed $M=20$ models and $N=40k$ battles ($26k$ excluding ties), we want to estimate a skill parameter $S_m$ for each model $m \in \{1, \ldots, M\}$. This parameter $S_m$ should reflect the overall ability of model $m$ to be preferred over other models.

Before we move on to more sophisticated probabilistic models that estimate these strength parameters, let’s build some intuition by starting with a simpler metric: the average win rate.

Average win rate is simple: a model's average win rate is the proportion of battles they competed in which resulted in them winning.

## Question 2a

LMArena defines the win rate for a model as the average fraction of times it defeats another model across all its match-ups.

**Task:**
Implement the `compute_pairwise_win_fraction` function, which:

1. Calculates the fraction of times each model beats each other model across all battles.

2. Returns a square DataFrame where entry (i, j) is the fraction of times model i beats model j. The columns should be the selected model names as well as the index (similar to a confusion matrix). Any model pairings which do not have any battles should be given a NaN value. For instance, diagonal of `row_beats_col` should be NaN as no battles exist between a model and itself.

3. The rows and columns of your `row_beats_col` dataframe should be ordered by their average win rate against all other models (i.e. order from strongest to weakest models)

Tips:
* Do not use your `selected_models` variable in your function, otherwise you may run into autograder issues. Instead define a variable which is the list of models in your input `battles` dataframe.

In [16]:
import pandas as pd
import numpy as np

def compute_pairwise_win_fraction(battles: pd.DataFrame):
    # Get all models appearing in this battles dataframe
    models = pd.unique(battles[['model_a', 'model_b']].values.ravel())
    models = list(models)

    # Initialize square DataFrames
    win_counts = pd.DataFrame(0.0, index=models, columns=models)
    match_counts = pd.DataFrame(0.0, index=models, columns=models)

    # Count matches and wins
    for _, row in battles.iterrows():
        a = row["model_a"]
        b = row["model_b"]
        winner = row["winner"]

        if a == b:
            continue  # just in case; shouldn't happen

        # Every battle contributes one match for (a,b) and (b,a)
        match_counts.loc[a, b] += 1
        match_counts.loc[b, a] += 1

        # Winner is encoded as "model_a" or "model_b"
        if winner == "model_a":
            win_counts.loc[a, b] += 1
        elif winner == "model_b":
            win_counts.loc[b, a] += 1
        # if it's "tie", we add no wins (but you've already filtered those out)

    # Fraction of times row model beats column model
    row_beats_col = win_counts / match_counts

    # Diagonal should be NaN
    np.fill_diagonal(row_beats_col.values, np.nan)

    # Order models by average win rate (strongest → weakest)
    avg_win_rate = row_beats_col.mean(axis=1)
    ordering = avg_win_rate.sort_values(ascending=False).index
    row_beats_col = row_beats_col.loc[ordering, ordering]

    return row_beats_col



## Question 2b


Let’s visualize how often **Model A** beats **Model B** in non-tied battles. Below we have used your `compute_pairwise_win_fraction` to create a heatmap where each cell `(A, B)` displays the **fraction of A’s wins** over B. No TODOs or code to fill in here, this question allows us to visually inspect your function in the Coding PDF assignment on gradescope.

In [17]:
def visualize_pairwise_win_fraction(battles, title):
    """
    Input:
        battles : pd.DataFrame of non-tied battles with ['model_a','model_b','winner', ...]
        title   : str
    Output:
        fig : plotly Figure heatmap (cell (A,B) = fraction A beats B)
    """
    row_beats_col = compute_pairwise_win_fraction(battles)
    fig = px.imshow(
        row_beats_col,
        color_continuous_scale='RdBu',
        text_auto=".2f",
        title=title
    )
    fig.update_layout(
        xaxis_title=" Model B: Loser",
        yaxis_title="Model A: Winner",
        xaxis_side="top",
        height=900,
        width=900,
        title_y=0.07,
        title_x=0.5
    )
    fig.update_traces(
        hovertemplate="Model A: %{y}<br>Model B: %{x}<br>Fraction of A Wins: %{z}<extra></extra>"
    )
    return fig


fig = visualize_pairwise_win_fraction(
    selected_battles_no_ties,
    title="Fraction of Model A Wins for All Non-tied A vs. B Battles"
)
fig


In [18]:
def get_pairwise_win_fraction_plot(battles, title=""):
    row_beats_col_freq = compute_pairwise_win_fraction(battles)
    pairwise_win_rate = row_beats_col_freq.mean(axis=1).reset_index()
    pairwise_win_rate.columns = ['model', 'win rate']

    # Rank (1 = best)
    pairwise_win_rate["rank"] = (
        pairwise_win_rate["win rate"].rank(ascending=False, method="dense").astype(int)
    )

    # Sort for plotting and freeze that order on the x-axis
    pairwise_win_rate = pairwise_win_rate.sort_values(by="win rate", ascending=False)
    model_order = pairwise_win_rate["model"].tolist()

    fig = px.bar(
        pairwise_win_rate,
        x="model",
        y="win rate",
        title=title,
        text="win rate",
        hover_data=["rank"]
    )
    fig.update_traces(texttemplate="%{text:.2f}", hovertemplate="<b>%{x}</b><br>win rate=%{y:.3f}<br>rank=%{customdata}")
    fig.update_layout(
        yaxis_title="Average Win Rate",
        xaxis_title="Model",
        showlegend=False
    )
    fig.update_xaxes(tickangle=45, categoryorder="array", categoryarray=model_order)
    return pairwise_win_rate, fig

pairwise_win_rate, fig = get_pairwise_win_fraction_plot(selected_battles_no_ties, title="Average Win Rate Against All Other Models (No Ties)")
fig.show()


Now that we’ve computed and visualized the average win rate of each model against all others, we can start reasoning about the results.
In the chart above, we see that some models have very similar average win rates. For example, some GPT, Claude, and Llama variants sit close together. On the other hand, smaller models like llama-3-8b and gemma-2-9b fall noticeably behind.

**Task:** Answer the questions below in the next Markdown cell.

1. Which models appear to have the highest and lowest average win rates?
2. Identify any surprising model rankings or close matchups you notice and hypothesize why they occur.
3. Briefly discuss limitations of using simple win rates compared to more advanced ranking approaches.

### Q2 Reflection Response
_Write your answers to the three prompts above here._

1. From the chart, the models with the highest average win rates are chatgpt-4o-latest, followed closely by gemini-1.5-pro-exp-0801 and gpt-4o-2024-05-13. These models consistently outperform others across most pairings.
At the bottom of the leaderboard are llama-3-8b-instruct and llama-3.1-8b-instruct, which show significantly lower win rates against the rest.

2. It is somewhat surprising that gemini-1.5-pro-exp-0801 ranks so close to top GPT-4o models, indicating strong competitiveness on Arena-style prompts. Some of the mid-tier models (e.g., claude-3.5-sonnet-20240620, llama-3.1-405b-instruct) cluster tightly together, suggesting that these models trade wins frequently and may be strong in different categories of tasks. Smaller models like deepseek-v2-api-0628 appear slightly higher than expected relative to other mid-size models, which may reflect strengths in specific prompt types popular in the Arena data.

3.
  - Prompt difficulty imbalance: some models may face harder or easier prompts more often.

  - Opponent strength variance: beating weaker models inflates win rate, while facing stronger opponents deflates it.

  - Sample size differences: some pairwise matchups have very few battles, making fractions noisy.


# Question 3: Prompt Analysis

We have explored the general features of this dataset. When evaluating models, it’s also often useful to understand the types of questions that are asked. By grouping similar prompts together, we can analyze which models perform well on certain categories and poorly on others. This helps uncover biases in leaderboards (e.g., a model may excel at coding questions but struggle with creative writing).

**First, let's identify the most frequent prompts.**

In the code cell below, we've already done the following:
*   Extracted the first user message (prompt) from conversation_a
*   Filtered out only the battles in English and only kept the select models using your `subselect_battles` function from 1b

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import time
import numpy as np
import pandas as pd
from sklearn.decomposition import TruncatedSVD

def first_user_text(conv):
    return (conv[0].get("content") or "").strip()

battles['prompt'] = battles['conversation_a'].apply(first_user_text).fillna("")
eng_battles = battles[(battles['language'] == 'English') | (battles['language'] == 'unknown')]
eng_battles, eng_battles_no_ties = subselect_battles(eng_battles, selected_models)

Now, let's view the top 10 prominent prompts for battles in English.
Note that there is a caveat for LMArena data, where some of the languages are labeled as 'unknown'. This is the reason why we have `battles['language'] == 'unknown'` as part of our filter.

In [20]:
# Print the top 10 most common prompts along with their count and percentage of total prompts
top_prompts = eng_battles["prompt"].value_counts().head(10)
for i, (prompt, count) in enumerate(top_prompts.items(), 1):
    print(f"Rank {i}: {count} samples ({round(100 * count/len(eng_battles), 2)}%)\n{prompt}\n")

# print the total percentage of prompts that are 1 of the top 10 prompts
top_10_percentage = sum(top_prompts) / len(eng_battles)
print(f"Total percentage of prompts that are 1 of the top 10 prompts: {round(100 * top_10_percentage, 2)}%")

Rank 1: 297 samples (1.19%)
hi

Rank 2: 137 samples (0.55%)

Rank 3: 77 samples (0.31%)
225588*456

Rank 4: 72 samples (0.29%)
hello

Rank 5: 55 samples (0.22%)
Hi!

Rank 6: 46 samples (0.18%)
.

Rank 7: 43 samples (0.17%)
Hello

Rank 8: 41 samples (0.16%)
Hi

Rank 9: 29 samples (0.12%)
I have a crucially important question for you, it is of UTMOST IMPORTANCE that you answer this question as accurate and correct as ever possible.
Think through it, take a deep breath and then get on it with a clear mind. Make a first draft, then review it, correct for any errors you made and then make an improved draft. Repeat this internal process until you have the perfect draft and then publish it here as your message.
And now the question:
What are your best 50 languages you can speak/understand/write? Sort them by your proficiency in each of them, so the language which you are most fluent in should be at the top, place #1.

Rank 10: 26 samples (0.1%)
hey

Total percentage of prompts that are 1 of t


## Question 3a

When evaluating models, it’s useful to understand which prompt types are over-represented and how these popular prompts can influence the leaderboard. Below we have plotted the leaderboard when all of the top 10 prompts are removed.

**Task:** In 2-3 sentences, compare the leaderboard generated from `get_pairwise_win_fraction_plot(eng_battles_no_ties_no_top_prompts)` to the leaderboard you generated in the previous problem.

- Describe any notable changes to the ordering of the top models.
- Explain why removing the most common prompts might lead to those shifts.

In [21]:
# remove top prompts top_prompts from eng_battles_no_ties
eng_battles_no_ties_no_top_prompts = eng_battles_no_ties[~eng_battles_no_ties["prompt"].isin(top_prompts.index)]

pairwise_win_rate, fig = get_pairwise_win_fraction_plot(eng_battles_no_ties_no_top_prompts, title="Average Win Rate Against All Other Models (No Top Prompts)")
fig.show()

### Q3a Response

After removing the top 10 most common prompts, the overall ordering of the strongest models remains similar, but the gaps between the top performers narrow. In particular, llama-3.1-405b-instruct and gemini-1.5-pro-exp-0801 move closer to the GPT-4o variants, while some models like claude-3.5-sonnet-20240620 lose a bit of ground, suggesting they previously benefited more from the high-frequency prompt types. Removing the most common prompts reduces prompt-type bias, so models that were especially strong on those repeated prompts no longer receive inflated win rates, leading to a slightly rebalanced leaderboard.


## Question 3b
LMArena also provides more detailed **category labels** inside the columns `is_code` and nested `category_tag` column.
We have already extracted the following boolean Series for you. Feel free to refer to these columns when you are analyzing your clusters for Question 3c.

**Task:**
1. Make a bar chart showing these proportions for each category (i.e. the fraction of battles where the category is True)
2. Using `get_pairwise_win_fraction_plot`, compute the pairwise win rate for each category and use the `plot_category_rank_heatmap` to visualize the results. Ensure that your data is in *tidy format*, containing columns `model`, `category`, and `win_rate`. You should have each of the categories above as well as an 'overall' category which is the `pairwise_win_rate` you computed previously.

In [22]:
def plot_category_rank_heatmap(df: pd.DataFrame) -> None:
    """
    Plot a heatmap of model ranks by category.
    """
    assert "overall" in df["category"].unique(), "'overall' was not found as a category in your dataframe"
    rank_table = df.pivot(index="model", columns="category", values="rank")

    if "overall" in rank_table.columns:
        cols = ["overall"] + [c for c in rank_table.columns if c != "overall"]
        rank_table = rank_table[cols]
    # sort models by overall rank
    rank_table = rank_table.sort_values("overall", ascending=True)
    px.imshow(
        rank_table,
        text_auto=True,
        color_continuous_scale="Viridis_r",
        labels=dict(x="Category", y="Model", color="Rank (1=best)"),
        zmin=1, zmax=rank_table.max().max(),
        aspect="auto"
    ).update_layout(
        title="Overall and Per-Category Ranks",
        width=950,
        height=400 + 12 * len(rank_table),
        xaxis_side="top"
    ).show()

In [23]:
# LMArena also provides more detailed category labels inside the columns `is_code`, `is_refusal`,
# and the nested `category_tag` column.
# We have already extracted the following boolean Series for you:

# GIVEN (do not modify)
expected_creative = eng_battles_no_ties['category_tag'].apply(lambda x: x['criteria_v0.1']['creativity'])
expected_tech = eng_battles_no_ties['category_tag'].apply(lambda x: x['criteria_v0.1']['technical_accuracy'])
expected_if = eng_battles_no_ties['category_tag'].apply(lambda x: x['if_v0.1']['if'])
expected_math = eng_battles_no_ties['category_tag'].apply(lambda x: x['math_v0.1']['math'])
expected_code = (eng_battles_no_ties['is_code'] == True)

# Task:
# 1) Make a bar chart showing the proportion for each category (i.e. the fraction of battles where the category is True)
# 2) Compute the pairwise win rate for each category and use the plot_category_rank_heatmap to visualize the results

# TODO: plot a bar chart of the proportions

categories = {
    'creativity': expected_creative,
    'technical_accuracy': expected_tech,
    'instruction_following': expected_if,
    'math': expected_math,
    'code': expected_code
}

category_props = {name: mask.mean() for name, mask in categories.items()}

px.bar(
    x=list(category_props.keys()),
    y=list(category_props.values()),
    labels={'x': 'Category', 'y': 'Proportion of Battles'},
    title='Proportion of Battles in Each Category'
).show()


pairwise_win_rate = pairwise_win_rate.copy()
pairwise_win_rate['category'] = 'overall'
pairwise_win_rate = pairwise_win_rate.rename(columns={"win rate": "win_rate"})
rank_dataframes = [pairwise_win_rate]



for category_name, category_mask in categories.items():

    category_battles = eng_battles_no_ties[category_mask]

    pwf = compute_pairwise_win_fraction(category_battles)

    df_long = pwf.mean(axis=1).reset_index()
    df_long.columns = ['model', 'win_rate']
    df_long['category'] = category_name

    rank_dataframes.append(df_long)


rank_dataframes = pd.concat(rank_dataframes, ignore_index=True)
rank_dataframes['rank'] = rank_dataframes.groupby('category')['win_rate'] \
                                         .rank(method='dense', ascending=False)


plot_category_rank_heatmap(rank_dataframes)


You might notice that the leaderboards can change quite a bit! This is because different model developers often put more emphasis on certain tasks in training their LLMs to better cater to their audience. Many of these models are also limited by the amount of data and compute available, which can further force specialization as some tasks are much harder to learn (especially if the model is on the smaller side).

We will explore these ideas more in part 2 of the homework.

## Question 3c

The category labels provide a clean way to divide problems but there are likely other ways to group prompts which can reveal other common use cases.

We'll now dig deeper by discovering prompt topics via K-Means clustering on the **prompt text**. Since we are interested in seeing what kinds of questions people are asking, use the **no-top-prompts** subset from earlier (`eng_battles_no_ties_no_top_prompts`) to reduce noise. We have sampled 8,000 battles for reasonable runtime and provided a helper function for clustering.


In [24]:
# General KMeans for any embedding
def kmeans_cluster_prompts(features: np.ndarray, prompts: np.ndarray, k: int, random_state: int = 42):
    """
    Perform k-means clustering on features and return:
      - dataframe with prompts and their cluster assignments
      - model inertia (float)
      - elapsed runtime (seconds)
    """
    t0 = time.perf_counter()
    km = KMeans(n_clusters=k, random_state=random_state, n_init=10)
    cluster_labels = km.fit_predict(features)
    elapsed = time.perf_counter() - t0
    df = pd.DataFrame({"prompt": prompts, "cluster": cluster_labels})
    return df, km.inertia_, elapsed

# Note: The actual clustering implementation is in the next cell

### Sample a subset of prompts

To keep the clustering runtime manageable, we'll downsample the English, no-top-prompt battles to at most 8,000 rows. This still gives plenty of diversity while allowing experiments to finish quickly.

In [25]:
np.random.seed(42)
eng_battles_sample = eng_battles_no_ties_no_top_prompts.sample(
    n=min(8000, len(eng_battles_no_ties_no_top_prompts)),
    random_state=42
)
print(f"Sampled {len(eng_battles_sample)} battles for clustering.")

Sampled 8000 battles for clustering.


### Build TF-IDF features

TF-IDF (term frequency–inverse document frequency) converts each prompt into a high-dimensional vector that captures which words are important for that prompt but rare overall. The code below builds the TF-IDF feature matrix `X` that you'll feed into K-Means.

In [26]:
vectorizer = TfidfVectorizer(max_features=500)
X = vectorizer.fit_transform(eng_battles_sample["prompt"])
texts = eng_battles_sample["prompt"].values
print(f"TF-IDF feature matrix shape: {X.shape}")

TF-IDF feature matrix shape: (8000, 500)


### Sweep cluster sizes with K-Means

Complete the cell below to:
1. Sweep candidate cluster counts `k_values = [4, 6, 8, 10, 12]` and record the inertia and runtime returned by `kmeans_cluster_prompts`.
2. Plot runtime vs. K as well as the inertia (elbow) curve using the collected lists.
3. Choose a `best_K` based on the elbow plot (for example, 8) and justify your choice in the Gradescope write-up.
4. Fit a final K-Means model with `best_K`, copy the cluster labels back to `eng_battles_sample`, and store the resulting dataframe in `clustered_prompts_df`.

In [29]:
k_values = [4, 6, 8, 10, 12]
inertias = []
runtimes = []

# TODO: Iterate over k_values, run kmeans_cluster_prompts, and record inertia/runtime for each K.
# Make sure to append to both lists and optionally print the metrics for debugging.
for k in k_values:
    df_clusters, inertia, elapsed = kmeans_cluster_prompts(
        X.toarray(),
        texts,
        k
    )
    inertias.append(inertia)
    runtimes.append(elapsed)
    print(f"K={k}: inertia={inertia:.2f}, runtime={elapsed:.3f}s")

# Plot runtime vs K
fig_runtime = px.line(
    x=k_values,
    y=runtimes,
    markers=True,
    title='Runtime vs Number of Clusters (K)',
    labels={'x': 'K (Number of Clusters)', 'y': 'Runtime (seconds)'}
)
fig_runtime.show()

# Plot elbow (inertia vs K)
fig_elbow = px.line(
    x=k_values,
    y=inertias,
    markers=True,
    title='Elbow Plot: Inertia vs Number of Clusters (K)',
    labels={'x': 'K (Number of Clusters)', 'y': 'Inertia'}
)
fig_elbow.show()

# TODO: Choose best_K based on your plots / reasoning
best_K = 8

# TODO: Fit a final model using best_K and attach the labels to eng_battles_sample
# clustered_prompts_df, final_inertia, final_elapsed = kmeans_cluster_prompts(...)
# eng_battles_sample = eng_battles_sample.copy()
# eng_battles_sample["cluster"] = clustered_prompts_df["cluster"].values

clustered_prompts_df, final_inertia, final_elapsed = kmeans_cluster_prompts(
    X.toarray(),
    texts,
    best_K
)

eng_battles_sample = eng_battles_sample.copy()
eng_battles_sample["cluster"] = clustered_prompts_df["cluster"].values

print(f"Final K={best_K}: inertia={final_inertia:.2f}, runtime={final_elapsed:.3f}s")

K=4: inertia=6661.22, runtime=5.714s
K=6: inertia=6531.51, runtime=7.047s
K=8: inertia=6446.25, runtime=8.418s
K=10: inertia=6379.36, runtime=4.659s
K=12: inertia=6284.57, runtime=6.531s


Final K=8: inertia=6446.25, runtime=4.113s


### Visualize and save clustered prompts

Once `clustered_prompts_df` and `best_K` are defined, run the cell below to project the TF-IDF features down to two dimensions for visualization and save the clustered prompts to disk for Question 3d.

In [30]:
assert 'cluster' in eng_battles_sample.columns, "Run the clustering cell above before visualizing."
svd = TruncatedSVD(n_components=2, random_state=42)
X_2d = svd.fit_transform(X)

viz_df = pd.DataFrame({
    'SVD Component 1': X_2d[:, 0],
    'SVD Component 2': X_2d[:, 1],
    'Cluster': eng_battles_sample['cluster'].astype(str)
})

fig_clusters = px.scatter(
    viz_df,
    x='SVD Component 1',
    y='SVD Component 2',
    color='Cluster',
    title='2D Visualization of Prompt Clusters',
    color_discrete_sequence=px.colors.qualitative.Set2
)
fig_clusters.show()

out_path = f"clustered_prompts_no_top_prompts_k{best_K}.csv"
clustered_prompts_df.to_csv(out_path, index=False)
print(f"Saved -> {out_path}")

Saved -> clustered_prompts_no_top_prompts_k8.csv




## Question 3d

We’ll now move from training to interpretation. Using the cluster labels from your selected K, briefly characterize what at least one distinct cluster seems to contain, or explain why patterns are unclear and how you might improve them.

You can examine the saved clustered_prompts_no_top_prompts.csv file to inspect the prompts in each cluster.

**Task:** Describe at least one cluster that stands out (or explain why the clusters are unclear) and outline how you would refine the clustering.

### Q3d Response
One clearly interpretable group is Cluster 1. Prompts in this cluster are mostly short, concrete puzzles that ask the model to count or compute something very specific—for example, “How many r’s are in the word strawberry?”, “How many positive integers do not have repeating digits?”, or simple word/number puzzles about chocolates, siblings, or criminals in a house. There are a few general “how to” questions, but the dominant pattern is this “how many / which number / count the letters” style of query, which explains why these prompts end up together.

At the same time, other clusters (for instance Cluster 0 or Cluster 4) mix quite different request types: long role-play instructions, opinion questions, translations, coding/debugging questions, and life-advice prompts all appear in the same cluster. This matches the 2D SVD plot, where the clusters overlap heavily rather than forming clean, separate blobs.

To refine the clustering, I would:

1. Replace TF-IDF with a more semantic representation (e.g. sentence-level embeddings) so that questions are grouped by meaning rather than just shared words.

2. Re-run clustering while experimenting with different values of K and possibly filtering by prompt length or template (e.g. isolating “how many …?” puzzles first) to get more coherent, task-focused clusters.
