<a href="https://colab.research.google.com/github/Untouchables007/llm-bench/blob/main/%5BVISION%5D_Chatbot_Arena_Bradley_Terry_model_Calculation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this notebook, we present data analysis on Chatbot Arena data collected from https://arena.lmsys.org.

We explain different Elo calculation methods (online Elo and MLE Elo, also known as Bradley-Terry model) for model ranking.

To view the latest leaderboard, see https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard.


In [None]:
from collections import defaultdict
import json, math, gdown
import numpy as np
import pandas as pd
import plotly.express as px
from tqdm import tqdm
import requests
pd.options.display.float_format = '{:.2f}'.format

# Obtaining and Cleaning the Tournament Data
We are hosting the initial tournament results as a JSON file on Google Drive. We use the `gdown` function to download the data. The data contains all the battels and voting results collected for ranking models.

In [None]:
# we use the latest data
url = "https://storage.googleapis.com/arena_external_data/public/vision_clean_battle_20240625_public.json"
response = requests.get(url)

with open('local_file_name.json', 'wb') as file:
    file.write(response.content)

# load the JSON data from the local file
with open('local_file_name.json', 'r') as file:
    battles = pd.read_json(file).sort_values(ascending=True, by=["tstamp"])

In [None]:
battles

Unnamed: 0,model_a,model_b,winner,judge,turn,anony,language,tstamp,num_tokens_info,is_code,is_refusal,dedup_tag
0,claude-3-opus-20240229,gemini-1.5-pro-api-0514,tie,arena_user_kCPjnKv4zvwsh5iqkFvtvX,1,True,English,1718003027.69,"{'user_tokens': 1, 'context_a_tokens': 1, 'con...",False,False,"{'high_freq': False, 'sampled': True}"
1,claude-3-sonnet-20240229,claude-3-opus-20240229,model_b,arena_user_o4HEFeLmSpTmPrt7pm4uJe,1,True,Russian,1718003593.83,"{'user_tokens': 8, 'context_a_tokens': 8, 'con...",False,False,"{'high_freq': False, 'sampled': True}"
2,gpt-4o-2024-05-13,gemini-1.5-pro-api-0514,model_b,arena_user_DjHwDQxDZcWre7XftM2RLy,1,True,English,1718003599.78,"{'user_tokens': 5, 'context_a_tokens': 5, 'con...",False,False,"{'high_freq': False, 'sampled': True}"
3,claude-3-haiku-20240307,gpt-4o-2024-05-13,model_b,arena_user_GAsaNLPKyTHdjuo2zrXWSw,1,True,Chinese,1718003966.04,"{'user_tokens': 18, 'context_a_tokens': 18, 'c...",False,False,"{'high_freq': False, 'sampled': True}"
4,claude-3-haiku-20240307,gemini-1.5-flash-api-0514,tie (bothbad),arena_user_GAsaNLPKyTHdjuo2zrXWSw,1,True,Chinese,1718004161.74,"{'user_tokens': 95, 'context_a_tokens': 95, 'c...",False,False,"{'high_freq': False, 'sampled': True}"
...,...,...,...,...,...,...,...,...,...,...,...,...
17424,gemini-1.5-pro-api-0514,claude-3-sonnet-20240229,tie (bothbad),arena_user_5yoWgProWfoZaA5BWViQRm,1,True,English,1719356939.48,"{'user_tokens': 5, 'context_a_tokens': 5, 'con...",False,False,"{'high_freq': False, 'sampled': True}"
17425,claude-3-5-sonnet-20240620,claude-3-sonnet-20240229,tie,arena_user_NwY4uVKKFkD8QTedueRGmn,1,True,English,1719357014.65,"{'user_tokens': 24, 'context_a_tokens': 24, 'c...",False,False,"{'high_freq': False, 'sampled': True}"
17426,gemini-1.5-flash-api-0514,claude-3-sonnet-20240229,tie (bothbad),arena_user_5yoWgProWfoZaA5BWViQRm,1,True,English,1719357017.45,"{'user_tokens': 5, 'context_a_tokens': 5, 'con...",False,False,"{'high_freq': False, 'sampled': True}"
17427,claude-3-haiku-20240307,llava-v1.6-34b,model_a,arena_user_3uXsQJPfETJ7Db9R457z4n,2,True,English,1719357120.99,"{'user_tokens': 15, 'context_a_tokens': 81, 'c...",False,False,"{'high_freq': False, 'sampled': True}"


In [None]:
# we use anony battles only for leaderboard
battles = battles[battles["anony"] == True]

# we de-duplicate top 0.1% redudant prompts
# see https://lmsys.org/blog/2024-05-17-category-hard/#note-enhancing-quality-through-de-duplication
print("Before dedup: ", len(battles))
battles = battles[battles["dedup_tag"].apply(lambda x: x.get("sampled", False))]
print("After dedup: ", len(battles))

Before dedup:  17429
After dedup:  17429


# Exploratory Analysis

Before computing the Elo ratings, we first conduct some basic exploratory analysis to highlight a few key properties and caveates with this data.

## Statistics

We allowed the user to declare a tie between the pairs of models.  To collect additional data, later in the tournament we also allowed the user to declare a tie in which both models were bad.  There were a significant portion of tied outcomes.

In [None]:
fig = px.bar(battles["winner"].value_counts(),
             title="Counts of Battle Outcomes", text_auto=True, height=400)
fig.update_layout(xaxis_title="Battle Outcome", yaxis_title="Count",
                  showlegend=False)
fig

In [None]:
battles_no_ties = battles[~battles["winner"].str.contains("tie")]

## Non-uniform Model Frequency

The model frequency is not uniform because of the follwoing reasons:
- Several different matching and sampling algorithms were used. We employed uniform sampling as well as weighted sampling methods, which assign greater weights to better models.
- Some new models were added later.


In [None]:
fig = px.bar(pd.concat([battles["model_a"], battles["model_b"]]).value_counts(),
             title="Battle Count for Each Model", text_auto=True)
fig.update_layout(xaxis_title="model", yaxis_title="Battle Count", height=400,
                  showlegend=False)
fig

We examing the number of pairings for each combination of models.

In [None]:
def visualize_battle_count(battles, title, show_num_models=30):
    ptbl = pd.pivot_table(battles, index="model_a", columns="model_b", aggfunc="size",
                          fill_value=0)
    battle_counts = ptbl + ptbl.T
    ordering = battle_counts.sum().sort_values(ascending=False).index
    ordering = ordering[:show_num_models]
    fig = px.imshow(battle_counts.loc[ordering, ordering],
                    title=title, text_auto=True)
    fig.update_layout(xaxis_title="Model B",
                      yaxis_title="Model A",
                      xaxis_side="top", height=800, width=800,
                      title_y=0.07, title_x=0.5,
                      font=dict(size=10))
    fig.update_traces(hovertemplate=
                      "Model A: %{y}<br>Model B: %{x}<br>Count: %{z}<extra></extra>")
    return fig

fig = visualize_battle_count(battles, title="Battle Count of Each Combination of Models", show_num_models=30)
fig

### Battles Excluding Ties

In [None]:
visualize_battle_count(battles_no_ties, "Battle Count for Each Combination of Models (without Ties)")

### Counting Ties

In [None]:
visualize_battle_count(battles[battles['winner'].str.contains("tie")], "Tie Count for Each Combination of Models")

## Inferred Language

We also inferred the language for each conversation using `polyglot` package. This is just an estimate but will help guide future analysis.  The vast majority of conversations were in English.

In [None]:
lang_count = battles["language"].value_counts()
lang_count = lang_count.drop(index=("unknown"))

In [None]:
topk = 15
fig = px.bar(lang_count.head(topk),
             title=f"Battle Counts for the Top {topk} Languages",
             text_auto=True, height=400, log_y=True)
fig.update_layout(xaxis_title="Language", yaxis_title="Count", showlegend=False)
fig

## Number of Conversation Turns

We also noticed that most counversations only have one turn.

In [None]:
fig = px.histogram(battles["turn"],
             title=f"Number of Conversation Turns",
             text_auto=True, height=400, log_y=True)
fig.update_layout(xaxis_title="Turns", yaxis_title="Count", showlegend=False)
fig

## Pairwise Win Fractions

Finally, we can also compute the pairwise win fractions. However, because each model can play as Model A and as Model B and win in both situations we need to compute the wins in both configurations divided by the number of pairings of each model.

In [None]:
def compute_pairwise_win_fraction(battles, max_num_models=30):
    # Times each model wins as Model A
    a_win_ptbl = pd.pivot_table(
        battles[battles['winner'] == "model_a"],
        index="model_a", columns="model_b", aggfunc="size", fill_value=0)

    # Table counting times each model wins as Model B
    b_win_ptbl = pd.pivot_table(
        battles[battles['winner'] == "model_b"],
        index="model_a", columns="model_b", aggfunc="size", fill_value=0)

    # Table counting number of A-B pairs
    num_battles_ptbl = pd.pivot_table(battles,
        index="model_a", columns="model_b", aggfunc="size", fill_value=0)

    # Computing the proportion of wins for each model as A and as B
    # against all other models
    row_beats_col_freq = (
        (a_win_ptbl + b_win_ptbl.T) /
        (num_battles_ptbl + num_battles_ptbl.T)
    )

    # Arrange ordering according to proprition of wins
    prop_wins = row_beats_col_freq.mean(axis=1).sort_values(ascending=False)
    prop_wins = prop_wins[:max_num_models]
    model_names = list(prop_wins.keys())
    row_beats_col = row_beats_col_freq.loc[model_names, model_names]
    return row_beats_col

def visualize_pairwise_win_fraction(battles, title, max_num_models=30):
    row_beats_col = compute_pairwise_win_fraction(battles, max_num_models)
    fig = px.imshow(row_beats_col, color_continuous_scale='RdBu',
                    text_auto=".2f", title=title)
    fig.update_layout(xaxis_title=" Model B: Loser",
                  yaxis_title="Model A: Winner",
                  xaxis_side="top", height=900, width=900,
                  title_y=0.07, title_x=0.5)
    fig.update_traces(hovertemplate=
                  "Model A: %{y}<br>Model B: %{x}<br>Fraction of A Wins: %{z}<extra></extra>")

    return fig

In [None]:
fig = visualize_pairwise_win_fraction(battles_no_ties,
      title = "Fraction of Model A Wins for All Non-tied A vs. B Battles")
fig

## Preliminary Ranking

Using just the average win rate against all other models we can already compute an estimated leaderboard.
However, this method may not be as scalable as the Elo rating system that we will use later because this method requires data from all model combinations.

In [None]:
row_beats_col_freq = compute_pairwise_win_fraction(battles_no_ties)
fig = px.bar(row_beats_col_freq.mean(axis=1).sort_values(ascending=False),
             title="Average Win Rate Against All Other Models (Assuming Uniform Sampling and No Ties)",
             text_auto=".2f")
fig.update_layout(yaxis_title="Average Win Rate", xaxis_title="Model",
                  showlegend=False)
fig

#Bradley-Terry Model

In [None]:
def pretty_print_model_ratings(ratings):
    df = pd.DataFrame([
        [n, ratings[n]] for n in ratings.keys()
    ], columns=["Model", "BT rating"]).sort_values("BT rating", ascending=False).reset_index(drop=True)
    # df["Elo rating"] = (df["Elo rating"] + 0.5).astype(int)
    df.index = df.index + 1
    return df


### Maximum Likelihood Estimation with [Bradley-Terry model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model)

In [None]:
def compute_bt(
    df, SCALE=400, BASE=10, INIT_RATING=1000, sample_weight=None, offset=0,
):
    from sklearn.linear_model import LogisticRegression
    ptbl_a_win = pd.pivot_table(
        df[df["winner"] == "model_a"],
        index="model_a",
        columns="model_b",
        aggfunc="size",
        fill_value=0,
    )
    # if no tie, create a zero matrix
    if sum(df["winner"].isin(["tie", "tie (bothbad)"])) == 0:
        ptbl_tie = pd.DataFrame(0, index=ptbl_a_win.index, columns=ptbl_a_win.columns)
    else:
        ptbl_tie = pd.pivot_table(
            df[df["winner"].isin(["tie", "tie (bothbad)"])],
            index="model_a",
            columns="model_b",
            aggfunc="size",
            fill_value=0,
        )
        ptbl_tie = ptbl_tie + ptbl_tie.T

    ptbl_b_win = pd.pivot_table(
        df[df["winner"] == "model_b"],
        index="model_a",
        columns="model_b",
        aggfunc="size",
        fill_value=0,
    )
    ptbl_win = ptbl_a_win * 2 + ptbl_b_win.T * 2 + ptbl_tie

    models = pd.Series(np.arange(len(ptbl_win.index)), index=ptbl_win.index)

    p = len(models)
    X = np.zeros([p * (p - 1) * 2, p])
    Y = np.zeros(p * (p - 1) * 2)

    cur_row = 0
    sample_weights = []
    for m_a in ptbl_win.index:
        for m_b in ptbl_win.columns:
            if m_a == m_b:
                continue
            # if nan skip
            if math.isnan(ptbl_win.loc[m_a, m_b]) or math.isnan(ptbl_win.loc[m_b, m_a]):
                continue
            X[cur_row, models[m_a]] = +math.log(BASE)
            X[cur_row, models[m_b]] = -math.log(BASE)
            Y[cur_row] = 1.0
            sample_weights.append(ptbl_win.loc[m_a, m_b])

            X[cur_row + 1, models[m_a]] = math.log(BASE)
            X[cur_row + 1, models[m_b]] = -math.log(BASE)
            Y[cur_row + 1] = 0.0
            sample_weights.append(ptbl_win.loc[m_b, m_a])
            cur_row += 2
    X = X[:cur_row]
    Y = Y[:cur_row]

    sample_weights = np.array(sample_weights)

    lr = LogisticRegression(fit_intercept=False, penalty="l2", C=1, tol=1e-6)
    lr.fit(X, Y, sample_weight=sample_weights)
    beta = (lr.coef_.squeeze()*SCALE+INIT_RATING) + offset

    return pd.Series(beta, index=models.index).sort_values(ascending=False)

bt_ratings = compute_bt(battles)
pretty_print_model_ratings(bt_ratings)

Unnamed: 0,Model,BT rating
1,gpt-4o-2024-05-13,1114.79
2,claude-3-5-sonnet-20240620,1098.16
3,gemini-1.5-pro-api-0514,1059.51
4,gpt-4-turbo-2024-04-09,1056.23
5,claude-3-opus-20240229,972.54
6,gemini-1.5-flash-api-0514,967.7
7,claude-3-sonnet-20240229,938.81
8,llava-v1.6-34b,903.26
9,claude-3-haiku-20240307,888.99


### Compute Bootstrap Confidence Interavals for MLE Elo Scores

We can further use bootstrap to estimate the confidence intervals as well.


In [None]:
def get_bootstrap_result(battles, func_compute_elo, num_round, offset):
    rows = []
    for i in tqdm(range(num_round), desc="bootstrap"):
        rows.append(func_compute_elo(battles.sample(frac=1.0, replace=True), offset=offset))
    df = pd.DataFrame(rows)
    return df[df.median().sort_values(ascending=False).index]

In [None]:
bt_ratings = compute_bt(battles)
anchor_score = bt_ratings[bt_ratings.index == "claude-3-haiku-20240307"].values[0]-1000

In [None]:
BOOTSTRAP_ROUNDS = 100

np.random.seed(42)
bootstrap_elo_lu = get_bootstrap_result(battles, compute_bt, BOOTSTRAP_ROUNDS, offset=-anchor_score)

bootstrap: 100%|██████████| 100/100 [00:07<00:00, 12.60it/s]


In [None]:
def visualize_bootstrap_scores(df, title):
    bars = pd.DataFrame(dict(
        lower = df.quantile(.025),
        rating = df.quantile(.5),
        upper = df.quantile(.975))).reset_index(names="model").sort_values("rating", ascending=False)
    bars['error_y'] = bars['upper'] - bars["rating"]
    bars['error_y_minus'] = bars['rating'] - bars["lower"]
    bars['rating_rounded'] = np.round(bars['rating'])
    fig = px.scatter(bars, x="model", y="rating", error_y="error_y",
                     error_y_minus="error_y_minus", text="rating_rounded",
                     title=title)
    fig.update_traces(textposition='middle right')
    fig.update_layout(xaxis_title="Model", yaxis_title="Rating",
                      height=600)
    return fig

fig = visualize_bootstrap_scores(bootstrap_elo_lu, "Bootstrap of BT Rating Estimates")
fig

### Predict Win Rates
Utilizing Elo ratings allows us to predict win probabilities. By comparing the predicted win rates with the actual win rates, we can gain insight into the accuracy and quality of the Elo rating system.






In [None]:
def predict_win_rate(bt_ratings, SCALE=400, BASE=10, INIT_RATING=1000):
    names = sorted(list(bt_ratings.keys()))
    wins = defaultdict(lambda: defaultdict(lambda: 0))
    for a in names:
        for b in names:
            ea = 1 / (1 + BASE ** ((bt_ratings[b] - bt_ratings[a])/SCALE))
            wins[a][b] = ea
            wins[b][a] = 1 - ea

    data = {
        a: [wins[a][b] if a != b else np.NAN for b in names]
        for a in names
    }

    df = pd.DataFrame(data, index=names)
    df.index.name = "model_a"
    df.columns.name = "model_b"
    return df.T

In [None]:
win_rate = predict_win_rate(dict(bootstrap_elo_lu.quantile(0.5)))
ordered_models = win_rate.mean(axis=1).sort_values(ascending=False).index
ordered_models = ordered_models[:30]
fig = px.imshow(win_rate.loc[ordered_models, ordered_models],
                color_continuous_scale='RdBu', text_auto=".2f",
                title="Predicted Win Rate Using Elo Ratings for Model A in an A vs. B Battle")
fig.update_layout(xaxis_title="Model B",
                  yaxis_title="Model A",
                  xaxis_side="top", height=900, width=900,
                  title_y=0.07, title_x=0.5)
fig.update_traces(hovertemplate=
                  "Model A: %{y}<br>Model B: %{x}<br>Win Rate: %{z}<extra></extra>")
fig

# Language-specific Leaderboards
We present two language-specific leaderboards, by isolating the chat data into two subsets based on the language: (1) English-only and (2) Non-English.

## English-only

In [None]:
english_only_battles = battles[battles["language"] == "English"]
bt_ratings = compute_bt(english_only_battles)
pd.DataFrame(bt_ratings)

Unnamed: 0_level_0,0
model_a,Unnamed: 1_level_1
gpt-4o-2024-05-13,1112.34
claude-3-5-sonnet-20240620,1091.37
gpt-4-turbo-2024-04-09,1060.43
gemini-1.5-pro-api-0514,1058.55
gemini-1.5-flash-api-0514,974.3
claude-3-opus-20240229,959.57
claude-3-sonnet-20240229,934.31
llava-v1.6-34b,926.89
claude-3-haiku-20240307,882.23


## Non-English

In [None]:
non_english_battles = battles[battles["language"] != "English"]
bt_ratings = compute_bt(non_english_battles)
pd.DataFrame(bt_ratings)

Unnamed: 0_level_0,0
model_a,Unnamed: 1_level_1
gpt-4o-2024-05-13,1118.7
claude-3-5-sonnet-20240620,1111.63
gemini-1.5-pro-api-0514,1060.63
gpt-4-turbo-2024-04-09,1049.91
claude-3-opus-20240229,993.82
gemini-1.5-flash-api-0514,956.52
claude-3-sonnet-20240229,946.27
claude-3-haiku-20240307,899.91
llava-v1.6-34b,862.61


# Links



Some good resources to learn more about Elo rating systems:
- Elo rating system https://en.wikipedia.org/wiki/Elo_rating_system
- Bradley-Terry model https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model
- An introduction video https://www.youtube.com/watch?v=AsYfbmp0To0
- A FiveThirtyEight article https://fivethirtyeight.com/methodology/how-our-nfl-predictions-work/
