# Introduction

In this notebook, we present data analysis on Chatbot Arena data collected from https://arena.lmsys.org between April 24, 2023 to Apr 9, 2024.

We explain different Elo calculation methods (online Elo and MLE Elo, also known as Bradley-Terry model) for model ranking.

To view the latest leaderboard, see https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard.


In [98]:
from collections import defaultdict
import json, math
import numpy as np
import pandas as pd
import plotly.express as px
from tqdm import tqdm
import requests
pd.options.display.float_format = '{:.2f}'.format

# Obtaining and Cleaning the Tournament Data
We are hosting the initial tournament results as a JSON file on Google Drive. We use the `gdown` function to download the data. The data contains all the battels and voting results collected for ranking models.

In [99]:
# we use the latest data on 20240422 (we had two leaderboard updates on 4/22)
# url = "https://storage.googleapis.com/arena_external_data/public/clean_battle_20240422-2.json"
# response = requests.get(url)

# with open('results.json', 'wb') as file:
#     file.write(response.content)

# load the JSON data from the local file
with open('results.json', 'r') as file:
    battles = pd.read_json(file)

In [100]:
battles

Unnamed: 0,model_a,exid,result_a,info_a,model_b,result_b,info_b,winner
0,deepseek-base-1.3b,sample_0input.json,False,temp0.2,deepseek-base-1.3b,False,temp0.2,tie
1,deepseek-base-1.3b,sample_0input.json,False,temp0.2,deepseek-instruct-6.7b,True,temp0.2,model_b
2,deepseek-base-1.3b,sample_0input.json,False,temp0.2,deepseek-instruct-1.3b,False,temp0.2,tie
3,deepseek-base-1.3b,sample_0input.json,False,temp0.2,codellama-python-13b,True,temp0.2,model_b
4,deepseek-base-1.3b,sample_0input.json,False,temp0.2,deepseek-base-6.7b,True,temp0.2,model_b
...,...,...,...,...,...,...,...,...
1959995,gpt-4-turbo-2024-04-09+cot,sample_799output.json,False,temp0.2,codellama-python-7b,False,temp0.2,tie
1959996,gpt-4-turbo-2024-04-09+cot,sample_799output.json,False,temp0.2,claude-3-opus-20240229,False,temp0.2,tie
1959997,gpt-4-turbo-2024-04-09+cot,sample_799output.json,False,temp0.2,claude-3-opus-20240229+cot,False,temp0.2,tie
1959998,gpt-4-turbo-2024-04-09+cot,sample_799output.json,False,temp0.2,gpt-4-turbo-2024-04-09,False,temp0.2,tie


In [101]:
# battles = battles[battles["anony"] == True]
# print(len(battles))

# Exploratory Analysis

Before computing the Elo ratings, we first conduct some basic exploratory analysis to highlight a few key properties and caveates with this data.

## Statistics

We allowed the user to declare a tie between the pairs of models.  To collect additional data, later in the tournament we also allowed the user to declare a tie in which both models were bad.  There were a significant portion of tied outcomes.

In [102]:
fig = px.bar(battles["winner"].value_counts(),
             title="Counts of Battle Outcomes", text_auto=True, height=400)
fig.update_layout(xaxis_title="Battle Outcome", yaxis_title="Count",
                  showlegend=False)
fig

In [103]:
battles_no_ties = battles[~battles["winner"].str.contains("tie")]

## Non-uniform Model Frequency

The model frequency is not uniform because of the follwoing reasons:
- Several different matching and sampling algorithms were used. We employed uniform sampling as well as weighted sampling methods, which assign greater weights to better models.
- Some new models were added later.


In [104]:
fig = px.bar(pd.concat([battles["model_a"], battles["model_b"]]).value_counts(),
             title="Battle Count for Each Model", text_auto=True)
fig.update_layout(xaxis_title="model", yaxis_title="Battle Count", height=400,
                  showlegend=False)
fig

We examing the number of pairings for each combination of models.

In [105]:
def visualize_battle_count(battles, title, show_num_models=30):
    ptbl = pd.pivot_table(battles, index="model_a", columns="model_b", aggfunc="size",
                          fill_value=0)
    battle_counts = ptbl + ptbl.T
    ordering = battle_counts.sum().sort_values(ascending=False).index
    ordering = ordering[:show_num_models]
    fig = px.imshow(battle_counts.loc[ordering, ordering],
                    title=title, text_auto=True)
    fig.update_layout(xaxis_title="Model B",
                      yaxis_title="Model A",
                      xaxis_side="top", height=800, width=800,
                      title_y=0.07, title_x=0.5,
                      font=dict(size=10))
    fig.update_traces(hovertemplate=
                      "Model A: %{y}<br>Model B: %{x}<br>Count: %{z}<extra></extra>")
    return fig

fig = visualize_battle_count(battles, title="Battle Count of Each Combination of Models", show_num_models=30)
fig

### Battles Excluding Ties

In [106]:
visualize_battle_count(battles_no_ties, "Battle Count for Each Combination of Models (without Ties)")

### Counting Ties

In [107]:
visualize_battle_count(battles[battles['winner'].str.contains("tie")], "Tie Count for Each Combination of Models")

## Inferred Language

We also inferred the language for each conversation using `polyglot` package. This is just an estimate but will help guide future analysis.  The vast majority of conversations were in English.

topk = 15
fig = px.bar(battles["language"].value_counts().head(topk),
             title=f"Battle Counts for the Top {topk} Languages",
             text_auto=True, height=400)
fig.update_layout(xaxis_title="Language", yaxis_title="Count", showlegend=False)
fig

## Number of Conversation Turns

We also noticed that most counversations only have one turn.
fig = px.histogram(battles["turn"],
             title=f"Number of Conversation Turns",
             text_auto=True, height=400, log_y=True)
fig.update_layout(xaxis_title="Turns", yaxis_title="Count", showlegend=False)
fig

## Pairwise Win Fractions

Finally, we can also compute the pairwise win fractions. However, because each model can play as Model A and as Model B and win in both situations we need to compute the wins in both configurations divided by the number of pairings of each model.

In [108]:
def compute_pairwise_win_fraction(battles, max_num_models=30):
    # Times each model wins as Model A
    a_win_ptbl = pd.pivot_table(
        battles[battles['winner'] == "model_a"],
        index="model_a", columns="model_b", aggfunc="size", fill_value=0)
    # Table counting times each model wins as Model B
    b_win_ptbl = pd.pivot_table(
        battles[battles['winner'] == "model_b"],
        index="model_a", columns="model_b", aggfunc="size", fill_value=0)

    # Table counting number of A-B pairs
    num_battles_ptbl = pd.pivot_table(battles,
        index="model_a", columns="model_b", aggfunc="size", fill_value=0)

    # Computing the proportion of wins for each model as A and as B
    # against all other models
    row_beats_col_freq = (
        (a_win_ptbl + b_win_ptbl.T) /
        (num_battles_ptbl + num_battles_ptbl.T)
    )
    wins = a_win_ptbl + b_win_ptbl.T
    mcnemar = (wins - wins.T)**2 / (wins + wins.T)
    display(mcnemar)

    # Arrange ordering according to proprition of wins
    prop_wins = row_beats_col_freq.mean(axis=1).sort_values(ascending=False)
    prop_wins = prop_wins[:max_num_models]
    model_names = list(prop_wins.keys())
    row_beats_col = row_beats_col_freq.loc[model_names, model_names]
    return row_beats_col, mcnemar

def visualize_pairwise_win_fraction(battles, title, max_num_models=30):
    row_beats_col, mcnemar = compute_pairwise_win_fraction(battles, max_num_models)
    fig = px.imshow(row_beats_col, color_continuous_scale='RdBu',
                    text_auto=".2f", title=title)
    fig.update_layout(xaxis_title=" Model B: Loser",
                  yaxis_title="Model A: Winner",
                  xaxis_side="top", height=900, width=900,
                  title_y=0.07, title_x=0.5)
    fig.update_traces(customdata=mcnemar, hovertemplate=
                  "Model A: %{y}<br>Model B: %{x}<br>Fraction of A Wins: %{z}<extra> McNemar val: %{customdata} </extra>")

    return fig

In [141]:
fig_actualwin = visualize_pairwise_win_fraction(battles_no_ties,
      title = "Fraction of Model A Wins for All Non-tied A vs. B Battles")
fig_actualwin
# battles_no_ties

# compute_pairwise_win_fraction(battles_no_ties, max_num_models=80)


model_b,claude-3-opus-20240229,claude-3-opus-20240229+cot,codellama-13b,codellama-13b+cot,codellama-34b,codellama-34b+cot,codellama-7b,codellama-7b+cot,codellama-python-13b,codellama-python-34b,...,mistral-7b,mixtral-8x7b,phi-1,phi-1.5,phi-2,phind,starcoderbase-16b,starcoderbase-7b,wizard-13b,wizard-34b
model_a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
claude-3-opus-20240229,,240.29,461.11,429.14,386.36,298.81,663.32,647.63,602.81,481.18,...,731.87,553.09,1353.61,973.97,787.74,435.67,789.59,847.48,621.0,478.76
claude-3-opus-20240229+cot,240.29,,912.84,889.11,814.15,737.54,1128.55,1115.36,1067.08,917.28,...,1189.58,1017.8,1808.19,1454.75,1271.74,858.25,1257.63,1295.66,1079.97,927.91
codellama-13b,461.11,912.84,,0.01,21.29,28.83,68.69,59.25,17.86,1.51,...,86.23,6.03,578.94,295.32,135.06,4.28,140.59,191.61,19.09,0.79
codellama-13b+cot,429.14,889.11,0.01,,15.37,33.05,45.45,69.69,11.63,1.28,...,63.36,4.04,529.82,250.16,102.25,3.55,100.02,134.1,12.79,0.77
codellama-34b,386.36,814.15,21.29,15.37,,2.41,133.41,124.7,70.69,13.89,...,161.65,45.92,685.17,366.37,209.64,7.98,219.86,279.62,71.94,13.56
codellama-34b+cot,298.81,737.54,28.83,33.05,2.41,,136.22,168.93,76.47,19.14,...,163.93,49.98,693.16,382.96,219.84,12.97,209.21,264.73,76.1,20.46
codellama-7b,663.32,1128.55,68.69,45.45,133.41,136.22,,0.72,17.17,74.6,...,2.53,30.43,375.81,131.2,19.12,84.02,17.91,42.47,12.9,65.99
codellama-7b+cot,647.63,1115.36,59.25,69.69,124.7,168.93,0.72,,18.59,68.91,...,0.23,27.67,286.06,91.39,8.64,79.67,7.37,19.14,14.56,62.13
codellama-python-13b,602.81,1067.08,17.86,11.63,70.69,76.47,17.17,18.59,,25.47,...,27.68,2.14,451.71,188.46,60.42,31.62,61.15,100.49,0.24,21.12
codellama-python-34b,481.18,917.28,1.51,1.28,13.89,19.14,74.6,68.91,25.47,,...,98.38,12.52,622.18,284.06,138.24,0.78,141.04,183.04,27.77,0.15


## Preliminary Ranking

Using just the average win rate against all other models we can already compute an estimated leaderboard.
However, this method may not be as scalable as the Elo rating system that we will use later because this method requires data from all model combinations.

In [110]:
row_beats_col_freq = compute_pairwise_win_fraction(battles_no_ties)
fig = px.bar(row_beats_col_freq.mean(axis=1).sort_values(ascending=False),
             title="Average Win Rate Against All Other Models (Assuming Uniform Sampling and No Ties)",
             text_auto=".2f")
fig.update_layout(yaxis_title="Average Win Rate", xaxis_title="Model",
                  showlegend=False)
fig

model_b,claude-3-opus-20240229,claude-3-opus-20240229+cot,codellama-13b,codellama-13b+cot,codellama-34b,codellama-34b+cot,codellama-7b,codellama-7b+cot,codellama-python-13b,codellama-python-34b,...,mistral-7b,mixtral-8x7b,phi-1,phi-1.5,phi-2,phind,starcoderbase-16b,starcoderbase-7b,wizard-13b,wizard-34b
model_a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
claude-3-opus-20240229,,240.29,461.11,429.14,386.36,298.81,663.32,647.63,602.81,481.18,...,731.87,553.09,1353.61,973.97,787.74,435.67,789.59,847.48,621.0,478.76
claude-3-opus-20240229+cot,240.29,,912.84,889.11,814.15,737.54,1128.55,1115.36,1067.08,917.28,...,1189.58,1017.8,1808.19,1454.75,1271.74,858.25,1257.63,1295.66,1079.97,927.91
codellama-13b,461.11,912.84,,0.01,21.29,28.83,68.69,59.25,17.86,1.51,...,86.23,6.03,578.94,295.32,135.06,4.28,140.59,191.61,19.09,0.79
codellama-13b+cot,429.14,889.11,0.01,,15.37,33.05,45.45,69.69,11.63,1.28,...,63.36,4.04,529.82,250.16,102.25,3.55,100.02,134.1,12.79,0.77
codellama-34b,386.36,814.15,21.29,15.37,,2.41,133.41,124.7,70.69,13.89,...,161.65,45.92,685.17,366.37,209.64,7.98,219.86,279.62,71.94,13.56
codellama-34b+cot,298.81,737.54,28.83,33.05,2.41,,136.22,168.93,76.47,19.14,...,163.93,49.98,693.16,382.96,219.84,12.97,209.21,264.73,76.1,20.46
codellama-7b,663.32,1128.55,68.69,45.45,133.41,136.22,,0.72,17.17,74.6,...,2.53,30.43,375.81,131.2,19.12,84.02,17.91,42.47,12.9,65.99
codellama-7b+cot,647.63,1115.36,59.25,69.69,124.7,168.93,0.72,,18.59,68.91,...,0.23,27.67,286.06,91.39,8.64,79.67,7.37,19.14,14.56,62.13
codellama-python-13b,602.81,1067.08,17.86,11.63,70.69,76.47,17.17,18.59,,25.47,...,27.68,2.14,451.71,188.46,60.42,31.62,61.15,100.49,0.24,21.12
codellama-python-34b,481.18,917.28,1.51,1.28,13.89,19.14,74.6,68.91,25.47,,...,98.38,12.52,622.18,284.06,138.24,0.78,141.04,183.04,27.77,0.15


AttributeError: 'tuple' object has no attribute 'mean'

#Elo Ratings

The [Elo rating system ](https://en.wikipedia.org/wiki/Elo_rating_system)is a method for calculating the relative skill levels of players, which has been widely adopted in chess and other competitive games. The difference in the ratings between two players serves as a predictor of the outcome of a match. The Elo rating system works well for our case because we have multiple models and we run pairwise battles between them.
In this section, we present different methods for calculating Elo ratings.

### Compute Ratings
We first use the online linear update algorithm to compute Elo ratings.
We choose a small K-factor of 4 to make the Elo ratings more stable and less biased towards recent games.

In [None]:
def compute_online_elo(battles, K=4, SCALE=400, BASE=10, INIT_RATING=1000):
    rating = defaultdict(lambda: INIT_RATING)

    for rd, model_a, model_b, winner in battles[['model_a', 'model_b', 'winner']].itertuples():
        ra = rating[model_a]
        rb = rating[model_b]
        ea = 1 / (1 + BASE ** ((rb - ra) / SCALE))
        eb = 1 / (1 + BASE ** ((ra - rb) / SCALE))
        if winner == "model_a":
            sa = 1
        elif winner == "model_b":
            sa = 0
        elif winner == "tie" or winner == "tie (bothbad)":
            sa = 0.5
        else:
            raise Exception(f"unexpected vote {winner}")
        rating[model_a] += K * (sa - ea)
        rating[model_b] += K * (1 - sa - eb)

    # calibrate llama-13b to 800
    delta = (800-rating["llama-13b"])
    for model in battles["model_a"].unique():
        rating[model] += delta

    return rating

In [None]:
def preety_print_model_ratings(ratings):
    df = pd.DataFrame([
        [n, ratings[n]] for n in ratings.keys()
    ], columns=["Model", "Elo rating"]).sort_values("Elo rating", ascending=False).reset_index(drop=True)
    # df["Elo rating"] = (df["Elo rating"] + 0.5).astype(int)
    df.index = df.index + 1
    return df

online_elo_ratings = compute_online_elo(battles)
preety_print_model_ratings(online_elo_ratings)

Unnamed: 0,Model,Elo rating
1,gpt-4-0613+cot,1020.29
2,llama-13b,1000.0
3,gpt-4-turbo-2024-04-09+cot,904.71
4,gpt-4-turbo-2024-04-09,900.99
5,claude-3-opus-20240229+cot,898.7
6,claude-3-opus-20240229,896.46
7,gpt-4-0613,844.4
8,wizard-13b,837.82
9,gpt-3.5-turbo-0613,837.82
10,deepseek-base-6.7b,836.78


However, even with a small K-factor, we still found this online update algorithm to be unstable.

To demonstrate it, we recompute Elo rating by using the reversed game order and observe significant difference due to online update of Elo which biases the recent games.

In [70]:
def preety_print_two_ratings(ratings_1, ratings_2, column_names):
    df = pd.DataFrame([
        [n, ratings_1[n], ratings_2[n]] for n in ratings_1.keys()
    ], columns=["Model", column_names[0], column_names[1]]).sort_values(column_names[0], ascending=False).reset_index(drop=True)
    df[column_names[0]] = (df[column_names[0]] + 0.5).astype(int)
    df[column_names[1]] = (df[column_names[1]] + 0.5).astype(int)
    df.index = df.index + 1
    return df

elo_mle_ratings_reverse = compute_online_elo(battles.iloc[::-1])
preety_print_two_ratings(online_elo_ratings,
                         elo_mle_ratings_reverse,
                         column_names=["Elo rating", "Elo rating with reverse order"])

Unnamed: 0,Model,Elo rating,Elo rating with reverse order
1,gpt-4-0613+cot,1020,935
2,llama-13b,1000,1000
3,gpt-4-turbo-2024-04-09+cot,905,978
4,gpt-4-turbo-2024-04-09,901,956
5,claude-3-opus-20240229+cot,899,910
6,claude-3-opus-20240229,896,890
7,gpt-4-0613,844,918
8,wizard-13b,838,817
9,gpt-3.5-turbo-0613,838,822
10,deepseek-base-6.7b,837,920



### Maximum Likelihood Estimation for Elo Ratings (aka [Bradley-Terry model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model))

In the context of LLM evaluation, models can be assumed to be static. In this case, we can directly fit the ratings by maximum likelihood estimation method (aka Bradley-Terry model), which produce significantly stable ratings. Here we provide an implementation with logistic regression.

In [137]:
def compute_mle_elo(
    df, SCALE=400, BASE=10, INIT_RATING=1000
):
    from sklearn.linear_model import LogisticRegression
    ptbl_a_win = pd.pivot_table(
        df[df["winner"] == "model_a"],
        index="model_a",
        columns="model_b",
        aggfunc="size",
        fill_value=0,
    )
    # if no tie, create a zero matrix
    if sum(df["winner"].isin(["tie", "tie (bothbad)"])) == 0:
        ptbl_tie = pd.DataFrame(0, index=ptbl_a_win.index, columns=ptbl_a_win.columns)
    else:
        ptbl_tie = pd.pivot_table(
            df[df["winner"].isin(["tie", "tie (bothbad)"])],
            index="model_a",
            columns="model_b",
            aggfunc="size",
            fill_value=0,
        )
        ptbl_tie = ptbl_tie + ptbl_tie.T

    ptbl_b_win = pd.pivot_table(
        df[df["winner"] == "model_b"],
        index="model_a",
        columns="model_b",
        aggfunc="size",
        fill_value=0,
    )
    # display(ptbl_win)
    ptbl_win = ptbl_a_win * 2 + ptbl_b_win.T * 2 + ptbl_tie
    # display(ptbl_tie)

    models = pd.Series(np.arange(len(ptbl_win.index)), index=ptbl_win.index)

    p = len(models)
    X = np.zeros([p * (p - 1) * 2, p])
    Y = np.zeros(p * (p - 1) * 2)

    cur_row = 0
    sample_weights = []
    for m_a in ptbl_win.index:
        for m_b in ptbl_win.columns:
            if m_a == m_b:
                continue
            # if nan skip
            if math.isnan(ptbl_win.loc[m_a, m_b]) or math.isnan(ptbl_win.loc[m_b, m_a]):
                continue
            X[cur_row, models[m_a]] = +math.log(BASE)
            X[cur_row, models[m_b]] = -math.log(BASE)
            Y[cur_row] = 1.0
            sample_weights.append(ptbl_win.loc[m_a, m_b])

            X[cur_row + 1, models[m_a]] = math.log(BASE)
            X[cur_row + 1, models[m_b]] = -math.log(BASE)
            Y[cur_row + 1] = 0.0
            sample_weights.append(ptbl_win.loc[m_b, m_a])
            cur_row += 2
    X = X[:cur_row]
    Y = Y[:cur_row]
    # fig = px.imshow(X)
    # display(fig)
    # print(Y, Y.shape)

    lr = LogisticRegression(fit_intercept=False, penalty=None, tol=1e-6)
    lr.fit(X, Y, sample_weight=sample_weights)
    elo_scores = SCALE * lr.coef_[0] + INIT_RATING
    if "gpt-3.5" in models.index:
        elo_scores += 1000 - elo_scores[models["gpt-3.5"]]
    return pd.Series(elo_scores, index=models.index).sort_values(ascending=False)

In [138]:
elo_mle_ratings = compute_mle_elo(battles_no_ties)
preety_print_model_ratings(elo_mle_ratings)

Unnamed: 0,Model,Elo rating
1,gpt-4-turbo-2024-04-09+cot,1385.84
2,claude-3-opus-20240229+cot,1376.73
3,gpt-4-0613+cot,1367.64
4,gpt-4-0613,1292.68
5,gpt-4-turbo-2024-04-09,1284.17
6,claude-3-opus-20240229,1230.21
7,gpt-3.5-turbo-0613+cot,1084.78
8,gpt-3.5-turbo-0613,1056.52
9,deepseek-instruct-33b,1051.87
10,codetulu-2-34b,1046.44


### Compute Bootstrap Confidence Interavals for MLE Elo Scores

We can further use bootstrap to estimate the confidence intervals as well.


In [66]:
def get_bootstrap_result(battles, func_compute_elo, num_round):
    rows = []
    for i in tqdm(range(num_round), desc="bootstrap"):
        rows.append(func_compute_elo(battles.sample(frac=1.0, replace=True)))
    df = pd.DataFrame(rows)
    return df[df.median().sort_values(ascending=False).index]


In [67]:
groups = battles.groupby(by='exid')
samples = groups.sample(frac=0.1, replace=True)
print(groups)
print(samples.describe())


<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe08c2d40a0>
                     model_a                exid result_a   info_a  \
count                 697600              697600   697600   697600   
unique                    35                1600        2        2   
top     codellama-python-34b  sample_0input.json    False  temp0.2   
freq                   21467                 436   390822   369996   

           model_b result_b   info_b  winner  
count       697600   697600   697600  697600  
unique          35        2        2       3  
top     wizard-13b    False  temp0.2     tie  
freq         21454   390513   370315  498883  


In [116]:
BOOTSTRAP_ROUNDS = 3

np.random.seed(42)
bootstrap_elo_lu = get_bootstrap_result(battles_no_ties, compute_mle_elo, BOOTSTRAP_ROUNDS)

bootstrap:   0%|          | 0/3 [00:00<?, ?it/s]

model_b,claude-3-opus-20240229,claude-3-opus-20240229+cot,codellama-13b,codellama-13b+cot,codellama-34b,codellama-34b+cot,codellama-7b,codellama-7b+cot,codellama-python-13b,codellama-python-34b,...,mistral-7b,mixtral-8x7b,phi-1,phi-1.5,phi-2,phind,starcoderbase-16b,starcoderbase-7b,wizard-13b,wizard-34b
model_a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
claude-3-opus-20240229,0,292,1780,1870,1626,1612,2218,2416,1806,1702,...,2196,1922,3090,2732,2438,1666,2322,2444,1990,1676
claude-3-opus-20240229+cot,1028,0,2520,2554,2226,2114,2824,2968,2646,2436,...,3010,2506,3940,3512,3104,2488,2828,3160,2496,2352
codellama-13b,428,256,0,792,482,692,802,1140,598,666,...,818,720,1932,1478,1014,680,930,988,718,680
codellama-13b+cot,484,320,760,0,742,624,1014,904,884,910,...,1190,876,1898,1660,1254,836,1230,1366,1034,866
codellama-34b,350,248,698,1052,0,690,1116,1248,858,580,...,1094,844,2180,1838,1282,526,1284,1274,974,728
codellama-34b+cot,494,380,1060,906,782,0,1254,1224,1106,940,...,1466,1160,2372,1862,1472,1006,1484,1490,1168,1076
codellama-7b,362,244,406,674,442,642,0,874,540,494,...,698,512,1538,1200,716,486,684,652,584,622
codellama-7b+cot,418,276,694,488,558,534,834,0,694,700,...,806,784,1676,1262,886,664,994,904,778,776
codellama-python-13b,354,196,460,692,486,660,666,946,0,598,...,836,670,1740,1362,1016,608,900,910,388,624
codellama-python-34b,286,232,718,976,472,680,972,1328,800,0,...,986,784,2000,1726,1194,600,1184,1244,864,486


model_b,claude-3-opus-20240229,claude-3-opus-20240229+cot,codellama-13b,codellama-13b+cot,codellama-34b,codellama-34b+cot,codellama-7b,codellama-7b+cot,codellama-python-13b,codellama-python-34b,...,mistral-7b,mixtral-8x7b,phi-1,phi-1.5,phi-2,phind,starcoderbase-16b,starcoderbase-7b,wizard-13b,wizard-34b
model_a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
claude-3-opus-20240229,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
claude-3-opus-20240229+cot,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-13b,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-13b+cot,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-34b,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-34b+cot,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-7b,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-7b+cot,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-python-13b,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-python-34b,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


bootstrap:  33%|███▎      | 1/3 [00:03<00:07,  3.77s/it]

model_b,claude-3-opus-20240229,claude-3-opus-20240229+cot,codellama-13b,codellama-13b+cot,codellama-34b,codellama-34b+cot,codellama-7b,codellama-7b+cot,codellama-python-13b,codellama-python-34b,...,mistral-7b,mixtral-8x7b,phi-1,phi-1.5,phi-2,phind,starcoderbase-16b,starcoderbase-7b,wizard-13b,wizard-34b
model_a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
claude-3-opus-20240229,0,306,1856,1890,1470,1512,2152,2348,1964,1682,...,2264,1866,3064,2806,2290,1674,2318,2394,2008,1720
claude-3-opus-20240229+cot,1102,0,2546,2444,2296,2242,2908,3026,2558,2418,...,2866,2530,3900,3438,3004,2356,2974,3204,2520,2310
codellama-13b,400,248,0,836,466,778,764,1198,702,590,...,878,692,1886,1422,1004,646,1082,964,698,666
codellama-13b+cot,502,248,710,0,806,614,1070,914,956,848,...,1106,972,2054,1704,1344,828,1258,1156,1020,716
codellama-34b,362,252,682,1004,0,762,1106,1298,828,614,...,1122,886,2080,1822,1296,596,1318,1230,918,710
codellama-34b+cot,456,284,1108,1014,774,0,1320,1322,1158,994,...,1508,1248,2236,1944,1484,978,1412,1470,1274,1056
codellama-7b,400,232,378,728,444,562,0,826,536,502,...,616,550,1540,1118,816,504,752,682,554,550
codellama-7b+cot,440,312,664,412,638,506,760,0,732,724,...,850,812,1718,1256,984,702,934,984,818,744
codellama-python-13b,354,190,428,716,434,580,784,1042,0,510,...,832,640,1742,1426,964,616,956,852,402,578
codellama-python-34b,288,230,780,892,452,814,924,1202,844,0,...,1004,828,2030,1640,1238,626,1138,1230,920,474


model_b,claude-3-opus-20240229,claude-3-opus-20240229+cot,codellama-13b,codellama-13b+cot,codellama-34b,codellama-34b+cot,codellama-7b,codellama-7b+cot,codellama-python-13b,codellama-python-34b,...,mistral-7b,mixtral-8x7b,phi-1,phi-1.5,phi-2,phind,starcoderbase-16b,starcoderbase-7b,wizard-13b,wizard-34b
model_a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
claude-3-opus-20240229,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
claude-3-opus-20240229+cot,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-13b,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-13b+cot,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-34b,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-34b+cot,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-7b,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-7b+cot,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-python-13b,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-python-34b,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


bootstrap:  67%|██████▋   | 2/3 [00:04<00:02,  2.12s/it]

model_b,claude-3-opus-20240229,claude-3-opus-20240229+cot,codellama-13b,codellama-13b+cot,codellama-34b,codellama-34b+cot,codellama-7b,codellama-7b+cot,codellama-python-13b,codellama-python-34b,...,mistral-7b,mixtral-8x7b,phi-1,phi-1.5,phi-2,phind,starcoderbase-16b,starcoderbase-7b,wizard-13b,wizard-34b
model_a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
claude-3-opus-20240229,0,280,1912,2008,1420,1548,2186,2372,1850,1718,...,2306,1884,3268,2872,2382,1676,2444,2338,1880,1760
claude-3-opus-20240229+cot,1106,0,2602,2604,2202,2244,2806,2988,2720,2318,...,2870,2576,3946,3664,3100,2404,3094,3230,2694,2428
codellama-13b,422,266,0,800,478,768,782,1072,656,644,...,962,740,1808,1478,1090,626,976,992,772,632
codellama-13b+cot,498,314,822,0,806,624,1064,904,924,884,...,1078,942,1920,1618,1362,872,1148,1250,928,796
codellama-34b,340,248,728,1078,0,782,1070,1288,882,588,...,1200,914,2158,1770,1414,558,1260,1290,910,742
codellama-34b+cot,486,278,1056,960,838,0,1304,1242,1120,964,...,1346,1250,2288,2030,1452,950,1568,1442,1254,970
codellama-7b,342,244,366,680,446,622,0,832,502,474,...,626,512,1488,1134,742,570,710,684,568,520
codellama-7b+cot,434,270,648,494,618,552,778,0,764,734,...,892,820,1704,1378,974,726,916,982,754,736
codellama-python-13b,280,202,496,700,422,614,756,944,0,526,...,894,650,1754,1452,890,666,858,860,374,618
codellama-python-34b,278,182,712,942,388,694,994,1186,806,0,...,1016,810,1994,1690,1214,626,1162,1188,934,468


model_b,claude-3-opus-20240229,claude-3-opus-20240229+cot,codellama-13b,codellama-13b+cot,codellama-34b,codellama-34b+cot,codellama-7b,codellama-7b+cot,codellama-python-13b,codellama-python-34b,...,mistral-7b,mixtral-8x7b,phi-1,phi-1.5,phi-2,phind,starcoderbase-16b,starcoderbase-7b,wizard-13b,wizard-34b
model_a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
claude-3-opus-20240229,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
claude-3-opus-20240229+cot,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-13b,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-13b+cot,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-34b,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-34b+cot,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-7b,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-7b+cot,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-python-13b,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
codellama-python-34b,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


bootstrap: 100%|██████████| 3/3 [00:05<00:00,  1.92s/it]


In [117]:
def visualize_bootstrap_scores(df, title):
    bars = pd.DataFrame(dict(
        lower = df.quantile(.025),
        rating = df.quantile(.5),
        upper = df.quantile(.975))).reset_index(names="model").sort_values("rating", ascending=False)
    bars['error_y'] = bars['upper'] - bars["rating"]
    bars['error_y_minus'] = bars['rating'] - bars["lower"]
    bars['rating_rounded'] = np.round(bars['rating'], 2)
    fig = px.scatter(bars, x="model", y="rating", error_y="error_y",
                     error_y_minus="error_y_minus", text="rating_rounded",
                     title=title)
    fig.update_layout(xaxis_title="Model", yaxis_title="Rating",
                      height=600)
    return fig

fig = visualize_bootstrap_scores(bootstrap_elo_lu, "Bootstrap of MLE Elo Rating Estimates CRUXEval")
fig.write_html('main_intervals.html')
fig

We previously apply bootstrapping on the online Elo to obtain stabler ratings.

In [None]:
np.random.seed(42)
bootstrap_online_elo = get_bootstrap_result(battles, compute_online_elo, BOOTSTRAP_ROUNDS)

bootstrap: 100%|██████████| 1000/1000 [44:41<00:00,  2.68s/it]


We can see the bootstrapping medians obtained by both methods are similar.

In [None]:
preety_print_two_ratings(bootstrap_elo_lu.quantile(.5),
                         bootstrap_online_elo.quantile(.5),
                         column_names=["Bootstrap Median of MLE Elo", "Bootstrap Median of Online Elo"])

Unnamed: 0,Model,Bootstrap Median of MLE Elo,Bootstrap Median of Online Elo
1,gpt-4-turbo-2024-04-09+cot_temp0.2,1138,938
2,claude-3-opus-20240229+cot_temp0.2,1137,937
3,gpt-4-0613+cot_temp0.2,1122,921
4,gpt-4-0613_temp0.2,1086,886
5,gpt-4-turbo-2024-04-09_temp0.2,1079,879
6,claude-3-opus-20240229_temp0.2,1073,873
7,gpt-3.5-turbo-0613+cot_temp0.2,1042,842
8,deepseek-instruct-33b_temp0.2,1017,817
9,gpt-3.5-turbo-0613_temp0.2,1016,816
10,deepseek-base-33b_temp0.2,1015,815


However, online Elo's confidence intervals are significantly larger than the MLE Elo.

In [None]:
fig = visualize_bootstrap_scores(bootstrap_online_elo, "Bootstrap of Online Elo Rating Estimates")
fig

### Predict Win Rates
Utilizing Elo ratings allows us to predict win probabilities. By comparing the predicted win rates with the actual win rates, we can gain insight into the accuracy and quality of the Elo rating system.






In [None]:
def predict_win_rate(elo_ratings, SCALE=400, BASE=10, INIT_RATING=1000):
    names = sorted(list(elo_ratings.keys()))
    wins = defaultdict(lambda: defaultdict(lambda: 0))
    for a in names:
        for b in names:
            ea = 1 / (1 + BASE ** ((elo_ratings[b] - elo_ratings[a]) / SCALE))
            wins[a][b] = ea
            wins[b][a] = 1 - ea

    data = {
        a: [wins[a][b] if a != b else np.NAN for b in names]
        for a in names
    }

    df = pd.DataFrame(data, index=names)
    df.index.name = "model_a"
    df.columns.name = "model_b"
    return df.T

In [142]:
win_rate = predict_win_rate(dict(bootstrap_elo_lu.quantile(0.5)))
ordered_models = win_rate.mean(axis=1).sort_values(ascending=False).index
ordered_models = ordered_models[:30]
fig = px.imshow(win_rate.loc[ordered_models, ordered_models],
                color_continuous_scale='RdBu', text_auto=".2f",
                title="Predicted Win Rate Using Elo Ratings for Model A in an A vs. B Battle")
fig.update_layout(xaxis_title="Model B",
                  yaxis_title="Model A",
                  xaxis_side="top", height=900, width=900,
                  title_y=0.07, title_x=0.5)
fig.update_traces(hovertemplate=
                  "Model A: %{y}<br>Model B: %{x}<br>Win Rate: %{z}<extra></extra>")
display(fig)
display(fig_actualwin)


### Compute Bootstrap Confidence Intervals Assuming Uniform Sampling

We also study how the ratings will change if we only sample an equal number of battles for each model pair.

In [143]:
def sample_battle_even(battles, n_per_battle):
    groups = battles.groupby(["model_a", "model_b"], as_index=False)
    resampled = (groups
                 .apply(lambda grp: grp.sample(n_per_battle, replace=True))
                 .reset_index(drop=True))
    return resampled

In [144]:
num_samples = 800
battles_even = sample_battle_even(battles, num_samples)
pd.pivot_table(battles_even, index="model_a", columns="model_b", aggfunc="size", fill_value=0)

model_b,claude-3-opus-20240229,claude-3-opus-20240229+cot,codellama-13b,codellama-13b+cot,codellama-34b,codellama-34b+cot,codellama-7b,codellama-7b+cot,codellama-python-13b,codellama-python-34b,...,mistral-7b,mixtral-8x7b,phi-1,phi-1.5,phi-2,phind,starcoderbase-16b,starcoderbase-7b,wizard-13b,wizard-34b
model_a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
claude-3-opus-20240229,800,800,800,800,800,800,800,800,800,800,...,800,800,800,800,800,800,800,800,800,800
claude-3-opus-20240229+cot,800,800,800,800,800,800,800,800,800,800,...,800,800,800,800,800,800,800,800,800,800
codellama-13b,800,800,800,800,800,800,800,800,800,800,...,800,800,800,800,800,800,800,800,800,800
codellama-13b+cot,800,800,800,800,800,800,800,800,800,800,...,800,800,800,800,800,800,800,800,800,800
codellama-34b,800,800,800,800,800,800,800,800,800,800,...,800,800,800,800,800,800,800,800,800,800
codellama-34b+cot,800,800,800,800,800,800,800,800,800,800,...,800,800,800,800,800,800,800,800,800,800
codellama-7b,800,800,800,800,800,800,800,800,800,800,...,800,800,800,800,800,800,800,800,800,800
codellama-7b+cot,800,800,800,800,800,800,800,800,800,800,...,800,800,800,800,800,800,800,800,800,800
codellama-python-13b,800,800,800,800,800,800,800,800,800,800,...,800,800,800,800,800,800,800,800,800,800
codellama-python-34b,800,800,800,800,800,800,800,800,800,800,...,800,800,800,800,800,800,800,800,800,800


In [145]:
# Sampling Battles Evenly
def get_bootstrap_even_sample(battles, n_per_battle, func_compute_elo, num_round=BOOTSTRAP_ROUNDS):
    rows = []
    for n in tqdm(range(num_round), desc="sampling battles evenly"):
        resampled = sample_battle_even(battles, n_per_battle)
        rows.append(func_compute_elo(resampled))
    df = pd.DataFrame(rows)
    return df[df.median().sort_values(ascending=False).index]

In [146]:
print("number of samples per battle pair:", num_samples)
bootstrap_even_lu = get_bootstrap_even_sample(battles, num_samples, compute_mle_elo, num_round=100)

number of samples per battle pair: 800


sampling battles evenly: 100%|██████████| 100/100 [10:47<00:00,  6.47s/it]


In [147]:
fig = visualize_bootstrap_scores(bootstrap_even_lu, f"Bootstrap of MLE Elo Estimates - Even sample")
fig

# Links



Some good resources to learn more about Elo rating systems:
- Elo rating system https://en.wikipedia.org/wiki/Elo_rating_system
- Bradley-Terry model https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model
- An introduction video https://www.youtube.com/watch?v=AsYfbmp0To0
- A FiveThirtyEight article https://fivethirtyeight.com/methodology/how-our-nfl-predictions-work/
