# LLMFAO: Large Language Model Feedback Analysis and Optimization

This is a minimalistic [large language model](https://en.wikipedia.org/wiki/Large_language_model) (LLM) leaderboard that is based on human and machine feedback on pairwise responses of the models based on a carefully-selected set of 13 prompts and 59 different models.

> When you see the outputs of two different systems for the specific query, you can determine the better one using a smaller instruction. I decided to give pairwise comparisons a try on the data kindly provided by the [llmonitor.com](https://llmonitor.com/) team.
>
> I asked carefully chosen crowd annotators to evaluate every pair to determine the winner. If both models performed similarly well or poorly, it’s a tie. Five different annotators evaluated every pair according to the [instruction](https://github.com/dustalov/llmfao/blob/master/crowd-instruction.md); there were 124 annotators in total. I also asked GPT-3.5 Turbo Instruct and GPT-4 to do the same using a shorter [evaluation prompt](https://github.com/dustalov/llmfao/blob/master/gpt-instruction.txt), but I subjectively found human performance to be superior.

A more detailed description of this study is available at <https://evalovernite.substack.com/p/llmfao-human-ranking>.

The datasets and code are available on GitHub at <https://github.com/dustalov/llmfao> under open-source licenses.

In [None]:
import json
import numpy as np
import pandas as pd
import plotly.express as px
from plotly.graph_objects import Figure
from gradio_client import Client

In [None]:
client = Client('https://dustalov-pair2rank.hf.space/')

def pair2rank(path: str, client: Client = client) -> pd.DataFrame:
    rankings, _ = client.predict(path, 'Bradley-Terry (1952)', False, False, 0)
    
    with open(rankings, 'rb') as f:
        rankings_json = json.load(f)
    
    df = pd.DataFrame(data=rankings_json['data'], columns=rankings_json['headers'])
    df.set_index('item', inplace=True)

    return df

In [None]:
def pairwise(df: pd.DataFrame, n: int = 7) -> Figure:
    scores = df['score'].to_numpy()

    df_pairwise = pd.DataFrame(data=scores[:, np.newaxis] / (scores + scores[:, np.newaxis]),
                           index=df.index, columns=df.index)

    df = pd.concat((df.head(n), df.tail(n)))
    df = df[~df.index.duplicated(keep='last')]

    df_pairwise = df_pairwise.reindex(labels=df.index, columns=df.index, copy=False)

    fig = px.imshow(df_pairwise, color_continuous_scale='RdBu', text_auto='.2f')
    fig.update_layout(xaxis_title='Loser', yaxis_title='Winner', xaxis_side='top')
    fig.update_traces(hovertemplate='Winner: %{y}<br>Loser: %{x}<br>Fraction of Wins: %{z}<extra></extra>')

    return fig

## Human Judgements

In [None]:
df_crowd = pair2rank('crowd-comparisons.csv')
df_crowd

In [None]:
pairwise(df_crowd)

## Evaluation with GPT-4

In [None]:
df_gpt4 = pair2rank('gpt4-crowd-comparisons.csv')
df_gpt4

In [None]:
pairwise(df_gpt4)

## Evaluation with GPT-3

In [None]:
df_gpt3 = pair2rank('gpt3-crowd-comparisons.csv')
df_gpt3

In [None]:
pairwise(df_gpt3)

## Correlations

In [None]:
df_ranks = pd.concat((df_crowd['rank'], df_gpt4['rank'], df_gpt3['rank']), axis=1)
df_ranks.columns = ['Humans', 'GPT-4', 'GPT-3']
df_ranks.corr(method='spearman')