This is a minimalistic large language model (LLM) leaderboard that is based on human and machine feedback on pairwise responses of the models based on a carefully-selected set of prompts and different models.
- Overview: https://evalovernite.substack.com/p/llmfao-human-ranking
- Leaderboard: https://dustalov.github.io/llmfao/
- Dataset: https://huggingface.co/datasets/dustalov/llmfao
The pairwise comparisons are transformed into scores using the Evalica library.
The original Crowdsourced LLM Benchmark dataset in files prompts.jsonl
, models.jsonl
, and results.jsonl
was kindly provided by the team at llmonitor.com under a CC BY 4.0 license.
All files with substring crowd
in the name have only prompts from a smaller subset non-coding and non-redundant prompts: k8s
, qft
, vendor-extract
, munchhausen
, basic-sentiment
, cot-apple
, taiwan
, sally
, holidays
, bagel
, planets-json
, blue
, product-description
(13 out of 19 prompts).
All other data files were released under the same CC BY 4.0 license by Dmitry Ustalov:
pairs.jsonl
,pairs-crowd.jsonl
: sampled pairspairs-crowd-training.jsonl
: pairs for training the annotatorscrowd-instruction.md
,gpt-instruction.txt
: instructions for annotators and modelsgpt3.jsonl
,gpt4.jsonl
: GPT-3 and GPT-4 API requestsgpt*-responses.jsonl
: GPT-3 and GPT-4 API responses*-comparisons.csv
: pairwise comparisons*-crowd-comparisons.csv
: pairwise comparisons without code-related pairs
erDiagram
models {
int id
string api_id
string name
string api
string type
string org
}
prompts {
int id
string text
string type
string slug
string stop
string note
}
results {
int id
int model
string result
float duration
float rate
int prompt
string api_id
string name
string api
string type
string org
string slug
}
pairs {
int id
int prompt
string text
string type
string slug
string note
int model_x
string result_x
int model_y
string result_y
}
comparisons {
int id
int prompt
int model_x
int model_y
string left
string right
string winner
}
models ||--|{ results : has
models ||--|{ pairs : has
models ||--|{ comparisons : has
prompts ||--|{ results : has
prompts ||--|{ pairs : has
prompts ||--|{ comparisons : has
All the code is released under the GPLv3+ license, but if you use only the data and not the code, it does not apply to you.
pipenv run llmfao pairs # generates pairs.jsonl and pairs-crowd.jsonl
pipenv run llmfao gpt3-requests # generates gpt3.jsonl (makes no API requests)
pipenv run llmfao gpt4-requests # generates gpt4.jsonl (makes no API requests)
The generated gpt3.jsonl
and gpt4.jsonl
files can be used to make requests to the OpenAI API via api_request_parallel_processor.py (not included here).
# GPT-3.5 Turbo Instruct
python3 api_request_parallel_processor.py \
--requests_filepath 'gpt3.jsonl' \
--save_filepath 'gpt3-responses.jsonl' \
--request_url 'https://api.openai.com/v1/completions' \
--max_requests_per_minute 3000 \
--max_tokens_per_minute 80000
# GPT-4
python3 api_request_parallel_processor.py \
--requests_filepath 'gpt4.jsonl' \
--save_filepath 'gpt4-responses.jsonl' \
--request_url 'https://api.openai.com/v1/chat/completions' \
--max_requests_per_minute 150 \
--max_tokens_per_minute 9000
After obtaining the responses from GPT models, it is possible to transform them into comparisons.
pipenv run llmfao gpt3-comparisons # generates gpt3-comparisons.csv (makes no API requests)
pipenv run llmfao gpt4-comparisons # generates gpt4-comparisons.csv (makes no API requests)
pipenv run llmfao gpt3-comparisons \
--pairs pairs-crowd.jsonl \
--output gpt3-crowd-comparisons.csv # generates gpt3-crowd-comparisons.csv (makes no API requests)
pipenv run llmfao gpt4-comparisons \
--pairs pairs-crowd.jsonl \
--output gpt4-crowd-comparisons.csv # generates gpt4-crowd-comparisons.csv (makes no API requests)