Large Language Model Feedback Analysis and Optimization (LLMFAO)

This is a minimalistic large language model (LLM) leaderboard that is based on human and machine feedback on pairwise responses of the models based on a carefully-selected set of prompts and different models.

Overview: https://evalovernite.substack.com/p/llmfao-human-ranking
Leaderboard: https://dustalov.github.io/llmfao/
Dataset: https://huggingface.co/datasets/dustalov/llmfao

The pairwise comparisons are transformed into scores using the Evalica library.

Data

The original Crowdsourced LLM Benchmark dataset in files prompts.jsonl, models.jsonl, and results.jsonl was kindly provided by the team at llmonitor.com under a CC BY 4.0 license.

All files with substring crowd in the name have only prompts from a smaller subset non-coding and non-redundant prompts: k8s, qft, vendor-extract, munchhausen, basic-sentiment, cot-apple, taiwan, sally, holidays, bagel, planets-json, blue, product-description (13 out of 19 prompts).

All other data files were released under the same CC BY 4.0 license by Dmitry Ustalov:

pairs.jsonl, pairs-crowd.jsonl: sampled pairs
pairs-crowd-training.jsonl: pairs for training the annotators
crowd-instruction.md, gpt-instruction.txt: instructions for annotators and models
gpt3.jsonl, gpt4.jsonl: GPT-3 and GPT-4 API requests
gpt*-responses.jsonl: GPT-3 and GPT-4 API responses
*-comparisons.csv: pairwise comparisons
*-crowd-comparisons.csv: pairwise comparisons without code-related pairs

erDiagram
    models {
        int id
        string api_id
        string name
        string api
        string type
        string org
    }
    prompts {
        int id
        string text
        string type
        string slug
        string stop
        string note
    }
    results {
        int id
        int model
        string result
        float duration
        float rate
        int prompt
        string api_id
        string name
        string api
        string type
        string org
        string slug
    }
    pairs {
        int id
        int prompt
        string text
        string type
        string slug
        string note
        int model_x
        string result_x
        int model_y
        string result_y
    }
    comparisons {
        int id
        int prompt
        int model_x
        int model_y
        string left
        string right
        string winner
    }
    models ||--|{ results : has
    models ||--|{ pairs : has
    models ||--|{ comparisons : has
    prompts ||--|{ results : has
    prompts ||--|{ pairs : has
    prompts ||--|{ comparisons : has

Code

All the code is released under the GPLv3+ license, but if you use only the data and not the code, it does not apply to you.

pipenv run llmfao pairs  # generates pairs.jsonl and pairs-crowd.jsonl
pipenv run llmfao gpt3-requests  # generates gpt3.jsonl (makes no API requests)
pipenv run llmfao gpt4-requests  # generates gpt4.jsonl (makes no API requests)

The generated gpt3.jsonl and gpt4.jsonl files can be used to make requests to the OpenAI API via api_request_parallel_processor.py (not included here).

# GPT-3.5 Turbo Instruct
python3 api_request_parallel_processor.py \
    --requests_filepath 'gpt3.jsonl' \
    --save_filepath 'gpt3-responses.jsonl' \
    --request_url 'https://api.openai.com/v1/completions' \
    --max_requests_per_minute 3000 \
    --max_tokens_per_minute 80000

# GPT-4
python3 api_request_parallel_processor.py \
    --requests_filepath 'gpt4.jsonl' \
    --save_filepath 'gpt4-responses.jsonl' \
    --request_url 'https://api.openai.com/v1/chat/completions' \
    --max_requests_per_minute 150 \
    --max_tokens_per_minute 9000

After obtaining the responses from GPT models, it is possible to transform them into comparisons.

pipenv run llmfao gpt3-comparisons  # generates gpt3-comparisons.csv (makes no API requests)
pipenv run llmfao gpt4-comparisons  # generates gpt4-comparisons.csv (makes no API requests)
pipenv run llmfao gpt3-comparisons \
    --pairs pairs-crowd.jsonl \
    --output gpt3-crowd-comparisons.csv  # generates gpt3-crowd-comparisons.csv (makes no API requests)
pipenv run llmfao gpt4-comparisons \
    --pairs pairs-crowd.jsonl \
    --output gpt4-crowd-comparisons.csv  # generates gpt4-crowd-comparisons.csv (makes no API requests)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Large Language Model Feedback Analysis and Optimization (LLMFAO)

Data

Code

About

Licenses found

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.github		.github
.gitignore		.gitignore
LICENSE.CC-BY		LICENSE.CC-BY
LICENSE.GPL		LICENSE.GPL
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
crowd-comparisons.csv		crowd-comparisons.csv
crowd-instruction.md		crowd-instruction.md
gpt-instruction.txt		gpt-instruction.txt
gpt3-comparisons.csv		gpt3-comparisons.csv
gpt3-crowd-comparisons.csv		gpt3-crowd-comparisons.csv
gpt3-responses.jsonl		gpt3-responses.jsonl
gpt3.jsonl		gpt3.jsonl
gpt4-comparisons.csv		gpt4-comparisons.csv
gpt4-crowd-comparisons.csv		gpt4-crowd-comparisons.csv
gpt4-responses.jsonl		gpt4-responses.jsonl
gpt4.jsonl		gpt4.jsonl
leaderboard.ipynb		leaderboard.ipynb
llmfao.py		llmfao.py
models.jsonl		models.jsonl
mypy.ini		mypy.ini
pairs-crowd-training.jsonl		pairs-crowd-training.jsonl
pairs-crowd.jsonl		pairs-crowd.jsonl
pairs.jsonl		pairs.jsonl
prompts.jsonl		prompts.jsonl
results.jsonl		results.jsonl
ruff.toml		ruff.toml

License

Licenses found

dustalov/llmfao

Folders and files

Latest commit

History

Repository files navigation

Large Language Model Feedback Analysis and Optimization (LLMFAO)

Data

Code

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Contributors 2

Languages