Original FastChat

FastChat is an open platform for training, serving, and evaluating large language model based chatbots. The core features include:

The weights, training code, and evaluation code for state-of-the-art models (e.g., Vicuna).
A distributed multi-model serving system with web UI and OpenAI-compatible RESTful APIs.

Purpose of this fork

Reproduce the data in https://medium.com/@geronimo7/judging-the-judges-668e80f4a1f2

Figure 1: Agreement of judges GPT-4 and llama 2 with human evaluation

# 1: download human judgements
# https://colab.research.google.com/drive/1ctgygDRJhVGUJTQy8-bRZCl1WNcT8De6?usp=sharing
cd FastChat/fastchat/llm_judge/data/mt_bench/model_judgment/

python3 -c "\
from datasets import load_dataset; \
dataset = load_dataset(\"lmsys/mt_bench_human_judgments\");\
dataset[\"human\"].to_json(\"human_judgments.json\");"

# 2: download GPT-4 judgments
# this is an 'old' file, the most recent one does not contain judgments for model vicuna-13b-v1.2, which is needed for human agreement
wget https://huggingface.co/spaces/lmsys/mt-bench/resolve/89039fbb04009aa54b2ce88a294402fe9d756913/data/mt_bench/model_judgment/gpt-4_pair.jsonl

# 3: transform the GPT-4 and llama2 judgments to agreement-ready format (output format of gen_judgment.py is slightly different than the input format needed for compute_agreement.py)

## GPT-4: transform, and additionally filter GPT-4 judgment file for the models used in this study
cd ../../../
python3 check_judgements.py gpt-4_pair.jsonl --prepare_for_agreement=True --filter_models "gpt-4 vicuna-13b-v1.2 claude-v1 alpaca-13b llama-13b gpt-3.5-turbo"

## Llama 2 (upstage_Llama-2-70b-instruct-v2):
python3 check_judgements.py llama2_pair.jsonl --prepare_for_agreement=True 

# 4: Show agreement

## gpt4/human
python3 compute_agreement.py --judges gpt-4 human --votefiles data/mt_bench/model_judgment/human_judgments.json data/mt_bench/model_judgment/gpt-4_pair_agreement-ready.json

## llama2/human
python3 compute_agreement.py --judges upstage_Llama-2-70b-instruct-v2 human --votefiles data/mt_bench/model_judgment/human_judgments.json data/mt_bench/model_judgment/llama2_pair_agreement-ready.json

## human/human
python3 compute_agreement.py --judges human human --votefiles data/mt_bench/model_judgment/human_judgments.json

Figure 1, blue panel on the right: Correlation GPT-4 and Llama2 all-against-all

cd FastChat/fastchat/llm_judge/

# Show results for GPT-4
python3 show_result.py --mode pairwise-all --input-file data/mt_bench/model_judgment/gpt-4_pair.jsonl --model-list gpt-4 vicuna-13b-v1.2 claude-v1 alpaca-13b llama-13b gpt-3.5-turbo

# Show results for Llama 2 (upstage_Llama-2-70b-instruct-v2):
python3 show_result.py --mode pairwise-all --input-file data/mt_bench/model_judgment/llama2_pair.jsonl

Figure 2: Judgment quality

cd FastChat/fastchat/llm_judge/

# GPT-4
python3 check_judgements.py gpt-4_pair.jsonl

## Llama 2 (upstage_Llama-2-70b-instruct-v2):
python3 check_judgements.py llama2_pair.jsonl

Figure 3: Winrate against GPT-3.5 for Guanaco 7B checkpoints

cd FastChat/fastchat/llm_judge

# GPT-4
python3 show_result.py --mode pairwise-baseline --input-file data/mt_bench/model_judgment/guanaco_gpt-4_pair.jsonl

## Llama 2 (upstage_Llama-2-70b-instruct-v2):
python3 show_result.py --exclude-ties-and-errors --mode pairwise-baseline --judge-model upstage_Llama-2-70b-instruct-v2 --input-file data/mt_bench/model_judgment/guanaco_llama2_pair.jsonl

Run judgements

Note: everything that follows below is only needed if you want to rerun the judgments using another judge or other models' answers; the judgment data to produce the figures are already included in this fork

Generate model answers

All-against-all analysis (Figures 1 and 2)

# Download model answers
cd FastChat/fastchat/llm_judge
python3 download_mt_bench_pregenerated.py  # downloads everything except the vicuna-13b-v1.2 data
wget https://huggingface.co/spaces/lmsys/mt-bench/raw/eb60c015e9c4dad0cbdb01c905067ec8b0973fd7/data/mt_bench/model_answer/vicuna-13b-v1.2.jsonl -O data/mt_bench/model_answer/vicuna-13b-v1.2.jsonl        # deleted, old commit

Guanaco analysis (Figure 3)

download the LoRAs from g-ronimo/llama-2-7b-guanaco-2023-08-04_misunderstood-lion · Hugging Face
merge them with the base model Llama 2 7B, obtained here: meta-llama/Llama-2-7b · Hugging Face
generate answers from each checkpoint:

dir="models"
modelprefixes="llama-2-7b-guanaco-2023-08-04"

IFS=$'\n'
for model in $(cd $dir;ls -1 | grep ${modelprefixes} | sed "s/ /\\\ /g")
do
    echo ${model}
    python gen_model_answer.py --model-path ${dir}/${model} --model-id ${model}
done

Setup oobabooga/text-generation-webui to serve the judge via API

Download the judge model, I used TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ at gptq-4bit-32g-actorder_True
Setup text-generation-webui as shown below

# clone oobabooga/text-generation-webui, commit used: 08c622df2e26811440f6b3311dff3553ba20dc86
git clone https://github.com/oobabooga/text-generation-webui

# create new instruction template: Orca-Hashes
echo 'user: "### User:"
bot: "### Assistant:"
turn_template: "<|user|>\n<|user-message|>\n\n<|bot|>\n<|bot-message|>\n\n"
context: "### System:\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.\n\n"' > Orca-Hashes.yaml

# copy Orca-Hashes.yaml to text-generation-webui/instruction-templates or characters/instruction-following, depending on your version of text-generation-webui

# Start oobabooga/text-generation-webui, load the judge model upstage_Llama-2-70b-instruct-v2 
python3 server.py --model Upstage-Llama-2-70B-instruct-v2-GPTQ --public-api --api --no_inject_fused_attention

Generate judgments

Create reference answers for llama2

# Create reference answers for upstage_Llama-2-70b-instruct-v2 (same as for GPT-4)
cd FastChat/fastchat/llm_judge/data/mt_bench/reference_answer
cp gpt-4.jsonl upstage_Llama-2-70b-instruct-v2.jsonl

Set API host

edit tgw_request_template.json, insert the URL where text-generation-webui API is running

{
    "URI": "https://fiction-genesis-highlight-cpu.trycloudflare.com/api/v1/chat",
    ...
}

Six model all-against-all analysis (Figures 1 and 2)

cd FastChat/fastchat/llm_judge

python3 gen_judgment.py --mode pairwise-all --model-list gpt-4 vicuna-13b-v1.2 claude-v1 llama-13b gpt-3.5-turbo alpaca-13b --parallel 1 --use-api True --judge-model upstage_Llama-2-70b-instruct-v2

Guanaco checkpoints against GPT-3.5 analysis (Figure 3)

cd FastChat/fastchat/llm_judge

python3 gen_judgment.py --mode pairwise-baseline --parallel 1 --use-api True --judge-model upstage_Llama-2-70b-instruct-v2 --model-list alpaca-13b gpt-4 llama-2-7b-guanaco-2023-08-04_misunderstood-lion-checkpoint-500 llama-2-7b-guanaco-2023-08-04_misunderstood-lion-checkpoint-1000 llama-2-7b-guanaco-2023-08-04_misunderstood-lion-checkpoint-1500 llama-2-7b-guanaco-2023-08-04_misunderstood-lion-checkpoint-2000 llama-2-7b-guanaco-2023-08-04_misunderstood-lion-checkpoint-2500 llama-2-7b-guanaco-2023-08-04_misunderstood-lion-checkpoint-3000 llama-2-7b-guanaco-2023-08-04_misunderstood-lion-checkpoint-3500

Name		Name	Last commit message	Last commit date
Latest commit History 499 Commits
.github		.github
assets		assets
data		data
docker		docker
docs		docs
fastchat		fastchat
llm-judge		llm-judge
playground		playground
scripts		scripts
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
format.sh		format.sh
pyproject.toml		pyproject.toml

License

geronimi73/FastChat

Folders and files

Latest commit

History

Repository files navigation

Original FastChat

Purpose of this fork

Figure 1: Agreement of judges GPT-4 and llama 2 with human evaluation

Figure 1, blue panel on the right: Correlation GPT-4 and Llama2 all-against-all

Figure 2: Judgment quality

Figure 3: Winrate against GPT-3.5 for Guanaco 7B checkpoints

Run judgements

Generate model answers

All-against-all analysis (Figures 1 and 2)

Guanaco analysis (Figure 3)

Setup oobabooga/text-generation-webui to serve the judge via API

Generate judgments

Create reference answers for llama2

Set API host

Six model all-against-all analysis (Figures 1 and 2)

Guanaco checkpoints against GPT-3.5 analysis (Figure 3)

About

Resources

License

Stars

Watchers

Forks

Languages