Skip to content

using eval part of FastChat to evaluate the current mess of open-source LLMs

Notifications You must be signed in to change notification settings

g588928812/FastChat_eval

 
 

Repository files navigation

Turning FastChat upside down to benchmark open-source LLMs

ab Winning percentage of an all-against-all competition. 70 questions asked to each model. Answers evaluated by GPT-3.5 (API). Shown are averages and std. dev. of 3 replicates per model. Control: GPT-3.5 answers “shifted” = answers not related to question asked. Bottom: Human preference as Elo ratings, assessed in the LMSYS chatbot arena.

https://medium.com/@geronimo7/open-source-chatbots-in-the-wild-9a44d7a41a48

export APIKEY="YOURKEY"
python3 get_review.py -r "gpt-3.5-pairwise" -a table/answer/answer_vicuna-13b.jsonl table/answer/answer_pythia-12b-sft-v8-7k-steps.jsonl -o table/review_pairwise/review_vicuna-13b_pythia-12b-sft-v8-7k-steps -dr 3 -k "$APIKEY"

# calc winning percentage
python3 reviews_winrate_pairwise.py -r table/reviews_pairwise

# generate html
python3 generate_webpage_data_from_table.py -r table/reviews_pairwise

About

using eval part of FastChat to evaluate the current mess of open-source LLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 100.0%