An automated pipeline for evaluating LLMs for role-playing.
pip install -r requirements.txt
First, set the environment variable OPENAI_API_KEY
for the judge model and to the path of the RPBench dataset.
export OPENAI_API_KEY=<API_KEY>
Then, add the model config file for the model you want to evaluate. Currently we support OpenAI API (and compatible APIs) and Anthropic API. Edit config/api_config.yaml to add the model config.
Finally, run the pipeline.
python run_character_eval.py --model_1 <CONFIG_NAME> # Evaluate the model on the character subset
python run_scene_eval.py --model_1 <CONFIG_NAME> # Evaluate the model on the scene subset
Generate the leaderboard.
python generate_leaderboard.py
After running all commands above, you can add your model to the leaderboard by creating a pull request with the updated leaderboard files, leaderboard.csv
and leaderboard_for_display.csv
, plus the .jsonl files in /results/character
and /results/scene
. The leaderboard will be updated automatically when the PR is merged.
This benchmark is heavily inspired by ArenaHard and AlpacaEval. Some code implementations are borrowed from these repositories.