Skip to content

An automated pipeline for evaluating LLMs for role-playing.

License

Notifications You must be signed in to change notification settings

boson-ai/RPBench-Auto

Repository files navigation

RPBench-Auto

Leaderboard | Blog

An automated pipeline for evaluating LLMs for role-playing.

RPBench Auto

Installation

pip install -r requirements.txt

Usage

First, set the environment variable OPENAI_API_KEY for the judge model and to the path of the RPBench dataset.

export OPENAI_API_KEY=<API_KEY>

Then, add the model config file for the model you want to evaluate. Currently we support OpenAI API (and compatible APIs) and Anthropic API. Edit config/api_config.yaml to add the model config.

Finally, run the pipeline.

python run_character_eval.py --model_1 <CONFIG_NAME>  # Evaluate the model on the character subset
python run_scene_eval.py --model_1 <CONFIG_NAME>  # Evaluate the model on the scene subset

Generate the leaderboard.

python generate_leaderboard.py

How to contribute

After running all commands above, you can add your model to the leaderboard by creating a pull request with the updated leaderboard files, leaderboard.csv and leaderboard_for_display.csv, plus the .jsonl files in /results/character and /results/scene. The leaderboard will be updated automatically when the PR is merged.

Acknowledgements

This benchmark is heavily inspired by ArenaHard and AlpacaEval. Some code implementations are borrowed from these repositories.

About

An automated pipeline for evaluating LLMs for role-playing.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages