clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

UPDATE (16.02.24): We released v0.3 of the benchmark code. The main branch will continue as v1.0-beta which has changes that effect the game code. Follow this guide to update your game.

clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

The cLLM (chat-optimized Large Language Model, "clem") framework tests such models' ability to engage in games – rule-constituted activities played using language. The framework is a systematic way of probing for the situated language understanding of language using agents.

This repository contains the code for setting up the framework and implements a number of games that are further discussed in

Chalamalasetti, K., Götze, J., Hakimov, S., Madureira, B., Sadler, P., & Schlangen, D. (2023). clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents (arXiv:2305.13455). arXiv. https://doi.org/10.48550/arXiv.2305.13455

Evaluation Results

On the main project website , under leaderboard.

Game details

A Simple Word Game: taboo
A Word-Guessing Game Based on Clues: wordle
Drawing Instruction Giving and Following: image
An ASCII Picture Reference Game: reference
Scorekeeping: private and shared

Using the benchmark

This repository is tested on Python 3.8+

We welcome you to contribute to or extend the benchmark with your own games and models. Please simply open a pull request. You can find more information on how to use the benchmark in the links below.

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
backends		backends
clemgame		clemgame
docs		docs
evaluation		evaluation
games		games
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chat-two-tracks.css		chat-two-tracks.css
key.json.template		key.json.template
logging.yaml		logging.yaml
pipeline_clembench.sh		pipeline_clembench.sh
pipeline_huggingfaces.sh		pipeline_huggingfaces.sh
pipeline_llama2_hf.sh		pipeline_llama2_hf.sh
prepare_path.sh		prepare_path.sh
requirements.txt		requirements.txt
requirements_hf.txt		requirements_hf.txt
run.sh		run.sh
setup.sh		setup.sh
setup_hf.sh		setup_hf.sh
setup_llamacpp_cuda122.sh		setup_llamacpp_cuda122.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UPDATE (16.02.24): We released v0.3 of the benchmark code. The main branch will continue as v1.0-beta which has changes that effect the game code. Follow this guide to update your game.

clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

Evaluation Results

Game details

Using the benchmark

About

Releases 2

Packages

Contributors 9

Languages

License

clp-research/clembench

Folders and files

Latest commit

History

Repository files navigation

UPDATE (16.02.24): We released v0.3 of the benchmark code. The main branch will continue as v1.0-beta which has changes that effect the game code. Follow this guide to update your game.

clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

Evaluation Results

Game details

Using the benchmark

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 9

Languages

Packages