Entity-Deduction Arena (EDA)

This software project accompanies the research paper, Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games.

Motivation

There is a demand to assessing the capability of LLM to clarify with questions in order to effectively resolve ambiguities, when confronted with vague queries.
This capability demands a sophisticated understanding of context, state tracking, deductive reasoning, and strategic planning across multiple conversational exchanges.

Highlights

The Entity-Deduction Arena (EDA) is a surrogate problem that gauges an LLM's aptitude to deduce an entity by posing a series of queries to the judge.
Through systematic evaluations, we analyze diverse LLMs and uncover noteworthy disparities in their performance on this particular task.

Model	#Turns (↓)	Success (↑)	#Yes	Score (↑)
GPT-4-0613	16.9±0.2	0.49±0.06	6.0±0.2	0.40±0.05
GPT-3.5-turbo-0613	18.4±0.3	0.25±0.04	7.1±0.4	0.21±0.04
Claude-2	17.6±0.3	0.29±0.05	4.5±0.3	0.25±0.04
Claude-1	18.7±0.1	0.15±0.02	4.3±0.2	0.13±0.02
Vicuna 13B (v1.3)	18.7±0.2	0.20±0.03	5.2±0.3	0.17±0.02
Vicuna 7B (v1.3)	19.1±0.4	0.11±0.06	5.7±0.6	0.10±0.05

Install dependencies

pip install -r requirements.txt

Specify your OpenAI credential (API key)

export OPENAI_API_KEY="sk-XXXX"

Run the game in commandline

Example usage:

# GPT3.5 play against GPT3.5 on Things.
python GPT_Q20.py --input data/things/list_of_things_eval.txt -g gpt-3.5-turbo
# GPT4 play against GPT3.5 on Celebs.
python GPT_Q20_celebrity.py --input data/celebrities/list_of_people_eval.txt -g gpt-4
# Vicuna 7b play against GPT3.5 on Things with 5 repetitions.
python GPT_Q20.py --input data/things/list_of_things_eval.txt -g lmsys/vicuna-7b-v1.3 --openai-api False -s hf --num-sessions 5

The execution of the above command will create of an output folder named ./data/<model_name>. This folder contains the game play sessions for all the tested entities.

Command-line Arguments:

Required:

--input: Specifies the input data to the script.
-g, --guesser_model: Specify the model for guessing. Examples: gpt-4, gpt-3.5, gpt-3.5-turbo, claude-1, claude-2, lmsys/vicuna-7b-v1.3, lmsys/vicuna-13b-v1.3, meta-llama/Llama-2-13b-chat-hf ...

Please note that when using models from the huggingface hub or from a local path, such as lmsys/vicuna-7b-v1.3 or /mnt/ckpts/checkpoint_400 (a local path), you need to set --openai-api to False in order to load the model correctly. Alternatively, you can set up an openAI API server from the checkpoint as per the instructions provided in the FastChat repository. In this scenario, --openai-api should remain True.

Optional:

--answerer_model: Specify the model for answering questions. Default to GPT-3.5-turbo
--suffix: Optional suffix to modify output or data.
--user: User name for user mode
--turns: Set the maximum number of turns for the game.
--temp: Set the temperature for model sampling.
--num-sessions: Set the number of sessions or iterations.
--openai-api: Whether or not to use OpenAI API server for the guesser model. Default to True.

Batch evaluation

Specify the model as <your_model>. If opting for a model path, set --openai-api False.

for rep in 1 2 3 4 5; do
    python script/eval_result.py --dir ./data/<your_model>_rep${rep}
done

Then running the following code to compare and summarize all the results across models in the entire folder:

python script/breakdown_stats.py --dir ./data

Setting a demo server

Coming soon.

Citation

Please consider citing our work if it is helpful to your research.

@article{zhang2023entity,
  title={The Entity-Deduction Arena: A playground for probing the conversational reasoning and planning capabilities of LLMs},
  author={Zhang, Yizhe and Lu, Jiarui and Jaitly, Navdeep},
  journal={arXiv preprint arXiv:2310.01468},
  year={2023}
}

Poster

Poster for the entity-deduction arena

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
imgs		imgs
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GPT_Q20.py		GPT_Q20.py
GPT_Q20_celebrity.py		GPT_Q20_celebrity.py
LICENSE		LICENSE
README.md		README.md
breakdown_stats.py		breakdown_stats.py
eval_result.py		eval_result.py
game.py		game.py
requirements.txt		requirements.txt
utils.py		utils.py

License

apple/ml-entity-deduction-arena

Folders and files

Latest commit

History

Repository files navigation

Entity-Deduction Arena (EDA)

Motivation

Highlights

Install dependencies

Specify your OpenAI credential (API key)

Run the game in commandline

Command-line Arguments:

Batch evaluation

Setting a demo server

Citation

Poster

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages