This code provides an unofficial Pytorch implementation for the paper titled Let Models Speak Ciphers: Multiagent Debate through Embeddings.
@InProceedings{phamCIPHER2024,
author={Chau Pham and Boyi Liu and Yingxiang Yang and Zhengyu Chen and Tianyi Liu and Jianbo Yuan and Bryan A. Plummer and Zhaoran Wang and Hongxia Yang},
title={Let Models Speak Ciphers: Multiagent Debate through Embeddings},
booktitle={International Conference on Learning Representations (ICLR)},
year={2024}}
The code was tested with Python 3.10 and PyTorch 2.0
To install required packages, run the following command:
conda create -n cipher python=3.10 -y
conda activate cipher
pip install -r requirements.txt
All datasets are included in this repo, stored in data
folder.
-
MMLU dataset was downloaded from https://github.com/hendrycks/test
-
GSM8k dataset was downloaded from https://github.com/openai/grade-school-math
-
Arithmetic dataset was generated by running
run_generate_dataset()
indatasets/arithmetic.py
Models were downloaded from Huggingface model hub. For example, LLaMA2 models can be found here.
First, you need to modify the configs in config.yml
.
root_dir
: path to the root directory of the projectXYZ_model_path
: path to XYZ checkpoints, e.g., "pretrained_models/llama2_hf/{model_name}".
Note that keep {model_name}
in the path, it will be replaced by the actual model name (e.g., "Llama-2-7b-hf") when running the code.
Assume there is a pretrained model Llama-2-7b-hf
in pretrained_models/llama2_hf
folder and the path is set in config.yml
, you can run the following command on 1 GPU (such as a NVIDIA A40/A6000) to see if the code works:
## Quickly test the code on 1 GMS8K question with 2 Llama-2-7b.
sh scripts/quick_test.sh
The run will output a json file in output
folder, which can be evaluated by analyze_results.py
. For example:
python analyze_results.py -m output/log_2024-Jan-19/id_2Llama-2-7b-hf_Llama-2-7b-hf_gsm8k_test_gsm8k_full_seed65533_vec_nsols1_r3_temp0.2-0.8_nques1_2024-Jan-19--23-24-360.85.json
More details on evaluating the results can be found in the Evaluation section.
The main entry point is run_debate.py
.
After running, it will write out a json file containing responses of the models. We evaluate the result by using analyze_results.py
(see Evaluation section below).
For description of parameter settings, you can run:
python run_debate.py --help
Here are some key arguments and their descriptions for clarity:
-p
(--num_points
): Specifies the number of runs (i.e., number of debates) to be executed for this job.
-b
(--batch_size
): Sets the batch size for inference.
-d
(--dataset
): Specifies the dataset to be used for the job, such as "gsm8k," "mmlu," or "arithmetic".
--debaters
: A list of debaters to participate in the debate. Debaters can be any of the following: falcon-40b-instruct, llama_7B, llama_65B, Llama-2-70b-hf, Llama-2-70b-chat-hf, Llama-2-70b-chat-hf_expert, Llama-2-70b-hf_dummy_expert, etc.
--initial_prompt_paths
: Specifies the path(s) to the initial prompt file(s). If only one file is provided, it will be used for all debaters. If multiple files are provided, the number of files must match the number of debaters.
--debate_prompt_paths
: Specifies the path(s) to the debate prompt file(s). Similar to initial prompts, if one file is provided, it will be used for all debaters, and the number of files must match the number of debaters if multiple files are used.
--max_new_tokens
: Sets the maximum number of new tokens when generating responses.
--n_rounds
: Specifies the number of rounds for the debate. The default is 3 rounds, which includes 1 initial round and 2 debate rounds.
--data_path
: Specifies the path to the dataset, such as "data/gsm/gsm8k_split_test200ques.jsonl".
--n_questions
: Controls the number of questions to be sampled from the dataset. For example, in GSM8K, you can sample 200 questions from the dataset by setting this parameter to 200.
--point_path
: Specifies the path for pre-defined points that will be used for random search or Bayesian optimization. We will search on these points first. One example can be found at probe_points/arithmetic_2debaters_v3.txt
--temperatures
: Specifies the temperatures for the agents participating in each round of the debate. It is a list of temperatures corresponding to each agent's engagement in each round. For instance, in a 2-debater debate with 3 rounds, you should provide temperatures as follows: [agent0_round0, agent0_round1, agent0_round2, agent1_round0, agent1_round1, agent1_round2]. Note that this parameter is used when the -p
is set to 1. In case -p
> 1, this parameter will be disregarded.
These commands serve as examples for conducting the experiments described in the paper. The hyperparameter settings are detailed in Appendix D of the paper.
python run_debate.py -p 5 -d gsm8k --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 8 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 400 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 200 --point_path probe_points/gsm8k_NLD.txt --n_gpus_per_actor 4 --n_ray_actors 2
python run_debate.py -p 5 -d gsm8k --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 8 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_vector_language_v1.txt -v --temperature_max 2.0 --max_new_tokens 400 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 200 --point_path probe_points/gsm8k_cipher.txt --n_gpus_per_actor 4 --n_ray_actors 2
python run_debate.py -p 5 -d gsm8k --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 8 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 400 --n_rounds 1 --n_sols_each_ques 5 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 200 --n_gpus_per_actor 4 --n_ray_actors 2 --temperatures 0.8,0.8
python run_debate.py -p 3 -d gsm8k --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 8 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 400 --n_rounds 1 --n_sols_each_ques 1 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 200 --n_gpus_per_actor 4 --n_ray_actors 2 --temperatures 0.15,0.15
python run_debate.py -p 5 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 20 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --point_path probe_points/arithmetic_NLD.txt --n_gpus_per_actor 4 --n_ray_actors 1
python run_debate.py -p 5 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 20 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_vector_language_v1.txt -v --temperature_max 2.0 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --point_path probe_points/arithmetic_cipher.txt --n_gpus_per_actor 4 --n_ray_actors 1
python run_debate.py -p 5 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 20 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 120 --n_rounds 1 --n_sols_each_ques 5 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --n_gpus_per_actor 4 --n_ray_actors 1 --temperatures 0.6,0.6
python run_debate.py -p 3 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 20 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 120 --n_rounds 1 --n_sols_each_ques 1 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --n_gpus_per_actor 4 --n_ray_actors 1 --temperatures 0.35,0.35
python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_professional_psychology_v1.txt --debate_prompt_paths prompts_v2/mmlu/debate_professional_psychology_2debaters_v1.txt --max_new_tokens 400 --n_rounds 3 --temperature_max 2.0 --n_questions 200 --data_path data/mmlu/test/professional_psychology_split_testset_200questions.csv --point_path probe_points/psychology_NLD.txt --n_gpus_per_actor 4 --n_ray_actors 2
python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_professional_psychology_v1.txt --debate_prompt_paths prompts_v2/mmlu/debate_professional_psychology_2debaters_vector_language_v1.txt -v --max_new_tokens 400 --n_rounds 3 --temperature_max 2.0 --n_questions 200 --data_path data/mmlu/test/professional_psychology_split_testset_200questions.csv --point_path probe_points/psychology_cipher.txt --n_gpus_per_actor 4 --n_ray_actors 2
python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_professional_psychology_v1.txt --debate_prompt_paths prompts_v2/mmlu/debate_professional_psychology_2debaters_v1.txt --max_new_tokens 400 --n_rounds 1 --n_sols_each_ques 5 --temperature_max 2.0 --n_questions 200 --data_path data/mmlu/test/professional_psychology_split_testset_200questions.csv --n_gpus_per_actor 4 --n_ray_actors 2 --temperatures 0.3,0.3
python run_debate.py -p 3 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_professional_psychology_v1.txt --debate_prompt_paths prompts_v2/mmlu/debate_professional_psychology_2debaters_v1.txt --max_new_tokens 400 --n_rounds 1 --n_sols_each_ques 1 --temperature_max 2.0 --n_questions 200 --data_path data/mmlu/test/professional_psychology_split_testset_200questions.csv --n_gpus_per_actor 4 --n_ray_actors 2 --temperatures 0.33,0.33
python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_high_school_mathematics_v1.txt --debate_prompt_paths prompts_v2/mmlu/debate_high_school_mathematics_2debaters_v1.txt --max_new_tokens 400 --n_rounds 3 --data_path data/mmlu/test/high_school_mathematics_test.csv --point_path probe_points/math_NLD.txt --n_gpus_per_actor 4 --n_ray_actors 2
python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_high_school_mathematics_v1.txt --debate_prompt_paths prompts_v2/mmlu/debate_high_school_mathematics_2debaters_vector_language_v1.txt -v --max_new_tokens 400 --n_rounds 3 --data_path data/mmlu/test/high_school_mathematics_test.csv --point_path probe_points/math_cipher.txt --n_gpus_per_actor 4 --n_ray_actors 2
For Majority Voting and Single Answer, you can use similar commands as in MMLU Psychology.
python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_fomal_logic_v2.txt --debate_prompt_paths prompts_v2/mmlu/debate_fomal_logic_2debaters_v2.txt --temperature_max 2.0 --max_new_tokens 400 --n_rounds 3 --data_path data/mmlu/test/formal_logic_test.csv --n_questions 126 --point_path probe_points/logic_NLD.txt --n_gpus_per_actor 4 --n_ray_actors 2
python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_fomal_logic_v2.txt --debate_prompt_paths prompts_v2/mmlu/debate_fomal_logic_2debaters_vector_language_2.txt -v --temperature_max 2.0 --max_new_tokens 400 --n_rounds 3 --data_path data/mmlu/test/formal_logic_test.csv --n_questions 126 --point_path probe_points/logic_cipher.txt --n_gpus_per_actor 4 --n_ray_actors 2
For LLaMA1-65B, you can use similar commands with --debaters llama_65B,llama_65B
, refer to scripts/LLaMA1_scripts.md
for more details.
To run this, first you need to store emb table (embed_tokens) of LLaMa1 and LLaMA2 in emb_weights
folder as llama-2-70b-hf.pt
and llama_65b.pt
. Afterward, you can proceed by executing the command below, using the Arithmetic dataset as an example:
python run_debate.py -p 4 -d arithmetic --debaters Llama-2-70b-hf,llama_65B -b 8 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --n_ray_actors 1
Examples:
python run_debate.py -p 6 -d gsm8k --debaters WizardMath-70B-V1.0,WizardMath-70B-V1.0 -b 8 --initial_prompt_paths prompts_v2/gsm8k/init_wizardmath.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_wizardmath_vector_language.txt -v --temperature_max 0.85 --max_new_tokens 500 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 200 --n_ray_actors 1 --seed 232495
python run_debate.py -p 4 -d gsm8k --debaters falcon-40b-instruct,falcon-40b-instruct -b 6 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_v1.txt --temperature_max 0.85 --max_new_tokens 400 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 400 --n_ray_actors 1
Entropy
python run_debate.py -p 3 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 16 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_vector_language_v1.txt -v --temperature_max 1.5 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --partial_cipher_when_confident --partial_entropy_over_max --partial_thres 1.869
Max
python run_debate.py -p 3 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 5 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_vector_language_v1.txt -v --temperature_max 2.0 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --n_gpus_per_actor 4 --partial_cipher_when_not_confident --partial_thres 0.8
Entropy reversed
python run_debate.py -p 3 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 16 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_vector_language_v1.txt -v --temperature_max 1.5 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --partial_cipher_when_not_confident --partial_entropy_over_max --partial_thres 1.869
Max reversed
python run_debate.py -p 3 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 5 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_vector_language_v1.txt -v --temperature_max 2.0 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --n_gpus_per_actor 4 --partial_cipher_when_confident --partial_thres 0.8
Example with Natural Language Debate:
python run_debate.py -p 4 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 8 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_v1.txt --temperature_max 1.5 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --n_ray_actors 1 --positional_bias
1. Dummy expert (swap groud truth (gt) with other gt in the same batch to confuse the other debater)
python run_debate.py -p 3 -d gsm8k --debaters Llama-2-70b-hf,Llama-2-70b-hf_dummy_expert -b 12 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_vector_language_v1.txt -v --temperature_max 0.85 --max_new_tokens 400 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 400 --n_ray_actors 1
python run_debate.py -p 4 -d gsm8k --debaters Llama-2-7b-hf,Llama-2-7b-hf_expert -b 12 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_vector_language_v1.txt -v --temperature_max 0.7 --max_new_tokens 400 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 400 --n_ray_actors 1
python run_debate.py -p 4 -d gsm8k --debaters Llama-2-7b-hf,Llama-2-7b-hf -b 12 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_vector_language_v1.txt -v --temperature_max 0.7 --max_new_tokens 400 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 400 --n_ray_actors 1
After running the inference, you will get a json output, which can be evaluated by using analyze_results.py
.
Detailed description of the parameters:
-m
: debaters, or path to json output
-d
: dataset, e.g., gsm
, mmlu
, arithmetic
-t
: type of debate, e.g., human
, vector
, or both
(default)
-s
: subdataset (used when -d set to mmlu), e.g., psychology, formal_logic, math
Usage
- To evaluate a json output (example 1 below):
python analyze_results.py -m <relative_path_to_the_json_output>
- To evaluate many json files at once in
output
folder (example 2, and 3):python analyze_results.py -m <debaters> -d <dataset> -t <type> -s <subdataset>
Examples:
- Example 1
python analyze_results.py -m output/log_2023-Sep-26/Llama-2-70b-hf_Llama-2-70b-hf_mmlu_high_school_mathematics_test_seed75036.json
- Example 2:
python analyze_results.py -m 2Llama-2-70b-hf -d gsm -t human
- Example 3:
python analyze_results.py -m 2Llama-2-70b-hf -d mmlu -s math -t vector