Let Models Speak Ciphers: Multiagent Debate through Embeddings

This code provides an unofficial Pytorch implementation for the paper titled Let Models Speak Ciphers: Multiagent Debate through Embeddings.

@InProceedings{phamCIPHER2024,
    author={Chau Pham and Boyi Liu and Yingxiang Yang and Zhengyu Chen and Tianyi Liu and Jianbo Yuan and Bryan A. Plummer and Zhaoran Wang and Hongxia Yang},
    title={Let Models Speak Ciphers: Multiagent Debate through Embeddings},
    booktitle={International Conference on Learning Representations (ICLR)},
    year={2024}}

Installation

The code was tested with Python 3.10 and PyTorch 2.0

To install required packages, run the following command:

conda create -n cipher python=3.10 -y
conda activate cipher
pip install -r requirements.txt

Datasets

All datasets are included in this repo, stored in data folder.

MMLU dataset was downloaded from https://github.com/hendrycks/test
GSM8k dataset was downloaded from https://github.com/openai/grade-school-math
Arithmetic dataset was generated by running run_generate_dataset() in datasets/arithmetic.py

Models

Models were downloaded from Huggingface model hub. For example, LLaMA2 models can be found here.

Usage

First, you need to modify the configs in config.yml.

root_dir: path to the root directory of the project
XYZ_model_path: path to XYZ checkpoints, e.g., "pretrained_models/llama2_hf/{model_name}".

Note that keep {model_name} in the path, it will be replaced by the actual model name (e.g., "Llama-2-7b-hf") when running the code.

Quick test

Assume there is a pretrained model Llama-2-7b-hf in pretrained_models/llama2_hf folder and the path is set in config.yml, you can run the following command on 1 GPU (such as a NVIDIA A40/A6000) to see if the code works:

## Quickly test the code on 1 GMS8K question with 2 Llama-2-7b. 
sh scripts/quick_test.sh

The run will output a json file in output folder, which can be evaluated by analyze_results.py. For example:

python analyze_results.py -m output/log_2024-Jan-19/id_2Llama-2-7b-hf_Llama-2-7b-hf_gsm8k_test_gsm8k_full_seed65533_vec_nsols1_r3_temp0.2-0.8_nques1_2024-Jan-19--23-24-360.85.json

More details on evaluating the results can be found in the Evaluation section.

Run Inference

The main entry point is run_debate.py. After running, it will write out a json file containing responses of the models. We evaluate the result by using analyze_results.py (see Evaluation section below).

For description of parameter settings, you can run:

python run_debate.py --help

Here are some key arguments and their descriptions for clarity:

-p (--num_points): Specifies the number of runs (i.e., number of debates) to be executed for this job.

-b (--batch_size): Sets the batch size for inference.

-d (--dataset): Specifies the dataset to be used for the job, such as "gsm8k," "mmlu," or "arithmetic".

--debaters: A list of debaters to participate in the debate. Debaters can be any of the following: falcon-40b-instruct, llama_7B, llama_65B, Llama-2-70b-hf, Llama-2-70b-chat-hf, Llama-2-70b-chat-hf_expert, Llama-2-70b-hf_dummy_expert, etc.

--initial_prompt_paths: Specifies the path(s) to the initial prompt file(s). If only one file is provided, it will be used for all debaters. If multiple files are provided, the number of files must match the number of debaters.

--debate_prompt_paths: Specifies the path(s) to the debate prompt file(s). Similar to initial prompts, if one file is provided, it will be used for all debaters, and the number of files must match the number of debaters if multiple files are used.

--max_new_tokens: Sets the maximum number of new tokens when generating responses.

--n_rounds: Specifies the number of rounds for the debate. The default is 3 rounds, which includes 1 initial round and 2 debate rounds.

--data_path: Specifies the path to the dataset, such as "data/gsm/gsm8k_split_test200ques.jsonl".

--n_questions: Controls the number of questions to be sampled from the dataset. For example, in GSM8K, you can sample 200 questions from the dataset by setting this parameter to 200.

--point_path: Specifies the path for pre-defined points that will be used for random search or Bayesian optimization. We will search on these points first. One example can be found at probe_points/arithmetic_2debaters_v3.txt

--temperatures: Specifies the temperatures for the agents participating in each round of the debate. It is a list of temperatures corresponding to each agent's engagement in each round. For instance, in a 2-debater debate with 3 rounds, you should provide temperatures as follows: [agent0_round0, agent0_round1, agent0_round2, agent1_round0, agent1_round1, agent1_round2]. Note that this parameter is used when the -p is set to 1. In case -p > 1, this parameter will be disregarded.

Experiments in the papers

These commands serve as examples for conducting the experiments described in the paper. The hyperparameter settings are detailed in Appendix D of the paper.

I. Experiments in Table 1 for LLaMA2-70B

1. GSM8K

1.1. NLD

python run_debate.py -p 5 -d gsm8k  --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 8 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 400 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 200 --point_path probe_points/gsm8k_NLD.txt --n_gpus_per_actor 4 --n_ray_actors 2

1.2. CIPHER

python run_debate.py -p 5 -d gsm8k --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 8 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_vector_language_v1.txt -v --temperature_max 2.0 --max_new_tokens 400 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 200 --point_path probe_points/gsm8k_cipher.txt --n_gpus_per_actor 4 --n_ray_actors 2

1.3. Majority Voting

python run_debate.py -p 5 -d gsm8k  --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 8 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 400 --n_rounds 1 --n_sols_each_ques 5 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 200 --n_gpus_per_actor 4 --n_ray_actors 2 --temperatures 0.8,0.8

1.4. Single Answer

python run_debate.py -p 3 -d gsm8k  --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 8 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 400 --n_rounds 1 --n_sols_each_ques 1 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 200 --n_gpus_per_actor 4 --n_ray_actors 2 --temperatures 0.15,0.15

2. Arithmetic

2.1. NLD

python run_debate.py -p 5 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 20 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --point_path probe_points/arithmetic_NLD.txt --n_gpus_per_actor 4 --n_ray_actors 1

2.2. CIPHER

python run_debate.py -p 5 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 20 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_vector_language_v1.txt -v --temperature_max 2.0 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --point_path probe_points/arithmetic_cipher.txt --n_gpus_per_actor 4 --n_ray_actors 1

2.3. Majority Voting

python run_debate.py -p 5 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 20 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 120 --n_rounds 1 --n_sols_each_ques 5 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --n_gpus_per_actor 4 --n_ray_actors 1 --temperatures 0.6,0.6

2.4. Single Answer

python run_debate.py -p 3 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 20 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 120 --n_rounds 1 --n_sols_each_ques 1 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --n_gpus_per_actor 4 --n_ray_actors 1 --temperatures 0.35,0.35

3. MMLU Psychology

3.1. NLD

python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_professional_psychology_v1.txt --debate_prompt_paths prompts_v2/mmlu/debate_professional_psychology_2debaters_v1.txt --max_new_tokens 400 --n_rounds 3 --temperature_max 2.0 --n_questions 200 --data_path data/mmlu/test/professional_psychology_split_testset_200questions.csv --point_path probe_points/psychology_NLD.txt --n_gpus_per_actor 4 --n_ray_actors 2

3.2. CIPHER

python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_professional_psychology_v1.txt --debate_prompt_paths prompts_v2/mmlu/debate_professional_psychology_2debaters_vector_language_v1.txt -v --max_new_tokens 400 --n_rounds 3 --temperature_max 2.0 --n_questions 200 --data_path data/mmlu/test/professional_psychology_split_testset_200questions.csv --point_path probe_points/psychology_cipher.txt --n_gpus_per_actor 4 --n_ray_actors 2

3.3. Majority Voting

python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_professional_psychology_v1.txt --debate_prompt_paths prompts_v2/mmlu/debate_professional_psychology_2debaters_v1.txt --max_new_tokens 400 --n_rounds 1 --n_sols_each_ques 5 --temperature_max 2.0 --n_questions 200 --data_path data/mmlu/test/professional_psychology_split_testset_200questions.csv --n_gpus_per_actor 4 --n_ray_actors 2 --temperatures 0.3,0.3

3.4. Single Answer

python run_debate.py -p 3 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_professional_psychology_v1.txt --debate_prompt_paths prompts_v2/mmlu/debate_professional_psychology_2debaters_v1.txt --max_new_tokens 400 --n_rounds 1 --n_sols_each_ques 1 --temperature_max 2.0 --n_questions 200 --data_path data/mmlu/test/professional_psychology_split_testset_200questions.csv --n_gpus_per_actor 4 --n_ray_actors 2 --temperatures 0.33,0.33

4. MMLU Math

4.1. NLD

python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_high_school_mathematics_v1.txt --debate_prompt_paths prompts_v2/mmlu/debate_high_school_mathematics_2debaters_v1.txt --max_new_tokens 400 --n_rounds 3 --data_path data/mmlu/test/high_school_mathematics_test.csv --point_path probe_points/math_NLD.txt --n_gpus_per_actor 4 --n_ray_actors 2

4.2. CIPHER

python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_high_school_mathematics_v1.txt --debate_prompt_paths prompts_v2/mmlu/debate_high_school_mathematics_2debaters_vector_language_v1.txt -v --max_new_tokens 400 --n_rounds 3  --data_path data/mmlu/test/high_school_mathematics_test.csv --point_path probe_points/math_cipher.txt --n_gpus_per_actor 4 --n_ray_actors 2

For Majority Voting and Single Answer, you can use similar commands as in MMLU Psychology.

5. MMLU formal logic

5.1. NLD

python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_fomal_logic_v2.txt --debate_prompt_paths prompts_v2/mmlu/debate_fomal_logic_2debaters_v2.txt --temperature_max 2.0 --max_new_tokens 400 --n_rounds 3 --data_path data/mmlu/test/formal_logic_test.csv --n_questions 126 --point_path probe_points/logic_NLD.txt --n_gpus_per_actor 4 --n_ray_actors 2

5.2. CIPHER

python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_fomal_logic_v2.txt --debate_prompt_paths prompts_v2/mmlu/debate_fomal_logic_2debaters_vector_language_2.txt -v --temperature_max 2.0 --max_new_tokens 400 --n_rounds 3 --data_path data/mmlu/test/formal_logic_test.csv --n_questions 126 --point_path probe_points/logic_cipher.txt --n_gpus_per_actor 4 --n_ray_actors 2

II. Table 1 for LLaMA1-65B

For LLaMA1-65B, you can use similar commands with --debaters llama_65B,llama_65B, refer to scripts/LLaMA1_scripts.md for more details.

III. LLaMA1 vs LLaMA2 (Table 2)

To run this, first you need to store emb table (embed_tokens) of LLaMa1 and LLaMA2 in emb_weights folder as llama-2-70b-hf.pt and llama_65b.pt. Afterward, you can proceed by executing the command below, using the Arithmetic dataset as an example:

python run_debate.py -p 4 -d arithmetic --debaters Llama-2-70b-hf,llama_65B -b 8 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --n_ray_actors 1

IV. GSM8K on different models (Figure 3)

Examples:

GSM8K WizardMath

python run_debate.py -p 6 -d gsm8k --debaters WizardMath-70B-V1.0,WizardMath-70B-V1.0 -b 8 --initial_prompt_paths prompts_v2/gsm8k/init_wizardmath.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_wizardmath_vector_language.txt -v --temperature_max 0.85 --max_new_tokens 500 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 200 --n_ray_actors 1 --seed 232495

GSM8k Falcon

python run_debate.py -p 4 -d gsm8k --debaters falcon-40b-instruct,falcon-40b-instruct -b 6 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_v1.txt --temperature_max 0.85 --max_new_tokens 400 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 400 --n_ray_actors 1

V. Partial CIPHER (Figure 6)

Entropy

python run_debate.py -p 3 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 16 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_vector_language_v1.txt -v --temperature_max 1.5 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --partial_cipher_when_confident --partial_entropy_over_max --partial_thres 1.869

Max

python run_debate.py -p 3 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 5 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_vector_language_v1.txt -v --temperature_max 2.0 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --n_gpus_per_actor 4 --partial_cipher_when_not_confident --partial_thres 0.8

Entropy reversed

python run_debate.py -p 3 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 16 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_vector_language_v1.txt -v --temperature_max 1.5 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --partial_cipher_when_not_confident --partial_entropy_over_max --partial_thres 1.869

Max reversed

python run_debate.py -p 3 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 5 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_vector_language_v1.txt -v --temperature_max 2.0 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --n_gpus_per_actor 4 --partial_cipher_when_confident --partial_thres 0.8

VI. Positional Bias (Figure 7)

Example with Natural Language Debate:

python run_debate.py -p 4 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 8 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_v1.txt --temperature_max 1.5 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --n_ray_actors 1 --positional_bias

VII. Upper and Lower Bounds (Figure 8)

1. Dummy expert (swap groud truth (gt) with other gt in the same batch to confuse the other debater)

python run_debate.py -p 3 -d gsm8k --debaters Llama-2-70b-hf,Llama-2-70b-hf_dummy_expert -b 12 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_vector_language_v1.txt -v --temperature_max 0.85 --max_new_tokens 400 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 400 --n_ray_actors 1

2. Expert (always return ground truth answers)

python run_debate.py -p 4 -d gsm8k --debaters Llama-2-7b-hf,Llama-2-7b-hf_expert -b 12 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_vector_language_v1.txt -v --temperature_max 0.7 --max_new_tokens 400 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 400 --n_ray_actors 1

3. Dummy high temperature (random responses due to extremely high temperature)

python run_debate.py -p 4 -d gsm8k --debaters Llama-2-7b-hf,Llama-2-7b-hf -b 12 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_vector_language_v1.txt -v --temperature_max 0.7 --max_new_tokens 400 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 400 --n_ray_actors 1

Evaluation

After running the inference, you will get a json output, which can be evaluated by using analyze_results.py.

Detailed description of the parameters:

-m: debaters, or path to json output

-d: dataset, e.g., gsm, mmlu, arithmetic

-t: type of debate, e.g., human, vector, or both (default)

-s: subdataset (used when -d set to mmlu), e.g., psychology, formal_logic, math

Usage

To evaluate a json output (example 1 below): python analyze_results.py -m <relative_path_to_the_json_output>
To evaluate many json files at once in output folder (example 2, and 3): python analyze_results.py -m <debaters> -d <dataset> -t <type> -s <subdataset>

Examples:

Example 1

python analyze_results.py -m output/log_2023-Sep-26/Llama-2-70b-hf_Llama-2-70b-hf_mmlu_high_school_mathematics_test_seed75036.json

Example 2:

python analyze_results.py -m 2Llama-2-70b-hf -d gsm -t human

Example 3:

python analyze_results.py -m 2Llama-2-70b-hf -d mmlu -s math -t vector

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
datasets		datasets
emb_weights		emb_weights
evaluations		evaluations
models		models
output/log_2024-Jan-19		output/log_2024-Jan-19
probe_points		probe_points
prompts_v2		prompts_v2
scripts		scripts
.gitignore		.gitignore
README.md		README.md
analyze_results.py		analyze_results.py
config.yml		config.yml
convert_falcon_format.py		convert_falcon_format.py
requirements.txt		requirements.txt
run_debate.py		run_debate.py

chaudatascience/cipher_multiagent_debate

Folders and files

Latest commit

History

Repository files navigation

Let Models Speak Ciphers: Multiagent Debate through Embeddings

Installation

Datasets

Models

Usage

Quick test

Run Inference

Experiments in the papers

I. Experiments in Table 1 for LLaMA2-70B

1. GSM8K

1.1. NLD

1.2. CIPHER

1.3. Majority Voting

1.4. Single Answer

2. Arithmetic

2.1. NLD

2.2. CIPHER

2.3. Majority Voting

2.4. Single Answer

3. MMLU Psychology

3.1. NLD

3.2. CIPHER

3.3. Majority Voting

3.4. Single Answer

4. MMLU Math

4.1. NLD

4.2. CIPHER

5. MMLU formal logic

5.1. NLD

5.2. CIPHER

II. Table 1 for LLaMA1-65B

III. LLaMA1 vs LLaMA2 (Table 2)

IV. GSM8K on different models (Figure 3)

GSM8K WizardMath

GSM8k Falcon

V. Partial CIPHER (Figure 6)

VI. Positional Bias (Figure 7)

VII. Upper and Lower Bounds (Figure 8)

1. Dummy expert (swap groud truth (gt) with other gt in the same batch to confuse the other debater)

2. Expert (always return ground truth answers)

3. Dummy high temperature (random responses due to extremely high temperature)

Evaluation

About

Resources

Stars

Watchers

Forks

Languages