Skip to content

chaudatascience/cipher_multiagent_debate

Repository files navigation

Let Models Speak Ciphers: Multiagent Debate through Embeddings

This code provides an unofficial Pytorch implementation for the paper titled Let Models Speak Ciphers: Multiagent Debate through Embeddings.

@InProceedings{phamCIPHER2024,
    author={Chau Pham and Boyi Liu and Yingxiang Yang and Zhengyu Chen and Tianyi Liu and Jianbo Yuan and Bryan A. Plummer and Zhaoran Wang and Hongxia Yang},
    title={Let Models Speak Ciphers: Multiagent Debate through Embeddings},
    booktitle={International Conference on Learning Representations (ICLR)},
    year={2024}}

Installation

The code was tested with Python 3.10 and PyTorch 2.0

To install required packages, run the following command:

conda create -n cipher python=3.10 -y
conda activate cipher
pip install -r requirements.txt

Datasets

All datasets are included in this repo, stored in data folder.

Models

Models were downloaded from Huggingface model hub. For example, LLaMA2 models can be found here.

Usage

First, you need to modify the configs in config.yml.

  • root_dir: path to the root directory of the project
  • XYZ_model_path: path to XYZ checkpoints, e.g., "pretrained_models/llama2_hf/{model_name}".

Note that keep {model_name} in the path, it will be replaced by the actual model name (e.g., "Llama-2-7b-hf") when running the code.

Quick test

Assume there is a pretrained model Llama-2-7b-hf in pretrained_models/llama2_hf folder and the path is set in config.yml, you can run the following command on 1 GPU (such as a NVIDIA A40/A6000) to see if the code works:

## Quickly test the code on 1 GMS8K question with 2 Llama-2-7b. 
sh scripts/quick_test.sh 

The run will output a json file in output folder, which can be evaluated by analyze_results.py. For example:

python analyze_results.py -m output/log_2024-Jan-19/id_2Llama-2-7b-hf_Llama-2-7b-hf_gsm8k_test_gsm8k_full_seed65533_vec_nsols1_r3_temp0.2-0.8_nques1_2024-Jan-19--23-24-360.85.json

More details on evaluating the results can be found in the Evaluation section.

Run Inference

The main entry point is run_debate.py. After running, it will write out a json file containing responses of the models. We evaluate the result by using analyze_results.py (see Evaluation section below).

For description of parameter settings, you can run:

python run_debate.py --help

Here are some key arguments and their descriptions for clarity:

-p (--num_points): Specifies the number of runs (i.e., number of debates) to be executed for this job.

-b (--batch_size): Sets the batch size for inference.

-d (--dataset): Specifies the dataset to be used for the job, such as "gsm8k," "mmlu," or "arithmetic".

--debaters: A list of debaters to participate in the debate. Debaters can be any of the following: falcon-40b-instruct, llama_7B, llama_65B, Llama-2-70b-hf, Llama-2-70b-chat-hf, Llama-2-70b-chat-hf_expert, Llama-2-70b-hf_dummy_expert, etc.

--initial_prompt_paths: Specifies the path(s) to the initial prompt file(s). If only one file is provided, it will be used for all debaters. If multiple files are provided, the number of files must match the number of debaters.

--debate_prompt_paths: Specifies the path(s) to the debate prompt file(s). Similar to initial prompts, if one file is provided, it will be used for all debaters, and the number of files must match the number of debaters if multiple files are used.

--max_new_tokens: Sets the maximum number of new tokens when generating responses.

--n_rounds: Specifies the number of rounds for the debate. The default is 3 rounds, which includes 1 initial round and 2 debate rounds.

--data_path: Specifies the path to the dataset, such as "data/gsm/gsm8k_split_test200ques.jsonl".

--n_questions: Controls the number of questions to be sampled from the dataset. For example, in GSM8K, you can sample 200 questions from the dataset by setting this parameter to 200.

--point_path: Specifies the path for pre-defined points that will be used for random search or Bayesian optimization. We will search on these points first. One example can be found at probe_points/arithmetic_2debaters_v3.txt

--temperatures: Specifies the temperatures for the agents participating in each round of the debate. It is a list of temperatures corresponding to each agent's engagement in each round. For instance, in a 2-debater debate with 3 rounds, you should provide temperatures as follows: [agent0_round0, agent0_round1, agent0_round2, agent1_round0, agent1_round1, agent1_round2]. Note that this parameter is used when the -p is set to 1. In case -p > 1, this parameter will be disregarded.


Experiments in the papers

These commands serve as examples for conducting the experiments described in the paper. The hyperparameter settings are detailed in Appendix D of the paper.

I. Experiments in Table 1 for LLaMA2-70B

1. GSM8K

1.1. NLD

python run_debate.py -p 5 -d gsm8k  --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 8 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 400 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 200 --point_path probe_points/gsm8k_NLD.txt --n_gpus_per_actor 4 --n_ray_actors 2

1.2. CIPHER

python run_debate.py -p 5 -d gsm8k --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 8 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_vector_language_v1.txt -v --temperature_max 2.0 --max_new_tokens 400 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 200 --point_path probe_points/gsm8k_cipher.txt --n_gpus_per_actor 4 --n_ray_actors 2

1.3. Majority Voting

python run_debate.py -p 5 -d gsm8k  --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 8 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 400 --n_rounds 1 --n_sols_each_ques 5 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 200 --n_gpus_per_actor 4 --n_ray_actors 2 --temperatures 0.8,0.8

1.4. Single Answer

python run_debate.py -p 3 -d gsm8k  --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 8 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 400 --n_rounds 1 --n_sols_each_ques 1 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 200 --n_gpus_per_actor 4 --n_ray_actors 2 --temperatures 0.15,0.15

2. Arithmetic

2.1. NLD

python run_debate.py -p 5 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 20 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --point_path probe_points/arithmetic_NLD.txt --n_gpus_per_actor 4 --n_ray_actors 1

2.2. CIPHER

python run_debate.py -p 5 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 20 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_vector_language_v1.txt -v --temperature_max 2.0 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --point_path probe_points/arithmetic_cipher.txt --n_gpus_per_actor 4 --n_ray_actors 1

2.3. Majority Voting

python run_debate.py -p 5 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 20 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 120 --n_rounds 1 --n_sols_each_ques 5 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --n_gpus_per_actor 4 --n_ray_actors 1 --temperatures 0.6,0.6

2.4. Single Answer

python run_debate.py -p 3 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 20 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 120 --n_rounds 1 --n_sols_each_ques 1 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --n_gpus_per_actor 4 --n_ray_actors 1 --temperatures 0.35,0.35

3. MMLU Psychology

3.1. NLD

python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_professional_psychology_v1.txt --debate_prompt_paths prompts_v2/mmlu/debate_professional_psychology_2debaters_v1.txt --max_new_tokens 400 --n_rounds 3 --temperature_max 2.0 --n_questions 200 --data_path data/mmlu/test/professional_psychology_split_testset_200questions.csv --point_path probe_points/psychology_NLD.txt --n_gpus_per_actor 4 --n_ray_actors 2

3.2. CIPHER

python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_professional_psychology_v1.txt --debate_prompt_paths prompts_v2/mmlu/debate_professional_psychology_2debaters_vector_language_v1.txt -v --max_new_tokens 400 --n_rounds 3 --temperature_max 2.0 --n_questions 200 --data_path data/mmlu/test/professional_psychology_split_testset_200questions.csv --point_path probe_points/psychology_cipher.txt --n_gpus_per_actor 4 --n_ray_actors 2

3.3. Majority Voting

python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_professional_psychology_v1.txt --debate_prompt_paths prompts_v2/mmlu/debate_professional_psychology_2debaters_v1.txt --max_new_tokens 400 --n_rounds 1 --n_sols_each_ques 5 --temperature_max 2.0 --n_questions 200 --data_path data/mmlu/test/professional_psychology_split_testset_200questions.csv --n_gpus_per_actor 4 --n_ray_actors 2 --temperatures 0.3,0.3

3.4. Single Answer

python run_debate.py -p 3 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_professional_psychology_v1.txt --debate_prompt_paths prompts_v2/mmlu/debate_professional_psychology_2debaters_v1.txt --max_new_tokens 400 --n_rounds 1 --n_sols_each_ques 1 --temperature_max 2.0 --n_questions 200 --data_path data/mmlu/test/professional_psychology_split_testset_200questions.csv --n_gpus_per_actor 4 --n_ray_actors 2 --temperatures 0.33,0.33

4. MMLU Math

4.1. NLD

python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_high_school_mathematics_v1.txt --debate_prompt_paths prompts_v2/mmlu/debate_high_school_mathematics_2debaters_v1.txt --max_new_tokens 400 --n_rounds 3 --data_path data/mmlu/test/high_school_mathematics_test.csv --point_path probe_points/math_NLD.txt --n_gpus_per_actor 4 --n_ray_actors 2

4.2. CIPHER

python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_high_school_mathematics_v1.txt --debate_prompt_paths prompts_v2/mmlu/debate_high_school_mathematics_2debaters_vector_language_v1.txt -v --max_new_tokens 400 --n_rounds 3  --data_path data/mmlu/test/high_school_mathematics_test.csv --point_path probe_points/math_cipher.txt --n_gpus_per_actor 4 --n_ray_actors 2

For Majority Voting and Single Answer, you can use similar commands as in MMLU Psychology.

5. MMLU formal logic

5.1. NLD

python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_fomal_logic_v2.txt --debate_prompt_paths prompts_v2/mmlu/debate_fomal_logic_2debaters_v2.txt --temperature_max 2.0 --max_new_tokens 400 --n_rounds 3 --data_path data/mmlu/test/formal_logic_test.csv --n_questions 126 --point_path probe_points/logic_NLD.txt --n_gpus_per_actor 4 --n_ray_actors 2

5.2. CIPHER

python run_debate.py -p 5 -d mmlu --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 12 --initial_prompt_paths prompts_v2/mmlu/init_fomal_logic_v2.txt --debate_prompt_paths prompts_v2/mmlu/debate_fomal_logic_2debaters_vector_language_2.txt -v --temperature_max 2.0 --max_new_tokens 400 --n_rounds 3 --data_path data/mmlu/test/formal_logic_test.csv --n_questions 126 --point_path probe_points/logic_cipher.txt --n_gpus_per_actor 4 --n_ray_actors 2

II. Table 1 for LLaMA1-65B

For LLaMA1-65B, you can use similar commands with --debaters llama_65B,llama_65B, refer to scripts/LLaMA1_scripts.md for more details.

III. LLaMA1 vs LLaMA2 (Table 2)

To run this, first you need to store emb table (embed_tokens) of LLaMa1 and LLaMA2 in emb_weights folder as llama-2-70b-hf.pt and llama_65b.pt. Afterward, you can proceed by executing the command below, using the Arithmetic dataset as an example:

python run_debate.py -p 4 -d arithmetic --debaters Llama-2-70b-hf,llama_65B -b 8 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_v1.txt --temperature_max 2.0 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --n_ray_actors 1

IV. GSM8K on different models (Figure 3)

Examples:

GSM8K WizardMath

python run_debate.py -p 6 -d gsm8k --debaters WizardMath-70B-V1.0,WizardMath-70B-V1.0 -b 8 --initial_prompt_paths prompts_v2/gsm8k/init_wizardmath.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_wizardmath_vector_language.txt -v --temperature_max 0.85 --max_new_tokens 500 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 200 --n_ray_actors 1 --seed 232495 

GSM8k Falcon

python run_debate.py -p 4 -d gsm8k --debaters falcon-40b-instruct,falcon-40b-instruct -b 6 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_v1.txt --temperature_max 0.85 --max_new_tokens 400 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 400 --n_ray_actors 1 

V. Partial CIPHER (Figure 6)

Entropy

python run_debate.py -p 3 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 16 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_vector_language_v1.txt -v --temperature_max 1.5 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --partial_cipher_when_confident --partial_entropy_over_max --partial_thres 1.869

Max

python run_debate.py -p 3 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 5 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_vector_language_v1.txt -v --temperature_max 2.0 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --n_gpus_per_actor 4 --partial_cipher_when_not_confident --partial_thres 0.8

Entropy reversed

python run_debate.py -p 3 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 16 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_vector_language_v1.txt -v --temperature_max 1.5 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --partial_cipher_when_not_confident --partial_entropy_over_max --partial_thres 1.869

Max reversed

python run_debate.py -p 3 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 5 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_vector_language_v1.txt -v --temperature_max 2.0 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --n_gpus_per_actor 4 --partial_cipher_when_confident --partial_thres 0.8

VI. Positional Bias (Figure 7)

Example with Natural Language Debate:

python run_debate.py -p 4 -d arithmetic --debaters Llama-2-70b-hf,Llama-2-70b-hf -b 8 --initial_prompt_paths prompts_v2/arithmetic/init_prompt.txt --debate_prompt_paths prompts_v2/arithmetic/debate_2debaters_v1.txt --temperature_max 1.5 --max_new_tokens 120 --n_rounds 3 --data_path data/arithmetic/arithmetic_test200ques.jsonl --n_questions 200 --n_ray_actors 1 --positional_bias

VII. Upper and Lower Bounds (Figure 8)

1. Dummy expert (swap groud truth (gt) with other gt in the same batch to confuse the other debater)

python run_debate.py -p 3 -d gsm8k --debaters Llama-2-70b-hf,Llama-2-70b-hf_dummy_expert -b 12 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_vector_language_v1.txt -v --temperature_max 0.85 --max_new_tokens 400 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 400 --n_ray_actors 1 

2. Expert (always return ground truth answers)

python run_debate.py -p 4 -d gsm8k --debaters Llama-2-7b-hf,Llama-2-7b-hf_expert -b 12 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_vector_language_v1.txt -v --temperature_max 0.7 --max_new_tokens 400 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 400 --n_ray_actors 1 

3. Dummy high temperature (random responses due to extremely high temperature)

python run_debate.py -p 4 -d gsm8k --debaters Llama-2-7b-hf,Llama-2-7b-hf -b 12 --initial_prompt_paths prompts_v2/gsm8k/init_question_3shot_v3.txt --debate_prompt_paths prompts_v2/gsm8k/debate_2debaters_vector_language_v1.txt -v --temperature_max 0.7 --max_new_tokens 400 --n_rounds 3 --data_path data/gsm/gsm8k_split_test200ques.jsonl --n_questions 400 --n_ray_actors 1 

Evaluation

After running the inference, you will get a json output, which can be evaluated by using analyze_results.py.

Detailed description of the parameters:

-m: debaters, or path to json output

-d: dataset, e.g., gsm, mmlu, arithmetic

-t: type of debate, e.g., human, vector, or both (default)

-s: subdataset (used when -d set to mmlu), e.g., psychology, formal_logic, math

Usage

  • To evaluate a json output (example 1 below): python analyze_results.py -m <relative_path_to_the_json_output>
  • To evaluate many json files at once in output folder (example 2, and 3): python analyze_results.py -m <debaters> -d <dataset> -t <type> -s <subdataset>

Examples:

  • Example 1
python analyze_results.py -m output/log_2023-Sep-26/Llama-2-70b-hf_Llama-2-70b-hf_mmlu_high_school_mathematics_test_seed75036.json
  • Example 2:
python analyze_results.py -m 2Llama-2-70b-hf -d gsm -t human
  • Example 3:
python analyze_results.py -m 2Llama-2-70b-hf -d mmlu -s math -t vector

About

Let Models Speak Ciphers: Multiagent Debate through Embeddings

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published