# Chain-of-thought prompting experiments on BLOOM-7B
---

This notebook represents experiments with two decoding strategies in chain-of-thought prompting:
- an original chain-of-thought method (CoT) -- [**greedy decoding**](https://arxiv.org/abs/2201.11903) method,
- a kind of ensemble chain-of-thought method (ensemble CoT) -- [**self-consistency**](https://arxiv.org/abs/2203.11171) method.

🎯 The purpose of the mini research is to compare these strategies on the [distributed version](https://huggingface.co/bigscience/bloom-7b1-petals) of the [BLOOM-7B model](https://huggingface.co/bigscience/bloom-7b1). The distributed version allows to run large LM at home using the [Petals](https://petals.ml/) swarm.

🥇 Evaluation of the two strategies is performed on the arithmetic reasoning benchmark [GSM-8K](https://github.com/openai/grade-school-math) containing grade-school-math problems.

To use this notebook in Colab:
1. Follow this link: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/aebogdanova/BLOOM-CoT-prompting-experiments/blob/master/CoT-prompting-experiments.ipynb)
2. Go to Runtime and change runtime type by selecting the GPU accelerator.

## Preparation
*Installing and importing necessary packages and downloading the model*

In [1]:
%pip install -q petals

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.3/92.3 KB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.5/191.5 KB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.9/55.9 MB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 KB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m43.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

We will use pre-defined scripts and will evaluate the model's performance on the pre-generated answers.

In [2]:
!git clone https://github.com/aebogdanova/BLOOM-CoT-prompting-experiments.git
!mv BLOOM-CoT-prompting-experiments/scripts/ .

Cloning into 'BLOOM-CoT-prompting-experiments'...
remote: Enumerating objects: 27, done.[K
remote: Counting objects: 100% (27/27), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 27 (delta 1), reused 27 (delta 1), pack-reused 0[K
Unpacking objects: 100% (27/27), 525.25 KiB | 7.84 MiB/s, done.


In [3]:
import os
import random
import numpy as np
import json
import torch
from transformers import BloomTokenizerFast 
from petals import DistributedBloomForCausalLM

In [4]:
SEED = 42

random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

In [5]:
MODEL_NAME = "bigscience/bloom-7b1-petals"

tokenizer = BloomTokenizerFast.from_pretrained(MODEL_NAME)
model = DistributedBloomForCausalLM.from_pretrained(MODEL_NAME)
model = model.cuda()

Downloading:   0%|          | 0.00/322 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/786 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.06G [00:00<?, ?B/s]

In [7]:
# uncomment and run the cell to test model's inference

# inputs = tokenizer('A cat in French is "', return_tensors="pt")["input_ids"].cuda()
# outputs = model.generate(inputs, max_new_tokens=3)
# print(tokenizer.decode(outputs[0]))

## Data

*Downloading and preparing the dataset*

For evaluation process the first 500 examples of GSM-8K test subset are used.

In [6]:
!wget -q https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl

In [7]:
TEST_SIZE = 500

with open("test.jsonl", "r") as gsm_file:
  gsm_lines = gsm_file.readlines()[:TEST_SIZE]

We use the same prompts as in the [paper](https://arxiv.org/abs/2201.11903) on the original chain-of-thought method. These prompts are presented by the set of 8 manually written exemplars.

In [8]:
with open("BLOOM-CoT-prompting-experiments/prompt-exemplars.txt", "r") as exemplars_file:
  exemplars = exemplars_file.read()

In [12]:
# uncomment and run the cell to see exemplars
# print(exemplars)

Now let's prepare inputs for inference and save targets for evaluation.

In [9]:
input_list = []
target_list = []

for line in gsm_lines:
  fields = json.loads(line)
  question = fields["question"].strip()
  full_answer = fields["answer"].split("\n####")
  answer = full_answer[1].strip()
  input_list.append(exemplars + "\n\nQ: " + question + "\nA: ")
  target_list.append(answer)

assert len(input_list) == len(target_list)

Here is an example of a single input for the model:

In [10]:
print(random.choice(input_list))

Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
A: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. The answer is 6.

Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
A: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5.

Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
A: Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. The answer is 39.

Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
A: Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 

## Experiments

*Running experiments with the two decoding strategies in chain-of-thought prompting*

### Preparation

Before running the inference we should define some parameters of generation: 
- length of generated output, 
- stop tokens if needed,
- generation parameters for the self-consistency method: temperature, top-k, top-p.

An appropriate length of output can be estimated by the value of 90-percentile of the GSM-8K test subset.

In [11]:
full_answers = [json.loads(line)["answer"] for line in gsm_lines]
full_answers_lenghts = [len(tokenizer.encode(answer)) for answer in full_answers]

print("90-percentile value:", np.percentile(full_answers_lenghts, 90))

90-percentile value: 145.0


Several test outputs with the BLOOM-7b1 model have shown that the model often tends to continue generation even after giving the full answer while ```"A:"``` token is generated, so let's set this token as the stop token in ```generate()``` function further.

In [12]:
stop_token = tokenizer.encode('A:', return_tensors='pt')[0].cuda()

For generation parameters of sampling we use the same parameters as the authors of the [paper](https://arxiv.org/abs/2203.11171) on self-consistency method use for relatively small models: ```temperature=0.5``` and ```top-k=40```.

Finally we create directories to save generated answers. In order not to lose generated answers after disconnecting Google Colab it is recommended to mount your Google Drive and save files there.

In [None]:
# from google.colab import drive
# drive.mount('/content/drive/')

In [13]:
# to create directories in Google Colab temporary storage:
OUTPUT_DIR = "/content/generated-answers"

# to create directories in Google Drive: 
# uncomment and run the cell above and uncomment the line below
# OUTPUT_DIR = "/content/drive/MyDrive/generated-answers"

os.mkdir(OUTPUT_DIR)
os.mkdir(f"{OUTPUT_DIR}/cot")
os.mkdir(f"{OUTPUT_DIR}/sc")

Experiments are performed using pre-defined scripts.

In [14]:
from scripts.experiment import run_inference

### Experiment with CoT-prompting

In [16]:
run_inference(input_list=input_list, 
              output_file_path=f"{OUTPUT_DIR}/cot/gsm.json", 
              tokenizer=tokenizer, 
              model=model, 
              stop_token=stop_token, 
              do_sample=False)

### Experiment with self-consistency prompting

Unfortunately due to very limited resources of Google Colab GPUs it was possible to sample only 12 outputs for each prompt.

In [18]:
N_SAMPLES = 12

for i in range(N_SAMPLES):
  run_inference(input_list=input_list, 
                output_file_path=f"{OUTPUT_DIR}/sc/{i+1}-gsm.json", 
                tokenizer=tokenizer, 
                model=model, 
                stop_token=stop_token, 
                do_sample=True)

## Evaluation

*Evaluating results of different decoding strategies*

Evaluation is performed using pre-defined scripts.

In [20]:
from scripts.eval import extract_number, aggregate_answer, evaluate_acc

Let's see accuracy metric for chain-of-thought method and self-conistency method. We address the answers already generated by the model. If you want to evaluate your own answers, do not run the cell below.

In [21]:
OUTPUT_DIR = "BLOOM-CoT-prompting-experiments/generated-answers"

Accuracy for chain-of-thought prompting:

In [22]:
with open(f"{OUTPUT_DIR}/cot/gsm.json", "r") as cot_results_file:
  cot_results_lines = cot_results_file.readlines()

cot_answers = []
for line in cot_results_lines:
  cot_pred = json.loads(line)["generated answer"].strip()
  cot_answers.append(extract_number(cot_pred))

evaluate_acc(cot_answers, target_list)

Total examples: 500	Correct examples: 21	Accuracy score: 0.042


Accuracy for self-consistency prompting:

NOTE: following the results of the [paper](https://arxiv.org/abs/2203.11171) we use unweighted sum (majority vote) as aggregation strategy for self-consistency samples.

In [31]:
files = [file for file in os.listdir(f"{OUTPUT_DIR}/sc/") if file.endswith(".json")]

sc_answers_per_file = []
for file in files:
  preds = []
  with open(f"{OUTPUT_DIR}/sc/"+file, "r") as sc_results_file:
    sc_results_lines = sc_results_file.readlines()
  for line in sc_results_lines:
    sc_pred = json.loads(line)["generated answer"].strip()
    preds.append(extract_number(sc_pred))
  sc_answers_per_file.append(preds)

sc_answers = []
for i in range(0, TEST_SIZE):
  answers_set = [answers[i] for answers in sc_answers_per_file]
  sc_answers.append(aggregate_answer(answers_set))

evaluate_acc(sc_answers, target_list)

Total examples: 500	Correct examples: 25	Accuracy score: 0.05
