# COMP3361 Part 1: Building a Transformer Encoder

Note: You should finish your code solution of Part 1 & 2 with A2p12.tgz. For Q2 & Q3, you should include your writeup in this notebook.

## Q2:

From the produced attention masks, we can observe that the heatmap plots are the brightest when the character of the row is the same as the character of the column. For example, the character `d` looks for its previous occurrences in the input character stream, thus giving higher attention scores for the `d`s occurring before the current `d` character. This suggests that the model is doing its job to capture the number of letters of the same type preceding that letter, so that the model is more likely to output 2 (for occurring more than twice) if the output probability distribution has two or more large values (as in the case of the spacebar), and 1 if the output distribution has one large value (as in the case `a` in the 5th plotted example).

## Q3:

Not all attention masks can fit the expected pattern.
In the less clear attention masks, the attention scores are all (close to) zero except for one entry that is close to one, as suggested by the bright dot in each row. The bright dots for each row are concentrated in a few specific columns, where those columns represent the characters that are frequently occurring in the character stream, such as the character e in the 4th plotted example. Based on the observations above, the less clear attention masks is probably overfitting to the noise in the character stream data.

# COMP3361 Part 3: Generation with Large Language Model

## Load model and tokenizer

In this section, we will use [CodeLlama-7B](https://huggingface.co/codellama/CodeLlama-7b-hf) as the language model.

In [1]:
!pip install transformers datasets evaluate accelerate bitsandbytes

Collecting datasets
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m28.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0

In [2]:
from abc import ABC, abstractmethod
from typing import List, Dict, Any
import os
import json
import locale
import math
import psutil
import evaluate
from datasets import load_dataset
from tqdm import tqdm
import re
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

locale.getpreferredencoding = lambda: "UTF-8"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ['LC_ALL'] = 'en_US.UTF-8'

In [3]:
def empty_gpu_pt_cache() -> None:
    n_gpu = torch.cuda.device_count()
    for gpu_id in range(n_gpu):
        torch.cuda.set_device(gpu_id)
        torch.cuda.empty_cache()

In [4]:
# use this in colab
def get_device_name() -> str:
    if not torch.cuda.is_available():
        return 'cpu'
    return 'cuda'

In [5]:
empty_gpu_pt_cache()

In [6]:
class LLM(object):
    def __init__(self, model_name="codellama/CodeLlama-7b-hf"):
        bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
        self.model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', quantization_config=bnb_config)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.tokenizer.pad_token = self.tokenizer.eos_token

    def generate(self, prompts: List[str], **kwargs) -> List[str]:
        empty_gpu_pt_cache()
        device = get_device_name()
        model_inputs = self.tokenizer(prompts, return_tensors="pt", padding=True).to(device)
        with torch.no_grad():
            generated_ids = self.model.generate(**model_inputs, **kwargs)
        return self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

In [7]:
llm = LLM()

llm.generate(["A list of colors: red, blue", "Portugal is"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/637 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


['A list of colors: red, blue, green, yellow, orange, purple, brown',
 'Portugal is a small town in the state of New York, United']

In [8]:
class Evaluator(ABC):
    def __init__(self, llm):
        self.llm = llm

    @abstractmethod
    def load_data(self):
        pass

    @abstractmethod
    def build_prompts(self):
        pass

    @abstractmethod
    def postprocess_output(self, output: str) -> str:
        pass

    def generate_completions(self, prompts: List[str], batch_size=4, **kwargs) -> List[str]:
        completions = []
        n_prompts = len(prompts)
        n_iter = math.ceil(n_prompts / batch_size)
        for i in tqdm(range(n_iter)):
            batch_prompts = prompts[batch_size*i : min(n_prompts, batch_size*(i+1))]
            batch_completions = self.llm.generate(batch_prompts, **kwargs)
            completions.extend(batch_completions)
        return completions

    def evaluate(self, evalset_file, batch_size=4, save_dir="outputs", max_new_tokens=128, **kwargs):
        dataset = self.load_data(evalset_file)
        prompts = self.build_prompts(dataset)
        outputs = self.generate_completions(prompts, batch_size=batch_size, max_new_tokens=max_new_tokens, **kwargs)

        predictions = []
        for i, (example, prompt, output) in tqdm(enumerate(zip(dataset, prompts, outputs))):
            prediction = {
                "task_id": example.get("task_id", f"task_{i}"),
                "prompt": prompt,
                "completion": self.postprocess_output(output)
            }
            predictions.append(prediction)

        # Save predictions to file
        os.makedirs(save_dir, exist_ok=True)
        prediction_save_path = os.path.join(save_dir, f"{type(self).__name__}_predictions.jsonl")
        with open(prediction_save_path, "w") as fout:
            for pred in predictions:
                fout.write(json.dumps(pred) + "\n")

        # Calculate metrics and print results
        results = self.calculate_metrics(predictions, dataset)
        print(f"Results for {type(self).__name__}: {results}")

    @abstractmethod
    def calculate_metrics(self):
        pass

## Zero-shot Code Generation

In [9]:
!mkdir -p human_eval
!wget -O human_eval/__init__.py https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/__init__.py
!wget -O human_eval/data.py human_eval https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/data.py
!wget -O human_eval/evaluation.py https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/evaluation.py
!wget -O human_eval/execution.py human_eval https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/execution.py

!mkdir -p data/humaneval
!wget -O data/humaneval/HumanEval.jsonl.gz https://github.com/openai/human-eval/raw/master/data/HumanEval.jsonl.gz

--2024-02-24 04:56:45--  https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/__init__.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 0 [text/plain]
Saving to: ‘human_eval/__init__.py’

human_eval/__init__     [ <=>                ]       0  --.-KB/s    in 0s      

2024-02-24 04:56:45 (0.00 B/s) - ‘human_eval/__init__.py’ saved [0/0]

--2024-02-24 04:56:45--  http://human_eval/
Resolving human_eval (human_eval)... failed: Name or service not known.
wget: unable to resolve host address ‘human_eval’
--2024-02-24 04:56:45--  https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/data.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.

In [10]:
from human_eval.data import read_problems
from human_eval.evaluation import evaluate_functional_correctness

class HumanEvalEvaluator(Evaluator):
    def load_data(self, evalset_file="data/humaneval/HumanEval.jsonl.gz") -> List[Dict[str, Any]]:
        """
        Load the humaneval dataset
        :param evalset_file: path to the humaneval dataset file
        :return: list of examples
        """
        return list(read_problems(evalset_file).values())

    def build_prompts(self, dataset) -> List[str]:
        """
        Build zero-shot prompts from the humaneval dataset.
        """
        prompts = [example["prompt"] for example in dataset]
        return prompts

    def postprocess_output(self, output: str) -> str:
        stop_sequences = ["\nclass", "\ndef", "\n#", "\nif", "\nprint"]
        matches = re.finditer('|'.join(stop_sequences), output)
        for _ in range(2):
            second_match = next(matches, None)
        try:
            idx = second_match.start()
            return output[:idx].strip('\n')
        except AttributeError: # some output codes are too long, truncated at max_new_tokens
            return output

    def calculate_metrics(self, predictions, dataset):
        pass_at_k_results = evaluate_functional_correctness(
            sample_file=os.path.join("outputs", f"{type(self).__name__}_predictions.jsonl"),
            k=[1],
            problems={example["task_id"]: example for example in dataset},
            n_workers=64
        )
        return pass_at_k_results

In [11]:
empty_gpu_pt_cache()

In [12]:
human_eval_evaluator = HumanEvalEvaluator(llm)
human_eval_evaluator.evaluate(evalset_file="data/humaneval/HumanEval.jsonl.gz", batch_size=8)

  0%|          | 0/21 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  5%|▍         | 1/21 [01:30<30:03, 90.17s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 10%|▉         | 2/21 [03:02<29:00, 91.61s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 14%|█▍        | 3/21 [04:35<27:41, 92.29s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 19%|█▉        | 4/21 [06:06<25:54, 91.47s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 24%|██▍       | 5/21 [07:44<25:06, 94.14s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 29%|██▊       | 6/21 [09:18<23:30, 94.01s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 33%|███▎      | 7/21 [10:51<21:48, 93.49s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 38%|███▊      | 8/21 [12:23<20:09, 93.05s/it]Setting `pad_token_id` to `eos_token_id`:2 for ope

Reading samples...


164it [00:27,  6.07it/s]


Running test suites...


100%|██████████| 164/164 [00:01<00:00, 119.51it/s]


Writing results to outputs/HumanEvalEvaluator_predictions.jsonl_results.jsonl...


100%|██████████| 164/164 [00:00<00:00, 21136.49it/s]

Results for HumanEvalEvaluator: {'pass@1': 0.2926829268292683}





Obtained execution accuracy = 29.27%, which is close enough to 30%.

In [13]:
del human_eval_evaluator

## Few-shot Math Reasoning

In [14]:
GSM_EXAMPLARS = [
    {
        "question": "There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?",
        "cot_answer": "There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. So the answer is 6.",
        "pot_answer": "def solution():\n    \"\"\"There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\"\"\"\n    trees_initial = 15\n    trees_after = 21\n    trees_added = trees_after - trees_initial\n    result = trees_added\n    return result",
        "short_answer": "6"
    },
    {
        "question": "If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?",
        "cot_answer": "There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. So the answer is 5.",
        "pot_answer": "def solution():\n    \"\"\"If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\"\"\"\n    cars_initial = 3\n    cars_arrived = 2\n    total_cars = cars_initial + cars_arrived\n    result = total_cars\n    return result",
        "short_answer": "5"
    },
    {
        "question": "Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?",
        "cot_answer": "Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. So the answer is 39.",
        "pot_answer": "def solution():\n    \"\"\"Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\"\"\"\n    leah_chocolates = 32\n    sister_chocolates = 42\n    total_chocolates = leah_chocolates + sister_chocolates\n    chocolates_eaten = 35\n    chocolates_left = total_chocolates - chocolates_eaten\n    result = chocolates_left\n    return result",
        "short_answer": "39"
    },
    {
        "question": "Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?",
        "cot_answer": "Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8. So the answer is 8.",
        "pot_answer": "def solution():\n    \"\"\"Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\"\"\"\n    jason_lollipops_initial = 20\n    jason_lollipops_after = 12\n    denny_lollipops = jason_lollipops_initial - jason_lollipops_after\n    result = denny_lollipops\n    return result",
        "short_answer": "8"
    },
    {
        "question": "Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?",
        "cot_answer": "Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9. So the answer is 9.",
        "pot_answer": "def solution():\n    \"\"\"Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?\"\"\"\n    toys_initial = 5\n    mom_toys = 2\n    dad_toys = 2\n    total_received = mom_toys + dad_toys\n    total_toys = toys_initial + total_received\n    result = total_toys\n    return result",
        "short_answer": "9"
    },
    {
        "question": "There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?",
        "cot_answer": "There were originally 9 computers. For each of 4 days, 5 more computers were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. So the answer is 29.",
        "pot_answer": "def solution():\n    \"\"\"Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?\"\"\"\n    toys_initial = 5\n    mom_toys = 2\n    dad_toys = 2\n    total_received = mom_toys + dad_toys\n    total_toys = toys_initial + total_received\n    result = total_toys\n    return result",
        "short_answer": "29"
    },
    {
        "question": "Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?",
        "cot_answer": "Michael started with 58 golf balls. After losing 23 on tuesday, he had 58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. So the answer is 33.",
        "pot_answer": "def solution():\n    \"\"\"Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?\"\"\"\n    golf_balls_initial = 58\n    golf_balls_lost_tuesday = 23\n    golf_balls_lost_wednesday = 2\n    golf_balls_left = golf_balls_initial - golf_balls_lost_tuesday - golf_balls_lost_wednesday\n    result = golf_balls_left\n    return result",
        "short_answer": "33"
    },
    {
        "question": "Olivia has $23. She bought five bagels for $3 each. How much money does she have left?",
        "cot_answer": "Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. So the answer is 8.",
        "pot_answer": "def solution():\n    \"\"\"Olivia has $23. She bought five bagels for $3 each. How much money does she have left?\"\"\"\n    money_initial = 23\n    bagels = 5\n    bagel_cost = 3\n    money_spent = bagels * bagel_cost\n    money_left = money_initial - money_spent\n    result = money_left\n    return result",
        "short_answer": "8"
    }
]

In [15]:
class GSM8KEvaluator(Evaluator):
    def load_data(self, evalset_file="gsm8k") -> List[Dict[str, Any]]:
        """
        Load the GSM8K dataset https://huggingface.co/datasets/gsm8k with Huggingface datasets library
        Load the first 100 examples from the test split in main subset.
        """
        output_dataset = []
        dataset = load_dataset(evalset_file, 'main', split='test[:100]')
        questions = dataset['question']
        answers = [str(re.search(r'#### (\d+)', answer).group(1)) for answer in dataset['answer']]
        example = {}
        return [{'task_id': f'GSM8K/{i}',
                 'question': questions[i],
                 'answer': answers[i]} for i in range(len(dataset))]

    def build_prompts(self, dataset, n_shot=8, demos=GSM_EXAMPLARS):
        """
        Build few-shot prompts from the GSM8K dataset. Use
        :param dataset: list of examples
        :param n_shot: number of examples to use for few-shot learning
        :param demos: list of demonstrator examples
        :return: list of prompts
        """
        prompts = []
        for example in dataset:
            prompt = 'Answer the following questions. '
            for examplar in demos[:n_shot]:
                prompt += f"\nQuestion: {examplar['question']} \nAnswer: {examplar['short_answer']}"
            question = example['question']
            prompt += f'\nQuestion: {question} \nAnswer: '
            prompts.append(prompt)
        return prompts

    def postprocess_output(self, output: str) -> str:
        """
        Postprocess the output from the language model.
        """
        return str(re.findall(r'Answer: (\d+,?\d*)', output)[8]).replace(',', '')

    def calculate_metrics(self, predictions, dataset):
        """
        Calculate metrics for the GSM8K dataset
        """
        accuracy = 0
        assert len(predictions) == len(dataset), \
            f'found mismatch in prediction length: {len(predictions)} with no. of rows in dataset: {dataset.num_rows}'
        for i, (prediction, example) in tqdm(enumerate(zip(predictions, dataset))):
            predicted = prediction['completion']
            ground_truth = example['answer']
            if predicted == ground_truth:
                accuracy += 1 / len(predictions)
        return accuracy

In [16]:
empty_gpu_pt_cache()

In [17]:
gsm8k_evaluator = GSM8KEvaluator(llm)
gsm8k_evaluator.evaluate(evalset_file="gsm8k", batch_size=8)

Downloading readme:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

  0%|          | 0/13 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  8%|▊         | 1/13 [01:53<22:44, 113.68s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 15%|█▌        | 2/13 [03:45<20:40, 112.74s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 23%|██▎       | 3/13 [05:35<18:33, 111.39s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 31%|███       | 4/13 [07:25<16:37, 110.86s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 38%|███▊      | 5/13 [09:15<14:44, 110.59s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 46%|████▌     | 6/13 [11:09<13:02, 111.79s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 54%|█████▍    | 7/13 [13:03<11:14, 112.34s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 62%|██████▏   | 8/13 [14:54<09:19, 111.97s/it]Setting `pad_token_id` to `eos_token_id`:2

Results for GSM8KEvaluator: 0.060000000000000005





In [18]:
del gsm8k_evaluator

## Few-shot Chain-of Thought Math Reasoning

In [19]:
class GSM8KCoTEvaluator(GSM8KEvaluator):
    def build_prompts(self, dataset, n_shot=8, demos=GSM_EXAMPLARS):
        """
        Build few-shot prompts from the GSM8K dataset. Use
        :param dataset: list of examples
        :param n_shot: number of examples to use for few-shot learning
        :param demos: list of demonstrator examples
        :return: list of prompts
        """
        prompts = []
        for example in dataset:
            prompt = 'Answer the following questions. Respond with "So the answer is ##" for the last sentence. '
            for examplar in demos[:n_shot]:
                prompt += f"\nQuestion: {examplar['question']} \nAnswer: {examplar['cot_answer']}"
            question = example['question']
            prompt += f'\nQuestion: {question} \nAnswer: '
            prompts.append(prompt)
        return prompts

    def postprocess_output(self, output: str) -> str:
        """
        Postprocess the output from the language model.
        """
        try:
            raw_number = str(re.findall(r'So the answer is[\d\+\-\*\/\= ]* [\$\-]?(\d+[\.,]?\d*)', output)[8]).replace(',', '')
            processed_number = re.sub(r'\.$', '', raw_number)
            return processed_number
        except IndexError: # not following the usual format, will get wrong answer as no response detected
            return ''

In [20]:
empty_gpu_pt_cache()

In [21]:
gsm8k_cot_evaluator = GSM8KCoTEvaluator(llm)
gsm8k_cot_evaluator.evaluate(evalset_file="gsm8k", batch_size=8)

  0%|          | 0/13 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  8%|▊         | 1/13 [02:06<25:14, 126.20s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 15%|█▌        | 2/13 [04:16<23:32, 128.38s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 23%|██▎       | 3/13 [06:24<21:22, 128.26s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 31%|███       | 4/13 [08:32<19:15, 128.37s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 38%|███▊      | 5/13 [10:41<17:06, 128.34s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 46%|████▌     | 6/13 [12:52<15:06, 129.56s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 54%|█████▍    | 7/13 [15:04<13:00, 130.14s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 62%|██████▏   | 8/13 [17:13<10:49, 129.82s/it]Setting `pad_token_id` to `eos_token_id`:2

Results for GSM8KCoTEvaluator: 0.08





In [22]:
del gsm8k_cot_evaluator

## Few-shot Program-of Thought Math Reasoning

In [23]:
!pip install timeout-decorator Pebble
!wget -O python_executor.py https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/python_executor.py

Collecting timeout-decorator
  Downloading timeout-decorator-0.5.0.tar.gz (4.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting Pebble
  Downloading Pebble-5.0.6-py3-none-any.whl (30 kB)
Building wheels for collected packages: timeout-decorator
  Building wheel for timeout-decorator (setup.py) ... [?25l[?25hdone
  Created wheel for timeout-decorator: filename=timeout_decorator-0.5.0-py3-none-any.whl size=5004 sha256=9615c0bedff959db8ed9883910ff98cf87b0a626c7221d950cc9a92e98e6ff86
  Stored in directory: /root/.cache/pip/wheels/68/2f/bc/76f1192d474666d41ae6f09813fccbd00fe3f07e8261c4cff5
Successfully built timeout-decorator
Installing collected packages: timeout-decorator, Pebble
Successfully installed Pebble-5.0.6 timeout-decorator-0.5.0
--2024-02-25 01:32:01--  https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/python_executor.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199

In [24]:
from python_executor import PythonExecutor
executor = PythonExecutor(get_answer_expr='solution()')

codes = [
    "def solution():\n    return 1 + 1",
    "def solution():\n    return 2 * 2",
]

predictions = []
runtime_errors = []
for pred, err in executor.batch_apply(codes):
    predictions.append(str(pred))
    runtime_errors.append(str(err['exec_info']).strip())

In [25]:
predictions

['2', '4']

In [26]:
class GSM8KPoTEvaluator(Evaluator):
    def load_data(self, evalset_file="gsm8k") -> List[Dict[str, Any]]:
        """
        Load the GSM8K dataset https://huggingface.co/datasets/gsm8k with Huggingface datasets library
        Load the first 100 examples from the test split in main subset.
        """
        output_dataset = []
        dataset = load_dataset(evalset_file, 'main', split='test[:100]')
        questions = dataset['question']
        answers = [str(re.search(r'#### (\d+)', answer).group(1)) for answer in dataset['answer']]
        example = {}
        return [{'task_id': f'GSM8K/{i}',
                 'question': questions[i],
                 'answer': answers[i]} for i in range(len(dataset))]

    def build_prompts(self, dataset, n_shot=8, demos=GSM_EXAMPLARS):
        """
        Build few-shot prompts from the GSM8K dataset. Use
        :param dataset: list of examples
        :param n_shot: number of examples to use for few-shot learning
        :param demos: list of demonstrator examples
        :return: list of prompts
        """
        prompts = []
        for example in dataset:
            prompt = ''
            for examplar in demos[:n_shot]:
                prompt += f"\nQ: {examplar['question']} \n# solution in Python: \n{examplar['pot_answer']}"
            question = example['question']
            prompt += f'\nQuestion: {question} \n# solution in Python: \n'
            prompts.append(prompt)
        return prompts

    def postprocess_output(self, output: str) -> str:
        """
        Postprocess the output from the language model.
        """
        start_matches = re.finditer(r'def solution\(\):', output)
        for _ in range(9):
            start_match = next(start_matches)
        start_idx = start_match.start()
        end_matches = re.finditer(r'return result', output)
        try:
            for _ in range(9):
                end_match = next(end_matches)
            end_idx = end_match.end()
        except StopIteration: # code too long exceeding max_new_tokens
            end_idx = len(output) + 1
        code = output[start_idx:end_idx]
        result, err = executor.apply(code)
        if not re.search('\d', str(result)):
            return result
        return str(int(float(result)))

    def calculate_metrics(self, predictions, dataset):
        """
        Calculate metrics for the GSM8K dataset
        """
        accuracy = 0
        assert len(predictions) == len(dataset), \
            f'found mismatch in prediction length: {len(predictions)} with no. of rows in dataset: {dataset.num_rows}'
        for i, (prediction, example) in tqdm(enumerate(zip(predictions, dataset))):
            predicted = prediction['completion']
            ground_truth = example['answer']
            if predicted == ground_truth:
                accuracy += 1 / len(predictions)
        return accuracy

In [27]:
empty_gpu_pt_cache()

In [28]:
gsm8k_pot_evaluator = GSM8KPoTEvaluator(llm)
gsm8k_pot_evaluator.evaluate(evalset_file="gsm8k", max_new_tokens=256, batch_size=4)

  0%|          | 0/25 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  4%|▍         | 1/25 [03:33<1:25:26, 213.60s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  8%|▊         | 2/25 [07:09<1:22:19, 214.76s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 12%|█▏        | 3/25 [10:44<1:18:46, 214.85s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 16%|█▌        | 4/25 [14:18<1:15:09, 214.72s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 20%|██        | 5/25 [17:51<1:11:23, 214.18s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 24%|██▍       | 6/25 [21:25<1:07:42, 213.83s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 28%|██▊       | 7/25 [24:58<1:04:05, 213.61s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 32%|███▏      | 8/25 [28:31<1:00:32, 213.67s/it]Setting `pad_token_id` to 

100it [00:00, 400602.10it/s]

Results for GSM8KPoTEvaluator: 0.22000000000000006





In [29]:
del gsm8k_pot_evaluator

|                    | GSM8K |
|--------------------|-------|
| Direct Prompting   | 0.06      |
| Chain-of-Thought   | 0.08      |
| Program-of-Thought | 0.22      |