# COMP3361 Part 3: Generation with Large Language Model

## Load model and tokenizer

In this section, we will use [Qwen1.5-1.8B](https://huggingface.co/Qwen/Qwen1.5-1.8B) as the language model.

In [1]:
!pip install transformers==4.37.2 datasets evaluate accelerate bitsandbytes

Collecting transformers==4.37.2
  Downloading transformers-4.37.2-py3-none-any.whl.metadata (129 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.4/129.4 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.0-py3-none-manylinux_2_24_x86_64.whl.metadata (1.8 kB)
Downloading transformers-4.37.2-py3-none-any.whl (8.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bitsandbytes-0.43.0-py3-none-manylinux_2_24_x86_64.whl (102.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m15.1 MB/s[0m eta [36m0:0

In [2]:
from abc import ABC, abstractmethod
from typing import List, Dict, Any
import os
import json
import evaluate
from datasets import load_dataset
from tqdm import tqdm
import re
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

os.environ["TOKENIZERS_PARALLELISM"] = "false"

2024-03-23 00:53:15.991070: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-23 00:53:15.991168: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-23 00:53:16.152462: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
class LLM(object):
    def __init__(self, model_name="Qwen/Qwen1.5-1.8B"):
        self.model_name = model_name
        self.model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen1.5-1.8B",torch_dtype=torch.float16)
        self.model.to("cuda")

    def generate(self, prompts: List[str], **kwargs) -> List[str]:
        max_new_tokens = kwargs.get('max_new_tokens', 128)
        tokenizer = AutoTokenizer.from_pretrained(self.model_name, padding_side="left")
        tokenizer.pad_token = tokenizer.eos_token
        
        model_inputs = tokenizer(prompts,
                                 return_tensors="pt",
                                 padding=True).to("cuda")
        generated_ids = self.model.generate(**model_inputs, max_new_tokens=max_new_tokens)
        output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
        return output

In [12]:
llm = LLM()

llm.generate(["A list of colors: red, blue", "Portugal is"])

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.67G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


['A list of colors: red, blue, green, yellow, purple, orange, pink, black, white, gray, brown, blue-green, green-blue, yellow-green, purple-blue, orange-red, red-orange, yellow-red, purple-red, orange-yellow, pink-red, black-yellow, white-yellow, gray-yellow, brown-yellow, blue-green-yellow, green-blue-yellow, yellow-green-yellow, purple-blue-yellow, orange-red-yellow, red-orange-yellow, yellow-red-yellow, purple-red-yellow, orange-yellow-yellow, pink-red-yellow, black-yellow-yellow, white-yellow-yellow, gray-yellow-yellow, brown-yellow-yellow, blue-green-yellow-orange, green-blue-yellow-orange, yellow',
 'Portugal is a country that is rich in history and culture. It is a country that is full of surprises and it is a country that is full of beauty. It is a country that is full of history and it is a country that is full of culture. It is a country that is full of surprises and it is a country that is full of beauty. It is a country that is full of history and it is a country that is fu

In [4]:
import math

class Evaluator(ABC):
    def __init__(self, llm):
        self.llm = llm

    @abstractmethod
    def load_data(self):
        pass

    @abstractmethod
    def build_prompts(self):
        pass

    @abstractmethod
    def postprocess_output(self, output: str) -> str:
        pass

    def generate_completions(self, prompts: List[str], batch_size=4, **kwargs) -> List[str]:
        max_new_tokens = kwargs.get('max_new_tokens', 128)
        response = []
        for i in range(math.ceil(len(prompts) / batch_size)):
            if ((i+1) * batch_size) < len(prompts):
                end_pos = (i+1)*batch_size
            else:
                end_pos = len(prompts)
            output = self.llm.generate(prompts[i*batch_size:end_pos], max_new_tokens=max_new_tokens)
            response.extend(output)
        return response

    def evaluate(self, evalset_file, batch_size=4, save_dir="outputs", max_new_tokens=128, **kwargs):
        dataset = self.load_data(evalset_file)
        prompts = self.build_prompts(dataset)
        outputs = self.generate_completions(prompts, batch_size=batch_size, max_new_tokens=max_new_tokens, **kwargs)

        predictions = []
        for i, (example, prompt, output) in enumerate(zip(dataset, prompts, outputs)):
            prediction = {
                "task_id": example.get("task_id", f"task_{i}"),
                "prompt": prompt,
                "completion": self.postprocess_output(output[len(prompt):])
            }
            predictions.append(prediction)

        # Save predictions to file
        os.makedirs(save_dir, exist_ok=True)
        prediction_save_path = os.path.join(save_dir, f"{type(self).__name__}_predictions.jsonl")
        with open(prediction_save_path, "w") as fout:
            for pred in predictions:
                fout.write(json.dumps(pred) + "\n")

        # Calculate metrics and print results
        results = self.calculate_metrics(predictions, dataset)
        print(f"Results for {type(self).__name__}: {results}")

    @abstractmethod
    def calculate_metrics(self):
        pass

## Zero-shot Code Generation

In [5]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [7]:
!mkdir -p human_eval
!wget -O human_eval/__init__.py https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/__init__.py
!wget -O human_eval/data.py human_eval https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/data.py
!wget -O human_eval/evaluation.py https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/evaluation.py
!wget -O human_eval/execution.py human_eval https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/execution.py

!mkdir -p data/humaneval
!wget -O data/humaneval/HumanEval.jsonl.gz https://github.com/openai/human-eval/raw/master/data/HumanEval.jsonl.gz

--2024-03-22 21:58:45--  https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/__init__.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 0 [text/plain]
Saving to: 'human_eval/__init__.py'

human_eval/__init__     [ <=>                ]       0  --.-KB/s    in 0s      

2024-03-22 21:58:45 (0.00 B/s) - 'human_eval/__init__.py' saved [0/0]

--2024-03-22 21:58:47--  http://human_eval/
Resolving human_eval (human_eval)... failed: Name or service not known.
wget: unable to resolve host address 'human_eval'
--2024-03-22 21:58:47--  https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/data.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.

In [116]:
from human_eval.data import read_problems
from human_eval.evaluation import evaluate_functional_correctness

class HumanEvalEvaluator(Evaluator):
    def load_data(self, evalset_file="data/humaneval/HumanEval.jsonl.gz") -> List[Dict[str, Any]]:
        """
        Load the humaneval dataset
        :param evalset_file: path to the humaneval dataset file
        :return: list of examples
        """
        return list(read_problems(evalset_file).values())

    def build_prompts(self, dataset) -> List[str]:
        """
        Build zero-shot prompts from the humaneval dataset.
        """
        prompts = [example["prompt"] for example in dataset]
        return prompts

    def postprocess_output(self, output: str) -> str:
        stop_sequences=["\nclass", "\ndef", "\n#", "\nif", "\nprint"]
        for stop_word in stop_sequences:
            if stop_word in output:
                output = output.split(stop_word)[0]
                break
        return output

    def calculate_metrics(self, predictions, dataset):
        pass_at_k_results = evaluate_functional_correctness(
            sample_file=os.path.join("outputs", f"{type(self).__name__}_predictions.jsonl"),
            k=[1],
            problems={example["task_id"]: example for example in dataset},
            n_workers=64
        )
        return pass_at_k_results


In [9]:
human_eval_evaluator = HumanEvalEvaluator(llm)
human_eval_evaluator.evaluate(evalset_file = "data/humaneval/HumanEval.jsonl.gz", batch_size=4)

<class 'list'>
Length of dataset is 164
Length of prompts is 164


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Special tokens have been added in the vocabul

Length of outputs is 164
task_id: HumanEval/0
  Prompt: from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """

  Original response: from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
    return all(close > threshold for close in numbers)


def close(numbers: List[float]) -> float:
    """ Return the closest number to the given threshold.
    >>> close([1.0, 2.0, 3.0])
    2.0
    >>> close([1.0, 2.8, 3.0, 

164it [00:00, 6792.93it/s]


Running test suites...


100%|██████████| 164/164 [00:29<00:00,  5.48it/s]


Writing results to outputs/HumanEvalEvaluator_predictions.jsonl_results.jsonl...


100%|██████████| 164/164 [00:00<00:00, 25741.56it/s]

Results for HumanEvalEvaluator: {'pass@1': 0.2073170731707317}





## Few-shot Math Reasoning

In [6]:
GSM_EXAMPLARS = [
    {
        "question": "There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?",
        "cot_answer": "There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. So the answer is 6.",
        "pot_answer": "def solution():\n    \"\"\"There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\"\"\"\n    trees_initial = 15\n    trees_after = 21\n    trees_added = trees_after - trees_initial\n    result = trees_added\n    return result",
        "short_answer": "6"
    },
    {
        "question": "If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?",
        "cot_answer": "There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. So the answer is 5.",
        "pot_answer": "def solution():\n    \"\"\"If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\"\"\"\n    cars_initial = 3\n    cars_arrived = 2\n    total_cars = cars_initial + cars_arrived\n    result = total_cars\n    return result",
        "short_answer": "5"
    },
    {
        "question": "Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?",
        "cot_answer": "Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. So the answer is 39.",
        "pot_answer": "def solution():\n    \"\"\"Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\"\"\"\n    leah_chocolates = 32\n    sister_chocolates = 42\n    total_chocolates = leah_chocolates + sister_chocolates\n    chocolates_eaten = 35\n    chocolates_left = total_chocolates - chocolates_eaten\n    result = chocolates_left\n    return result",
        "short_answer": "39"
    },
    {
        "question": "Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?",
        "cot_answer": "Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8. So the answer is 8.",
        "pot_answer": "def solution():\n    \"\"\"Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\"\"\"\n    jason_lollipops_initial = 20\n    jason_lollipops_after = 12\n    denny_lollipops = jason_lollipops_initial - jason_lollipops_after\n    result = denny_lollipops\n    return result",
        "short_answer": "8"
    },
    {
        "question": "Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?",
        "cot_answer": "Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9. So the answer is 9.",
        "pot_answer": "def solution():\n    \"\"\"Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?\"\"\"\n    toys_initial = 5\n    mom_toys = 2\n    dad_toys = 2\n    total_received = mom_toys + dad_toys\n    total_toys = toys_initial + total_received\n    result = total_toys\n    return result",
        "short_answer": "9"
    },
    {
        "question": "There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?",
        "cot_answer": "There were originally 9 computers. For each of 4 days, 5 more computers were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. So the answer is 29.",
        "pot_answer": "def solution():\n    \"\"\"Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?\"\"\"\n    toys_initial = 5\n    mom_toys = 2\n    dad_toys = 2\n    total_received = mom_toys + dad_toys\n    total_toys = toys_initial + total_received\n    result = total_toys\n    return result",
        "short_answer": "29"
    },
    {
        "question": "Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?",
        "cot_answer": "Michael started with 58 golf balls. After losing 23 on tuesday, he had 58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. So the answer is 33.",
        "pot_answer": "def solution():\n    \"\"\"Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?\"\"\"\n    golf_balls_initial = 58\n    golf_balls_lost_tuesday = 23\n    golf_balls_lost_wednesday = 2\n    golf_balls_left = golf_balls_initial - golf_balls_lost_tuesday - golf_balls_lost_wednesday\n    result = golf_balls_left\n    return result",
        "short_answer": "33"
    },
    {
        "question": "Olivia has $23. She bought five bagels for $3 each. How much money does she have left?",
        "cot_answer": "Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. So the answer is 8.",
        "pot_answer": "def solution():\n    \"\"\"Olivia has $23. She bought five bagels for $3 each. How much money does she have left?\"\"\"\n    money_initial = 23\n    bagels = 5\n    bagel_cost = 3\n    money_spent = bagels * bagel_cost\n    money_left = money_initial - money_spent\n    result = money_left\n    return result",
        "short_answer": "8"
    }
]

In [7]:
class GSM8KEvaluator(Evaluator):
    def load_data(self, evalset_file="gsm8k") -> List[Dict[str, Any]]:
        """
        Load the GSM8K dataset https://huggingface.co/datasets/gsm8k with Huggingface datasets library
        Load the first 100 examples from the test split in main subset.
        """
        size = 100
        ds = load_dataset(evalset_file, "main", split="test")[:size]
        questions = ds['question']
        answers = ds['answer']
        self.dataset = [{"question": questions[i],
                        "answer": answers[i].split("####")[1].strip()} for i in range(size)]
        for pair in self.dataset:
            pair["answer"] = re.sub(r"(\d),(\d)", r"\1\2", pair["answer"])
        return self.dataset

    def build_prompts(self, dataset, n_shot=8, demos=GSM_EXAMPLARS):
        """
        Build few-shot prompts from the GSM8K dataset. Use 
        :param dataset: list of examples
        :param n_shot: number of examples to use for few-shot learning
        :param demos: list of demonstrator examples
        :return: list of prompts
        """
        final_prompt = "Answer the following questions."
        for i in range(n_shot):
            question = demos[i]["question"]
            answer = demos[i]["short_answer"]
            prompt = f"\nQuestion: {{{question}}} \nAnswer: {{{answer}}}"
            final_prompt = final_prompt + prompt
        prompts = [f'{final_prompt}\nQuestion: {example["question"]} \n Answer: ' for example in dataset]
        return prompts

    def postprocess_output(self, output: str) -> str:
        """
        Postprocess the output from the language model.
        """
        output = output.split("Question")[0]
        output = output.replace("{","").replace("}","").replace(",","").strip()
        return output
        
    def calculate_metrics(self, predictions, dataset):
        """
        Calculate metrics for the GSM8K dataset
        """
        score = 0
        for i in range(len(predictions)):
            if predictions[i]["completion"] == dataset[i]["answer"]:
                score += 1
            else:
                continue
        score /= len(predictions)
        return score

In [62]:
gsm8k_evaluator = GSM8KEvaluator(llm)
gsm8k_evaluator.evaluate(evalset_file = "gsm8k")

<class 'list'>
Length of dataset is 100
["Answer the following questions.\nQuestion: {There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?} \nAnswer: {6}\nQuestion: {If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?} \nAnswer: {5}\nQuestion: {Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?} \nAnswer: {39}\nQuestion: {Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?} \nAnswer: {8}\nQuestion: {Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?} \nAnswer: {9}\nQuestion: {There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the serve

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


KeyboardInterrupt: 

## Few-shot Chain-of Thought Math Reasoning

In [119]:
class GSM8KCoTEvaluator(GSM8KEvaluator):
    def postprocess_output(self, output: str) -> str:
        """
        Postprocess the output from the language model.
        """
        print(f"postprocess_output")
        print(f"  original output {output}")
        if "Question" in output:
            output = output.split("Question")[0]
        if "he answer is" in output:
            output = output.split("he answer is")[1]
        output = output.replace("{","").replace("}","").replace(",","").replace("$","").replace("=","").strip()
        print(f"  transformed output {output}")
        if output[-1]==".":
            output = output[:-1]

        output = re.findall(r'\d+', output)[0]
        return output

    def build_prompts(self, dataset, n_shot=8, demos=GSM_EXAMPLARS):
        """
        Build few-shot prompts from the GSM8K dataset. Use
        :param dataset: list of examples
        :param n_shot: number of examples to use for few-shot learning
        :param demos: list of demonstrator examples
        :return: list of prompts
        """
        final_prompt = """"Answer the following questions."""
        for i in range(n_shot):
            question = demos[i]["question"]
            cot_answer = demos[i]["cot_answer"]
            prompt = f"\nQuestion: {{{question}}} \nAnswer: {{{cot_answer}}}"
            final_prompt = final_prompt + prompt

        prompts = [f'{final_prompt}\nQuestion: {example["question"]} \n Answer: ' for example in dataset]
        return prompts

In [80]:
gsm8k_cot_evaluator = GSM8KCoTEvaluator(llm)
gsm8k_cot_evaluator.evaluate(evalset_file = "gsm8k")

<class 'list'>
Length of dataset is 10
['"Answer the following questions.\nQuestion: {There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?} \nAnswer: {There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. So the answer is 6.}\nQuestion: {If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?} \nAnswer: {There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. So the answer is 5.}\nQuestion: {Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?} \nAnswer: {Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. So the answer is 39.}\nQuestion: {Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipop

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


KeyboardInterrupt: 

## Few-shot Program-of Thought Math Reasoning

In [1]:
!pip install timeout-decorator Pebble
!wget -O python_executor.py https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/python_executor.py

--2024-03-24 02:46:59--  https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/python_executor.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6811 (6.7K) [text/plain]
Saving to: ‘python_executor.py’


2024-03-24 02:47:00 (4.23 MB/s) - ‘python_executor.py’ saved [6811/6811]



In [15]:
from python_executor import PythonExecutor
executor = PythonExecutor(get_answer_expr='solution()')

codes = [
    "def solution():\n    return 1 + 1",
    "def solution():\n    return 2 * 2",
    "def solution():\n    \"\"\"If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\"\"\"\n    cars_initial = 3\n    cars_arrived = 2\n    total_cars = cars_initial + cars_arrived\n    result = total_cars\n    return result",
    '''def solution():
     """Toula went to the bakery and bought various types of pastries. She bought 3 dozen donuts which cost $68 per dozen, 2 dozen mini cupcakes which cost $80 per dozen, and 6 dozen mini cheesecakes for $55 per dozen. How much was the total cost?"""
     donuts = 3 * 12
     mini_cupcakes = 2 * 12
     mini_cheesecakes = 6 * 12
     total_cost = donuts + mini_cupcakes + mini_cheesecakes
     result = total_cost
     return result''',
     '''def solution():
     """Melanie is a door-to-door saleswoman. She sold a third of her vacuum cleaners at the green house, 2 more to the red house, and half of what was left at the orange house. If Melanie has 5 vacuum cleaners left, how many did she start with?"""
     vacuum_cleaners_initial = 5
     green_house = 3
     red_house = 2
     orange_house = 1
     vacuum_cleaners_sold = green_house + red_house + orange_house
     vacuum_cleaners_left = vacuum_cleaners_initial - vacuum_cleaners_sold
     result = vacuum_cleaners_left
     return result''',
     '''def solution():
     """Billy sells DVDs. He has 8 customers on Tuesday. His first 3 customers buy one DVD each. His next 2 customers buy 2 DVDs each. His last 3 customers don't buy any DVDs. How many DVDs did Billy sell on Tuesday?"""
     customers_tuesday = 8
     customers_first_three = 3
     customers_next_two = 2
     customers_last_three = 3
     total_dvd_sold = customers_tuesday * customers_first_three + customers_next_two * customers_second_three + customers_last_three * customers_last_three
     result = total_dvd_sold
     return result'''
]

predictions = []
runtime_errors = []
for pred, err in executor.batch_apply([codes[5]]):
    predictions.append(str(pred))
    runtime_errors.append(str(err['exec_info']).strip())

In [16]:
predictions

['']

In [17]:
class GSM8KPoTEvaluator(GSM8KEvaluator):
    def build_prompts(self, dataset, n_shot=8, demos=GSM_EXAMPLARS):
        """
        Build few-shot prompts from the GSM8K dataset. Use
        :param dataset: list of examples
        :param n_shot: number of examples to use for few-shot learning
        :param demos: list of demonstrator examples
        :return: list of prompts
        """
        final_prompt = """You are a logical and expereinced coder.
        You must generate code to solve mathematical problems. Answer the following questions."""
        for i in range(n_shot):
            question = demos[i]["question"]
            pot_answer = demos[i]["pot_answer"]
            prompt = f"\nQuestion: {{{question}}} \n# solution in Python: {{{pot_answer}}}"
            final_prompt = final_prompt + prompt

        prompts = [f'{final_prompt}\nQuestion: {{{example["question"]}}} \n# solution in Python: \n' for example in dataset]
        return prompts


    def postprocess_output(self, output: str) -> str:
        """
        Postprocess the output from the language model.
        """
        print(f"   Original code: {output}")
        if "}" in output:
            output = output.split("}")[0]
        print(f"   Full code: {output}")
        predictions = []
        for pred, err in executor.batch_apply([output]):
            predictions.append(str(pred))
            runtime_errors.append(str(err['exec_info']).strip())
        print(f"   Postprocessed prediction: {predictions[0]}")
        return predictions[0]
        
    def calculate_metrics(self, predictions, dataset):
        """
        Calculate metrics for the GSM8K dataset
        """
        for i in range(len(predictions)):
            predictions[i]['completion'] = str(predictions[i]['completion'])
            if len(re.findall(r'\d+', predictions[i]['completion'])) !=0:
                predictions[i]['completion'] = re.findall(r'\d+', predictions[i]['completion'])[0]
        score = 0
        for i in range(len(predictions)):
            if predictions[i]['completion'] == dataset[i]["answer"]:
                score += 1
            else:
                continue
        score /= len(predictions)
        for i in range(len(predictions)):
            print(f"Prediction: {predictions[i]['completion']} Answer: {dataset[i]['answer']}")
        return score

In [18]:
gsm8k_pot_evaluator = GSM8KPoTEvaluator(llm)
gsm8k_pot_evaluator.evaluate(evalset_file = "gsm8k", max_new_tokens = 256)

<class 'list'>
Length of dataset is 100
['You are a logical and expereinced coder.\n        You must generate code to solve mathematical problems. Answer the following questions.\nQuestion: {There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?} \nAnswer: {def solution():\n    """There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?"""\n    trees_initial = 15\n    trees_after = 21\n    trees_added = trees_after - trees_initial\n    result = trees_added\n    return result}\nQuestion: {If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?} \nAnswer: {def solution():\n    """If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?"""\n    cars_initial = 

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Special tokens have been added in the vocabul

Length of outputs is 100
   Original code: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?"""
    eggs_per_day = 16
    eggs_per_morning = 3
    eggs_per_day = eggs_per_day + eggs_per_morning
    muffins_per_day = 4
    muffins_per_morning = muffins_per_day / 4
    eggs_per_morning =
   Reduced code: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?"""
    eggs_per_day = 16
    eggs_per_morning = 3
    eggs_per_day = eggs_per_day + eggs_per_morning
    muffins_per_day = 4
    muffins_per_morning = muffins_per_day / 4
    eggs_per_

|                    | GSM8K |
|--------------------|-------|
| Direct Prompting   | 0.06      |
| Chain-of-Thought   | 0.25      |
| Program-of-Thought | 0.22      |