<a href="https://colab.research.google.com/github/bythyag/chain-of-thought/blob/main/arthmetic-reasoning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### necessary evil

In [1]:
# installations - if done locally, need to install more libraries depending upon requirements
%%capture
!pip install --upgrade datasets fsspec huggingface_hub

In [2]:
# import modules

import re
import os
import json
import time
import torch
import pandas as pd
from tqdm import tqdm
from google import genai
#from openai import OpenAI
from google.genai import types
from datasets import load_dataset
from transformers import pipeline
from google.colab import userdata
from transformers import AutoModelForCausalLM, AutoTokenizer

### load keys

In [3]:
# platform agnostic method to load API keys
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API')
gemini_api = userdata.get('GEMINI_API')

### load dataset

In [3]:
%%capture
# load datasets

gsm8k = load_dataset("openai/gsm8k", "main")
svamp = load_dataset("ChilleD/SVAMP")
asdiv = load_dataset("EleutherAI/asdiv")
aqua = load_dataset("deepmind/aqua_rat")
mawps = load_dataset("MU-NLPC/Calc-mawps")

The repository for EleutherAI/asdiv contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/EleutherAI/asdiv.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


### explore dataset

In [None]:
# gsm8k
print("---- gsm8k dataset sample ----\n")
print("Question:\n" + gsm8k['train']['question'][0] + "\n")
print("Answer:\n" + gsm8k['train']['answer'][0])

In [None]:
# svamp
print("---- svamp dataset sample ----\n")
print("Question:\n" + svamp['train']['question_concat'][0] + "\n")
print("Answer:\n" + svamp['train']['Answer'][0])

In [None]:
# asdiv
print("---- asdiv dataset sample ----\n")
print("Question:\n" + asdiv['validation']['body'][0] + " " + asdiv['validation']['question'][0] + "\n")
print("Answer:\n" + asdiv['validation']['answer'][0])

In [None]:
# aqua
print("---- aqua dataset sample ----\n")
print("Question:\n" + aqua['train']['question'][10] + "\n")
print(f"Answer Choices:\n{aqua['train']['options'][10]}" + "\n") # aqua['train']['options'][0] is a list of options
print("Answer:\n" + aqua['train']['correct'][10])

In [None]:
# mawps
print("---- mawps dataset sample ----\n")
print("Question:\n" + mawps['train']['question'][0] + "\n")
print("Answer:\n"+ mawps['train']['result'][0])

### prompt template

In [4]:
PROMPT_TEMPLATE_1 = """
Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there
will be 21 trees. How many trees did the grove workers plant today?
A: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have
been 21 - 15 = 6. The answer is 6.

Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
A: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5.

Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
A: Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they
had 74 - 35 = 39. The answer is 39.

Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did
Jason give to Denny?
A: Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8.
The answer is 8.

Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he
have now?
A: Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9.
The answer is 9.

Q: There were nine computers in the server room. Five more computers were installed each day, from monday
to thursday. How many computers are now in the server room?
A: There were originally 9 computers. For each of 4 days, 5 more computers were added. So 5 * 4 = 20
computers were added. 9 + 20 is 29. The answer is 29.

Q: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf
balls did he have at the end of wednesday?
A: Michael started with 58 golf balls. After losing 23 on tuesday, he had 58 - 23 = 35. After losing 2 more, he
had 35 - 2 = 33 golf balls. The answer is 33.

Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
A: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23
- 15 is 8. The answer is 8.

Q: {question}
A:
"""

PROMPT_TEMPLATE_2 = """
Q: John found that the average of 15 numbers is 40. If 10 is added to each number then the mean of the numbers is?
Answer Choices: (a) 50 (b) 45 (c) 65 (d) 78 (e) 64
A: If 10 is added to each number, then the mean of the numbers also increases by 10. So the new mean would be 50. The answer is (a).

Q: If a / b = 3/4 and 8a + 5b = 22,then find the value of a.
Answer Choices: (a) 1/2 (b) 3/2 (c) 5/2 (d) 4/2 (e) 7/2
A: If a / b = 3/4, then b = 4a / 3. So 8a + 5(4a / 3) = 22. This simplifies to 8a + 20a / 3 = 22, which means 44a / 3 = 22. So a is equal to 3/2. The answer is (b).

Q: A person is traveling at 20 km/hr and reached his destiny in 2.5 hr then find the distance?
Answer Choices: (a) 53 km (b) 55 km (c) 52 km (d) 60 km (e) 50 km
A: The distance that the person traveled would have been 20 km/hr * 2.5 hrs = 50 km. The answer is (e).

Q: How many keystrokes are needed to type the numbers from 1 to 500?
Answer Choices: (a) 1156 (b) 1392 (c) 1480 (d) 1562 (e) 1788
A: There are 9 one-digit numbers from 1 to 9. There are 90 two-digit numbers from 10 to 99. There are 401
three-digit numbers from 100 to 500. 9 + 90(2) + 401(3) = 1392. The answer is (b).

Q: {question}
"""

### load models

In [None]:
# open-ai

client = OpenAI()

model_name = ["gpt-4.1-nano-2025-04-14", "gpt-4o-mini"]

def run_gpt(prompt_type, dataset_name, model_name, questions, original_reponse, sample_size, prompt_template):
  chat_response = []

  for i in tqdm(range(sample_size), desc="processing questions"):
    question = questions[i]
    original_answer = original_reponse[i]

    if prompt_type == "few-shot":
      prompt = prompt_template.format(question=question)
      instructions = "You are a helpful assistant. Follow the few-shot example format provided by the user."
    else:
      prompt = question
      instructions = "You are a helpful assistant."

    response = client.responses.create(
        model=model_name,
        instructions=instructions,
        input=prompt,
        temperature=0.0,
    )
    model_answer = response.output_text
    chat_response.append(
        {
            'question': question,
            'original_answer': original_answer,
            'answer_text': model_answer
        }
    )
    output_filename = f"{dataset_name}-{prompt_type}-{model_name.replace('.', '-')}.json"
  with open(output_filename, "w", encoding="utf-8") as f:
      json.dump(chat_response, f, ensure_ascii=False, indent=4)

run_gpt("base", "aqua", "gpt-4o-mini", gsm8k['train']['question'],asdiv['validation']['answer'], 20, PROMPT_TEMPLATE_1 )

In [None]:
# gemini

client = genai.Client(api_key=gemini_api)

models = ["gemini-2.0-flash", "gemini-1.5-flash"]

def run_gemini(prompt_type, dataset_name, model_name, questions, original_reponse, sample_size, prompt_template):
  chat_response = []

  for i in tqdm(range(sample_size), desc="processing questions"):
    question = questions[i]
    original_answer = original_reponse[i]

    if prompt_type == "few-shot":
      prompt = prompt_template.format(question=question)
      instructions = "You are a helpful assistant. Follow the few-shot example format provided by the user."
    else:
      prompt = question
      instructions = "You are a helpful assistant."

    response = client.models.generate_content(
        model=model_name,
        contents=prompt,
        config=types.GenerateContentConfig(
            system_instruction=instructions,
            temperature=0
            )
        )
    model_answer = response.text

    chat_response.append(
        {
            'question': question,
            'original_answer': original_answer,
            'answer_text': model_answer
        }
    )
    time.sleep(10)

  output_filename = f"{dataset_name}-{prompt_type}-{model_name.replace('.', '-')}.json"
  with open(output_filename, "w", encoding="utf-8") as f:
      json.dump(chat_response, f, ensure_ascii=False, indent=4)

# questions = [
#     asdiv['validation']['body'][i] + " " + asdiv['validation']['question'][i]
#     for i in range(20)
# ]
# answers = [asdiv['validation']['answer'][i] for i in range(20)]

# combined_qas = []

# for i in range(20):
#     question = aqua['train']['question'][i]
#     options = aqua['train']['options'][i]
#     formatted_options = ' '.join(options)
#     combined_qas.append(f"Question:\n{question}\n\nAnswer Choices:\n{formatted_options}")

run_gemini("few-shot", "asdiv", "gemini-2.0-flash", questions, answers, 20, PROMPT_TEMPLATE_2)

In [6]:
from tqdm import tqdm
import json
import time
from transformers import pipeline
import torch
import torch._dynamo
torch._dynamo.config.cache_size_limit = 64  # or a higher value than 8
torch._dynamo.disable()

pipe = pipeline(
    "text-generation",
    model="google/gemma-3-1b-it",
    device="cuda",
)

def run_gemma(prompt_type, dataset_name, model_name, questions, original_answers, sample_size, prompt_template):
    results = []

    for i in tqdm(range(sample_size), desc="processing questions"):
        question = questions[i]
        original_answer = original_answers[i]

        if prompt_type == "few-shot":
            user_prompt = prompt_template.format(question=question)
            system_prompt = "You are a helpful assistant. Follow the few-shot example in the message provided by the user."
        else:
            user_prompt = question
            system_prompt = "You are a helpful assistant."

        messages = [
            {"role": "system", "content": [{"type": "text", "text": system_prompt}]},
            {"role": "user", "content": [{"type": "text", "text": user_prompt}]}
        ]

        output = pipe(text_inputs=messages, max_new_tokens=3000)
        generated_answer = output[0]['generated_text'][-1].get('content', '')

        results.append({
            "question": question,
            "original_answer": original_answer,
            "generated_answer": generated_answer
        })

    output_filename = f"{dataset_name}-{prompt_type}-{model_name.replace('.', '-')}.json"
    with open(output_filename, "w", encoding="utf-8") as f:
        json.dump(results, f, ensure_ascii=False, indent=4)

# combined_qas = []

# for i in range(20):
#     question = aqua['train']['question'][i]
#     options = aqua['train']['options'][i]
#     formatted_options = ' '.join(options)
#     combined_qas.append(f"Question:\n{question}\n\nAnswer Choices:\n{formatted_options}")

# run_gemma("few-shot", "aqua", "gemma-3-1b-it", combined_qas, aqua['train']['correct'], 20, PROMPT_TEMPLATE_2)

# run_gemma("base", "aqua", "gemma-3-1b-it", combined_qas, aqua['train']['correct'], 20, PROMPT_TEMPLATE_2)

# questions = [
#     asdiv['validation']['body'][i] + " " + asdiv['validation']['question'][i]
#     for i in range(20)
# ]
# answers = [asdiv['validation']['answer'][i] for i in range(20)]

# run_gemma("base", "asdiv", "gemma-3-1b-it", questions, answers, 20, PROMPT_TEMPLATE_1)
# run_gemma("few-shot", "asdiv", "gemma-3-1b-it", questions, answers, 20, PROMPT_TEMPLATE_1)

# run_gemma("base", "gsm8k", "gemma-3-1b-it", gsm8k['train']['question'], gsm8k['train']['answer'], 20, PROMPT_TEMPLATE_1)

# run_gemma("base", "mawps", "gemma-3-1b-it", mawps['train']['question'], mawps['train']['result'], 20, PROMPT_TEMPLATE_1)
# run_gemma("few-shot", "mawps", "gemma-3-1b-it", mawps['train']['question'], mawps['train']['result'], 20, PROMPT_TEMPLATE_1)

# run_gemma("base", "svamp", "gemma-3-1b-it", svamp['train']['question_concat'], svamp['train']['Answer'], 20, PROMPT_TEMPLATE_1)
# run_gemma("few-shot", "svamp", "gemma-3-1b-it", svamp['train']['question_concat'],svamp['train']['Answer'], 20, PROMPT_TEMPLATE_1)

Device set to use cuda
processing questions:   0%|          | 0/20 [00:00<?, ?it/s]W0528 08:35:08.502000 268 torch/_inductor/utils.py:1137] [0/0] Not enough SMs to use max_autotune_gemm mode
processing questions:  50%|█████     | 10/20 [04:15<03:20, 20.09s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
processing questions: 100%|██████████| 20/20 [07:46<00:00, 23.33s/it]


In [None]:
# qwen3-4b

model_name = "Qwen/Qwen3-4B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
qwen3_4b = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# write a function for prompt input + structured output of response in a json file

### generate answers and save on google drive / locally

### evaluate generated reponse

### graph plot