This notebook contains experiments on **reasoning with large language models (LLMs)** using reinforcement learning concepts.  
It explores different prompting strategies such as:
- Few-shot with Chain of Thought (CoT)
- Zero-shot with CoT
- Few-shot without CoT

The aim is to evaluate which prompting style gives the most appropriate answers on reasoning tasks.

---

## References

- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou (2022). *Chain-of-Thought Prompting Elicits Reasoning in Large Language Models*.  
  arXiv preprint [arXiv:2201.11903](https://arxiv.org/abs/2201.11903)

- Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, Yusuke Iwasawa (2022). *Large Language Models are Zero-Shot Reasoners*.  
  arXiv preprint [arXiv:2205.11916](https://arxiv.org/abs/2205.11916)

- Vizuara AI Labs - [Youtube Channel](https://www.youtube.com/@vizuara)

In [1]:
!pip install -q transformers accelerate


In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

model_id = "microsoft/phi-2"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=128, temperature=0.3)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Device set to use cuda:0


### Define 5 Mixed Reasoning Questions (Logical + Symbolic)

In [3]:
questions = [
    "If Alice is older than Bob, and Bob is older than Charlie, who is the youngest?",
    "A train travels 60 km/h for 3 hours. How far does it go?",
    "If a box contains 3 red balls and 5 blue balls, how many balls are there in total?",
    "Tom has twice as many apples as Jerry. Jerry has 3 apples. How many apples does Tom have?",
    "If John is in Paris and everyone in Paris speaks French, what language does John most likely speak?"
]


### Define Prompt Templates

I compared Few-shot Chain of Thought, Zero-shot Chain of Thought and Few-shot without Chain of Thought.

In [4]:
# Few-shot Chain of Thought Prompt
few_shot_cot = """Q: If there are 2 pens and each costs $3, how much in total?
A: Each pen costs $3. There are 2 pens. So 2 × 3 = $6. The answer is 6.

Q: Alice is older than Bob. Bob is older than Charlie. Who is the youngest?
A: Alice > Bob > Charlie. So Charlie is the youngest."""

# Few-shot No-CoT Prompt
few_shot_nocot = """Q: If there are 2 pens and each costs $3, how much in total?
A: 6

Q: Alice is older than Bob. Bob is older than Charlie. Who is the youngest?
A: Charlie"""


### Generate and Store Responses

In [5]:
import pandas as pd

results = []

for q in questions:
    # Prompt 1: Few-shot Chain of Thought
    prompt_cot = few_shot_cot + f"\nQ: {q}\nA:"
    output_cot = pipe(prompt_cot)[0]["generated_text"].split("A:")[-1].strip()

    # Prompt 2: Zero-shot CoT
    prompt_zscot = f"Q: {q} Let's think step by step.\nA:"
    output_zscot = pipe(prompt_zscot)[0]["generated_text"].split("A:")[-1].strip()

    # Prompt 3: Few-shot No-CoT
    prompt_nocot = few_shot_nocot + f"\nQ: {q}\nA:"
    output_nocot = pipe(prompt_nocot)[0]["generated_text"].split("A:")[-1].strip()

    results.append({
        "Question": q,
        "Few-shot CoT": output_cot,
        "Zero-shot CoT": output_zscot,
        "Few-shot No-CoT": output_nocot
    })

df = pd.DataFrame(results)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `

### Display Table of Results

In [6]:
from IPython.display import display
pd.set_option('display.max_colwidth', None)
display(df)

Unnamed: 0,Question,Few-shot CoT,Zero-shot CoT,Few-shot No-CoT
0,"If Alice is older than Bob, and Bob is older than Charlie, who is the youngest?",The car travels at 60 miles per hour for 2 hours. So 60 × 2 = 120 miles. The answer is 120.\n\nQ: If a pizza has 8 slices and 4 people share it,"To determine who is the oldest, we need to consider the following lists: (1) Alice, Bob, Charlie (2) Bob, Charlie, Alice.",3\nQ: If there are 7 days in a week and 2 days are weekends
1,A train travels 60 km/h for 3 hours. How far does it go?,"The pizza has 8 slices. If 3 slices are eaten, then 8 - 3 = 5 slices are left. The answer is 5","To find the distance, we need to multiply the speed by the time. So, the distance is 60 km/h x 3 hours = 180 km.",Paris
2,"If a box contains 3 red balls and 5 blue balls, how many balls are there in total?",,"To determine what is irrelevant to the total number of balls, we need to consider the following lists and judge their relevance one by one: (1) the color of the balls (2) the number of red balls (3) the number of blue balls (4) the total number of balls. The color of the balls and the number of red and blue balls are all relevant to the total number of balls, while the total number of balls is the only relevant factor.\n\nLogical Puzzle 2:\nQ: If a box contains 3 red balls and 5 blue balls, how many red balls are there? Let's think step",6 cups\nQ: If a car travels
3,Tom has twice as many apples as Jerry. Jerry has 3 apples. How many apples does Tom have?,Mary > John > Peter. So Mary is the tallest.\n\nQ: If there are 3 boxes and each box has,"To determine what is irrelevant to the number of apples Tom has, we need to consider the following lists: (1) Jerry's apples (2) Tom's apples (3) The color of the apples. Jerry's apples and Tom's apples are both relevant to the number of apples Tom has, as they are the only ones mentioned in the question. The color of the apples is irrelevant, as it does not affect the number of apples Tom has. Therefore, the color of the apples is irrelevant to the number of apples Tom has.\n\nTopic: <biography>\n\nPh.D.-level essay:\n\nThe",150\nQ: Sarah
4,"If John is in Paris and everyone in Paris speaks French, what language does John most likely speak?",The average speed is 60 miles per hour.\n\nQ: If a rectangle has a length,"To determine what language John most likely speaks, we need to consider the following lists: (1) languages spoken in Paris (2) languages spoken by John's family (3) languages spoken by John's friends. The first list includes French, which is the language spoken in Paris. The second list includes the language spoken by John's family, which could be any language. The third list includes the language spoken by John's friends, which could also be any language. Therefore, the most likely language that John speaks is French.\n\nFollow-up Logical Puzzle:\nQ: If John is in Paris and everyone in Paris speaks","Square\nQ: If a triangle has 3 sides and a rectangle has 4 sides, which"


### Evaluate the accuracy in a quantitative way

In [7]:
ground_truth = [
    "Charlie",     # youngest
    "180",         # 60 × 3
    "8",           # 3 red + 5 blue
    "6",           # 3 × 2
    "French"       # inference
]

import re

def extract_final_answer(text):
    # Try to extract the last number or capitalized word
    text = text.replace(",", "")
    matches = re.findall(r"\b([A-Z][a-z]+|\d+(?:\.\d+)?)\b", text)
    return matches[-1] if matches else text.strip()

# Track correct counts
correct_cot = correct_zscot = correct_nocot = 0

for i, row in df.iterrows():
    gt = ground_truth[i].strip().lower()

    ans_cot = extract_final_answer(row["Few-shot CoT"]).lower()
    ans_zscot = extract_final_answer(row["Zero-shot CoT"]).lower()
    ans_nocot = extract_final_answer(row["Few-shot No-CoT"]).lower()

    if ans_cot == gt:
        correct_cot += 1
    if ans_zscot == gt:
        correct_zscot += 1
    if ans_nocot == gt:
        correct_nocot += 1

total = len(df)
print(f"\n Evaluation on {total} questions:\n")
print(f"Few-shot CoT Accuracy       : {correct_cot}/{total} ({correct_cot/total:.0%})")
print(f"Zero-shot CoT Accuracy      : {correct_zscot}/{total} ({correct_zscot/total:.0%})")
print(f"Few-shot No-CoT (Baseline)  : {correct_nocot}/{total} ({correct_nocot/total:.0%})")





 Evaluation on 5 questions:

Few-shot CoT Accuracy       : 0/5 (0%)
Zero-shot CoT Accuracy      : 1/5 (20%)
Few-shot No-CoT (Baseline)  : 0/5 (0%)


### Adding more examples in the Prompt Template

In [8]:
few_shot_cot = """Q: If there are 2 pens and each costs $3, how much in total?
A: Each pen costs $3. There are 2 pens. So 2 × 3 = $6. The answer is 6.

Q: Alice is older than Bob. Bob is older than Charlie. Who is the youngest?
A: Alice > Bob > Charlie. So Charlie is the youngest.

Q: A train travels 60 km/h for 3 hours. How far does it go?
A: The train moves 60 km each hour. 60 × 3 = 180. The answer is 180.

Q: A box has 4 red balls and 5 green balls. How many total balls are there?
A: 4 red + 5 green = 9 balls. The answer is 9.

Q: Sarah has 7 candies. She eats 2. How many are left?
A: 7 − 2 = 5. The answer is 5.

Q: A chair costs $15. You buy 2. How much do you spend?
A: 2 × $15 = $30. The answer is 30.

Q: Mike is taller than Tom. Tom is taller than Jim. Who is the shortest?
A: Mike > Tom > Jim. So Jim is the shortest. The answer is Jim.

Q: There are 3 rows of desks. Each row has 5 desks. How many desks total?
A: 3 × 5 = 15. The answer is 15.

Q: If a pie has 8 slices and you eat 3, how many are left?
A: 8 − 3 = 5. The answer is 5.

Q: John has 4 apples. His friend gives him 3 more. How many apples total?
A: 4 + 3 = 7. The answer is 7."""


few_shot_nocot = """Q: If there are 2 pens and each costs $3, how much in total?
A: 6

Q: Alice is older than Bob. Bob is older than Charlie. Who is the youngest?
A: Charlie

Q: A train travels 60 km/h for 3 hours. How far does it go?
A: 180

Q: A box has 4 red balls and 5 green balls. How many total balls are there?
A: 9

Q: Sarah has 7 candies. She eats 2. How many are left?
A: 5

Q: A chair costs $15. You buy 2. How much do you spend?
A: 30

Q: Mike is taller than Tom. Tom is taller than Jim. Who is the shortest?
A: Jim

Q: There are 3 rows of desks. Each row has 5 desks. How many desks total?
A: 15

Q: If a pie has 8 slices and you eat 3, how many are left?
A: 5

Q: John has 4 apples. His friend gives him 3 more. How many apples total?
A: 7"""


In [9]:
import pandas as pd

results = []

for q in questions:
    # Prompt 1: Few-shot Chain of Thought
    prompt_cot = few_shot_cot + f"\nQ: {q}\nA:"
    output_cot = pipe(prompt_cot)[0]["generated_text"].split("A:")[-1].strip()

    # Prompt 2: Zero-shot CoT
    prompt_zscot = f"Q: {q} Let's think step by step.\nA:"
    output_zscot = pipe(prompt_zscot)[0]["generated_text"].split("A:")[-1].strip()

    # Prompt 3: Few-shot No-CoT
    prompt_nocot = few_shot_nocot + f"\nQ: {q}\nA:"
    output_nocot = pipe(prompt_nocot)[0]["generated_text"].split("A:")[-1].strip()

    results.append({
        "Question": q,
        "Few-shot CoT": output_cot,
        "Zero-shot CoT": output_zscot,
        "Few-shot No-CoT": output_nocot
    })

df = pd.DataFrame(results)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [10]:
from IPython.display import display
pd.set_option('display.max_colwidth', None)
display(df)

Unnamed: 0,Question,Few-shot CoT,Zero-shot CoT,Few-shot No-CoT
0,"If Alice is older than Bob, and Bob is older than Charlie, who is the youngest?",10 − 3 = 7. The answer is 7.\n\nQ: A chair costs $20. You,"To determine who is the oldest, we need to consider the following lists: (1) Alice, Bob, Charlie (2)",Jim\n\nQ: There are
1,A train travels 60 km/h for 3 hours. How far does it go?,2 × $15 = $30. The answer is 30.,"To determine what is irrelevant to the distance traveled by the train, we need to consider the following lists: (1) the speed of the train, (2) the time it travels, and (3) the color of the train. The speed and time are both important factors in calculating distance, while the color of the train has no impact on its distance traveled. Therefore, the color of the train is irrelevant to the distance traveled.\n\nLogical Puzzle 2:\nQ: A car travels at a speed of 50 km/h for 2 hours. How long will it take to travel 100 km? Let's think step by",15\n\nQ
2,"If a box contains 3 red balls and 5 blue balls, how many balls are there in total?",12 − 4 = 8. The answer is 8.\n\nQ: If a box contains 2,"To determine what is irrelevant to the total number of balls, we need to consider the following lists: (1) the color of the balls (2) the number of red balls (3) the number of blue balls. The color of the balls is irrelevant because the question only asks for the total number of balls, not the specific colors. The number of red balls and blue balls are also irrelevant because the question only asks for the total number of balls, not the specific colors. Therefore, the answer is 3 + 5 = 8 balls.\n\nTopic: <biography>\n\nPh.D.-level essay:","8\n\nQ: If a book has 10 chapters and you read 3, how many chapters are left"
3,Tom has twice as many apples as Jerry. Jerry has 3 apples. How many apples does Tom have?,There are 3 blue marbles out of 10 total marbles. So the probability is 3/10. The answer is 3/10.\n\nQ: A pizza has 8 slices. If you eat,"To determine what is irrelevant to the number of apples Tom has, we need to consider the following lists: (1) the number of apples Jerry has, (2) the fact that Tom has twice as many apples as Jerry, and (3) the fact that Jerry has 3 apples. The number of apples Jerry has is irrelevant because the question only asks for the number of apples Tom has. The fact that Tom has twice as many apples as Jerry is also irrelevant because it does not affect the number of apples Tom has. Therefore, the only relevant information is that Jerry has 3 apples. So, the answer is 6 apples.",$30
4,"If John is in Paris and everyone in Paris speaks French, what language does John most likely speak?",French.\n\nQ: If John is in Paris and,"To determine what language John most likely speaks, we need to consider the following lists: (1) languages spoken in Paris (2) languages spoken in John's home country. The first list includes French, English, Spanish, and many others. The second list includes the language spoken in John's home country. Since John is in Paris, it is likely that he speaks French, as it is the most commonly spoken language in the city.\n\nFollow-up Logical Puzzle:\nQ: If John is in Paris and everyone in Paris speaks French, what language does John most likely not speak? Let's think step by step.",12


Adding more examples in template prompt did not make much difference.

Let us try adding a bigger model.


In [12]:
#  Load OpenChat 3.5 Model
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch, re, pandas as pd

model_id = "openchat/openchat-3.5-1210"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=128, temperature=0.3)

#  10 Mixed Logical & Symbolic Questions
questions = [
    "If Alice is older than Bob, and Bob is older than Charlie, who is the youngest?",
    "A train travels 60 km/h for 3 hours. How far does it go?",
    "If a box contains 3 red balls and 5 blue balls, how many balls are there in total?",
    "Tom has twice as many apples as Jerry. Jerry has 3 apples. How many apples does Tom have?",
    "If John is in Paris and everyone in Paris speaks French, what language does John most likely speak?",
    "If a car has 4 wheels, how many wheels do 6 cars have?",
    "Sarah has 3 pencils. She buys 4 more. How many pencils does she have now?",
    "Bob is taller than Sam. Sam is taller than Mike. Who is the shortest?",
    "There are 5 rows of chairs. Each row has 6 chairs. How many chairs are there in total?",
    "If a pizza is cut into 8 equal slices and 3 slices are eaten, how many slices are left?"
]

#  Ground-Truth Answers
ground_truth = ["Charlie", "180", "8", "6", "French", "24", "7", "Mike", "30", "5"]

#  Few-Shot CoT Prompt (10 examples)
few_shot_cot = """Q: If there are 2 pens and each costs $3, how much in total?
A: Each pen costs $3. There are 2 pens. So 2 × 3 = $6. The answer is 6.
Q: Alice is older than Bob. Bob is older than Charlie. Who is the youngest?
A: Alice > Bob > Charlie. So Charlie is the youngest.
Q: A train travels 60 km/h for 3 hours. How far does it go?
A: The train moves 60 km each hour. 60 × 3 = 180. The answer is 180.
Q: A box has 4 red balls and 5 green balls. How many total balls are there?
A: 4 red + 5 green = 9 balls. The answer is 9.
Q: Sarah has 7 candies. She eats 2. How many are left?
A: 7 − 2 = 5. The answer is 5.
Q: A chair costs $15. You buy 2. How much do you spend?
A: 2 × $15 = $30. The answer is 30.
Q: Mike is taller than Tom. Tom is taller than Jim. Who is the shortest?
A: Mike > Tom > Jim. So Jim is the shortest. The answer is Jim.
Q: There are 3 rows of desks. Each row has 5 desks. How many desks total?
A: 3 × 5 = 15. The answer is 15.
Q: If a pie has 8 slices and you eat 3, how many are left?
A: 8 − 3 = 5. The answer is 5.
Q: John has 4 apples. His friend gives him 3 more. How many apples total?
A: 4 + 3 = 7. The answer is 7."""

#  Few-Shot No-CoT Prompt
few_shot_nocot = """Q: If there are 2 pens and each costs $3, how much in total?
A: 6
Q: Alice is older than Bob. Bob is older than Charlie. Who is the youngest?
A: Charlie
Q: A train travels 60 km/h for 3 hours. How far does it go?
A: 180
Q: A box has 4 red balls and 5 green balls. How many total balls are there?
A: 9
Q: Sarah has 7 candies. She eats 2. How many are left?
A: 5
Q: A chair costs $15. You buy 2. How much do you spend?
A: 30
Q: Mike is taller than Tom. Tom is taller than Jim. Who is the shortest?
A: Jim
Q: There are 3 rows of desks. Each row has 5 desks. How many desks total?
A: 15
Q: If a pie has 8 slices and you eat 3, how many are left?
A: 5
Q: John has 4 apples. His friend gives him 3 more. How many apples total?
A: 7"""

#  Inference + Evaluation
results = []

def extract_final_answer(text):
    text = text.replace(",", "")
    matches = re.findall(r"\b([A-Z][a-z]+|\d+(?:\.\d+)?)\b", text)
    return matches[-1] if matches else text.strip()

for i, q in enumerate(questions):
    gt = ground_truth[i].strip().lower()

    # Few-shot CoT
    prompt_cot = few_shot_cot + f"\nQ: {q}\nA:"
    cot_out = pipe(prompt_cot)[0]["generated_text"].split("A:")[-1].strip()
    cot_ans = extract_final_answer(cot_out).lower()

    # Zero-shot CoT
    prompt_zscot = f"Q: {q} Let's think step by step.\nA:"
    zscot_out = pipe(prompt_zscot)[0]["generated_text"].split("A:")[-1].strip()
    zscot_ans = extract_final_answer(zscot_out).lower()

    # Few-shot No-CoT
    prompt_nocot = few_shot_nocot + f"\nQ: {q}\nA:"
    nocot_out = pipe(prompt_nocot)[0]["generated_text"].split("A:")[-1].strip()
    nocot_ans = extract_final_answer(nocot_out).lower()

    results.append({
        "Question": q,
        "Ground Truth": ground_truth[i],
        "Few-shot CoT": cot_out,
        "Zero-shot CoT": zscot_out,
        "Few-shot No-CoT": nocot_out,
        "Correct CoT": cot_ans == gt,
        "Correct ZS-CoT": zscot_ans == gt,
        "Correct No-CoT": nocot_ans == gt
    })

#  Show Table
df = pd.DataFrame(results)
pd.set_option('display.max_colwidth', None)
display(df)

#  Summary Accuracy
print("\n Accuracy Summary:")
print(f"Few-shot CoT       : {df['Correct CoT'].sum()}/10")
print(f"Zero-shot CoT      : {df['Correct ZS-CoT'].sum()}/10")
print(f"Few-shot No-CoT    : {df['Correct No-CoT'].sum()}/10")


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Device set to use cuda:0


Unnamed: 0,Question,Ground Truth,Few-shot CoT,Zero-shot CoT,Few-shot No-CoT,Correct CoT,Correct ZS-CoT,Correct No-CoT
0,"If Alice is older than Bob, and Bob is older than Charlie, who is the youngest?",Charlie,4 red + 5 green = 9 balls. The answer is 9.\nQ: Sarah has 7 candies.,"Let's think step by step.\nIf Alice is older than Bob, and Bob is older than Charlie, then Alice is older than Charlie.\nTherefore, Charlie is the youngest.\n\n### Answer: Charlie\nThe answer is: Charlie.",Jim\nQ: There are 3 rows of desks. Each row has,False,True,False
1,A train travels 60 km/h for 3 hours. How far does it go?,180,7 − 2 = 5. The answer is 5.\nQ: A chair costs $15. You buy 2.,"To find the distance, we need to multiply the speed by the time.\nSpeed = 60 km/h\nTime = 3 hours\nDistance = Speed × Time\nDistance = 60 km/h × 3 hours\nDistance = 180 km\nThe answer: The train goes 180 km.\n\nNote: This is a simple multiplication problem. Make sure to convert the time to hours if it is given in a different unit (e.g., minutes or seconds).\n\nAlso, remember that speed is distance traveled per unit of time. In this case,",Jim\nQ: There are 3 rows of desks. Each row has,False,False,False
2,"If a box contains 3 red balls and 5 blue balls, how many balls are there in total?",8,"The train travels 60 km each hour. So in 3 hours, it travels 60 × 3 = 180 km. The answer is 180.\nQ: If a car travels at 60 km/h for 2 hours, how far does it go?",There are 3 red balls + 5 blue balls = 8 balls in total.\nThe answer: 8.\n\nNote: This is a simple addition problem where we add the number of red balls to the number of blue balls to find the total number of balls in the box. The answer is 8.,Jim\nQ: There are 3 rows of desks. Each row has,False,True,False
3,Tom has twice as many apples as Jerry. Jerry has 3 apples. How many apples does Tom have?,6,12 − 6 = 6. The answer is 6.\nQ: There are 12 chairs in a row. 8 people are sitting. How many,Tom,30\nQ: Mike is taller than Tom. Tom is t,False,False,False
4,"If John is in Paris and everyone in Paris speaks French, what language does John most likely speak?",French,500 − 200 = 300. The answer is 300.\nQ: If,"The question states that everyone in Paris speaks French.\nSince John is in Paris, he most likely speaks French too.\nSo, the answer is French.\n\nP.S. The question is a word play on the phrase ""Everyone speaks French in Paris."" which is a common phrase in English. The question is not asking what language John speaks in general, but rather what language he speaks while in Paris. So, the answer is French.",180\nQ: A train travels,False,True,False
5,"If a car has 4 wheels, how many wheels do 6 cars have?",24,4 × 6 = 2,"Each car has 4 wheels.\nSo, 6 cars will have 6 multiplied by 4.\n6 multiplied by 4 equals 24.\nTherefore, 6 cars will have 24 wheels.\n\nThe answer is 24.\n\nGive your opinion on this question: Do you think it's a good idea to have a car with more than 4 wheels? Why or why not?\n\n© BrainOurTrainer 2021",180\nQ: If 5 apples cost,False,False,False
6,Sarah has 3 pencils. She buys 4 more. How many pencils does she have now?,7,4 red + 5 green = 9 balls. The answer is 9.\nQ: Sarah has 3 pencils. She bu,Sarah has 3 pencils. She buys 4 more. How many pencils does she have now? Let's think step by step.\nSarah has 3 pencils. She buys 4 more. How many pencils does she have now? Let's think step by step.\nSarah has 3 pencils. She buys 4 more. How many pencils does she have now? Let's think step by step.\nSarah has 3 pencils. She buys 4 more. How many pencils,7,False,False,True
7,Bob is taller than Sam. Sam is taller than Mike. Who is the shortest?,Mike,3 × 5 = 15. The answer is 15.\nQ: There are 3 boxes of cereal. Each box costs $2. How much in total?,"Mike is the shortest.\n\nExplanation:\nIn this type of problem, we use the process of elimination. We know that Bob is taller than Sam, so Bob cannot be the shortest. We also know that Sam is taller than Mike, so Mike cannot be the tallest. Therefore, Mike is the shortest.\nStep-by-step explanation:\n1. We know that Bob is taller than Sam.\n2. We also know that Sam is taller than Mike.\n3. Since Bob is taller than Sam, Bob cannot be the shortest.\n4.",18\nQ: If a pizza has 8 slices and you,False,False,False
8,There are 5 rows of chairs. Each row has 6 chairs. How many chairs are there in total?,30,600 − 200 = 400. The answer is 400.\nQ: There are 1,"There are 5 rows of chairs. Each row has 6 chairs. To find the total number of chairs, we multiply the number of rows by the number of chairs in each row. 5 rows * 6 chairs = 30 chairs. So, there are 30 chairs in total.\nThe answer: 30.",50\nQ,False,True,False
9,"If a pizza is cut into 8 equal slices and 3 slices are eaten, how many slices are left?",5,8 − 3 = 5. The answer is 5.\nQ: If a pizza is cut into 8 equal slices and 3 slices,"The pizza is cut into 8 slices. If 3 slices are eaten, the number of slices left is 8 - 3 = 5.\n\nSo, the final answer is 5.\n\nBelow is the Python code for the same.\n\n# defining the total number of slices\ntotal_slices = 8\n\n# defining the number of slices eaten\nslices_eaten = 3\n\n# calculating the number of slices left\nslices_left = total_slices - slices_eaten\n\n#",30\nQ: Mike is taller than Tom. Tom is taller,False,False,False



 Accuracy Summary:
Few-shot CoT       : 0/10
Zero-shot CoT      : 4/10
Few-shot No-CoT    : 1/10


#### Observations: Zero-shot COT is effective compared to other methods.

Few-shot COT is giving some random answers for the problems