# LLM reasoning pop quiz

Do open-sourced LLMs have the reasoning prowess of their closed-sourced siblings?

<a target="_blank" href="https://colab.research.google.com/github/daniel-furman/LLM-reasoning-pop-quiz/blob/main/notebooks/falcon-40b-instruct.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Table of Contents

1. Setup
2. Read yaml config
3. Load model
4. Run the quiz

## Setup

In [1]:
# detailed information on the GPU

!nvidia-smi

Sat Nov 11 18:10:08 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   30C    P0    44W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!git clone https://github.com/daniel-furman/LLM-reasoning-pop-quiz.git

Cloning into 'LLM-reasoning-pop-quiz'...
remote: Enumerating objects: 617, done.[K
remote: Counting objects: 100% (4/4), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 617 (delta 0), reused 1 (delta 0), pack-reused 613[K
Receiving objects: 100% (617/617), 291.89 KiB | 2.26 MiB/s, done.
Resolving deltas: 100% (347/347), done.


In [3]:
# install necessary libraries
import os

os.chdir("/content/LLM-reasoning-pop-quiz")
!pip install -q -U -r requirements.txt
os.chdir("..")

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m76.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m62.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m35.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [4]:
#!pip list

In [5]:
# import libraries

import transformers
import torch
import time
import yaml

# import helpers

from boilers import llm_boiler

In [6]:
# set the seed

transformers.set_seed(42)

In [7]:
# print GPU available memory

free_in_GB = int(torch.cuda.mem_get_info()[0] / 1024**3)
max_memory = f"{free_in_GB-2}GB"

n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}
max_memory

{0: '37GB'}

## Read in the yaml config for the run

In [8]:
with open("/content/LLM-reasoning-pop-quiz/configs/pop_quiz.yml", "r") as file:
    pop_quiz = yaml.safe_load(file)
pop_quiz

{'prompts': {'zero_shot': ["A juggler has 16 balls. Half of the balls are golf balls and half of the golf balls are blue. How many blue golf balls are there?\nLet's respond succinctly and work this out in a step by step way to be sure we have the right answer.",
   "Daniel is in need of a haircut. His barber works Mondays, Wednesdays, and Fridays. So, Daniel went in for a haircut on Sunday. Does this make logical sense?\nLet's respond succinctly and work this out in a step by step way to be sure we have the right answer.",
   "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?\nLet's respond succinctly and work this out in a step by step way to be sure we have the right answer.",
   "Roger has 3 children. Each of his kids invited 4 of their friends to come to the birthday party. All of the friends came to the party. How many children are at the party?\nLet's respond succinctly and work this out in a step by step way to be sure we

## Load the model

In [9]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the '

In [10]:
# load model
# this cell will take a long time, to avoid: deploy the LLM as an API inference endpoint

model_id = "meta-llama/Llama-2-70b-chat-hf"

model = llm_boiler(model_id)

Load function recognized for meta-llama/Llama-2-70b-chat-hf: llama_loader
Run function recognized for meta-llama/Llama-2-70b-chat-hf: llama


Downloading (…)okenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/66.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/15 [00:00<?, ?it/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.85G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.50G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/524M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [11]:
print(model.name, "\n")
print(model.tokenizer, "\n")
print(model.model, "\n")

llama 

LlamaTokenizerFast(name_or_path='meta-llama/Llama-2-70b-chat-hf', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
} 

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 8192)
    (layers): ModuleList(
      (0-79): 80 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=8192, out_features=8192, bias=False)
          (k_proj): Linear4bit(in_features=8192, out_feat

## Run the model

* For text generation options, refer to [https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextGenerationPipeline](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextGenerationPipeline)
* Below prompts are borrowed from [https://github.com/openai/openai-cookbook/blob/main/techniques_to_improve_reliability.md](https://github.com/openai/openai-cookbook/blob/main/techniques_to_improve_reliability.md)

### Example 1: Zero-shot reasoning conditioned on good performance
* From https://arxiv.org/abs/2205.11916

In [12]:
# run zero shot questions

for itr, prompt in enumerate(pop_quiz["prompts"]["zero_shot"]):
    print(f"Question 1.{itr+1}")
    start_time = time.time()
    generated_text = model.run(
        prompt=prompt,
        eos_token_ids=model.tokenizer.eos_token_id,
        max_new_tokens=1024,
        temperature=0.01,
        do_sample=True,
        top_p=0.92,
        top_k=50,
        repetition_penalty=1.1,
        num_return_sequences=1,
        debug=True,
    )
    print("--- %s seconds ---" % (time.time() - start_time))
    print("\n")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Question 1.1
*** Prompt
<s>[INST] <<SYS>>
Your are a helpful AI assistant.
<</SYS>>

A juggler has 16 balls. Half of the balls are golf balls and half of the golf balls are blue. How many blue golf balls are there?
Let's respond succinctly and work this out in a step by step way to be sure we have the right answer. [/INST]
*** Generated
Sure, I'd be happy to help you with that! Here's a step-by-step guide to finding the number of blue golf balls:

1. First, let's find out how many golf balls there are in total:
16 balls / 2 (half of the balls are golf balls) = 8 golf balls
2. Now, let's find out how many blue golf balls there are:
8 golf balls / 2 (half of the golf balls are blue) = 4 blue golf balls

So, there are 4 blue golf balls.
--- 31.395618438720703 seconds ---


Question 1.2
*** Prompt
<s>[INST] <<SYS>>
Your are a helpful AI assistant.
<</SYS>>

Daniel is in need of a haircut. His barber works Mondays, Wednesdays, and Fridays. So, Daniel went in for a haircut on Sunday. Does th

### Example 2: Least to most prompting
* From https://arxiv.org/abs/2205.10625


In [13]:
# run least to most questions

for itr, prompts in enumerate(pop_quiz["prompts"]["least_to_most"]):
    sub_question_1 = prompts[0]
    print(f"Question 2.{itr+1}")
    # Start with sub question #1

    start_time = time.time()
    res_1 = model.run(
        prompt=sub_question_1,
        eos_token_ids=model.tokenizer.eos_token_id,
        max_new_tokens=1024,
        temperature=0.01,
        do_sample=True,
        top_p=0.92,
        top_k=50,
        repetition_penalty=1.1,
        num_return_sequences=1,
        debug=True,
    )
    print("--- %s seconds ---" % (time.time() - start_time))
    print("\n")

    # Now do sub question #2 by appending answer to sub question #1
    sub_question_2 = f"{sub_question_1}\n{res_1}\n{prompts[1]}"

    start_time = time.time()
    res_2 = model.run(
        prompt=sub_question_2,
        eos_token_ids=model.tokenizer.eos_token_id,
        max_new_tokens=1000,
        temperature=0.01,
        do_sample=True,
        top_p=0.92,
        top_k=50,
        repetition_penalty=1.1,
        num_return_sequences=1,
        debug=True,
    )
    print("--- %s seconds ---" % (time.time() - start_time))
    print("\n")

Question 2.1
*** Prompt
<s>[INST] <<SYS>>
Your are a helpful AI assistant.
<</SYS>>

It takes Amy 4 minutes to climb to the top of a slide. It takes her 1 minute to slide down. How long does each trip take?
Let's respond succinctly and work this out in a step by step way to be sure we have the right answer. [/INST]
*** Generated
Sure, I'd be happy to help you with that!

To find the total time it takes for Amy to make one trip up and down the slide, we need to add the time it takes her to climb up and the time it takes her to slide down.

Climbing up takes 4 minutes, so that's 4 minutes of time.
Slide down takes 1 minute, so that's 1 minute of time.

Adding these two times together gives us a total of 5 minutes.

Therefore, each trip up and down the slide takes Amy 5 minutes.
--- 27.829980850219727 seconds ---


*** Prompt
<s>[INST] <<SYS>>
Your are a helpful AI assistant.
<</SYS>>

It takes Amy 4 minutes to climb to the top of a slide. It takes her 1 minute to slide down. How long doe

### Example 3: Tab-CoT

* See https://arxiv.org/abs/2305.17812

In [14]:
# run tab-cot questions

for itr, prompts in enumerate(pop_quiz["prompts"]["tab_cot"]):
    print(f"Question 3.{itr+1}")
    # Start with sub question #1
    sub_question_1 = prompts[0]

    start_time = time.time()
    res_1 = model.run(
        prompt=sub_question_1,
        eos_token_ids=model.tokenizer.eos_token_id,
        max_new_tokens=1024,
        temperature=0.01,
        do_sample=True,
        top_p=0.92,
        top_k=50,
        repetition_penalty=1.1,
        num_return_sequences=1,
        debug=True,
    )
    print("--- %s seconds ---" % (time.time() - start_time))
    print("\n")

Question 3.1
*** Prompt
<s>[INST] <<SYS>>
Your are a helpful AI assistant.
<</SYS>>

Jackson is planting tulips. He can fit 6 red tulips in a row and 8 blue tulips in a row. If Jackson buys 36 red tulips and 24 blue tulips, how many rows of flowers will he plant?
Respond as succinctly as possible. Format the response as a completion of this table.
|step|subquestion|procedure|result|
|:---|:----------|:--------|:-----:| [/INST]
*** Generated
| Step | Subquestion | Procedure | Result |
| --- | --- | --- | --- |
| 1 | How many rows of red tulips can Jackson plant? | Divide 36 by 6 | 6 rows |
| 2 | How many rows of blue tulips can Jackson plant? | Divide 24 by 8 | 3 rows |
| 3 | Total number of rows | Add 6 and 3 | 9 rows |

Therefore, Jackson will plant 9 rows of flowers - 6 rows of red tulips and 3 rows of blue tulips.
--- 28.572017192840576 seconds ---


Question 3.2
*** Prompt
<s>[INST] <<SYS>>
Your are a helpful AI assistant.
<</SYS>>

The bakers at the Beverly Hills Bakery baked 200 