# LLM reasoning pop quiz

Do open-sourced LLMs have the reasoning prowess of their closed-sourced siblings?

* Google Colab notebook information
  * GPU: A100-SXM 40 GB
  * Disk: 166.8 GB
* Details on falcon-40b-instruct
  * Documentation: [Hugging Face model card](https://huggingface.co/tiiuae/falcon-40b-instruct)
  * Latency: See below for examples with runtimes
  * Memory footprint: Roughly 25 GB of GPU RAM used with 4bit quantization
  * License: [Apache 2.0](https://huggingface.co/tiiuae/falcon-40b-instruct#license)


<a target="_blank" href="https://colab.research.google.com/github/daniel-furman/LLM-reasoning-pop-quiz/blob/main/notebooks/falcon-40b-instruct_core_pop_quiz.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Setup

In [1]:
# detailed information on the GPU

!nvidia-smi

Mon Jun  5 22:04:51 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0    45W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!git clone https://github.com/daniel-furman/LLM-reasoning-pop-quiz.git

Cloning into 'LLM-reasoning-pop-quiz'...
remote: Enumerating objects: 171, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 171 (delta 0), reused 3 (delta 0), pack-reused 165[K
Receiving objects: 100% (171/171), 72.12 KiB | 18.03 MiB/s, done.
Resolving deltas: 100% (71/71), done.


In [3]:
!ls

LLM-reasoning-pop-quiz	sample_data


In [4]:
# install necessary libraries
import os

os.chdir("/content/LLM-reasoning-pop-quiz")
!pip install -q -U -r requirements.txt
os.chdir("../..")

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/92.2 MB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m86.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB

In [None]:
# import libraries

import transformers
import torch
import time
import yaml

# import helpers

from drf_llm_boilers import llm_boiler

In [6]:
# set the seed

transformers.set_seed(4129408)

In [7]:
# print GPU available memory

free_in_GB = int(torch.cuda.mem_get_info()[0] / 1024**3)
max_memory = f"{free_in_GB-2}GB"

n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}
max_memory

{0: '37GB'}

## Read in the yaml config for the run

In [8]:
with open(
    "/content/LLM-reasoning-pop-quiz/configs/pop_quiz_v0_qa_style.yml", "r"
) as file:
    pop_quiz = yaml.safe_load(file)
pop_quiz

{'prompts': {'zero_shot': ["Q: A juggler has 16 balls. Half of the balls are golf balls and half of the golf balls are blue. How many blue golf balls are there? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question.",
   "Q: Daniel is in need of a haircut. His barber works Mondays, Wednesdays, and Fridays. So, Daniel went in for a haircut on Sunday. Does this make logical sense? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."],
  'cot_few_shot': ["Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does have now? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question. Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answe

## Load the model

In [9]:
# load falcon-40b-instruct
# see source: https://huggingface.co/tiiuae/falcon-40b-instruct#how-to-get-started-with-the-model

# this cell will take a long time, to avoid: deploy the LLM as an API inference endpoint

model_id = "tiiuae/falcon-40b-instruct"

model = llm_boiler(model_id, lora=False)

Load function recognized for tiiuae/falcon-40b-instruct: falcon_loader
Run function recognized for tiiuae/falcon-40b-instruct: falcon


Downloading (…)okenizer_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/658 [00:00<?, ?B/s]

Downloading (…)/configuration_RW.py:   0%|          | 0.00/2.51k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-40b-instruct:
- configuration_RW.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)main/modelling_RW.py:   0%|          | 0.00/47.1k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-40b-instruct:
- modelling_RW.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)model.bin.index.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/9 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00009.bin:   0%|          | 0.00/9.50G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00009.bin:   0%|          | 0.00/9.51G [00:00<?, ?B/s]

Downloading (…)l-00003-of-00009.bin:   0%|          | 0.00/9.51G [00:00<?, ?B/s]

Downloading (…)l-00004-of-00009.bin:   0%|          | 0.00/9.51G [00:00<?, ?B/s]

Downloading (…)l-00005-of-00009.bin:   0%|          | 0.00/9.51G [00:00<?, ?B/s]

Downloading (…)l-00006-of-00009.bin:   0%|          | 0.00/9.51G [00:00<?, ?B/s]

Downloading (…)l-00007-of-00009.bin:   0%|          | 0.00/9.51G [00:00<?, ?B/s]

Downloading (…)l-00008-of-00009.bin:   0%|          | 0.00/9.51G [00:00<?, ?B/s]

Downloading (…)l-00009-of-00009.bin:   0%|          | 0.00/7.58G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [10]:
print(model.name, "\n")
print(model.tokenizer, "\n")
print(model.model, "\n")

falcon 

PreTrainedTokenizerFast(name_or_path='tiiuae/falcon-40b-instruct', vocab_size=65024, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|endoftext|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['>>TITLE<<', '>>ABSTRACT<<', '>>INTRODUCTION<<', '>>SUMMARY<<', '>>COMMENT<<', '>>ANSWER<<', '>>QUESTION<<', '>>DOMAIN<<', '>>PREFIX<<', '>>SUFFIX<<', '>>MIDDLE<<']}, clean_up_tokenization_spaces=True) 

RWForCausalLM(
  (transformer): RWModel(
    (word_embeddings): Embedding(65024, 8192)
    (h): ModuleList(
      (0-59): 60 x DecoderLayer(
        (ln_attn): LayerNorm((8192,), eps=1e-05, elementwise_affine=True)
        (ln_mlp): LayerNorm((8192,), eps=1e-05, elementwise_affine=True)
        (self_attention): Attention(
          (maybe_rotary): RotaryEmbedding()
          (query_key_value): Linear4bit(in_features=8192, out_features=9216, bias=False)
          (dense): Linear4bit(in_features=8192, out_fe

## Run the model

* For text generation options, refer to [https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextGenerationPipeline](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextGenerationPipeline)
* Below prompts are borrowed from [https://github.com/openai/openai-cookbook/blob/main/techniques_to_improve_reliability.md](https://github.com/openai/openai-cookbook/blob/main/techniques_to_improve_reliability.md)

### Example 1: Zero-shot reasoning conditioned on good performance
* From https://arxiv.org/abs/2205.11916

In [12]:
# run zero shot questions

for itr, prompt in enumerate(pop_quiz["prompts"]["zero_shot"]):
    print(f"Question 1.{itr+1}")
    print(f'Prompt: "{prompt}"\n')
    start_time = time.time()
    generated_text = model.run(
        prompt=prompt,
        eos_token_ids=model.tokenizer.eos_token_id,
        max_new_tokens=256,
        temperature=0.01,
        do_sample=True,
        top_p=0.92,
        top_k=50,
        num_return_sequences=1,
    )
    print("--- %s seconds ---" % (time.time() - start_time))
    print("\n")
    print(f'Text generations: "{generated_text}"\n\n')

Question 1.1
Prompt: "Q: A juggler has 16 balls. Half of the balls are golf balls and half of the golf balls are blue. How many blue golf balls are there? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."

Q: A juggler has 16 balls. Half of the balls are golf balls and half of the golf balls are blue. How many blue golf balls are there? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question.
There are 8 golf balls in total. Half of the golf balls are blue, so there are 4 blue golf balls.<|endoftext|>
--- 24.779115915298462 seconds ---


Text generations: "
There are 8 golf balls in total. Half of the golf balls are blue, so there are 4 blue golf balls."


Question 1.2
Prompt: "Q: Daniel is in need of a haircut. His barber works Mondays, Wednesdays, and Fridays. So, Daniel went in for a haircut on Sunday. Does this

### Example 2: Chain-of-thought reasoning with few-shot examples
* From https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html


In [13]:
# run cot few-shot questions

for itr, prompt in enumerate(pop_quiz["prompts"]["cot_few_shot"]):
    print(f"Question 2.{itr+1}")
    print(f'Prompt: "{prompt}"\n')
    start_time = time.time()
    generated_text = model.run(
        prompt=prompt,
        eos_token_ids=model.tokenizer.eos_token_id,
        max_new_tokens=256,
        temperature=0.01,
        do_sample=True,
        top_p=0.92,
        top_k=50,
        num_return_sequences=1,
    )
    print("--- %s seconds ---" % (time.time() - start_time))
    print("\n")
    print(f'Text generations: "{generated_text}"\n\n')

Question 2.1
Prompt: "Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does have now? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question. Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11. Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does have now? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question. Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis ball

### Example 3: Least to most prompting
* From https://arxiv.org/abs/2205.10625


In [18]:
# run least to most questions

for itr, prompts in enumerate(pop_quiz["prompts"]["least_to_most"]):
    print(f"Question 3.{itr+1}")
    # Start with sub question #1
    sub_question_1 = prompts[0]
    print(f'Prompt: "{sub_question_1}"\n')

    start_time = time.time()
    res_1 = model.run(
        prompt=sub_question_1,
        eos_token_ids=model.tokenizer.eos_token_id,
        max_new_tokens=256,
        temperature=0.01,
        do_sample=True,
        top_p=0.92,
        top_k=50,
        num_return_sequences=1,
    )
    print("--- %s seconds ---" % (time.time() - start_time))
    print("\n")
    print(f'Text generation: "{res_1}"\n')

    # Now do sub question #2 by appending answer to sub question #1
    sub_question_2 = f"{sub_question_1} {res_1} {prompts[1]}"
    print(f'Prompt: "{sub_question_2}"\n')

    start_time = time.time()
    res_2 = model.run(
        prompt=sub_question_2,
        eos_token_ids=model.tokenizer.eos_token_id,
        max_new_tokens=256,
        temperature=0.01,
        do_sample=True,
        top_p=0.92,
        top_k=50,
        num_return_sequences=1,
    )
    print("--- %s seconds ---" % (time.time() - start_time))
    print("\n")
    print(f'Text generation: "{res_2}"\n')

Question 3.1
Prompt: "Q: It takes Amy 4 minutes to climb to the top of a slide. It takes her 1 minute to slide down. How long does each trip take? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."

Q: It takes Amy 4 minutes to climb to the top of a slide. It takes her 1 minute to slide down. How long does each trip take? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question.
Amy climbs up the slide in 4 minutes, so it takes her 4 minutes to complete one trip. She slides down the slide in 1 minute, so it takes her 1 minute to complete one trip. Therefore, each trip takes 5 minutes.<|endoftext|>
--- 40.167579650878906 seconds ---


Text generation: "
Amy climbs up the slide in 4 minutes, so it takes her 4 minutes to complete one trip. She slides down the slide in 1 minute, so it takes her 1 minute to complete one t