# LLM reasoning pop quiz

Do open-sourced LLMs have the reasoning prowess of their closed-sourced siblings?

* Google Colab notebook information
  * GPU: A100-SXM 40 GB
  * Disk: 166.8 GB
* Details on falcon-40b-instruct
  * Documentation: [Hugging Face model card](https://huggingface.co/tiiuae/falcon-40b-instruct)
  * Runtime latency: Insert total time through the *n* questions in the pop quiz
  * Memory footprint: Roughly 25 GB of GPU RAM used with 4bit quantization
  * License: [Apache 2.0](https://huggingface.co/tiiuae/falcon-40b-instruct#license)


<a target="_blank" href="https://colab.research.google.com/github/daniel-furman/LLM-reasoning-pop-quiz/blob/main/notebooks/falcon_core_pop_quiz.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Setup

In [1]:
# detailed information on the GPU

!nvidia-smi

zsh:1: command not found: nvidia-smi


In [2]:
#!git clone https://github.com/daniel-furman/LLM-reasoning-pop-quiz.git

Cloning into 'LLM-reasoning-pop-quiz'...
remote: Enumerating objects: 83, done.[K
remote: Counting objects: 100% (83/83), done.[K
remote: Compressing objects: 100% (44/44), done.[K
remote: Total 83 (delta 28), reused 70 (delta 25), pack-reused 0[K
Unpacking objects: 100% (83/83), 26.13 KiB | 1.24 MiB/s, done.


In [2]:
!ls

falcon_core_pop_quiz.ipynb


In [3]:
# install necessary libraries
import os

os.chdir('/content/LLM-reasoning-pop-quiz')
!pip install -q -U -r requirements.txt
os.chdir('../..')

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for drf-llm-boilers (setup.py) ... [?25l[?25hdone


In [6]:
# import libraries

import transformers
import torch
import time
import yaml

# import helpers

# from drf_llm_boilers import llm_boiler

In [5]:
# set the seed

transformers.set_seed(4129408)

In [6]:
# print GPU available memory

free_in_GB = int(torch.cuda.mem_get_info()[0] / 1024**3)
max_memory = f"{free_in_GB-2}GB"

n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}
max_memory

{0: '37GB'}

## Read in the yaml config for the run

In [15]:
# forthcoming

with open('../configs/pop_quiz_v0.yml', 'r') as file:
    pop_quiz = yaml.safe_load(file)

## Load the model

In [8]:
# load falcon-40b-instruct
# see source: https://huggingface.co/tiiuae/falcon-40b-instruct#how-to-get-started-with-the-model

# this cell will take a long time, to avoid: deploy the LLM as an API inference endpoint

model_id = "tiiuae/falcon-40b-instruct"

model = llm_boiler(model_id)

Load function recognized for tiiuae/falcon-40b-instruct: falcon_loader
Run function recognized for tiiuae/falcon-40b-instruct: falcon

Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]

In [9]:
print(model.name, '\n')
print(model.tokenizer, '\n')
print(model.model, '\n')

falcon 

PreTrainedTokenizerFast(name_or_path='tiiuae/falcon-40b-instruct', vocab_size=65024, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|endoftext|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['>>TITLE<<', '>>ABSTRACT<<', '>>INTRODUCTION<<', '>>SUMMARY<<', '>>COMMENT<<', '>>ANSWER<<', '>>QUESTION<<', '>>DOMAIN<<', '>>PREFIX<<', '>>SUFFIX<<', '>>MIDDLE<<']}, clean_up_tokenization_spaces=True) 

RWForCausalLM(
  (transformer): RWModel(
    (word_embeddings): Embedding(65024, 8192)
    (h): ModuleList(
      (0-59): 60 x DecoderLayer(
        (ln_attn): LayerNorm((8192,), eps=1e-05, elementwise_affine=True)
        (ln_mlp): LayerNorm((8192,), eps=1e-05, elementwise_affine=True)
        (self_attention): Attention(
          (maybe_rotary): RotaryEmbedding()
          (query_key_value): Linear4bit(in_features=8192, out_features=9216, bias=False)
          (dense): Linear4bit(in_features=8192, out_fe

## Run the model

* For text generation options, refer to [https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextGenerationPipeline](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextGenerationPipeline)
* Below prompts are borrowed from [https://github.com/openai/openai-cookbook/blob/main/techniques_to_improve_reliability.md](https://github.com/openai/openai-cookbook/blob/main/techniques_to_improve_reliability.md)

### Example 1: Zero-shot reasoning conditioned on good performance
* From https://arxiv.org/abs/2205.11916

In [None]:
pop_quiz['prompts']['zero_shot']
pop_quiz['prompts']['cot_few_shot']
pop_quiz['prompts']['least_to_most']

In [11]:
# Example 1.1

prompt = "Q: A juggler has 16 balls. Half of the balls are golf balls and half of the golf balls are blue. How many blue golf balls are there? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."
print(f'Prompt: "{prompt}"\n')

start_time = time.time()
generated_text = model.run(
    prompt,
    eos_token_ids=model.tokenizer.eos_token_id,
    max_new_tokens=256, 
    temperature=0.01,
    do_sample=True,
    top_p=0.92,
    top_k=50,
)
print("--- %s seconds ---" % (time.time() - start_time))
print("\n")
print(f'Text generations: "{generated_text}"\n')

Prompt: "Q: A juggler has 16 balls. Half of the balls are golf balls and half of the golf balls are blue. How many blue golf balls are there? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."

Q: A juggler has 16 balls. Half of the balls are golf balls and half of the golf balls are blue. How many blue golf balls are there? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question.
There are 8 golf balls in total. Half of the golf balls are blue, so there are 4 blue golf balls.<|endoftext|>
--- 23.680741548538208 seconds ---


Text generations: "
There are 8 golf balls in total. Half of the golf balls are blue, so there are 4 blue golf balls."



In [12]:
# Example 1.2

prompt = "Q: Daniel is in need of a haircut. His barber works Mondays, Wednesdays, and Fridays. So, Daniel went in for a haircut on Sunday. Does this make logical sense? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."
print(f'Prompt: "{prompt}"\n')

start_time = time.time()
generated_text = model.run(
    prompt,
    eos_token_ids=model.tokenizer.eos_token_id,
    max_new_tokens=256, 
    temperature=0.01,
    do_sample=True,
    top_p=0.92,
    top_k=50,
)
print("--- %s seconds ---" % (time.time() - start_time))
print("\n")
print(f'Text generations: "{generated_text}"\n')

Prompt: "Q: Daniel is in need of a haircut. His barber works Mondays, Wednesdays, and Fridays. So, Daniel went in for a haircut on Sunday. Does this make logical sense? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."

Q: Daniel is in need of a haircut. His barber works Mondays, Wednesdays, and Fridays. So, Daniel went in for a haircut on Sunday. Does this make logical sense? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question.
No, it does not make logical sense for Daniel to go in for a haircut on Sunday. His barber works on Mondays, Wednesdays, and Fridays, so Daniel should have waited until one of those days to get a haircut.<|endoftext|>
--- 34.40725827217102 seconds ---


Text generations: "
No, it does not make logical sense for Daniel to go in for a haircut on Sunday. His barber works on Mondays, Wednes

### Example 2: Chain-of-thought reasoning with few-shot examples
* From https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html


In [13]:
# Example 2.1

prompt = "Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does have now? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question. Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11. Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."
print(f'Prompt: "{prompt}"\n')

start_time = time.time()
generated_text = model.run(
    prompt,
    eos_token_ids=model.tokenizer.eos_token_id,
    max_new_tokens=256, 
    temperature=0.01,
    do_sample=True,
    top_p=0.92,
    top_k=50,
)
print("--- %s seconds ---" % (time.time() - start_time))
print("\n")
print(f'Text generations: "{generated_text}"\n')

Prompt: "Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does have now? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question. Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11. Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does have now? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question. Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11

In [14]:
# Example 2.2

prompt = "Q: Roger has 3 children. Each of his kids invited 4 of their friends to come to the birthday party. All of the friends came to the party. Q: How many children are at the party? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question. Roger has 3 children, each of whom came to the party. Each of them have 4 friends coming over. 3 * 4 = 12. So 12 of their friends came to the party. 12 + 3 = 15. So, there are 15 children at the party in total. The answer is 15. Q: Ben has 4 children. 50% of his kids are in college and no longer live at home. Q: How many of Ben's children still live at home? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."
print(f'Prompt: "{prompt}"\n')

start_time = time.time()
generated_text = model.run(
    prompt,
    eos_token_ids=model.tokenizer.eos_token_id,
    max_new_tokens=256, 
    temperature=0.01,
    do_sample=True,
    top_p=0.92,
    top_k=50,
)
print("--- %s seconds ---" % (time.time() - start_time))
print("\n")
print(f'Text generations: "{generated_text}"\n')

Prompt: "Q: Roger has 3 children. Each of his kids invited 4 of their friends to come to the birthday party. All of the friends came to the party. Q: How many children are at the party? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question. Roger has 3 children, each of whom came to the party. Each of them have 4 friends coming over. 3 * 4 = 12. So 12 of their friends came to the party. 12 + 3 = 15. So, there are 15 children at the party in total. The answer is 15. Q: Ben has 4 children. 50% of his kids are in college and no longer live at home. Q: How many of Ben's children still live at home? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."

Q: Roger has 3 children. Each of his kids invited 4 of their friends to come to the birthday party. All of the friends came to the party. Q: How many children are at the p

### Example 3: Least to most prompting
* From https://arxiv.org/abs/2205.10625


In [16]:
# Example 3.1

# Start with sub question #1
sub_question_1 = "It takes Amy 4 minutes to climb to the top of a slide. It takes her 1 minute to slide down. Q: How long does each trip take? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question." 
print(f'Prompt: "{sub_question_1}"\n')

start_time = time.time()
res_1 = model.run(
    sub_question_1,
    eos_token_ids=model.tokenizer.eos_token_id,
    max_new_tokens=256, 
    temperature=0.01,
    do_sample=True,
    top_p=0.92,
    top_k=50,
)
print("--- %s seconds ---" % (time.time() - start_time))
print("\n")
print(f'Text generation: "{res_1}"\n')

# Now do sub question #2 by appending answer to sub question #1
sub_question_2 = f"{sub_question_1} {res_1} The slide closes in 15 minutes. Q: How many times can she slide before it closes? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."
print(f'Prompt: "{sub_question_2}"\n')

start_time = time.time()
res_2 = model.run(
    sub_question_2,
    eos_token_ids=model.tokenizer.eos_token_id,
    max_new_tokens=256, 
    temperature=0.01,
    do_sample=True,
    top_p=0.92,
    top_k=50,
)
print("--- %s seconds ---" % (time.time() - start_time))
print("\n")
print(f'Text generation: "{res_2}"\n')

Prompt: "It takes Amy 4 minutes to climb to the top of a slide. It takes her 1 minute to slide down. Q: How long does each trip take? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."

It takes Amy 4 minutes to climb to the top of a slide. It takes her 1 minute to slide down. Q: How long does each trip take? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question.
Amy climbs up the slide for 4 minutes, then slides down for 1 minute. Therefore, each trip takes 5 minutes.<|endoftext|>
--- 21.209710121154785 seconds ---


Text generation: "
Amy climbs up the slide for 4 minutes, then slides down for 1 minute. Therefore, each trip takes 5 minutes."

Prompt: "It takes Amy 4 minutes to climb to the top of a slide. It takes her 1 minute to slide down. Q: How long does each trip take? A: Let's work this out in a step by st

In [17]:
# Example 3.2

# Start with sub question #1
sub_question_1 = "It takes Ben 10 minutes to drive to the store. It then takes him 4 minutes to find parking before he can start shopping. Q: How long before he can start shopping? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question." 
print(f'Prompt: "{sub_question_1}"\n')

start_time = time.time()
res_1 = model.run(
    sub_question_1,
    eos_token_ids=model.tokenizer.eos_token_id,
    max_new_tokens=256, 
    temperature=0.01,
    do_sample=True,
    top_p=0.92,
    top_k=50,
)
print("--- %s seconds ---" % (time.time() - start_time))
print("\n")
print(f'Text generation: "{res_1}"\n')

# Now do sub question #2 by appending answer to sub question #1
sub_question_2 = f"{sub_question_1} {res_1} The store closes in an hour. Q: Can he make to the store before it closes? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."
print(f'Prompt: "{sub_question_2}"\n')

start_time = time.time()
res_2 = model.run(
    sub_question_2,
    eos_token_ids=model.tokenizer.eos_token_id,
    max_new_tokens=256, 
    temperature=0.01,
    do_sample=True,
    top_p=0.92,
    top_k=50,
)
print("--- %s seconds ---" % (time.time() - start_time))
print("\n")
print(f'Text generation: "{res_2}"\n')

Prompt: "It takes Ben 10 minutes to drive to the store. It then takes him 4 minutes to find parking before he can start shopping. Q: How long before he can start shopping? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."

It takes Ben 10 minutes to drive to the store. It then takes him 4 minutes to find parking before he can start shopping. Q: How long before he can start shopping? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question.
Ben takes 10 minutes to drive to the store. He then takes 4 minutes to find parking. Therefore, it takes Ben 14 minutes to get to the store and find parking. Once he finds parking, he can start shopping.<|endoftext|>
--- 36.469444036483765 seconds ---


Text generation: "
Ben takes 10 minutes to drive to the store. He then takes 4 minutes to find parking. Therefore, it takes Ben 1