## Boilers for loading and running falcon-40b-instruct

* Google Colab notebook information
  * GPU: A100-SXM 40 GB
  * Disk: 166.8 GB
* Details on falcon-40b-instruct
  * Documentation: [Hugging Face model card](https://huggingface.co/tiiuae/falcon-40b-instruct)
  * Runtime latency: See below text generations for estimates
  * Memory footprint: Roughly 25 GB of GPU RAM used with 4bit quantization
  * License: [Apache 2.0](https://huggingface.co/tiiuae/falcon-40b-instruct#license)

## Setup

In [None]:
# detailed information on the GPU

!nvidia-smi

In [None]:
# install necessary libraries

%pip install -q -U git+https://github.com/huggingface/accelerate.git
%pip install -q -U git+https://github.com/huggingface/transformers.git 
%pip install -q -U torch==2.0.1 --index-url https://download.pytorch.org/whl/cu118
%pip install -q -U einops==0.6.1
%pip install -q -U bitsandbytes==0.39.0

In [None]:
# import libraries

import transformers
import torch
import time

In [None]:
# print GPU available memory

free_in_GB = int(torch.cuda.mem_get_info()[0] / 1024**3)
max_memory = f"{free_in_GB-2}GB"

n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}
max_memory

In [None]:
# set the seed

transformers.set_seed(4129408)

## Load the model

In [None]:
# load falcon-40b-instruct" 
# see source: https://huggingface.co/tiiuae/falcon-40b-instruct#how-to-get-started-with-the-model

# this cell will take a long time, to avoid: deploy the LLM as an API inference endpoint
model_id = "tiiuae/falcon-40b-instruct"

tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map={"":0},
    trust_remote_code=True,
)


## Run the model

* For text generation options, refer to [https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextGenerationPipeline](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextGenerationPipeline)
* Below prompts are borrowed from [https://github.com/openai/openai-cookbook/blob/main/techniques_to_improve_reliability.md](https://github.com/openai/openai-cookbook/blob/main/techniques_to_improve_reliability.md)

In [None]:
# custom text generation function
# requires "model" and "tokenizer" global vars initiated above

from typing import List

def generate_text_pipeline(
    prompt: str, 
    max_new_tokens: int = 128, 
    do_sample: bool = True, 
    eos_token_ids: List[int] = tokenizer.eos_token_id
) -> str:
  
    """
    Initialize the pipeline
    Args:
        prompt (str): Prompt for text generation
        max_new_tokens (int, optional): Max new tokens after the prompt to generate. Defaults to 128.
        do_sample (bool, optional): Whether or not to use sampling. Defaults to True.
    """

    batch = tokenizer(
        prompt,
        padding=True,
        truncation=True,
        return_tensors='pt'
    )
    batch = batch.to('cuda')

    with torch.cuda.amp.autocast():
        output_tokens = model.generate(
            inputs=batch.input_ids, 
            max_new_tokens=max_new_tokens,
            do_sample=do_sample,
            eos_token_id=eos_token_ids,
            pad_token_id=tokenizer.eos_token_id,
            bos_token_id=tokenizer.eos_token_id,
        )

    generated_text = tokenizer.decode(
        output_tokens[0][len(batch.input_ids[0]):], 
        skip_special_tokens=True
    )
    if generated_text[0] == " ":
        generated_text = generated_text[1:]

    return generated_text



### Example 1: Zero-shot reasoning conditioned on good performance
* From https://arxiv.org/abs/2205.11916

In [None]:
# Example 1.1

prompt = "Q: A juggler has 16 balls. Half of the balls are golf balls and half of the golf balls are blue. How many blue golf balls are there? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."
f'Prompt: "{prompt}"'

In [None]:
# For reference, 
    # GPT-4 response: Half of the 16 balls are golf balls, which is 8 balls. Half of these golf balls are blue, so there are 4 blue golf balls. ✅ 
    # GPT-3.5 response: There are 4 blue golf balls. ✅

start_time = time.time()
res = generate_text_pipeline(prompt)
print("--- %s seconds ---" % (time.time() - start_time))
print("\n")
f'Text generation: "{res}"'

In [None]:
# Example 1.2

prompt = "Q: Daniel is in need of a haircut. His barber works Mondays, Wednesdays, and Fridays. So, Daniel went in for a haircut on Sunday. Does this make logical sense? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."
f'Prompt: "{prompt}"'

In [None]:
# For reference:
    # GPT-4 response: No, it doesn't make logical sense because Daniel's barber does not work on Sundays. ✅
    # GPT-3.5 response: No, it does not make logical sense for Daniel to go in for a haircut on Sunday because his barber works on Mondays, Wednesdays, and Fridays. ✅

start_time = time.time()
res = generate_text_pipeline(prompt)
print("--- %s seconds ---" % (time.time() - start_time))
print("\n")
f'Text generation: "{res}"'

### Example 2: Chain-of-thought reasoning with few-shot examples
* From https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html


In [None]:
# Example 2.1

prompt = "Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does have now? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question. Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11. Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."
f'Prompt: "{prompt}"'

In [None]:
# For reference, 
    # GPT 4 response: After using 20 apples for lunch, the cafeteria has 3 apples left. With the purchase of 6 more apples, they now have 9 apples. ✅
    # GPT-3.5 response: The cafeteria now has 9 apples. ✅

start_time = time.time()
res = generate_text_pipeline(
    prompt, 
    eos_token_ids = [tokenizer.eos_token_id , tokenizer.encode(" Q: ")[0]]
)
print("--- %s seconds ---" % (time.time() - start_time))
print("\n")
f'Text generation: "{res}"'

In [None]:
# Example 2.2

prompt = "Q: Roger has 3 children. Each of his kids invited 4 of their friends to come to the birthday party. All of the friends came to the party. Q: How many children are at the party? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question. Roger has 3 children, each of whom came to the party. Each of them have 4 friends coming over. 3 * 4 = 12. So 12 of their friends came to the party. 12 + 3 = 15. So, there are 15 children at the party in total. The answer is 15. Q: Ben has 4 children. 50% of his kids are in college and no longer live at home. Q: How many of Ben's children still live at home? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."
f'Prompt: "{prompt}"'

In [None]:
# For reference:
    # GPT-4 response: Ben has 4 children and 50% of them are in college and no longer live at home. Therefore, the other 50%, which is 2 children, still live at home. ✅
    # GPT-3.5 response: Two of Ben's children still live at home. ✅

start_time = time.time()
res = generate_text_pipeline(prompt)
print("--- %s seconds ---" % (time.time() - start_time))
print("\n")
f'Text generation: "{res}"'

### Example 3: Least to most prompting
* From https://arxiv.org/abs/2205.10625


In [None]:
# Example 3.1
# Start with sub question #1

sub_question_1 = "It takes Amy 4 minutes to climb to the top of a slide. It takes her 1 minute to slide down. Q: How long does each trip take? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question." 
f'Prompt: "{sub_question_1}"'

In [None]:
# For reference, 
    # GPT 4 response: Sure, here are the steps: Amy takes 4 minutes to climb to the top of the slide. It takes her 1 minute to slide down. Therefore, each total trip (climbing up and sliding down) takes her 4 minutes (climbing) + 1 minute (sliding down) = 5 minutes. So, each trip takes Amy 5 minutes. ✅
    # GPT-3.5 response: Sure! Let's break it down step by step. It takes Amy 4 minutes to climb to the top of the slide. It takes her 1 minute to slide down. So, each trip consists of climbing up and sliding down. The total time for each trip is 4 minutes (climbing up) + 1 minute (sliding down) = 5 minutes. Therefore, each trip takes 5 minutes. ✅

start_time = time.time()
res_1 = generate_text_pipeline(sub_question_1)
print("--- %s seconds ---" % (time.time() - start_time))
print("\n")
f'Text generation: "{res_1}"'

In [None]:
# Now do sub question #2 by appending answer to sub question #1

sub_question_2 = f"{sub_question_1} {res_1} The slide closes in 15 minutes. Q: How many times can she slide before it closes? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."
f'Prompt: "{sub_question_2}"'

In [None]:
# For reference:
    # GPT-4 response: Sure, here are the steps: We've already established that each complete trip (climbing up and sliding down) takes Amy 5 minutes. The slide is open for 15 minutes. Therefore, the number of complete trips Amy can make is the total time the slide is open divided by the time each trip takes. That is, 15 minutes / 5 minutes per trip = 3 trips. So, Amy can slide down 3 times before the slide closes. ✅
    # GPT-3.5 response: Sure! Let's break it down step by step. The slide closes in 15 minutes. Each trip takes 5 minutes (4 minutes to climb up + 1 minute to slide down). To determine how many trips Amy can take before the slide closes, we divide the total time available (15 minutes) by the time for each trip (5 minutes). 15 minutes / 5 minutes = 3 trips. Therefore, Amy can slide down the slide 3 times before it closes. ✅

start_time = time.time()
res_2 = generate_text_pipeline(sub_question_2)
print("--- %s seconds ---" % (time.time() - start_time))
print("\n")
f'Text generation: "{res_2}"'

In [None]:
# Example 3.2
# Start with sub question #1

sub_question_1 = "It takes Ben 10 minutes to drive to the store. It then takes him 4 minutes to find parking before he can start shopping. Q: How long before he can start shopping? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question." 
f'Prompt: "{sub_question_1}"'

In [None]:
# For reference:
    # GPT-4 response: It takes Ben a total of 14 minutes (10 minutes driving + 4 minutes parking) to start shopping. ✅
    # GPT-3.5 response: It takes Ben 14 minutes before he can start shopping. ✅

start_time = time.time()
res_1 = generate_text_pipeline(sub_question_1)
print("--- %s seconds ---" % (time.time() - start_time))
print("\n")
f'Text generation: "{res_1}"'

In [None]:
# Now do sub question #2 by appending answer to sub question #1

sub_question_2 = f"{sub_question_1} {res_1} The store closes in an hour. Q: Can he make to the store before it closes? A: Let's work this out in a step by step way to be sure we have the right answer. Respond as succinctly as possible to answer the question."
f'Prompt: "{sub_question_2}"'

In [None]:
# For reference:
    # GPT-4 response: 
    # GPT-3.5 response: 

start_time = time.time()
res_2 = generate_text_pipeline(sub_question_2)
print("--- %s seconds ---" % (time.time() - start_time))
print("\n")
f'Text generation: "{res_2}"'