# A deeper look into generation with transformers

In the last exercise, we worked with open source LLMs to generate text. Today, we will learn how to do so more efficiently.

Among other things, we will be working with GPUs that can help us speed up inference.

In [None]:
! pip install transformers torch tqdm accelerate --upgrade --quiet

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import matplotlib
import numpy as np
import random
import datasets
import os
import tqdm.auto as tqdm
from transformers import set_seed
os.environ["TOKENIZERS_PARALLELISM"] = "false"


seed = 1122

# Selecting the font size here will affect all the figures in this notebook
# Alternatively, you can set the font size for axis labels of each figure separately
font = {'size': 16}
matplotlib.rc('font', **font)

# Exercise 1: Setting up the model

Let us set up our model which we will use for the rest of the exercise.

## Exercise 1a: Setting up the virtual machine with a GPU

Your first task is to form groups of 3. The number of GPUs we have is capped at 10 so we cannot have more than 10 teams working simultaneously.

Next connect to the GPU server by following the instructions below. Only one person from each team should connect.

1. Go to https://jupyterhub.uni-muenster.de/.
2. Log in using your RUB credentials.
3. Launch a new virtual machine using the following config:
 * vCPU: 1-8
 * Memory: 8 - 16
 * GPU: NVIDIA A40, 12 GB RAM
4. Use the token provided to you in the class.
5. Wait until the jupyter lab environment opens up.
6. Upload this notebook to Jupyter Lab. Go to the top left corner, click file and then "Open from path". 


## Exercise 1b: Watching the GPU Usage [Only if you are on a machine with a GPU]

1. Open a new terminal by following File -> New -> Terminal.
2. Click on the newly opened tab within the jupyter lab. This is not a new browser tab, but a new tab within the jupyter lab environment. 
3. In the console, execute the command `watch -n1 nvidia-smi`. The command shows you the GPU usage. Right now, you should be using 0 MB or GPU RAM. THe utilization should also be 0%. 

## Exercise 1c: Downloading the model

Now let us start working with a LLM. Execute the cell below to download the model and load it in GPU.

In [None]:
model_name = "Qwen/Qwen2.5-0.5B" # Very small model that we used in the last class. Can be used without a GPU.
# model_name = "ministral/Ministral-3b-instruct" # Use if you have access to a GPU
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

## Exercise 1d: Generating multiple tokens with the model

You already know the code below from the previous exercise. Run it again to make sure everything still works :)

In [None]:
def generate(prompt: str, gen_len: int, temp: float = 1):
    tokenized_input = tokenizer(prompt, return_tensors="pt")
    input_ids = tokenized_input["input_ids"]
    for _ in range(gen_len):
        with torch.no_grad():
           output = model(input_ids).logits  # We should pass the attention mask but we can ignore it for causal LLMs when we have just a single input
        output = output.squeeze(dim=0)
        next_token_scores = output[-1]
        softmax_probs = torch.softmax(next_token_scores.reshape(1,-1) / temp, axis=-1)
        next_token_id = torch.multinomial(softmax_probs.flatten(), 1)
        input_ids = torch.cat((input_ids, torch.LongTensor([next_token_id]).reshape(1,-1)), dim=-1)
    return tokenizer.decode(input_ids.numpy().flatten())


prompt = "Berlin is a city in"
generate(prompt, 30, 0.01)

# Exercise 2: Measuring and optimizing model performance

## Exercise 2a: Measuring model throughput [20 mins]

Before you can optimize model performance, you need to be able to measure it.

Rewerite the `generate` function. It should now take two new parameters as input:
1. `num_generations: int = 10` denotes how many outputs you want to generate for each prompt.
2. `progress: bool = True`. If this parameter is set to `True`, it should print a nested progress bar. The first bar updates every time the model generates a complete output. The second bar updates every time a new token within the output is generated. You can use [nested tqdm bars](https://github.com/tqdm/tqdm/blob/0ed5d7f18fa3153834cbac0aa57e8092b217cc16/README.rst#nested-progress-bars), which are two nested `tqdm.trange` loops.

In [None]:
# Your code here
from typing import List

def generate(
    prompt: str,
    gen_len: int,
    temp: float = 1,
    num_generations: int = 10,
    progress: bool = True,
) -> List[str]:

    def _generate_once(generation_number):
        tokenized_input = tokenizer(prompt, return_tensors="pt")
        input_ids = tokenized_input["input_ids"]
        bar_inner = tqdm.trange(gen_len, disable=not progress)
        for i in bar_inner:
            bar_inner.set_description(f"Tokens for generation # {generation_number}")
            with torch.no_grad():
               output = model(input_ids).logits  # We should pass the attention mask but we can ignore it for causal LLMs when we have just a single input
            output = output.squeeze(dim=0)
            next_token_scores = output[-1]
            softmax_probs = torch.softmax(next_token_scores.reshape(1,-1) / temp, axis=-1)
            next_token_id = torch.multinomial(softmax_probs.flatten(), 1)
            input_ids = torch.cat((input_ids, torch.LongTensor([next_token_id]).reshape(1,-1)), dim=-1)
        return tokenizer.decode(input_ids.numpy().flatten())

    outputs = []
    bar_outer = tqdm.trange(num_generations, disable=not progress)
    bar_outer.set_description("Generation #")
    for i in bar_outer:
        outputs.append(_generate_once(i))
    return outputs


prompt = "Berlin is a city in"
generations = generate(prompt, 30, temp=0.5, num_generations=10, progress=True)
print("\n")
for i, gen in enumerate(generations):
    print(f"== Generation # {i}==")
    print(gen)
    print("\n")

## Exercise 2b: Speeding things up with batching [30 mins]

Deep models can benefit from batching. Convert your code into a batched one where instead of generating a single output, you generate `num_generations` outputs at once. Specifically, instead of drawing a single token from the multinomial, you would draw `num_generations` tokens now.

Take 10 prompts from the BOLD dataset (you used it in the previous exercise). For each prompt, generate outputs with `num_generations=10` and `gen_len=30` with _batching_ and _without batching_. Report the time taken by each approach.

In [None]:
# Loading the BOLD data
n_prompts = 10
bold = datasets.load_dataset("AlexaAI/bold")
random.seed(11)
prompts = []
for prompt_list in bold["train"]["prompts"]:
    prompts.extend(prompt_list)
random.shuffle(prompts)
prompts = prompts[:n_prompts]
for prompt in prompts:
    print(prompt)

In [None]:
# Your code here
from typing import List

def generate_batch(
    prompt: str,
    gen_len: int,
    temp: float = 1,
    num_generations: int = 10,
    progress: bool = True,
) -> List[str]:

    tokenized_input = tokenizer([prompt] * num_generations, return_tensors="pt")
    input_ids = tokenized_input["input_ids"]
    bar = tqdm.trange(gen_len, disable=not progress)
    for i in bar:
        with torch.no_grad():
            output = model(input_ids).logits  # We should pass the attention mask but we can ignore it for causal LLMs when we have just a single input
        # Shape of the output is (batch, input token location, output token probability)
        # Select the output token probability at the last input token
        next_token_scores = output[:, -1, :]
        softmax_probs = torch.softmax(next_token_scores / temp, axis=-1)
        next_token_ids = torch.multinomial(softmax_probs, 1)
        input_ids = torch.cat((input_ids, next_token_ids), dim=-1)
        
    return tokenizer.batch_decode(input_ids.numpy())


prompt = "Berlin is a city in"
generations = generate_batch(prompt, 30, temp=0.5, num_generations=10, progress=True)
print("\n")
for i, gen in enumerate(generations):
    print(f"== Generation # {i}==")
    print(gen)
    print("\n")

# Exercise 3: Using `model.generate` [25 mins]

Instead of using your own generate function, now use the built-in `model.generate` function with the same values for `num_generations` and `gen_lenght` (they are named slightly differently for `model.generate`). You can use the [`GenerationConfig`](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig) object to control the generation. For examples, see [here](https://huggingface.co/docs/transformers/en/llm_tutorial).

1. Compare the time it takes the model to generate the outputs with your own `generate` function from the previous exercise. Are there any differences? Why?
2. The `generate` function by default uses KV cache. Set `use_cache = False` to disable KV caching. Do you notice any difference in performance?

In [None]:
from transformers import GenerationConfig
import time

num_generations = 10
generation_config = GenerationConfig(
    max_new_tokens=30,
    do_sample=True,
    temperaturea=0.5,
    num_return_sequences=10,
)
tokenized_input = tokenizer([prompt], return_tensors="pt")
input_ids = tokenized_input["input_ids"]

In [None]:
start_time = time.time()
# Set the seed if you want to have the sampling behavior.
# You can set it before the previous generation calls as well.
# Try generating repeatedly with the same non-zero temperature with and witout setting the seed.
set_seed(seed)
gen_ids = model.generate(input_ids, generation_config=generation_config)
print(f"Elapsed time: {time.time() - start_time: 0.2f}")

outputs = tokenizer.batch_decode(gen_ids)
print(outputs)

In [None]:
start_time = time.time()
set_seed(seed)
gen_ids = model.generate(input_ids, generation_config=generation_config, use_cache=False)
print(f"Elapsed time: {time.time() - start_time: 0.2f}")

outputs = tokenizer.batch_decode(gen_ids)
print(outputs)

# Exercise 4: Reproducibility [15 mins]

An essential part of your LLM generation pipeline is reproducibility. You want to save all the parameters used in generation so that you can get the same measurements again.

Save all the parameters, e.g., seeds, generation lenght, to file. Load these parameters again and perform the inference. Check if you get identical outputs before and after the save/load operation.

In [None]:
save_dir = "."  # The directory where we want to save the file
config_file = "gen_config.json" # Name of the saved file
seed_file = "seed.txt"  # File where we store our random seed
generation_config.save_pretrained(".", config_file_name=config_file)
with open(seed_file, "w") as f:
    f.write(str(seed) + "\n")

generation_config = GenerationConfig.from_pretrained(".", config_file_name=config_file)
with open(seed_file) as f:
    seed = int(f.read().strip())


In [None]:
set_seed(seed)
outputs_loaded_config = tokenizer.batch_decode(
    model.generate(input_ids, generation_config=generation_config)
    )

# Check that the outputs are the same with the loaded config and seed
assert all([g1==g2 for g1, g2 in zip(outputs, outputs_loaded_config)])