# Generate text using the transformer model 



## Lab Description:

This lab explores different decoding strategies for generating text with the Transformer. Participants will experiment with greedy search, beam search, and sampling techniques to understand their impact on text fluency and diversity. 

## Lab Objectives:

- Understand the fundamentals of text generation using the GPT-2 XL model.
  
- Explore and compare different decoding strategies, including greedy search, beam search, and sampling, to analyze their impact on text quality.
  
- Evaluate the trade-offs between fluency, coherence, and diversity in generated text across different decoding methods.

## Loading the Model and Tokenizer:

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Set device to GPU if available, otherwise use CPU
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load GPT-2 XL tokenizer (configured as decoder)
model_name = 'gpt2-xl'
tokenizer = AutoTokenizer.from_pretrained(model_name, is_decoder=True)

# Load GPT-2 XL model and move to the chosen device
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

# GREEDY SEARCH DECODING

Greedy search decoding is a method used in Natural Language Processing in which the model selects the most probable next word or token at each sequence of the text generation. This method is computationally fast as it requires only a single model pass. However, it will keep on selecting the next most probable word without considering multiple paths, which would generate less accurate sequences.


We define an input text "Sky is".

The tokenizer() gives us the corresponding token ids.

<div style="text-align: center;">
    <img src="tokenizer.png" alt="Description" width="600" height="400">
</div>

In [3]:
#Defining the input text
input_txt = "Sky is"

#Tokenizes the input
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)

#Print the input Id's
print(input_ids)

#Decode the input ID to get the corresponding token
print(tokenizer.decode(input_ids[0][0]))




tensor([[22308,   318]], device='cuda:0')
Sky


Each token ID maps to an entry in the vocabulary of the model (which the model is pretrained on). We can see that the ID 37573 represents "Machine" in the models vocabulary. Now that we have all the input ids, we can use the model to predict the next word, using the greedy search decoding.

In [None]:
output = model(input_ids=input_ids)

Output is a tensor which contain logits for each position in the input.




---



An example to illustrate the output tensor and logits.

Suppose we have a model with a 3 token vocabulary. (This means that the model is only trained on data that contain only these 3 words. However in reality these models will be trained on a really large dataset.)

```
Token IDs:  [0,       1,         2 ]
Tokens:     [Machine, Learning,  Is]
```

And suppose our input text is


```
input_txt = "Machine Learning"
```

The output tensor would look something like this



```
torch.tensor([
    [  
        [1.0, 2.0, -1.0],  # Logits for position 0 ('Machine')
        [1.5, 2.5,  0.0]   # Logits for position 1 ('Learning')
    ]
])
```

The logits for position 0 tells us the 'score' of each token in the vocabulary. Since we have only 3 tokens in our vocabulary, we have only 3 logits. This means that the token id 0 has a score of 1.0 (token id 0 maps to the word machine, so this means that the word machine has a score of 1.0 to be the next word in the sequence after the word at position 0, that is 'machine'), and token id 1 has a score of 2.0 (This means that the word 'Learning' has a score of 2.0 to be the next word).

We are interested in generating the next token, given the input. So we only need the logits for position 1. Logits for position 1 gives us the  score for the next possible word in the sequence.

Once we have the logits for position 1, we need to compute the probabilities that each token has to be the next token in the sequence.

We can use the softmax function to convert logits into probabilities that sum to 1.



---






In [None]:
# Get the logits of the last token from the model's output
next_logits = output.logits[0, -1, :]

# Apply softmax to convert logits to probabilities
next_prob = torch.softmax(next_logits, dim=-1)

# Print logits and probabilities
print(next_logits)
print(next_prob)


next_logits now contain the scores for each token in the model's vocabulary, which indicates how confident the model is in terms of the token becoming the next in the sequence.

Softmax gives us the corresponding probabilities from the logits.

Since we are applying Greedy Search Decoding, we just need to select the 'next most probable word'. Sorting the given probabilities would be the best idea to implement this. Once we have the sorted probabilities, we could just continue to add the token with the highest probability to the input

In [None]:
sorted_ids = torch.argsort(next_prob, descending=True)
print(sorted_ids)

Now we have the ids of the next most probable word from left to right. Which means that the token with id 257 will be the next most probable word in the sequence.

In [None]:
# Sort token IDs by probability in descending order
sorted_ids = torch.argsort(next_prob, descending=True)

# Print the sorted token IDs
print(sorted_ids)

The model predicts that the next token in the sequence would be 'a'.


We can now generate some more tokens to analyze the perfomance of the greedy search decoding.

In [None]:
# Define the number of steps/tokens to generate
num_steps = 8

# Loop through the top tokens and append them to the input text
for i in range(num_steps):
  next_token = tokenizer.decode(sorted_ids[i])
  input_txt += " " + next_token

# Print the updated input text
print(input_txt)

It instantly becomes clear that the greedy search method do not generate accurate meanigful texts.

This was an illustration to explain what happens under the hood while trying to implement greedy search decoding.

You could alternatively implement it like this:

In [None]:
# Tokenize the input text and convert to tensor, move to the correct device
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)

# Generate text based on the input, limiting the number of new tokens. Remember that num_steps is the number of tokens that we want to generate.
output = model.generate(input_ids, max_new_tokens=num_steps, do_sample=False)

# Decode and print the generated output text
print(tokenizer.decode(output[0]))


# BEAM SEARCH DECODING

One of the major drawbacks of the greedy search decoding is that it misses high probabilities which occur after a low probability token. Beam Search reduces the risk of this issue by keeping track of multiple paths. It then chooses a path with the most probability.

<div style="text-align: center;">
    <img src="beam.png" alt="Description" width="750" height="400">
</div>


In this example, a greedy search decoding would select "The", "nice", "road" because it simply selects the next most probable word. Beam search also keeps track of other possible sequences. Specifically, it also keeps track of "The" "bus" "turns", because it has a probability of 0.3 x 0.8 = 0.24, while the first sequence only has a combined probability of 0.5 x 0.4 = 0.2. The beam searchs selects the sequence with the highest probability.


Let us now try to generate the top 5 hypothese/beams, which are the 5 most probable sequences. We can do this by setting num_beams, which is called the beam search width, it tells the model how many possible sequences, (or paths in the example above) should be considered.

In [None]:
beam_outputs = model.generate(
    input_ids,               # Tokenized input IDs for generating text
    max_length=50,           # Max length of the generated sequence (50 tokens)
    num_beams=5,             # Use beam search with 5 beams
    no_repeat_ngram_size=2,  # Avoid repeating bigrams in the generated text
    num_return_sequences=5,  # Return 5 different generated sequences
    early_stopping=True      # Stop early if all beams have completed generating
)

# Print the generated outputs
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))


# SAMPLING


In sampling, the next token is selected from the models predicted probability distribution.

Take the sentence "The bus "

Let the next token probabilities be:


<div style="text-align: center;">
    <img src="table1.png" alt="Description" width="720" height="480">
</div>


In sampling, a random token is selected from the probability distribution. When selecting randomly from a distribution, the outcomes with higher probabilities are more likely to be chosen, but the outcomes with lower probabilities might also be chosen. It is just that the selection is weighted by the probabilities.

So the first sample might be "The bus is"
the second sample might be "The bus turns" and so on.

In the transformers library, we could implement sampling by setting do_sample = True, we should also set top_k = 0 to deactivate top k sampling (discussed later)

In [None]:


# set seed to reproduce results. Setting seed ensures that the same result is obtained every time the code is run with the same seed.
torch.manual_seed(0)

# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids,
    do_sample=True,        # Activate sampling
    max_length=50,         # Limit generated text to 50 tokens
    top_k=0                # Deactivate top_k sampling (consider all words in the vocabulary)
)

#Print the output
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

A closer look at the generated output tells us that the response is largely incoherent. Let us now see how this issue is addressed and how sampling response's accuracy can be improved.

# Top-K Sampling

Top-K sampling is a method used to control the randomness or to improve the acccuracy of sampling. The idea is to limit the probability distribution to the K-most probable tokens. This reduces the chance of the model choosing highly unlikely tokens.


For the sentence "The bus ", the next token probabilities might look something like:

<div style="text-align: center;">
    <img src="table2.png" alt="Description" width="720" height="480">
</div>


When we set top-k sampling we would simply limit the number tokens from which the next token is to be selected from. Suppose we set top_k = 3, this would make the model choose from:


<div style="text-align: center;">
    <img src="table3.png" alt="Description" width="720" height="480">
</div>

This will mprove the accuracy of the response and reduce the chances of less likely words getting chosen.



In [None]:
# set seed to reproduce results. Feel free to change the seed though to get different results
torch.manual_seed(0)

# set top_k to 50
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

It is obvious that the generated sentence became more meaningful. We set top_k = 50, so that the model randomly chooses only from the top 50 most probable tokens. This ensures a balance between diversity and accuracy.

However, It over-restricts the choices in many cases.

When the probability distribution is flat, that is, when most words have similar probabilities for becoming the next token, top-k method would still select top k token from the distribution, where there would be other token that fit in.

Similarly, when the distribution is sharp, that is, when the most likely words are concentrated, top-k sampling would allow too many irrelevant words.

Top-p (nucleus) sampling would be an easy fix.

# Top-P (Nucleus) Sampling

In top-p sampling, the model first generates a probability distribution for all tokens in the vocabulary. It then ranks the probabilities(in descending order). The model selects the smallest set of words whose cumulative probability is greater than or equal to the threshold p. The next word is then chosen from this subset of words.

Suppose the distribution is really sharp (meaning that only a few tokens are dominant). This means that, to reach the cumulative probability threshold, only a few tokens will be necessary. This will ensure meaningful generation.

In this may the model isn't overy restrictive or overly generous.

We can set 0 < top_p < 1 to activate top_p sampling.

In [None]:
torch.manual_seed(0)  # Set seed for reproducibility

# Generate text with top-p sampling (sampling from 91% most likely words)
sample_output = model.generate(
    input_ids,
    do_sample=True,       # Activate sampling
    max_length=100,        # Maximum length of 50 tokens
    top_p=0.85,           # Use top-p sampling with a 0.85 threshold
    top_k=0               # Disable top-k sampling
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))  # Decode and remove special tokens


<div style="text-align: left;">
    <img src="logo.png" alt="Description" width="150" height="100">
</div>