# 5 - Text generation

One of the most advertised features of transformer-based language models is their ability to generate text that is almost indistinguishable from human-written text. A famous example is OpenAi's GPT-2, which when given the prompt:

```
In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
```

was able to generate a compelling news article about talking unicorns:

```
The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science. Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved. Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow. Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez. Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them—they were so close they could touch their horns. While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English …
```

What makes this example so remarkable is that it was generated without any explicit supervision! By simply learning to predict the next word in the text of millions of web pages, GPT-2 and its more powerful descendants like GPT-3 are able to acquire a broad set of skills and pattern recognition abilities that can be activated with different kinds of input prompts. 

The following image shows how language models are sometimes exposed during pretraining to sequences of tasks where they need to predict the following tokens based on the context alone. Some of this tasks include: arithmetics, translation, fixing misspellings, etc

<img src="images/tasks_examples.png" title="" alt="" width="700" data-align="center">

The ability of transformers to generate realistic text has led to a diverse range of applications, auto-completition features like [Write With Transformer](https://transformer.huggingface.co/), text-based games such as [AI dungeon](https://play.aidungeon.io/), and conversational agents like [Google's Meena](https://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.html)

In this chapter, we'll use GPT-2 to illustrate how text generation works for language models and explore how different decoding strategies impact the generated texts.

## 5.1 - The challenge with generating coherent text

Up to this point, we have focused on tackling NLP tasks via a combination of pretraining and supervised fine-tuning. As we have seen, for task-specific heads like sequence or token classification, generating predictions is fairly straightforward; the model produces some logits and we take the maximum value to get the predicted class. By contrast, converting the model's probabilistic output to text requires a **decoding method**, which introduces a few challenges that are unique to text generation:

* The decoding is done **iteratively** and thus involves significantly more compute than simply passing inputs once through the forwards pass of a model.
* The **quality** and **diversity** of the generated text depend on the choice of decoding method and associated hyperparameters.

To understand how this decoding process works, let's start by examining how GPT-2 is pretrained and subsequently applied to generate text.

Like other autoregressive or causal language models, GPT-2 is pretrained to estimate the probability $P(y|x)$ of a sequence of tokens $\mathbf{y} = y_{1}, \dots, y_{t}$, ocurring in the text, given some initial prompt or context sequence $\mathbf{x} = x_{1}, \dots, x_{k}$. Since it is impractical to adcquire enough training data to estimate $P(\mathbf{y}|\mathbf{x})$ directly, it is common to use the chain rule of probability to factorize it as a product of conditional probabilities:

$$
P(y_{1}, \dots, y_{t}| \mathbf{x}) = \prod_{t=1}^{N} P(y_{t}|y_{1}, \dots, y_{t-1}, \mathbf{x})
$$

It is from these conditional probabilities that we pick up the intuition that autoregressive language modeling amounts to predicting each word given the preceding words in a sentence; this is exactly what the probability on the righthand side of the preceding equation describes. <span style="color:blue">Notice that this pretraining objective is quite different from BERT's, which utilizes both <b>past</b> and <b>future</b> contexts to predict a *masked* token.</span>

<img src="images/text_generation_example.png" title="" alt="" width="500" data-align="center">

As shown by the figure, we start with a prompt like "Transformers are the" and use the model to predict the next token. Once we have determined the next token, we append it to the prompt and then use the new input sequence to generate another token. We do this until we have reached a special end-of-sequence token or a predefined maximum length.

----

**Note:** Since the output sequence is *conditioned* on the choice of input prompt, this type of text generation is often called <span style="color:blue">conditional text generation</span>.

----

At the heart of this process lies a decoding method that determines which token is selected at each timestep. Since the language model head produces a <span style="color:blue">logit</span> $z_{t,i}$ per token in the vocabulary at each step, we can get the probability distribution over the next possible token $w_{i}$ by taking the softmax:

$$
P(y_{t} = w_{i} | y_{1}, \dots, y_{t-1}, \mathbf{x}) = \text{softmax}(z_{t,i})
$$

The goal of most decoding methods is to search for the most likely overall sequence by picking a $\hat{\mathbf{y}}$ such that:

$$
\hat{\mathbf{y}} = \underset{\mathbf{y}}{argmax} \ P(\mathbf{y} | \mathbf{x})
$$

<span style="color:blue">Finding</span> $\hat{\mathbf{y}}$ <span style="color:blue">directly would involve <b>evaluating every possible sequence</b> with the language model. Since there does not exist an algorithm that can do this in a reasonable amount of time, <b>we rely on approximations instead</b></span>.

## 5.2 - Decoding methods

[**Great article from hugginface on decoding methods**](https://huggingface.co/blog/how-to-generate)

### 5.2.1 - Greedy search decoding

<img src="images/greedy_search_tree.png" title="" alt="" width="400" data-align="center">

This is the simplest decoding method to get discrete tokens from a model's continuous output is to greedily select the token with the highest probability at each timestep:

$$
\hat{y_{t}} = \underset{y_{t}}{argmax} \ P(y_{t} | y_{1}, \dots, y_{t-1}, \mathbf{x})
$$

To see how greedy search works, let's start by loading the GPT-2 model with a language modeling head:

In [1]:
# https://huggingface.co/transformers/v2.2.0/pretrained_models.html

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "gpt2" # 117M parameters
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

ModuleNotFoundError: No module named 'torch'

Now, let's generate some text! Although 🤗 Transformers provides a `generate()` function for autoregressive models like GPT-2, we'll implement this decoding method ourselves to see what goes under the hood. To warm up, we'll use "Transformers are the" as the input prompt and run the decoding for eight timesteps. At each timestep, we pick out the model's logits for the last token in the prompt and wrap them with a softmax to get a probability distribution. We then pick the next token with the highest probability, add it to the input sequence, and run the process again.

In [None]:
import pandas as pd

input_txt = "Transformers are the"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
iterations = []
n_steps = 8
choices_per_step = 5

with torch.no_grad():
    for _ in range(n_steps):
        iteration = dict()
        iteration["Input"] = tokenizer.decode(input_ids[0])
        output = model(input_ids=input_ids)
        
        # Select logits of the first batch and the last token and apply softmax
        next_token_logits = output.logits[0, -1, :]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)
        
        # Store tokens with highest probabilities
        for choice_idx in range(choices_per_step):
            token_id = sorted_ids[choice_idx]
            token_prob = next_token_probs[token_id].cpu().numpy()
            token_choice = (
                f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"
            )
            iteration[f"Choice {choice_idx+1}"] = token_choice
            
        # Append predicted next token to input
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
        iterations.append(iteration)

pd.DataFrame(iterations)

With this simple method we were able to generate the sentence `Transformers are the most popular toy line in the world`. Interestingly, this indicates that GPT-2 has internalized some knowledge about the Transformers media franchise, which was created by two toy companies (Hasbro and Takara Tomy).

Unlike other tasks such as sequence classification where a single forward pass suffices to generate predictions, with text generation we need to decode the output tokens one at a time.

While implementing greedy search was not too hard, it would be better to use the built-in `generate()` function from 🤗 Transformers to explore more sophisticated decoding methods. To reproduce our example, let's make sure sampling is switched off (it’s off by default, unless the specific configuration of the model you are loading the checkpoint from states otherwise) and specify the `max_new_tokens` for the number of newly generated tokens:

In [None]:
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_new_tokens=n_steps, do_sample=False)
print(tokenizer.decode(output[0]))

Now, let's try something a bit more interesting: can we reproduce the unicorn story from OpenAI? As we did previously, we'll encode the prompt with the tokenizer, and we'll specify a larger value for `max_length` to generate a longer sequence of text:

In [None]:
max_length = 128

input_txt = """In a shocking finding, scientist discovered \
a herd of unicorns living in a remote, previously unexplored \
valley, in the Andes Mountains. Even more surprising to the \
researchers was the fact that the unicorns spoke perfect English.\n\n
"""

input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output_greedy = model.generate(input_ids, max_length=max_length, do_sample=False)

print(tokenizer.decode(output_greedy[0]))

From the resulting text we can see that one of the main drawbacks with greedy search decoding is that it tends to produce repetitive output sequences, which is certainly undesirable in a news article. This is a common problem with greedy search algorithms, which can fail to give you the optimal solution; in the context of decoding, <span style="color:blue">they can miss word sequences whose overall probability is higher <b>just because words happen to be preceded by low probability ones.</b></span>

----

**Note:** Although greedy search decoding is rarely used for text generation that require diversity (see this interesting video on that matter), <span style="color:blue">it can be useful for producing short sequences like arithmetic where a determinisit and factually correct output is preferred.</span>

----

### 5.2.2 - Beam search decoding

Instead of decoding the token with the highest probability at each step, beam search keeps track of the top-$b$ most probable next tokens, where $b$ is referred to as the number of *beams* or partial hypotheses. The next set of beams are chosen by considering all possible next-token extensions of the existing set and selecting the $b$ most likely extensions. The process is repeated until we reach the maximum length or an EOS token, and the most likely sequence is selected by ranking the $b$ beams according to their log probabilities. We use the logarithm of the probability because the computer would not be able to precisly represent (i.e., numerical instability) the result of multipliying many small numbers (i.e., probabilities). Instead we estimate their sum of log probabilities, which is much less likely to run into numerical instabilities.

<img src="images/beam_search_tree.png" title="" alt="" width="400" data-align="center">

For example, suppose we have a sequence of $t = 1024$ tokens and generously assume that the probability for each token is 0.5. The overall probability for this sequence is an extremely small number:

```python
0.5 ** 1024
5.562684646268003e-309
```

Calculation the log probability of the same example as before gives:

```python
import numpy as np
sum([np.log(0.5)] * 1024)
-709.7827128933695
```

Let's calculate and compare the log probabilities of the texts generated by greedy and beam search to see if beam search can improbe the overall probability. Since 🤗 Transformers models return the unnormalized logits for the next token given the input tokens, we first need to normalize the logits to create a probability distribution over the whole vocabulary for each token in the sequence. We then need to select only the token probabilities that were present in the sequence. The following function implements these steps:

In [None]:
import torch.nn.functional as F

def log_probs_from_logits(logits, labels):
    logp = F.log_softmax(logits, dim=-1)
    logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)
    return logp_label

This gives us the log probability for asingle token so to get the total log probability of a sequence we just need to sum the log probabilties for each token:

In [None]:
def sequence_logprob(model, labels, input_len=0):
    with torch.no_grad():
        output = model(labels)
        log_probs = log_probs_from_logits(
            output.logits[:, :-1, :], labels[:, 1:])
        seq_log_prob = torch.sum(log_probs[:, input_len:])
    return seq_log_prob.cpu().numpy()

Note that we ignore the log probabilties of the input sequence because they are not generated by the model. We can also see that it is important to align the logits and the labels; since the model predicts the next token, we do not get a logit for the first label, and we don't need the last logit because we don0t have a ground truth token for it.

Let's use these functions to fist calculate the sequence log probability of the greedy decoder on the OpenAI prompt:

In [None]:
logp = sequence_logprob(model, output_greedy, input_len=len(input_ids[0]))
print(tokenizer.decode(output_greedy[0]))
print(f"\nlog-prob: {logp:.2f}")

Now let’s compare this to a sequence that is generated with beam search. To activate beam search with the `generate()` function we just need to specify the number of beams with the `num_beams` parameter. The more beams we choose, the better the result potentially gets; however, the generation process becomes much slower since we generate parallel sequences for each beam:

In [None]:
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5, do_sample=False)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")

We can see that we get a better log probability (higher is better) with beam search than we did with simple greedy decoding. However, we can see that beam search also suffers from repetitive text. One wat to address this is to impose an $n$-gram penalty with the `no_repeat_ngram_size` parameter that tracks which $n$-grams have been seen and sets the next token probability to zero if it would produce a previously seen $n$-gram:

In [None]:
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5, do_sample=False, no_repeat_ngram_size=2)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")

This isn't too bad We have managed to stop the repetitions, and we can see that despite producing a lower score, the text remains cohererent. Beam search with $n$-gram penalty is a good way to find a trade-off between focusing on high-probability tokens (with beam search) while reducing repetitions (with $n$-gram penalty), and it’s commonly used in applications such as summarization or machine translation where factual correctness is important.

**When factual correctness is less important than the diversity of generated output, for instance in open-domain chitchat or story generation, another alternative to reduce repetitions while improving diversity is to use sampling**. Let’s round out our exploration of text generation by examining a few of the most common sampling methods.

### 5.2.3 - Sampling

The simplest sampling method is to randomly sample from the probability distribution of the model's outputs over the full vocabulary at each timestep:

$$
P(y_{t} = w_{i} | y_{<t}, \mathbf{x}) = \text{softmax}(z_{t,i}) = \frac{\text{exp}(z_{t,i})}{\sum^{|V|}_{j=1} \text{exp}(z_{t, j})}
$$

where $|V|$ denotes the cardinality of the vocabulary and where both $z_{t,i}$ and $z_{t,j}$ denote logits. We can easily control the diversity of the output by adding a temperature parameter $T$ that rescales (i.e., divides) the logits before taking the softmax:

$$
P(y_{t} = w_{i} | y_{<t}, \mathbf{x}) = \frac{\text{exp}(z_{t,i} \ / \ T)}{\sum^{|V|}_{j=1} \text{exp}(z_{t,j} \ / \ T)}
$$

By tuning $T$ we can control the shape of the probability distribution. When $T << 1$, the distribution becomes peaked around the origin and the rare tokens are suppressed. On the other hand, when $T >> 1$, the distribution flattens out and each token becomes equally likely. This effect can be seen in the following picture:

<img src="images/sampling_example.png" title="" alt="" width="400" data-align="center">

To see how we can use temperature to influence the generated text, let's sample with $T=2$ by setting the `temperature` parameter in the `generate()` function (we'll explain the meaning of the `top_k` parameter in the next sectionoutput_temp = model.generate(input_ids, max_length=max_length, do_sample=True,
temperature=2.0, top_k=0)
print(tokenizer.decode(output_temp[0])))

In [None]:
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True, temperature=2.0, top_k=0)
print(tokenizer.decode(output_temp[0]))

We can clearly see that a high temperature has produced mostly gibberish; by accentuating the probability of rare tokens, we have caused the model to create strange grammar and quite a few made-up words! Let's see what happens if we cool down the temperature:

In [None]:
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True, temperature=0.5, top_k=0)
print(tokenizer.decode(output_temp[0]))

This is significantly more coherent. The main lesson we can draw from temperature is that it allows us to control the quality of the samples, but there's always a trade-off between coherence (low temperature) and diversity (high temperature) that one has to tune to the use case at hand.

Another way to adjust the trade-off between coherence and diversity is to truncate the distribution of the vocabulary. This allows us to adjust the diversity freely with the temperature but in a more limited range that excludes words that would be too strange in the context (i.e., low-probability words). There are two main ways to do this: top-$k$ and nucleus (top-$p$) sampling.

### Top-k and top-p sampling

Top-$k$ and nucleus (top-$p$) sampling are two popular alternatives or extensions to using temperature. In both cases, the basic idea is to restrict the number of possible tokens we can sample from at each timestep. To see how this works, let's first visualize the cumulative proability distribution of the model's outputs at $T=1$ as sen in the following picture:

<img src="images/probability_distribution.png" title="" alt="" width="450" data-align="center">

From the upper plot we can see that the probability of picking the token with the highest probability (the isolated bar at $10^{-1}$) is 1 in 10. In the lower plot, we have ordered the tokens by descending probability and calculated the cumulative sum of the first 10000 tokens (in total, there are 50257 tokens in GPT-2's vocabulary). The curved line indicates the probability of picking any of the preceding tokens. For example, there is roughly a 96% chance of picking any of the 1000 tokens with the highest probability. 

The plot shows that there is a 1 in 100 chance of not picking any of the tokens that are not even in the top 2000. Although these numbers might appear small at first sight, they become important because we sample once per token when generating text. So, even if there is a small probability, if we sample hundreds of times there is a significant chance of picking an unlikely token at some point, which in some cases can badly influence the quality of the generated text. For this reason, we generally want to avoid these very unlikely tokens. This is where top-$k$ and top-$p$ sampling come into play.

#### Top-$k$ sampling

The idea behind top-$k$ sampling is to avoid the low probability choices by only sampling from the $k$ tokens with the highest probability. This puts a fixed cut on the long tail of the distribution and ensures that we only sample from likely choices. Going back into the previous figure. This puts a fixed cut on the long tail of the distribution and sampling from the tokens on the left. Again, the `generate()` function provides an easy method to achieve this with the `top_k` argument:



In [None]:
output_topk = model.generate(input_ids, max_length=max_length, do_sample=True, top_k=50)
print(tokenizer.decode(output_topk[0]))

But how do we choose $k$? The value of $k$ is chosen manually and is the same for each choice in the sequence, independent of the actual output distribution. We can find a good value for $k$ by looking at some text quality metrics.

#### Top-$p$ sampling

An alternative is to use a dynamic cutoff (i.e., top-$p$ sampling). With nucleus or top-p sampling, instead of choosing a fixed cutoff value, we set a condition of when to cut off. This condition is when a certain probability mass in the selection is reached. Let's say we set the value to `0.90`. We then order all tokens in descending order by probability and add one token after another from the top of the list until the sum of probabilities of the selected tokens is 0.90. Returning to the previous figure, the value of $p$ defines a horizontal line on the cumulative sum of probabilities plot, and we sample only from tokens below the line. 

In [None]:
output_topp = model.generate(input_ids, max_length=max_length, do_sample=True, top_p=0.90)
print(tokenizer.decode(output_topp[0]))

----

<span style="color:blue">You can even combine the two sampling approaches to get the best of both worlds.</span> Setting `top_k=50` and `top_p=0.9` corresponds to the rule of choosing tokens with a probability mass of 90%, from a pool of at most 50 tokens.

<span style="color:blue">We can also apply beam search when we use sampling</span>. Instead of selecting the next batch of candidate tokens greedilty, we can sample them and build up the beams in the same way.

----

### 5.2.4 - Which decoding method is best?

Unfortunately, there is no universally "best" decoding method. Which approach is best will depend on the nature of the task you are generating text for. As a rule of thumb:

* If you want your model to perform a precise task like arithmetic or providing an answer to a specific question, then you should lower the temperature or use deterministic methods like greedy search in combination with beam search to guarantee getting the most likely answer. 

* If you want the model to generate longer text and even be a bit creative, then you should switch to sampling methods and increase the temperature or use a mixt of top-k and nucleus sampling.

## Conclusion

Generating text requires at least one forward pass per generated token, and even more if we use beam search. This makes text generation computationally demanding, and one needs the right infraestructure to run a text generation model at scale. In addition, finding the best decoding strategy for our use case requires some experimentation and a subjective evaluation of the generated texts.

In practice, however, we don't want to make these decisions based on gut feeling alone! Like with other NLP tasks, we should choose a model performance metric that reflects the problem we want to solve. Unsurprisingly, there are a wide range of choices, and we will encounter the most ocmmon ones in the next chapter, where we have a look at how to train and evaluate a model for text summarization.