# Chapter 5: Text Generation

<!--
hello, world.
//-->
<!--script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-AMS_SVG-full" type="text/javascript"></script-->

## Greedy Search Decoding

The exercise in the book uses [GPT-2 XL](https://huggingface.co/gpt2-xl).

> GPT-2 XL is <span style="background-color: #9AFEFF">the 1.5B parameter version of GPT-2</span>, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a causal language modeling (CLM) objective.

You might want to read the `transformers` v4.28.1 documentation on:

* [`transformers.AutoTokenizer`](https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/auto#transformers.AutoTokenizer) ... generic tokenizer class that will be instantiated as one of the tokenizer classes of the library when created with the [`AutoTokenizer.from_pretrained()`](https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained) class method.
* [`transformers.AutoModelForCausalLM`](https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/auto#transformers.AutoModelForCausalLM) ... generic model class that will be instantiated as one of the model classes of the library (with a causal language modeling head) when created with the [`from_pretrained()`](hhttps://huggingface.co/docs/transformers/v4.28.1/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained) class method or the [`from_config()`](https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/auto#transformers.AutoModelForCausalLM.from_config) class method.

In [1]:
import pandas as pd
import torch

from transformers import AutoTokenizer, AutoModelForCausalLM

In [2]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = 'gpt2-xl'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

#### NOTE!

* Using `huggingface-cli`, [`scan`](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#scan-your-cache) your cache and confirm that the GPT-2 XL model we just downloaded takes up ~6.4G of space.
* Looking up the files at [Files and versions tab for `gpt2-xl` on HF](https://huggingface.co/gpt2-xl/tree/main), see how `pytorch_model.bin`is 6.43G is size!
* You may want to [`clear`](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#clean-your-cache) your HF cache-system as needed with `huggingface-cli`.


----

### Naive implementation of greedy decoding method

> _The simplest decoding method to get discrete tokens from a model's continuous output is to greedily select the token with the highest probability at each timestep:_
> 
> $ \hat{y}_{t} =\underset{y_t}{\operatorname{argmax}}{P}\left( y_{t} | y_{<t} , x \right) $


We implement the decoding method of this autoregressive model so that we can learn how things are done under the hood.

Please see:
* [`torch.softmax`]() ... Alias for [`torch.nn.functional.softmax`](https://pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html#torch.nn.functional.softmax); applies `softmax` function to all slices along dim, and will re-scale them so that the elements lie in the range $[0, 1]$ and sum to $1$.
* [`torch.argsort`](https://pytorch.org/docs/1.11/generated/torch.argsort.html?highlight=argsort#torch.argsort) ... returns the indices that sort a tensor along a given dimension in ascending order by value.
* [`torch.cat`]() ... 'catenates the given sequence of `seq` tensors in the given dimension. All tensors must either have the same shape (except in the concatenating dimension) or be empty... can be seen as an inverse operation for [`torch.split`](https://pytorch.org/docs/stable/generated/torch.split.html#torch.split) and [`torch.chunk`](https://pytorch.org/docs/stable/generated/torch.chunk.html#torch.chunk).

Here is our naive implementation of greedy decoding:

In [3]:
input_txt = "Transformers are the"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
iterations = []
n_steps = 8
choices_per_step = 5

with torch.no_grad():
    for _ in range(n_steps):
        iteration = dict()
        iteration["Input"] = tokenizer.decode(input_ids[0])
        output = model(
            input_ids=input_ids
        )
        
        # select logits of the 1st batch and the last token,
        # and apply softmax
        next_token_logits = output.logits[0, -1, :]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)
        
        # store tokens with the highest probabilities
        for choice_idx in range(choices_per_step):
            token_id = sorted_ids[choice_idx]
            # don't forget to move off GPU and back to CPU!
            token_prob = next_token_probs[token_id].cpu().numpy()
            token_choice = (
                f"{tokenizer.decode(token_id)} ({100*token_prob:.2F}%)"
            )
            iteration[f"Choice for {choice_idx+1}"] = token_choice
        
        # cat the predicted next token on to the input,
        # and prepare to do it again!
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
        iterations.append(iteration)

pd.DataFrame(iterations)

Unnamed: 0,Input,Choice for 1,Choice for 2,Choice for 3,Choice for 4,Choice for 5
0,Transformers are the,most (8.53%),only (4.96%),best (4.65%),Transformers (4.37%),ultimate (2.16%)
1,Transformers are the most,popular (16.78%),powerful (5.37%),common (4.96%),famous (3.72%),successful (3.20%)
2,Transformers are the most popular,toy (10.63%),toys (7.23%),Transformers (6.60%),of (5.46%),and (3.76%)
3,Transformers are the most popular toy,line (34.38%),in (18.20%),of (11.71%),brand (6.10%),line (2.69%)
4,Transformers are the most popular toy line,in (46.28%),of (15.09%),", (4.94%)",on (4.40%),ever (2.72%)
5,Transformers are the most popular toy line in,the (65.99%),history (12.42%),America (6.91%),Japan (2.44%),North (1.40%)
6,Transformers are the most popular toy line in the,world (69.26%),United (4.55%),history (4.29%),US (4.23%),U (2.30%)
7,Transformers are the most popular toy line in ...,", (39.73%)",. (30.64%),and (9.87%),with (2.32%),today (1.74%)


See the HF documenation on [Generation (Text Generation)](https://huggingface.co/docs/transformers/v4.28.1/en/main_classes/text_generation).

... and here's the equivalent using `generate`...

#### IMPORTANT NOTE

In the book, no attention mask is passed in to `generate` along with `input_ids`, resulting in a warning that looks like this:
> <span style="background-color: #ffe0d9">_The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation._</span>

Ignoring that bit about `pad_token_id` and `eos_token_id`, ... sgugger's answer to the question [Do automatically generated attention masks ignore padding?](https://discuss.huggingface.co/t/do-automatically-generated-attention-masks-ignore-padding/15479/2) states:
> _Yes, you need to pass the attention mask returned by the tokenizer. Most models don’t know the padding token ID, so they can’t generate an attention mask that ignores it._

So let's do just that.

In [4]:
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
attn_mask = tokenizer(input_txt, return_tensors="pt")["attention_mask"].to(device)

output = model.generate(
    input_ids=input_ids, 
    attention_mask=attn_mask,
    max_new_tokens=n_steps, 
    do_sample=False
)

print(tokenizer.decode(output[0]))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most popular toy line in the world,


In [5]:
max_length = 128

input_txt = (
    "In a shocking finding, scientist discovered a herd of unicorns living in a remote, "
    "previously unexplored valley, in the Andes Mountains. "
    "Even more surprising to the researchers was the fact that the unicorns spoke perfect English."
)

input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
attn_mask = tokenizer(input_txt, return_tensors="pt")["attention_mask"].to(device)

output_greedy = model.generate(
    input_ids=input_ids, 
    attention_mask=attn_mask,
    max_length=max_length, 
    do_sample=False
)

print(tokenizer.decode(output_greedy[0]))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

The researchers, from the University of California, Davis, and the University of Colorado, Boulder, were conducting a study on the Andean cloud forest, which is home to the rare species of cloud forest trees.

The researchers were conducting a study on the Andean cloud forest, which is home to the rare species of cloud forest trees.

The researchers were conducting a study on the Andean


----

## Beam Search Decoding

In [6]:
import numpy as np

sum([np.log(0.5)] * 1024)

-709.7827128933695

The function below does the following:

* Calculates the log softmax across the last dimension of `logits` (the 50257 element long list of probabilities per the given token)
* Uses [`torch.gather`](https://pytorch.org/docs/stable/generated/torch.gather.html), passing in the `labels` which are the _indices_ of the the actual word predicted, to index the corresponding log probability for that predicted word; and then return all of those log probabilities in a tensor
   * see this [answer on Stack Overflow](https://stackoverflow.com/questions/50999977/what-does-the-gather-function-do-in-pytorch-in-layman-terms/51032153#51032153) for a slightly better explanation of that `torch.gather` does

In [7]:
import torch.nn.functional as F

def log_probs_from_logits(logits, labels):
    logp = F.log_softmax(logits, dim=-1)
    logp_label = torch.gather(logp, -1, labels.unsqueeze(-1)).squeeze(-1)
    return logp_label

The function below returns the _total log probability for the generated sequence._

* The _generated sequence_ starts where the input sequences ends, so
   * the very first label does not have a logit as the model predicts the _following_ token
   * the very last logit is unneeded, since we do not have a corresponding label (ground truth)
* Thus we need to align the logits and labels when calculating the log probabilities
   * count only the logits up through but not including the very last one
   * align those logits with the labels starting from the 2nd one
* We are not interested in the log probabilities of the _input_sequence_, so we slice them out before summing up the log probabilities

In [8]:
def sequence_logprob(model, labels, input_len=0):
    with torch.no_grad():
        output = model(labels)
        log_probs = log_probs_from_logits(
            output.logits[:, :-1, :],
            labels[:, 1:]
        )
        seq_log_prob = torch.sum(log_probs[:, input_len:])
    return seq_log_prob.cpu().numpy()

In [9]:
logp = sequence_logprob(model, output_greedy, input_len=len(input_ids[0]))

print(tokenizer.decode(output_greedy[0]))
print(f"\nlog-prob: {logp:.2f}")

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

The researchers, from the University of California, Davis, and the University of Colorado, Boulder, were conducting a study on the Andean cloud forest, which is home to the rare species of cloud forest trees.

The researchers were conducting a study on the Andean cloud forest, which is home to the rare species of cloud forest trees.

The researchers were conducting a study on the Andean

log-prob: -68.74


----

In [10]:
output_beam = model.generate(
    input_ids=input_ids,
    attention_mask=attn_mask,
    max_length=max_length,
    num_beams=5,
    do_sample=False
)

logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))

print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

The discovery of the unicorns was made by a team of scientists from the University of California, Santa Cruz, and the National Geographic Society.

According to the researchers, the unicorns were found in a remote valley in the Andes Mountains. The valley is known as the "Valley of the Unicorns" because of the number of unicorns that have been found there.

The valley

log-prob: -72.30


In [11]:
output_beam_no_repeat = model.generate(
    input_ids=input_ids,
    attention_mask=attn_mask,
    max_length=max_length,
    num_beams=5,
    do_sample=False,
    no_repeat_ngram_size=2
)

logp = sequence_logprob(model, output_beam_no_repeat, input_len=len(input_ids[0]))

print(tokenizer.decode(output_beam_no_repeat[0]))
print(f"\nlog-prob: {logp:.2f}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

The discovery was made by a team of scientists from the University of California, Santa Cruz, and the National Geographic Society. The team, led by Dr. David Hone, discovered the unicorn herd while conducting a study on the ecology and evolution of mountain goats. According to a press release, the team found the herd in an area that had never been explored before. They were able to track the animals using

log-prob: -106.29


----

## Sampling Methods

In [12]:
output_temp = model.generate(
    input_ids=input_ids,
    attention_mask=attn_mask,
    max_length=max_length,
    do_sample=True,
    temperature=2.0,
    top_k=0
)

logp = sequence_logprob(model, output_temp, input_len=len(input_ids[0]))

print(tokenizer.decode(output_temp[0]))
print(f"\nlog-prob: {logp:.2f}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. AREOVER SAFrialger reunitions mediumWe Schnee feel eight will maintaining himself REALisingrage content彼大こ Texture ED bits graded VERBLUES Substance atop developing digital HD Victoria Christmas FormsVision After Sounds fitness imaginableRand coloredifribute providerwisebest §1977 Brola Looksowersam and silk UDread Languages PKEW irregular almost birthday numbers uhunc Kung Thing Gib the Patent got mixed orcs Parliamentary LEGO Appeal

log-prob: -868.70


In [13]:
output_temp_2 = model.generate(
    input_ids=input_ids,
    attention_mask=attn_mask,
    max_length=max_length,
    do_sample=True,
    temperature=0.5,
    top_k=0
)

logp = sequence_logprob(model, output_temp_2, input_len=len(input_ids[0]))

print(tokenizer.decode(output_temp_2[0]))
print(f"\nlog-prob: {logp:.2f}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

The discovery was made by the scientists, who were studying the natural environment of the Andes Mountains. The researchers headed to the valley in order to study the rare species of plants that grow in the area.

The scientists have been studying the area for the past three years and they have been able to identify the unicorns through their unique DNA.

The researchers were able to identify the unicorns

log-prob: -124.38


----

## Top-k and Nucleus Sampling

In [14]:
output_topk = model.generate(
    input_ids=input_ids,
    attention_mask=attn_mask,
    max_length=max_length,
    do_sample=True,
    top_k=50
)

logp = sequence_logprob(model, output_topk, input_len=len(input_ids[0]))

print(tokenizer.decode(output_topk[0]))
print(f"\nlog-prob: {logp:.2f}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

The discovery was made by Dr. Cristian Pacheco of the University of São Paulo, South America, through a detailed study of their genome which revealed the presence of unique genes, which could explain the unique behavior of the mythical creatures. Also, he observed that the horn is no longer of an animal's self but the result of an evolutionary process, which was once passed on from one generation to

log-prob: -178.45


In [15]:
output_nucleus= model.generate(
    input_ids=input_ids,
    attention_mask=attn_mask,
    max_length=max_length,
    do_sample=True,
    top_p=0.90
)

logp = sequence_logprob(model, output_nucleus, input_len=len(input_ids[0]))

print(tokenizer.decode(output_nucleus[0]))
print(f"\nlog-prob: {logp:.2f}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

The unicorns that live in this isolated area of Bolivia were captured and studied by researchers at the University of Texas, from the U.S. and Colombia. They noticed the animals seemed to be living in a village, with a community of humans living around it.

The herd consisted of four adults, a mother and five cubs. Scientists found the mother unicorn was pregnant when she was captured.

log-prob: -161.21


In [16]:
output_topk_topp = model.generate(
    input_ids=input_ids,
    attention_mask=attn_mask,
    max_length=max_length,
    do_sample=True,
    top_k=50,
    top_p=0.90
)

logp = sequence_logprob(model, output_topk_topp, input_len=len(input_ids[0]))

print(tokenizer.decode(output_topk_topp[0]))
print(f"\nlog-prob: {logp:.2f}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

To make the finding even more unbelievable, the unicorns were able to be spotted by a group of tourists and the scientists were able to communicate with them in the form of telepathic communication, after they spent a day listening to the voices of the unicorns.

In the report, Dr. Carlos Paz of the Universidad de Chile, revealed the findings of the research in the journal, Science

log-prob: -156.88
