# Chapter 5: Text Generation

<!--
hello, world.
//-->
<!--script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-AMS_SVG-full" type="text/javascript"></script-->

## Greedy Search Decoding

The exercise in the book uses [GPT-2 XL](https://huggingface.co/gpt2-xl).

> GPT-2 XL is <span style="background-color: #9AFEFF">the 1.5B parameter version of GPT-2</span>, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a causal language modeling (CLM) objective.

You might want to read the `transformers` v4.28.1 documentation on:

* [`transformers.AutoTokenizer`](https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/auto#transformers.AutoTokenizer) ... generic tokenizer class that will be instantiated as one of the tokenizer classes of the library when created with the [`AutoTokenizer.from_pretrained()`](https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained) class method.
* [`transformers.AutoModelForCausalLM`](https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/auto#transformers.AutoModelForCausalLM) ... generic model class that will be instantiated as one of the model classes of the library (with a causal language modeling head) when created with the [`from_pretrained()`](hhttps://huggingface.co/docs/transformers/v4.28.1/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained) class method or the [`from_config()`](https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/auto#transformers.AutoModelForCausalLM.from_config) class method.

In [1]:
import torch

from transformers import AutoTokenizer, AutoModelForCausalLM

In [2]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = 'gpt2-xl'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

#### NOTE!

* Using `huggingface-cli`, [`scan`](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#scan-your-cache) your cache and confirm that the GPT-2 XL model we just downloaded takes up ~6.4G of space.
* Looking up the files at [Files and versions tab for `gpt2-xl` on HF](https://huggingface.co/gpt2-xl/tree/main), see how `pytorch_model.bin`is 6.43G is size!
* You may want to [`clear`](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#clean-your-cache) your HF cache-system as needed with `huggingface-cli`.

### Naive implementation of greedy decoding method

> The simplest decoding method to get discrete tokens from a model's continuous output is to greedily select the token with the highest probability at each timestep:
> 
> $ \hat{y}_{t} =\underset{y_t}{\operatorname{argmax}}{P}\left( y_{t} | y_{<t} , x \right) $


We implement the decoding method of this autoregressive model so that we can learn how things are done under the hood.

Please see:
* [`torch.argsort`](https://pytorch.org/docs/1.11/generated/torch.argsort.html?highlight=argsort#torch.argsort) ... returns the indices that sort a tensor along a given dimension in ascending order by value.

Here is a naive implementation of greedy decoding:

In [3]:
import pandas as pd

input_txt = "Transformers are the"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
iterations = []
n_steps = 8
choices_per_step = 5

with torch.no_grad():
    for _ in range(n_steps):
        iteration = dict()
        iteration["Input"] = tokenizer.decode(input_ids[0])
        output = model(input_ids=input_ids)
        
        # select logits of the 1st batch and the last token,
        # and apply softmax
        next_token_logits = output.logits[0, -1, :]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)
        
        # store tokens with the highest probabilities
        for choice_idx in range(choices_per_step):
            token_id = sorted_ids[choice_idx]
            # don't forget to move off GPU and back to CPU!
            token_prob = next_token_probs[token_id].cpu().numpy()
            token_choice = (
                f"{tokenizer.decode(token_id)} ({100*token_prob:.2F}%)"
            )
            iteration[f"Choice for {choice_idx+1}"] = token_choice
        
        # 
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
        iterations.append(iteration)

pd.DataFrame(iterations)

Unnamed: 0,Input,Choice for 1,Choice for 2,Choice for 3,Choice for 4,Choice for 5
0,Transformers are the,most (8.53%),only (4.96%),best (4.65%),Transformers (4.37%),ultimate (2.16%)
1,Transformers are the most,popular (16.78%),powerful (5.37%),common (4.96%),famous (3.72%),successful (3.20%)
2,Transformers are the most popular,toy (10.63%),toys (7.23%),Transformers (6.60%),of (5.46%),and (3.76%)
3,Transformers are the most popular toy,line (34.38%),in (18.20%),of (11.71%),brand (6.10%),line (2.69%)
4,Transformers are the most popular toy line,in (46.28%),of (15.09%),", (4.94%)",on (4.40%),ever (2.72%)
5,Transformers are the most popular toy line in,the (65.99%),history (12.42%),America (6.91%),Japan (2.44%),North (1.40%)
6,Transformers are the most popular toy line in the,world (69.26%),United (4.55%),history (4.29%),US (4.23%),U (2.30%)
7,Transformers are the most popular toy line in ...,", (39.73%)",. (30.64%),and (9.87%),with (2.32%),today (1.74%)


See the HF documenation on [Generation (Text Generation)](https://huggingface.co/docs/transformers/v4.28.1/en/main_classes/text_generation).

... and here's the equivalent using `generate`...

In [5]:
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)

output = model.generate(
    input_ids, 
    max_new_tokens=n_steps, 
    do_sample=False
)

print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most popular toy line in the world,


In [24]:
max_length = 128

input_txt = (
    "In a shocking finding, scientist discovered a herd of unicorns living in a remote, "
    "previously unexplored valley, in the Andes Mountains. "
    "Even more surprising to the researchers was the fact that the unicorns spoke perfect English.¥n¥n"
)

input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)


output_greedy = model.generate(
    input_ids,
    max_length=max_length, 
    do_sample=False
)

print(tokenizer.decode(output_greedy[0]))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.¥n¥n

The researchers, from the University of California, Davis, and the University of Colorado, Boulder, were conducting a study on the Andean cloud forest, which is home to the largest concentration of cloud forest in the world.¥n¥n

The researchers were conducting a study on the Andean cloud forest, which is home to the largest concentration of cloud forest


----

## Beam Search Decoding

