# Natural Language Processing with Transformers

## Chapter 5 - Text Generation

### Greedy search decoding example

Use GPT-2 (very similar to Raschka LLM tutorial so not many notes needed here):

**load with "Language Modeling Head" - this is the CausalLM class from HF**

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

### Example using HF generate()

Here I just used the inbuilt generate() method - we are going to be implementing this from scratch.

Taken from [https://huggingface.co/docs/transformers/model_doc/gpt2](https://huggingface.co/docs/transformers/model_doc/gpt2)

In [3]:
#prompt = "def hello_world():"
prompt = "Transformers are the" # This is the example prompt used in book
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)

tokenizer.batch_decode(generated_ids)[0]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"Transformers are the future of the Marvel Universe and the biggest threat to the human race – except of course when Unicron destroys the earth.\n\nThis is how he got his name and everything. But, that's a story for another day.\n\nWhat we want to know is what you make of him. Because if you've got a burning question, then maybe you can put it to rest.\n\nHere we go.\n\nHis name is Unicron.\n\nHis backstory was mentioned in the"

To warm up, we’ll take the same iterative approach shown in Figure 5-3 (basically shows the step by step word by word expansion): we’ll use “Transformers are the” as the input
prompt and run the decoding for eight timesteps. At each timestep, we pick out the
model’s logits for the last token in the prompt and wrap them with a softmax to get a
probability distribution. We then pick the next token with the highest probability, add
it to the input sequence, and run the process again. The following code does the job,
and also stores the five most probable tokens at each timestep so we can visualize the
alternatives:

In [7]:
import pandas as pd

input_txt = "Transformers are the"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
print("ORIGINAL INPUT IDS ----->", input_ids)

iterations = []
n_steps = 8
choices_per_step = 5

with torch.no_grad():
    for _ in range(n_steps):
        iteration = dict()
        iteration["input"] = tokenizer.decode(input_ids[0]) # i looked why the [0], it's because input_ids is like : tensor([[41762,   364,   389,   262]], device='cuda:0')
        output = model(input_ids=input_ids)
        
        # select logits of the first batch (NOTE I THINK THIS IS BECAUSE ONLY 1 BATCH HERE)
        # and the last token of that batch, then apply softmax
        next_token_logits = output.logits[0, -1, : ]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        
        # TODO: NOTE USE OF argsort, SEEMS OK/OBVIOUS WHAT IT IS DOING BUT READ DOCS 
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)
        
        # store the tokens with the highest probabilities
        for choice_idx in range(choices_per_step):
            token_id = sorted_ids[choice_idx]
            token_prob = next_token_probs[token_id].cpu().numpy() # need on cpu before numpy IIRC
            token_choice = f"{tokenizer.decode(token_id)} - prob: {100 * token_prob:.2f}%"
            iteration[f"Choice {choice_idx + 1}"] = token_choice
        
        # store the info for this iteration
        iterations.append(iteration)
        
        # append predicted next token to the current input string
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
        
df = pd.DataFrame(iterations)

display(df)
        

ORIGINAL INPUT IDS -----> tensor([[41762,   364,   389,   262]], device='cuda:0')


Unnamed: 0,input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,Transformers are the,most - prob: 8.53%,only - prob: 4.96%,best - prob: 4.65%,Transformers - prob: 4.37%,ultimate - prob: 2.16%
1,Transformers are the most,popular - prob: 16.78%,powerful - prob: 5.37%,common - prob: 4.96%,famous - prob: 3.72%,successful - prob: 3.20%
2,Transformers are the most popular,toy - prob: 10.63%,toys - prob: 7.23%,Transformers - prob: 6.60%,of - prob: 5.46%,and - prob: 3.76%
3,Transformers are the most popular toy,line - prob: 34.38%,in - prob: 18.20%,of - prob: 11.71%,brand - prob: 6.10%,line - prob: 2.69%
4,Transformers are the most popular toy line,in - prob: 46.28%,of - prob: 15.09%,", - prob: 4.94%",on - prob: 4.40%,ever - prob: 2.72%
5,Transformers are the most popular toy line in,the - prob: 65.99%,history - prob: 12.42%,America - prob: 6.91%,Japan - prob: 2.44%,North - prob: 1.40%
6,Transformers are the most popular toy line in the,world - prob: 69.26%,United - prob: 4.55%,history - prob: 4.29%,US - prob: 4.23%,U - prob: 2.30%
7,Transformers are the most popular toy line in ...,", - prob: 39.73%",. - prob: 30.64%,and - prob: 9.87%,with - prob: 2.32%,today - prob: 1.74%


**Now do with inbuilt:**

make sure sampling is switched off

**TODO: UNDERSTAND THIS - it's not clear in book what it is doing?? i looked up docs:**

if set to True, this parameter enables decoding strategies such as multinomial sampling, beam-search multinomial sampling, Top-K sampling and Top-p sampling. All these strategies select the next token from the probability distribution over the entire vocabulary with various strategy-specific adjustments.

In [8]:
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_new_tokens=n_steps, do_sample=False)
print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most popular toy line in the world,


In [9]:
max_length = 128

input_txt = """In a shocking finding, scientist discovered \
a herd of unicorns living in a remote, previously unexplored \
valley, in the Andes Mountains. Even more surprising to the \
researchers was the fact that the unicorns spoke perfect English.\n\n
"""
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output_greedy = model.generate(input_ids, max_length=max_length, do_sample=False)
print(tokenizer.decode(output_greedy[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The researchers, from the University of California, Davis, and the University of Colorado, Boulder, were conducting a study on the Andean cloud forest, which is home to the rare species of cloud forest trees.


The researchers were surprised to find that the unicorns were able to communicate with each other, and even with humans.


The researchers were surprised to find that the unicorns were able


Note in the above how greedy search leads to repetitive output sequences (common problem with greedy - they can miss "high probability word SEQUENCES" because of individual high probability WORDS being preceded by low prob ones **and hence not getting reached etc**)


## Beam search

- keep track of `b` most probable next tokens
- at each step select the `b` most likely extensions
- continue until reach EOS token or max_length criterion
- **the most likely SEQUENCE is chosen by ranking the `b` BEAMS OVERALL according to LOG PROB**

Use **log prob** for numerical stability (avoids products of 0.00001 numbers etc)


---

HF generate: `num_beams`

`no_repeat_ngram_size` - an option to avoid repetition (if would produce a previously seen n-gram, sets next token prob to 0)


---

Continues about:

- Sampling methods / temperature (T rescales logits before taking softmax)

Another way to adjust output is to **truncate the distribution of the vocabulary:**

(avoid very unlikely tokens)

- top-k
- nucleus (also called top-p)

Top-k avoids low probability tokens by only sampling from the k tokens with the highest probability. Choose k manually (by eg test with different k and use text quality metrics)

Top-k is a **fixed cutoff**, same for every choice in the sequence.

Alternative is to use a **dynamic cutoff** - set a condition of when to cutoff which tokens are allowed (this is nucleus/top-p):

use as condition when a certain probability mass in the selection is reached: say you impose 95% - then order all tokens in descending order by probability and add one token after another from the top of the list until the sum of the probabilities of the selected tokens is 95%. 

quote from book:

Returning to Figure 5-6, the value for p defines a horizontal
line on the cumulative sum of probabilities plot, and we sample only from tokens
below the line. Depending on the output distribution, this could be just one (very
likely) token or a hundred (more equally likely) tokens.

**HF implementation: `top_p` in `generate()`**

---

You can even combine the two
sampling approaches to get the best of both worlds. Setting top_k=50 and top_p=0.9
corresponds to the rule of choosing tokens with a probability mass of 90%, from a
pool of at most 50 token

We can also apply beam search when we use sampling. Instead of
selecting the next batch of candidate tokens greedily, we can sample
them and build up the beams in the same way.