# Text generation using HF `transformers`

**Question**: Is recurrent generation an absolute must for language models? Could one combine sequence to sequence models with language models to generate text? For instance, reference 2 below says that `Greedy` decoding is well suited for machine translation tasks, which is what sequence to sequence models were originally designed for. 

References: 

1. [How to Generate Text - HF Blog](https://github.com/huggingface/blog/blob/main/how-to-generate.md)
2. [Generation with LLMs - HF Documentation](https://huggingface.co/docs/transformers/llm_tutorial) 
3. [Contrastive search - HF Blog](https://huggingface.co/blog/introducing-csearch)

Unlike prediction in classification and regression tasks, generative models like LLMs are not trained to predict a single label. They instead learn to predict a sequence of tokens, one at a time, conditioned on the tokens that came before. This is why they are often called *autoregressive* models. LLMs predict a range of probabilities associated with each token in the vocabulary. One can generate prediction using a range of different approaches:

- Greedy decoding (deterministic): pick the token with the highest probability at each step. This is the fastest decoding method, but it often leads to poor results particularly repititive text.
- Beam search (deterministic): keep track of the top $k$ most likely sequences at each step. 
- Sampling (random): Temperature is key to sampling. Top-K and Top-p sampling allow for more finegrained control over sampling. 
- Contrastive search (new): A new strategy proposed in the [paper](https://arxiv.org/abs/2202.06417) that depends on sampling and a contrastive loss parametrized by `penalty_alpha` in the `model.generate` method.

![beamsearch](beamsearch_hf.png)

Greedy search will pick "nice" at the first step, and then "woman". Beam search for $k=2$ will instead pick "dog has" since the combined probability is $0.4 \times 0.9 = 0.36$, which is greater than $0.5 \times 0.4 = 0.2$ for "nice woman". 

**TODO**: Compare speeds, and preferably understand how to implement each of these methods without using the high level `transformers` API. 

In [6]:
import torch
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer

In [3]:
torch_device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id).to(torch_device)

In [None]:
# encode context the generation is conditioned on
model_inputs = tokenizer('I enjoy walking with my cute dog', return_tensors='pt').to(torch_device)

In [8]:
input_ids = model_inputs.input_ids
iterations = []
n_steps = 8
choices_per_step = 5

with torch.no_grad():
    for _ in range(n_steps):
        iteration = dict()
        iteration["Input"] = tokenizer.decode(input_ids[0])
        output = model(input_ids=input_ids)
        # Select logits of the first batch and the last token and apply softmax
        next_token_logits = output.logits[0, -1, :]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)
        # Store tokens with highest probabilities
        for choice_idx in range(choices_per_step):
            token_id = sorted_ids[choice_idx]
            token_prob = next_token_probs[token_id].cpu().numpy()
            token_choice = (
                f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"
            )
            iteration[f"Choice {choice_idx+1}"] = token_choice
        # Append predicted next token to input
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
        iterations.append(iteration)
        
pd.DataFrame(iterations)

Unnamed: 0,Input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,I enjoy walking with my cute dog,", (21.84%)",. (15.40%),and (13.85%),","" (3.34%)",in (2.58%)
1,"I enjoy walking with my cute dog,",but (13.05%),and (12.59%),so (3.49%),I (1.98%),which (1.86%)
2,"I enjoy walking with my cute dog, but",I (28.31%),it (6.11%),when (5.01%),we (3.69%),she (3.64%)
3,"I enjoy walking with my cute dog, but I",'m (13.68%),don (9.59%),also (7.86%),can (5.14%),have (5.09%)
4,"I enjoy walking with my cute dog, but I'm",not (26.38%),also (10.40%),afraid (5.47%),a (4.92%),still (2.30%)
5,"I enjoy walking with my cute dog, but I'm not",sure (20.08%),a (11.75%),really (5.33%),going (4.38%),very (2.43%)
6,"I enjoy walking with my cute dog, but I'm not ...",if (27.81%),how (16.50%),I (16.18%),what (7.97%),why (5.55%)
7,"I enjoy walking with my cute dog, but I'm not ...",I (32.41%),it (13.13%),she (12.08%),he (9.16%),that (5.08%)


## Greedy search

In [4]:
# generate 40 new tokens
greedy_output = model.generate(**model_inputs, max_new_tokens=40)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))



Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure


## Beam search

In [10]:
# activate beam search and early_stopping
beam_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=5,
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I'm not sure


### Fix repetition: `no_repeat_ngram_size` 

Removing repeating ngrams can be beneficial in some cases, but not in general since some ngrams might be repeated for a reason. For instance, "New York" or "the bank" could occur repeatedly in a chunk of text for a reason. 

In [11]:
# set no_repeat_ngram_size to 2
beam_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=5,
    no_repeat_ngram_size=2, # Fixes the repetition of 2-grams
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to


### Generating multiple sequences using beam search

In [12]:
# set return_num_sequences > 1
beam_outputs = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=5,
    no_repeat_ngram_size=2,
    num_return_sequences=5,
    early_stopping=True
)

# now we have 3 output sequences
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to
1: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to
2: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's a good idea to
3: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time to take a
4: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's a good idea

In [13]:
# set return_num_sequences > 1
beam_outputs = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=5,
    # no_repeat_ngram_size=2,
    num_return_sequences=5,
    early_stopping=True
)

# now we have 3 output sequences
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I'm not sure
1: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again.

I'm
2: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I don't know
3: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I don't think
4: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I don't have


**Limitations of beam search**

- Fixed `Length` generation can work great with beam search. For instance, `machine translation` and `summarization` (References: [Murray 2018](https://arxiv.org/abs/1808.10006), [Yang 2018](https://arxiv.org/abs/1808.09582)). Open ended generation or variable length tasks such as story generation are not well suited for beam search.

- Repetitive generation: n-gram or other penalties can help reducing repetition, but this might hurt in open ended generation tasks. 

- [Ari Holtzman 2019](https://arxiv.org/abs/1904.09751) show that human language does not follow beam search, that is the highest conditional probabilities. Humans tend to pick low probability words and high probability words interchageably. 

## Temperature and sampling

Sampling comes in many forms. In the most vanilla form, no constraints on the total probability of all words combined or the top $k$ words are imposed. Naive sampling leads to randomly generated text that can be incoherent. 

$$Temperature \rightarrow 0 \implies \text{Greedy search}$$

In [14]:
# set seed to reproduce results. Feel free to change the seed though to get different results
from transformers import set_seed
set_seed(0)

# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog. Not because it makes sense. I appreciate hearing her say, "You know, this together thing ain't helping but make a change for another angel. So go Bill for a ride. I'll


In [16]:
set_seed(0)

# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=0,
    temperature=0.6,
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog. I don't think I can live without him, but I love him. I'm glad I'm home and I'm not going to have to be around him anymore."

O'Malley


## Top-K sampling

Great for sampling when many words have comparable probability, but for a distribution with high probability for a few words, this might not work well.

In [17]:
# set top_k to 50
sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, the only problem is, he's so annoying when I tell him that they love him so much. When he does, I try to encourage him by saying a few words of encouragement. (This


## Top-p sampling

Instead of specifying the number of words/tokens to pick as top-k does, top-p instead specifies a cumulative probability threshold $p$. The top-p sampling algorithm chooses the smallest possible set of tokens whose cumulative probability exceeds $p$. 

In [18]:
# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(0)

# set top_k to 50
sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_p=0.92,
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog. Not because it makes sense. I appreciate hearing her say, "You know, this will just make me happy." I will check that I am able to speak correctly when I am standing next to


## Combination: Top-K + Top-p

In [19]:
# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(0)

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3,
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog. Not because it makes sense. I appreciate hearing her say, "You know, this will just make me happy." I will always buy my puppy. As for her hair. I think it's
1: I enjoy walking with my cute dog and watching him play. I enjoy reading him a book about the history of China and how that made him think I was a genius, and I enjoy visiting my family for Thanksgiving.

But that
2: I enjoy walking with my cute dog. I love her because I've never been a dog, but I don't always have to be comfortable walking with her. She is my most prized possession and I will tell you how I feel when


## Contrastive search

In [22]:
# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(0)

# set top_k to 50
sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    # do_sample=True,
    # top_p=0.92,
    penalty_alpha=0.6, 
    top_k=4
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I don't like to be alone.

I'm going to be a little more adventurous with my dog, but I'm not going to be afraid to go out and play with him


## Conclusions

- Greedy search is fast, but often leads to repetitive text. 
- Beam search is great for fixed length generation tasks, but not for open ended generation tasks. 
- Sampling is great for open ended generation tasks, but can lead to incoherent text. 
- Top-K and Top-p sampling can help generate human-sounding text, but they can also lead to repititions and incoherent text. See [Welleck 2019](https://arxiv.org/pdf/1908.04319.pdf) and [Welleck 2020](https://arxiv.org/abs/2002.02492) for a deeper dive.

In [21]:
# Example prompt
prompt = "The future of AI is"

# Encode the prompt
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(torch_device)

# Decoding strategies
# Greedy
greedy_output = model.generate(input_ids, max_length=50, num_return_sequences=1, do_sample=False)

# Beam Search
beam_output = model.generate(input_ids, max_length=50, num_return_sequences=1, num_beams=5)

# Top-K Sampling
top_k_output = model.generate(input_ids, max_length=50, do_sample=True, top_k=50)

# Top-p (Nucleus) Sampling
top_p_output = model.generate(input_ids, max_length=50, do_sample=True, top_p=0.92)

# Temperature Sampling
temperature_output = model.generate(input_ids, max_length=50, do_sample=True, temperature=0.7)

# Print the outputs
print("Greedy:", tokenizer.decode(greedy_output[0], skip_special_tokens=True))
print("Beam Search:", tokenizer.decode(beam_output[0], skip_special_tokens=True))
print("Top-K Sampling:", tokenizer.decode(top_k_output[0], skip_special_tokens=True))
print("Top-p Sampling:", tokenizer.decode(top_p_output[0], skip_special_tokens=True))
print("Temperature Sampling:", tokenizer.decode(temperature_output[0], skip_special_tokens=True))

Greedy: The future of AI is uncertain. The future of AI is uncertain.

The future of AI is uncertain. The future of AI is uncertain.

The future of AI is uncertain. The future of AI is uncertain.

The future
Beam Search: The future of AI is in the hands of the next generation of scientists and engineers.

The future of AI is in the hands of the next generation of scientists and engineers.

The future of AI is in the hands of the next generation
Top-K Sampling: The future of AI is uncertain; some may find it a valuable tool to do their own research. Others prefer to use one computer program at a time, with more efficient systems built on a single branch of their research (R&D), in parallel
Top-p Sampling: The future of AI is not necessarily a mystery for humans, but the consequences are not entirely unknown. A group of researchers at the University of Bristol recently published an update of the code that would allow humans to train some intelligent AI machines to run on the
Temperature 