# ✨ Basic decoding methods

In text generation, decoding strategies determine how a model chooses the next token in a sequence. Different strategies can lead to outputs that vary in coherence, creativity, and diversity. Understanding these methods helps you control how models generate text and tailor their behavior to different tasks.  

In this notebook, you’ll explore and compare several generation strategies to see how these choices shape the final output. 🚀  

**🙏 Acknowledgment:** This notebook draws inspiration, explanations, and graphics from the excellent Hugging Face blog post [*“How to generate text”*](https://huggingface.co/blog/how-to-generate) by [Patrick von Platen](https://huggingface.co/patrickvonplaten). Huge thanks to the author and the Hugging Face team for their amazing open resources 💙🤗

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
import torch

torch_device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id).to(torch_device)


## Greedy search

Greedy search is a decoding strategy used in autoregressive text generation.  
At each time step, the model selects the token with the **highest conditional probability** given the tokens generated so far.  

In formula form:

\begin{align}

w_t = \arg\max_{w} P(w \mid w_{1:t-1})

\end{align}

where \( w_{1:t-1} \) are the previously generated tokens. 


### Example
<center><img src="./assets/greedy_search.png"/></center>

Starting from the initial word **"The"**,  
the algorithm greedily chooses the next word with the highest probability at each step:

| Step | Current sequence | Next word (highest probability) | Resulting sequence |
|------|------------------|----------------------------------|--------------------|
| 1 | "The" | "nice" | "The nice" |
| 2 | "The nice" | "woman" | "The nice woman" |

So the final generated sequence is:

`("The", "nice", "woman")`

If the probabilities were:

- P("nice" | "The") = 0.5  
- P("woman" | "The nice") = 0.4  

then the overall sequence probability is:

`0.5 × 0.4 = 0.2`

This example shows how greedy search **selects the most likely token at each step**.

`**TODO (Discussion):**` What are your thoughts on Greedy Search? What limitations might it have?

### General comments about Greedy Search
- **Simple and fast** — requires minimal computation.
- Works well for tasks needing consistent, reliable completions.

### Limitations
- **Shortsightedness:** Only optimizes locally, not globally.  
- **Repetition or collapse:** Can generate repetitive or looping text.  
- **Not globally optimal:** The final sequence might not have the highest overall probability.

### Code
Let's have a look at how greedy search can be used in `transformers`. All generation parameters are set to their default values, as greedy search is the default search method already.

In [2]:
model_inputs = tokenizer('I enjoy walking with my cute dog', return_tensors='pt').to(torch_device)

greedy_output = model.generate(**model_inputs, max_new_tokens=40)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure


`**TODO (Discussion)**:` Do you see something wrong with the generation?

The model keeps generating the same sentence.

## Beam Search
Beam search is a decoding strategy that keeps track of several possible sequences (called **beams**) at each generation step.  
Instead of picking only the single most likely next word (as in greedy search), beam search explores multiple alternatives in parallel and selects the **overall most probable sequence** at the end.

### How It Works
1. Start with the initial token(s) — for example, `"The"`.
2. Generate a list of the top `k` most likely next words (where `k` is the **beam width**).
3. For each candidate, keep expanding one token at a time.
4. At each step, keep only the `k` sequences with the highest overall probability.
5. When generation stops (e.g., after an end token or max length), the sequence with the **highest total probability** is selected.


### Example
<center><img src="./assets/beam_search.png"/></center>

Starting from the initial word **"The"**,  
the model predicts the next tokens and expands possible continuations.  
We’ll use a **beam width of 2**.

| Step | Candidate sequences | Probabilities | Top beams kept |
|------|---------------------|----------------|----------------|
| 1 | "The nice" (0.5), "The dog" (0.4), "The car" (0.1) | Keep 2 beams → "The nice", "The dog" | "The nice" (0.5), "The dog" (0.4) |
| 2 | From "The nice":<br>• "The nice woman" (0.5 × 0.4 = 0.20)<br>• "The nice house" (0.5 × 0.3 = 0.15)<br>• "The nice guy" (0.5 × 0.3 = 0.15)<br>From "The dog":<br>• "The dog has" (0.4 × 0.9 = 0.36)<br>• "The dog runs" (0.4 × 0.05 = 0.02)<br>• "The dog and" (0.4 × 0.05 = 0.02) | Keep 2 best beams → "The dog has" (0.36), "The nice woman" (0.20) | "The dog has" (0.36), "The nice woman" (0.20) |

The final output is:

`("The dog has")`  
with an overall probability of `0.36`.

This example shows how beam search keeps **multiple candidates** at each step and chooses the sequence with the **highest overall probability**, instead of the locally best option at each token.

### Code
Let's see how we implement this in `transformers`.

In [3]:
# activate beam search and early_stopping
beam_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=5,            # Activate beam search with 5 beams
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I'm not sure


`**TODO (Discussion):**` What are your thoughts on Beam Search? Does it overcome the limitations of Greedy Search and if so does it have any limitations of its own? If you're not sure yet, have a look at the next code block which might be helpful.

### Why Use It?
- Balances **exploration** and **efficiency**: it’s less myopic than greedy search but still tractable.  
- Produces more **globally optimal** sequences.  
- Works well for structured tasks like translation, summarization, and captioning.


### Limitations
- **Computationally heavier** than greedy search (more sequences tracked).  
- May still favor high-probability but **less diverse** outputs.  
- The result can depend strongly on the **beam width** — too small → misses good sequences; too large → slows down or overfits to longer outputs.
- The repetition problem still persists.

## Avoiding Repetition with N-gram Penalties

While beam search often produces more fluent and coherent results than greedy search, it can still suffer from **repetitions** — where the same phrases or word sequences are generated multiple times.  
This happens because beam search tends to favor high-probability continuations, which can lead the model to repeat familiar patterns rather than explore new ones.

A common remedy for this issue is to apply **n-gram penalties**.
An **n-gram** is simply a sequence of *n* words (for example, a 2-gram or “bigram” could be `"the cat"`, `"in the"`, etc.).  
The **n-gram penalty** ensures that no identical n-gram appears more than once in the generated text.

In practice, this is done by **setting the probability of any next word that would recreate an existing n-gram to zero**.  
This discourages the model from repeating the same word sequences, leading to more diverse and natural outputs.

`**TODO:**` Let’s try this out! Create a generation in the same way as before (can be greedy or beam search), but update it by using the setting: `no_repeat_ngram_size = 2`.


In [4]:
beam_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=5,
    no_repeat_ngram_size=2, # Activate n-gram penalization with n=2
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to


`**TODO (Discussion):**` Can you think of a case or application where this penalization approach could be proven problematic?

N-gram penalties are effective in reducing repetition, but they need to be used carefully.  
By blocking the model from repeating any n-gram, we may **accidentally remove legitimate repetitions** that are important for meaning or fluency.

For example, when generating a text about **New York**, applying a 2-gram penalty would prevent the model from using the phrase `"New York"` more than once.  
As a result, the output might sound unnatural or forced, since the model would have to find awkward paraphrases or avoid mentioning the main subject again

`**TODO (Discussion):**` In the following figure, [Ari Holtzman et al. (2019)](https://arxiv.org/abs/1904.09751), plot  the probability a model would give to human tect against what beam search does. Based on the plot do you think beam search is a viable solution or not?

<center><img src="./assets/word_probabilities.png"/></center>

High quality human language does not follow a distribution of high probability next words. In other words, as humans, we want generated text to surprise us and not to be boring/predictable. Hence, beam search cannot be regarded as a viable generation strategy if generating high quality human language is the goal.

## Random Sampling

Random sampling is a **stochastic decoding strategy**. Instead of always picking the most likely next word (as in greedy or beam search), the model **samples** a word from the probability distribution predicted for the next token.

This means that every token has a chance to be selected, **proportional to its probability**.  
Words with higher probabilities are more likely to be chosen, but lower-probability words can still appear occasionally.

Formally, we can write this as:

\begin{align}
w_t ~ P(w |w_{1:t-1})
\end{align}

This means that the next word `w_t` is drawn **at random** according to its probability given all previous words `w₁` to `wₜ₋₁`.


### Example
<center><img src="./assets/sampling_search.png"/></center>

Starting from **"The"**, the model predicts the following next-token probabilities:

| Candidate | Probability |
|------------|--------------|
| "nice" | 0.5 |
| "dog" | 0.4 |
| "car" | 0.1 |

Instead of deterministically choosing `"nice"`, random sampling **draws** one token based on these probabilities.  
So depending on chance:
- `"nice"` might be selected 50% of the time,  
- `"dog"` 40% of the time,  
- `"car"` 10% of the time.

This makes random sampling **non-deterministic** —> two runs with the same input may yield different outputs.

`**TODO:**` To activate random sampling in `transformers` you must set `do_sample=True` and deactivate Top-K sampling by setting `top_k=0` (more on this later). Do this and create a generation using the same inputs as before.

In [5]:
set_seed(42)

sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,         # Activate random sampling 
    top_k=0 ,              # Deactivate Top-k sampling
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog but what I love about being a dog cat person is being a pet being with people who can treat you. I feel happy to be such a pet person and get to meet so many people. I


## Sampling with Temperature

When we sample from the model’s probability distribution, the results can sometimes sound **incoherent or random**. 
The text may drift off-topic or feel unnatural.  
This is a common issue with pure random sampling.

To make the generated text more **focused** and **coherent**, we can adjust the *temperature* of the distribution.  
Temperature controls how sharply or smoothly the model selects tokens from its probability distribution.

Formally, we can express this as sampling from a **temperature-scaled distribution**:

\begin{align}
w_t ~ P(w |w_{1:t-1})
\end{align}

but with the probabilities modified as follows:

\begin{align}
P'(w) = softmax\left ( \frac{logits}{temperature} \right)
\end{align}



### How Temperature Works

- **Low temperature (< 1)** → the distribution becomes *sharper*.  
  High-probability words become even more likely, while low-probability ones are suppressed.  
  → The model behaves more deterministically and produces more coherent, focused text.

- **High temperature (> 1)** → the distribution becomes *flatter*.  
  Low-probability words get a higher chance of being selected.  
  → The model becomes more creative, but also more prone to incoherence.

At `temperature = 1`, sampling behaves as usual — no scaling is applied.


### Example

In the previous example, the model predicted the following next-token probabilities after `"The"`:

| Candidate | Probability (T=1.0) |
|------------|----------------------|
| "nice" | 0.5 |
| "dog" | 0.4 |
| "car" | 0.1 |

When we **lower the temperature to 0.6**, the distribution becomes **sharper** —  
high-probability words like `"nice"` dominate, while `"dog"` and `"car"` are much less likely to be chosen.



So, by “cooling down” the sampling, we make the text **more coherent and natural**, at the cost of reducing diversity.

`**TODO:**` Create a generration using sampling with temperature by setting the  `temperature` argument to `0.6`.

In [6]:
set_seed(42)

sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=0,
    temperature=0.6,      # Activating temperature scaling with temperature=0.6
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog but I also love the fact that my cat is not a dog. She is a good, loving dog. I do not like to be held back by other dogs but I think that I have to


## Top-k Sampling

Pure random sampling considers the *entire* vocabulary when picking the next token.
As a result, even low-probability, nonsensical words can occasionally be selected.  
This can lead to incoherent or off-topic text.

**Top-k sampling** (introduced by Fan et al., 2018) solves this problem by **restricting** the model’s choices.  
Instead of sampling from all possible words, it only samples from the **k most probable** next tokens.

Formally, we still sample:

\begin{align}
w_t ~ P(w |w_{1:t-1})
\end{align}

but after truncating the distribution to its **top k** candidates. All other words are assigned a probability of zero and the remaining ones are renormalized to sum to 1.

### How It Works

1. The model predicts a probability distribution for the next token.  
2. We keep only the **k most likely words** (e.g., the top 50).  
3. The probabilities of these top-k words are **renormalized**.  
4. A random token is then sampled from this restricted subset.

This ensures that only **plausible words** are considered, while still maintaining some **randomness** and **diversity**.


### Example
<center><img width=1000 src="./assets/top_k_sampling.png"/></center>

Suppose the model predicts the following probabilities after `"The"`:

| Candidate | Probability |
|------------|--------------|
| "nice" | 0.35 |
| "dog" | 0.25 |
| "car" | 0.20 |
| "quantum" | 0.12 |
| "wibble" | 0.08 |

If we set **k = 3**, we keep only the top-3 candidates:

`"nice"`, `"dog"`, `"car"`

The remaining words (like `"quantum"` and `"wibble"`) are ignored completely.  
Then, the model samples randomly **only** among the top 3 options.

This reduces the chance of producing incoherent or irrelevant tokens  
while still allowing variation between `"nice"`, `"dog"`, and `"car"`.

`**TODO:**` Create a generration using Top-k sampling by setting the  `top_k` argument to `50`.

In [7]:
set_seed(42)

sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=50            # Activate Top-k sampling with k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog but what I love about being a dog is I see a beautiful pet being cared for – I love having the opportunity to see her every day so I feel very privileged to have been able to help this


`**TODO (Discussion):**` What are your takeaways on Top-K sampling? Can you see any limitations?

### Takeaways
Top-k sampling improves random sampling by limiting the model’s choices to the most probable tokens: 
- Helps prevent incoherent or nonsensical generations.
- Keeps a good balance between diversity and fluency.
- Works best when combined with temperature scaling for finer control.

In short:<br>
→ Pure sampling = too random. <br>
→ Top-k sampling = controlled creativity.

### Limitations

While Top-k sampling improves coherence compared to pure sampling, it has one key limitation:  
it uses a **fixed number (k)** of candidate words at every step, regardless of how the probability distribution looks.

This means:
- For **sharp distributions**, Top-k might still include **unlikely or irrelevant words**, which can lead to gibberish.  
- For **flat distributions**, it might **exclude reasonable options**, limiting diversity and creativity.

Because it doesn’t adapt dynamically to the shape of the distribution `P(w | w₁:ₜ₋₁)`,  
Top-k sampling can sometimes be **too restrictive** or **too random** depending on context.


## Top-p (Nucleus) Sampling

**Top-p sampling** (also known as **nucleus sampling**) was introduced as an improvement over Top-k sampling.

Instead of keeping a **fixed number (k)** of the most probable words at each step, Top-p sampling dynamically chooses the **smallest possible set of words** whose **cumulative** probability exceeds a threshold *p* (for example, 0.9).  

Formally, we still sample:

\begin{align}
w_t ~ P(w | w₁:ₜ₋₁)
\end{align}

but we truncate the distribution to include only the top words such that:
\begin{align}
\sum P(w) ≥ p
\end{align}


and renormalize their probabilities before sampling.

### How It Works

1. The model predicts a probability distribution for the next token.  
2. Sort tokens by probability (highest → lowest).  
3. Keep the **smallest subset** of tokens whose cumulative probability ≥ *p*.  
4. Sample randomly from this subset.

This allows the model to **adapt dynamically**:  
- If the distribution is *sharp*, only a few tokens are considered.  
- If the distribution is *flat*, more tokens are included.  

The result is a flexible trade-off between **coherence** and **diversity**.


### Example
<center><img width=1000 src="./assets/top_p_sampling.png"/></center>

Assume the model predicts these next-token probabilities after `"The"`:

| Candidate | Probability | Cumulative |
|------------|--------------|-------------|
| "nice" | 0.5 | 0.5 |
| "dog" | 0.3 | 0.8 |
| "car" | 0.1 | 0.9 |
| "quantum" | 0.05 | 0.95 |
| "wibble" | 0.05 | 1.00 |

If we set **p = 0.9**, we include tokens until the cumulative probability reaches 0.9:  
→ `"nice"`, `"car"`, and `"dog"` are kept; the rest are discarded.  

Then the model samples **only** from those top-p words, ensuring that the sample space adapts to the probability distribution’s shape.

In [8]:

set_seed(42)

sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_p=0.92,          # Activative nucleus sampling with p=0.92
    top_k=0              # Deactivate Top-k sampling
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog but what I love about being a dog cat person is being a pet being with people who can treat you. I feel happy to be such a pet person and get to meet so many people. I
