# 🧠 Text Generation with Hugging Face Transformers

In this notebook, we'll take a **first look at text generation** using Hugging Face's 🤗 `transformers` library.

Modern language models, such as GPT-style models, can **generate coherent text continuations** given a prompt.  
However, the way they generate text can vary dramatically depending on the *generation strategy* used, such as:

- **Greedy decoding** – always picks the most likely next token.  
- **Sampling** – randomly samples from possible next tokens.  
- **Top-k / Top-p sampling** – limits the candidate pool for more diverse or focused generations.

We’ll use the [`transformers.GenerationConfig`](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig) class to easily control and compare these settings.

The goal of this notebook is **not to complete exercises**, but to **experiment** and **develop intuition** about how generation works.  
You’re encouraged to:
- Play around with different configuration parameters,  
- Observe how they affect the model’s output, and  
- Examine the output of the model at different stages.

In the **next notebook**, we’ll take a **deeper dive** into the different generation strategies themselves, exploring how methods like greedy search, beam search, and sampling work under the hood and how they impact the final output.

By the end of this notebook, you’ll have a solid foundation for experimenting with text generation and preparing to analyze decoding strategies in more depth.

## 🧩 Loading a Pretrained Model

We’ll first load a pretrained model and tokenizer that can perform text generation.

We'll use **DistilGPT-2**, a lightweight version of GPT-2, which runs easily on most laptops and Colab environments.

We'll use `AutoTokenizer` and `AutoModelForCausalLM`. These automatically select the right tokenizer/model classes for causal (left-to-right) language modeling.


In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

## ✍️ Basic Text Generation

Let’s start by generating text *without* changing any generation settings.

We'll use `model.generate()` with default parameters to see how the model continues a prompt like:

> "Once upon a time"

The default configuration typically uses *greedy decoding*, which means the model always chooses the most probable next token.


In [2]:
input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time of war, the United States was the only country in the world to have a military presence. The United States was the only country in the world to have a military presence. The United States was the only country in the world to have a military presence


## ⚙️ GenerationConfig
- `GenerationConfig` is a Hugging Face class that defines **how a model generates text**. It acts as a central configuration for text generation parameters.  
- It controls aspects of decoding such as **maximum length**, **sampling strategy**, **temperature**, **top-k**, and **top-p** values.  
- Instead of passing these parameters directly to `model.generate()`, you can store them neatly inside a `GenerationConfig` object for cleaner, reusable code.

For further documentation regarding HF dataset, you are encouraged to explore the following [documentation](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig).

📘 **Tip:** You can inspect a model’s default generation configuration at any time:

In [3]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("distilgpt2")
print(model.generation_config)

GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256
}



### 🕒 Length & Stopping Criteria

These arguments control **how long the model continues generating text** and **what conditions cause it to stop**.  
They help ensure the output is the right length and prevent the model from generating endlessly.

- **`max_new_tokens`** — Sets the *maximum number of new tokens* the model is allowed to generate after the input prompt.  
  This defines the upper bound of the output length.

- **`min_new_tokens`** — Specifies the *minimum number of new tokens* the model must produce before it can stop.  
  This prevents the generation from ending too early.

- **`max_time`** — Limits the total generation time (in seconds).  
  Useful for ensuring that generation doesn’t run indefinitely for complex prompts or large models.

- **`stop_strings`** — Defines one or more text sequences that immediately stop generation when produced.  
  Commonly used to signal the end of structured outputs (e.g., “END”, “User:”, or special markers in dialogues).

Together, these parameters give you fine-grained control over the **duration and stopping behavior** of text generation, allowing you to tailor output length and prevent runaway completions.


In [4]:
# Control how long the model generates text and when it stops
length_config = GenerationConfig(
    max_new_tokens=40,          # limit new tokens generated
    min_new_tokens=10,          # ensure at least 10 new tokens
    max_time=5.0,               # stop after 5 seconds if still running
    stop_strings=["THE END"]    # stop when model outputs this string
)

output = model.generate(**inputs, generation_config=length_config, tokenizer=tokenizer)
print(tokenizer.decode(output[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time of war, the United States was the only country in the world to have a military presence. The United States was the only country in the world to have a military presence. The United States was the


### 🎲 Sampling & Logits Control

These parameters control **how the model handles repetition, forbidden words, and special tokens** during generation.  
They allow you to guide the model toward more natural, diverse, and constrained outputs.

- **`repetition_penalty`** — Penalizes the probability of tokens that have already appeared in the generated text.  
  Increasing this value discourages the model from repeating words or phrases too often, leading to more varied sentences.

- **`no_repeat_ngram_size`** — Prevents the model from repeating any *n-gram* (sequence of *n* words) that has already been generated.  
  For example, setting this to 2 ensures that no two-word sequence appears twice, reducing redundancy.

- **`bad_words_ids`** — Defines specific tokens or sequences that the model is not allowed to generate.  
  This is useful for filtering out unwanted terms, sensitive content, or domain-irrelevant vocabulary.

- **`forced_bos_token_id`** — Forces the model to begin generation with a specific *beginning-of-sequence (BOS)* token.  
  Ensures a consistent starting point for all outputs.

- **`forced_eos_token_id`** — Forces the model to include a specific *end-of-sequence (EOS)* token when finishing.  
  Guarantees that all generations end cleanly and can be properly decoded.

Together, these settings provide fine control over **output quality and structure**, helping to reduce repetition, enforce constraints, and maintain coherent beginnings and endings.

In [5]:
sampling_config = GenerationConfig(
    repetition_penalty=1.2,      # discourage repeating the same words
    no_repeat_ngram_size=2,      # prevent repeating any 2-word sequences
    bad_words_ids=[[tokenizer.encode("boring")[0]]],  # forbid the word "boring"
    forced_bos_token_id=tokenizer.bos_token_id,       # ensure BOS token is added
    forced_eos_token_id=tokenizer.eos_token_id        # ensure EOS token is added
)

output = model.generate(**inputs, generation_config=sampling_config)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time of war, the United States was in danger. The Soviet Union had been on its way to


### 📤 Output & Return Options

These arguments control **what information is returned** from the generation process and **how it is formatted**.  
They are useful for analyzing the model’s internal behavior or generating multiple variations of a prompt.

- **`num_return_sequences`** — Specifies how many independent generations the model should produce for the same input.  
  This allows you to compare multiple possible completions and choose the best one.

- **`do_sample`** — Enables random sampling rather than deterministic decoding.  
  When active, the model can generate more diverse outputs by introducing controlled randomness.

- **`output_scores`** — Returns the token-level probability scores assigned during generation.  
  These scores can be used to analyze how confident the model was about each generated token.

- **`output_attentions`** — Returns the attention weights from the model’s layers.  
  Useful for understanding which parts of the input the model focused on while generating each token.

- **`output_hidden_states`** — Returns the hidden state representations for each layer.  
  This provides insight into how the model’s internal representations evolve over time.

- **`output_logits`** — Returns the raw, unnormalized prediction values (*logits*) for each token before the softmax step.  
  Helpful for debugging or inspecting the model’s raw output distribution.

- **`return_dict_in_generate`** — When set to `True`, returns a dictionary-like object containing all outputs  
  (generated sequences, scores, attentions, hidden states, logits, etc.) instead of just the token IDs.  
  This makes it easier to access and analyze different components of the generation process.

Together, these options make it possible to not only **generate text**, but also to **inspect, compare, and interpret** how the model arrived at its output.


In [6]:
# Define a generation configuration
output_config = GenerationConfig(
    num_return_sequences=3,         # generate 3 independent completions
    do_sample=True,                 # Activating random sampling (Will visit this in the next notebook)
    output_scores=True,             # return token-level scores
    output_attentions=True,         # (optional) set True to inspect attention
    output_hidden_states=True,      # (optional) set True to inspect hidden states
    output_logits=True,             # return raw logits for analysis
    return_dict_in_generate=True    # return a dictionary instead of just IDs
)

# Generate text using the custom configuration
generation = model.generate(**inputs, generation_config=output_config)

for i, seq in enumerate(generation.sequences):
    print(f"--- Output {i+1} ---")
    print(tokenizer.decode(seq, skip_special_tokens=True))
    print()

# Access additional returned information
print("Available output keys:", generation.keys())

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


--- Output 1 ---
Once upon a time of darkness a powerful and powerful force would destroy the planet and cause a tidal wave that would obliterate

--- Output 2 ---
Once upon a time of war, the world became divided into two parts, which were divided into the nations of the seven

--- Output 3 ---
Once upon a time of war the United States would cease its military activities in Iraq, and would maintain its occupation until the

Available output keys: odict_keys(['sequences', 'scores', 'logits', 'attentions', 'hidden_states', 'past_key_values'])


`**TODO:**` Examine all output keys other than `sequences`. These correspond to concepts we've already covered in lectures and exercises. This is your chance to explore them in a practical way: try printing their values, types, and shapes (or anything else) to build a deeper understanding and intuition.
