In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import random
import datasets
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"


# Selecting the font size here will affect all the figures in this notebook
# Alternatively, you can set the font size for axis labels of each figure separately
font = {'size': 16}
matplotlib.rc('font', **font)

In [2]:
# This is a code snippet from exercise 1
# Set the seed for reproducibility

def set_seed(seed):
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
set_seed(1)

## Homework 1: Recommended Reading: Understanding Attention

To build strong foundations in how modern Language Models work and to understand the **attention mechanism**, read:

[The Transformer Family - Lilian Weng](https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/)

Answer the following questions:
* What is the role of attention in a Transformer?
* How does self-attention work?
* How does attention differ from previous mechanisms?

## Homework 2: Sampling Strategies for Language Generation

When generating text with a language model, we don't just want the most likely next word every time—that would make outputs boring and repetitive. Instead, we use sampling to inject creativity and variation.

Two of the most common sampling strategies are:

**Top-k Sampling**
Instead of choosing from all possible tokens, we limit the model to the k most likely next tokens. This narrows the focus to a small, high-confidence pool while still allowing for variation.

**Top-p (Nucleus) Sampling**
Rather than picking a fixed number k, top-p sampling looks at the smallest set of tokens whose cumulative probability exceeds p (e.g., 90%). It adapts to the uncertainty of the model: if the model is confident, fewer tokens are considered; if it's unsure, more are included.

These techniques help balance randomness and relevance in generation.

### (2a)
📘 **Reading Assignment:**
To understand the different decoding methods for language generation, read this short article:
👉 [Huggingface Blog](https://huggingface.co/blog/how-to-generate)




### (2b)
In this task, you’ll gain hands-on experience with text generation using pre-trained LLMs and explore how different sampling strategies affect the model's output.
You will implement a custom text generation function using Hugging Face’s transformers library, and evaluate the impact of temperature, top-k, and top-p sampling.

#### 1. Model and Tokenizer Setup
Load a pre-trained **causal language model** and its **tokenizer** using Hugging Face’s `AutoModelForCausalLM` and `AutoTokenizer`.
We can reuse the code from Exercise 1:

In [3]:
model_name = "Qwen/Qwen2.5-0.5B"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


#### 2. Implement a custom text generation function
Your function should:
* Accept a text prompt and parameters:
    - `gen_len`: number of tokens to generate
    - `temperature`: value to scale logits (controls randomness)
    - `top_k`: limits sampling to the top-k most probable tokens
    - `top_p`: applies nucleus sampling, selecting tokens whose cumulative probability exceeds `p`

* Apply:
    - Temperature scaling  
    - top-k filtering  
    - top-p (nucleus) filtering (with cumulative probability logic)

Use PyTorch or Hugging Face utilities where possible, but ensure you understand and explain the logic behind the sampling.

*Hint:* You can reuse some of the code from exercise 1 below.

In [4]:
# This is a code snippet from exercise 1
def generate(prompt: str, gen_len: int) -> str:
    tokenized_input = tokenizer(prompt, return_tensors="pt")
    input_ids = tokenized_input["input_ids"]
    for _ in range(gen_len):
        with torch.no_grad():
           output = model(input_ids).logits  # We should pass the attention mask but we can ignore it for causal LLMs when we have just a single input
        output = output.squeeze(dim=0)
        next_token_scores = output[-1]
        next_token_id = next_token_scores.argmax(dim=-1)
        input_ids = torch.cat((input_ids, torch.LongTensor([next_token_id]).reshape(1,-1)), dim=-1)
    return tokenizer.decode(input_ids.numpy().flatten())

prompt = "I love Bochum because"
generate(prompt, 30)

'I love Bochum because it is a beautiful city with a lot of history. The city is located in the middle of the Rhine and is surrounded by beautiful forests. The'

In [5]:
# TODO Your code here


#### 3. Experiment
For a fixed prompt, generate text under the following conditions:
- Temperature values: `0.0`, `0.5`, `1.0`, `1.5`, 
- Top-k values: `5`, `10`, `15`
- Top-p values: `0.8`, `0.9`, `0.95`

Print the outputs and compare the results.

In [6]:
# TODO Your code here

#### Reflect & Analyze
Observe how the generated texts differ under various settings.  
For each variation, discuss:
* How did the output change in diversity, coherence, and creativity?
* Which combination(s) produced the most interesting or usable results?


In [7]:
# TODO Your answer here