# Chapter 4: Shakespeare NLP Exercises: From Classical NLP to Neural Language Models

**This exercise has 20 smaller tasks, for a total of 18 points, and you will have two weeks to complete it instead of one. Don't forget to submit your solutions to GitHub!**

Welcome to this comprehensive hands-on workshop on Natural Language Processing! You will journey from classical text processing techniques to building and comparing neural language models using Shakespeare's complete works.

## What You Will Learn

- **Stage 0**: Corpus processing, tokenization, and word embeddings
- **Stage 1**: Character-level RNN language models
- **Stage 2**: Word-level RNN language models with theatrical chat interfaces
- **Stage 3**: LSTM language models and architecture comparison

## Instructions for Students

Throughout this notebook, you will find code sections marked with:
```python
# START STUDENT CODE
...
# END STUDENT CODE
```

Any and all required tasks can and should be completed by writing code into these sections.

XXXXX lösungen rauscutten, hinweise aus non-solution notebook wieder einpflegen

---


## Setup and Dependencies

First, let's install the required packages and check our environment.


In [12]:
# Install required packages (uncomment if running in Colab)
# !pip install torch numpy requests

import torch
import numpy as np
import os
import re
import requests
from collections import Counter

# Check device availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"PyTorch version: {torch.__version__}")
print(f"Using device: {device}")
if device.type == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")

# Create data directories
os.makedirs("data/works", exist_ok=True)
print("Data directories created.")


PyTorch version: 2.9.0+cpu
Using device: cpu
Data directories created.


---

# Stage 0: Corpus & Classical NLP Foundations

In this stage, you will work with the complete works of Shakespeare to learn fundamental text processing techniques:
- Downloading and segmenting a large text corpus
- Tokenization and normalization
- Working with pretrained word embeddings (GloVe)

---

## Stage 0.1 – Shakespeare Corpus Download & Segmentation

This exercise introduces you to working with a **real, unstructured text corpus**. You will download the complete works of William Shakespeare from Project Gutenberg and convert the raw file into a collection of **separate, clean text files**, one per work. To make sure everyone can actually start working on the exercises and gets stuck right at the start, we provide code that starts you off and does the following things for you:

1. **Download the Shakespeare corpus** from Project Gutenberg
2. **Analyze the structure** - find the Table of Contents and title markers
3. **Segment the corpus** into separate work files
4. **Verify your segmentation** by printing statistics


In [13]:
RAW_FILE = 'data/pg100.txt'
WORKS_DIR = 'data/works'

def download_corpus():
    """Download the Shakespeare corpus from Project Gutenberg if not present."""

    if not os.path.exists(RAW_FILE):
        print(f"Downloading corpus...")
        url = "https://www.gutenberg.org/cache/epub/100/pg100.txt"
        try:
            response = requests.get(url)
            response.raise_for_status()
            with open(RAW_FILE, 'wb') as f:
                f.write(response.content)
            print("Download complete.")
        except Exception as e:
            print(f"Error downloading file: {e}")
            return False
    else:
        print(f"Found {RAW_FILE}, skipping download.")
    return True


def segment_corpus():
    """Segment the corpus into individual works."""
    # Read the file
    with open(RAW_FILE, 'r', encoding='utf-8') as f:
        lines = f.readlines()

    # Find Table of Contents
    toc_start_idx = -1
    for i, line in enumerate(lines):
        if "Contents" in line and len(line.strip()) < 20:
            if i < 200:  # TOC should be in first 200 lines
                toc_start_idx = i
                break

    if toc_start_idx == -1:
        print("Could not find Table of Contents.")
        return []

    print(f"Found Table of Contents around line {toc_start_idx + 1}")

    # Extract titles from TOC
    potential_titles = []
    first_candidate = None
    toc_end_idx = -1

    for i in range(toc_start_idx + 1, len(lines)):
        stripped = lines[i].strip()
        if not stripped:
            continue

        if first_candidate is None:
            first_candidate = stripped
            potential_titles.append(stripped)
            continue

        # Check if this line matches the first candidate (start of first work)
        if stripped == first_candidate:
            toc_end_idx = i
            break

        potential_titles.append(stripped)

        if i > 3000:  # Safety break
            print("Warning: TOC parsing went too far.")
            break

    if toc_end_idx == -1:
        print("Could not determine end of TOC.")
        return []

    works_titles = potential_titles
    print(f"Identified {len(works_titles)} works from TOC.")

    # Find start positions of each work
    work_starts = {}
    current_search_idx = toc_end_idx
    work_starts[works_titles[0]] = current_search_idx

    for k in range(1, len(works_titles)):
        title = works_titles[k]
        found = False
        for j in range(current_search_idx + 1, len(lines)):
            if lines[j].strip() == title:
                work_starts[title] = j
                current_search_idx = j
                found = True
                break
        if not found:
            print(f"Warning: Could not find start of '{title}'")

    # Write to files
    os.makedirs(WORKS_DIR, exist_ok=True)
    sorted_works = sorted(work_starts.items(), key=lambda x: x[1])

    extracted_works = []
    for i in range(len(sorted_works)):
        title, start_line = sorted_works[i]

        if i < len(sorted_works) - 1:
            end_line = sorted_works[i + 1][1]
        else:
            end_line = len(lines)

        content_lines = lines[start_line:end_line]
        text_content = "".join(content_lines)

        # Clean title for filename
        safe_title = re.sub(r'[^\w\s-]', '', title).strip().lower()
        safe_title = re.sub(r'[\s-]+', '_', safe_title)
        filename = f"{safe_title}.txt"

        out_path = os.path.join(WORKS_DIR, filename)
        with open(out_path, 'w', encoding='utf-8') as f:
            f.write(text_content)

        extracted_works.append((title, filename, len(text_content), len(content_lines)))

    return extracted_works
    # Task 0.2: END STUDENT CODE

# Run the corpus download and segmentation
if download_corpus():
    works = segment_corpus()

    if works:
        print(f"\n{'='*70}")
        print("SEGMENTATION SUMMARY")
        print(f"{'='*70}")
        print(f"{'Work Title':<50} {'Chars':>10} {'Lines':>8}")
        print("-" * 70)
        for title, filename, chars, lines_count in works:
            print(f"{title[:48]:<50} {chars:>10} {lines_count:>8}")
        print(f"\nTotal works extracted: {len(works)}")


Found data/pg100.txt, skipping download.
Found Table of Contents around line 34
Identified 44 works from TOC.

SEGMENTATION SUMMARY
Work Title                                              Chars    Lines
----------------------------------------------------------------------
THE SONNETS                                             98332     2777
ALL’S WELL THAT ENDS WELL                              134624     4964
THE TRAGEDY OF ANTONY AND CLEOPATRA                    152400     6641
AS YOU LIKE IT                                         127042     4438
THE COMEDY OF ERRORS                                    88333     3201
THE TRAGEDY OF CORIOLANUS                              165954     6445
CYMBELINE                                              161238     5885
THE TRAGEDY OF HAMLET, PRINCE OF DENMARK               177938     6698
THE FIRST PART OF KING HENRY THE FOURTH                141709     4807
THE SECOND PART OF KING HENRY THE FOURTH               153547     5196
THE LIFE OF KING

---

## Exercise 0.2 – Basic Tokenization and Normalization

### Description

In this exercise, you will build foundational text-processing utilities. You will design a simple **tokenizer** and apply basic **normalization** steps to Shakespeare's works. This mirrors the early stages of many NLP pipelines.

### Learning Objectives

After completing this exercise, you should be able to:
- Implement a minimal **tokenizer** for plain-text data
- Apply common **normalization** steps such as lowercasing and punctuation handling
- Inspect token distributions to understand corpus characteristics
- Compute statistics for single works and the entire corpus

### Tasks

1. **Implement a basic tokenizer (3x0.5 points)** that:
   - Splits text into word-like units
   - Treats whitespace as a separator
   - Separates punctuation into its own tokens
   
2. **Add normalization steps (1 point)**:
   - Lowercase all tokens
   - Normalize curly/smart quotes to straight quotes
   
3. **Inspect tokenized output** and compute statistics


In [14]:
# Exercise 0.2: Basic Tokenization and Normalization


def tokenize(text: str) -> list[str]:
    """
    Tokenizes the input text into a list of strings.

    Design Decisions:
    1. Lowercase: Applied to reduce vocabulary size.
    2. Punctuation: Separated from words into their own tokens.
    3. Contractions: Kept together using regex (e.g., "don't" stays as one token).

    Args:
        text: Input text string

    Returns:
        List of token strings
    """
    # Task 0.3: START STUDENT CODE

    # HINT:
    # 1. Lowercase the text
    text = text.lower()
    # 2. Normalize smart quotes ('', "") to straight quotes (' ', "")
    text = text.replace("“", '"').replace("”", '"').replace("‘", "'").replace("’", "'")
    # 3. Replace multiple whitespace chars (newlines, tabs) with single space
    text = re.sub(r'\s+', ' ', text)
    # 4. Use re.findall() with a regex pattern to extract tokens:
    #    - Words with optional contractions: \w+(?:'\w+)?
    #    - Single punctuation: [^\w\s]

    tokens = re.findall(r"\w+(?:'\w+)?|[^\w\s]", text)

    return tokens

    # Task 0.3: END STUDENT CODE

# Test the tokenizer on a sample work
sample_work = 'the_tragedy_of_romeo_and_juliet.txt'
sample_path = os.path.join(WORKS_DIR, sample_work)

if os.path.exists(sample_path):
    with open(sample_path, 'r', encoding='utf-8') as f:
        text = f.read()

    tokens = tokenize(text)
    vocab = Counter(tokens)

    print(f"Analyzing: {sample_work}")
    print(f"\nTotal tokens: {len(tokens)}")
    print(f"Unique tokens (vocabulary size): {len(vocab)}")

    print("\n--- First 50 tokens ---")
    print(tokens[:50])

    print("\n--- Top 10 most frequent tokens ---")
    for token, count in vocab.most_common(10):
        print(f"  '{token}': {count}")

    print("\n--- Sample rare tokens ---")
    for token, count in vocab.most_common()[-5:]:
        print(f"  '{token}': {count}")
else:
    print(f"Sample work not found at {sample_path}. Run Exercise 0.1 first.")


Analyzing: the_tragedy_of_romeo_and_juliet.txt

Total tokens: 33150
Unique tokens (vocabulary size): 3783

--- First 50 tokens ---
['the', 'tragedy', 'of', 'romeo', 'and', 'juliet', 'contents', 'the', 'prologue', '.', 'act', 'i', 'scene', 'i', '.', 'a', 'public', 'place', '.', 'scene', 'ii', '.', 'a', 'street', '.', 'scene', 'iii', '.', 'room', 'in', "capulet's", 'house', '.', 'scene', 'iv', '.', 'a', 'street', '.', 'scene', 'v', '.', 'a', 'hall', 'in', "capulet's", 'house', '.', 'act', 'ii']

--- Top 10 most frequent tokens ---
  ',': 2704
  '.': 2600
  'and': 736
  'the': 688
  'i': 583
  'to': 541
  'a': 488
  'of': 395
  '?': 369
  'my': 356

--- Sample rare tokens ---
  'figure': 1
  'sacrifices': 1
  'glooming': 1
  'pardon'd': 1
  'punished': 1


In [15]:
# Exercise 0.2 (continued): Corpus-wide Statistics

print("="*70)
print("CORPUS-WIDE STATISTICS")
print("="*70)

# Collect statistics from all works
work_files = [f for f in os.listdir(WORKS_DIR) if f.endswith('.txt')]
print(f"\nFound {len(work_files)} works in the corpus.\n")

corpus_tokens = []
corpus_vocab = Counter()
work_stats = []

for work_file in sorted(work_files):
    work_path = os.path.join(WORKS_DIR, work_file)
    with open(work_path, 'r', encoding='utf-8') as f:
        work_text = f.read()

    work_tokens = tokenize(work_text)
    work_vocab = Counter(work_tokens)

    corpus_tokens.extend(work_tokens)
    corpus_vocab.update(work_vocab)

    work_stats.append({
        'name': work_file,
        'total_tokens': len(work_tokens),
        'unique_tokens': len(work_vocab)
    })

# Print per-work summary
print(f"{'Work':<50} {'Total Tokens':>12} {'Unique Tokens':>14}")
print("-" * 76)
for stat in sorted(work_stats, key=lambda x: x['total_tokens'], reverse=True)[:10]:
    print(f"{stat['name']:<50} {stat['total_tokens']:>12} {stat['unique_tokens']:>14}")
print("... (showing top 10 by token count)")

# Print corpus-wide statistics
print("\n" + "="*70)
print("AGGREGATED STATISTICS")
print("="*70)
print(f"\nTotal tokens across all works: {len(corpus_tokens):,}")
print(f"Unique tokens (vocabulary size): {len(corpus_vocab):,}")
print(f"Average tokens per work: {len(corpus_tokens) / len(work_files):,.0f}")

print("\n--- Top 20 Most Common Tokens in Corpus ---")
for token, count in corpus_vocab.most_common(20):
    print(f"  {token:>15}: {count:>8} occurrences")

print("\n--- Token Coverage Statistics ---")
total_tokens = len(corpus_tokens)
for n in [10, 50, 100, 500, 1000]:
    top_n_count = sum(count for _, count in corpus_vocab.most_common(n))
    coverage = (top_n_count / total_tokens) * 100
    print(f"  Top {n:>4} tokens cover {coverage:>5.1f}% of all tokens")


CORPUS-WIDE STATISTICS

Found 44 works in the corpus.

Work                                               Total Tokens  Unique Tokens
----------------------------------------------------------------------------
the_tragedy_of_hamlet_prince_of_denmark.txt               40789           4828
king_richard_the_third.txt                                39454           4125
the_tragedy_of_coriolanus.txt                             37220           4134
cymbeline.txt                                             36677           4335
the_tragedy_of_othello_the_moor_of_venice.txt             35978           3879
the_tragedy_of_king_lear.txt                              35786           4271
troilus_and_cressida.txt                                  35646           4323
the_second_part_of_king_henry_the_fourth.txt              34849           4176
the_tragedy_of_antony_and_cleopatra.txt                   34531           4027
the_life_of_king_henry_the_fifth.txt                      34270           4628

---

## Exercise 0.3 – Working With Pretrained Word Embeddings

### Description

In this exercise, you will download a widely used pretrained word embedding model (GloVe) and explore semantic relationships between words by computing cosine similarities manually.

### Learning Objectives

After completing this exercise, you should be able to:
- Load a pretrained embedding model into Python
- Implement cosine similarity manually using tensor operations
- Compute and analyze similarities between selected word pairs
- Observe how semantic and syntactic relationships are reflected in vector space
- Perform analogy operations (e.g., king - man + woman ≈ queen)

### Tasks

1. Download GloVe embeddings (100-dimensional version)
2. Load embeddings into a dictionary
3. **Implement cosine similarity** manually **(0.5 points)**
4. Explore semantic relationships** between word pairs - this requires **building a find_nearest function** that finds the nearest neighbor from a set of input vectors and a target vector. **(1 point)**


In [16]:
# Exercise 0.3: Working With Pretrained Word Embeddings

GLOVE_PATH = 'data/glove.6B.100d.txt'


def load_glove(path: str) -> dict:
    """
    Load GloVe embeddings from a file.

    Args:
        path: Path to the GloVe file

    Returns:
        Dictionary mapping words to torch tensors
    """
    print(f"Loading GloVe from {path}...")
    embeddings = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            try:
                vector = np.array(values[1:], dtype='float32')
                if len(vector) == 100:  # Ensure correct dimension
                    embeddings[word] = torch.tensor(vector)
            except ValueError:
                continue
    print(f"Loaded {len(embeddings)} words.")
    return embeddings


def cosine_similarity(vec1: torch.Tensor, vec2: torch.Tensor) -> float:
    """
    Compute cosine similarity between two vectors manually.

    Formula: cos(θ) = (A · B) / (|A| * |B|)

    Args:
        vec1: First vector
        vec2: Second vector

    Returns:
        Cosine similarity as a float
    """
    # Task 1.1: START STUDENT CODE

    # HINT: Implement the cosine similarity formula: cos(θ) = (A · B) / (|A| * |B|)
    # 1. Compute dot product: torch.dot(vec1, vec2)
    dot_product = torch.dot(vec1, vec2)
    # 2. Compute norms: torch.norm()
    norm_vec1 = torch.norm(vec1)
    norm_vec2 = torch.norm(vec2)
    # 3. Handle zero norms (avoid division by zero)
    if norm_vec1.item() == 0 or norm_vec2.item() == 0:
        return 0.0
    # 4. Return the result as a Python float using .item()
    cosine_sim = dot_product / (norm_vec1 * norm_vec2)

    return cosine_sim.item()

    # Task 1.1: END STUDENT CODE


def find_nearest(embeddings: dict, target_vec: torch.Tensor, n: int = 5,
                 exclude_words: list = None) -> list:
    """
    Find the n nearest neighbors to a target vector by computing
    the dot product of each word embedding with the target vector.

    Args:
        embeddings: Dictionary of word embeddings
        target_vec: Target vector to find neighbors for
        n: Number of neighbors to return
        exclude_words: Words to exclude from results

    Returns:
        List of (word, similarity) tuples
    """
    # Task 1.2: START STUDENT CODE
    if exclude_words is None:
        exclude_words = []
    # HINT:
    # 1. Convert embeddings dict to lists: words and vocab_matrix (stacked tensors)
    words = list(embeddings.keys())
    vocab_matrix = torch.stack([embeddings[w] for w in words])
    # 2. Compute norms for all vocabulary vectors and the target vector
    vocab_norms = torch.norm(vocab_matrix, dim=1)
    target_norm = torch.norm(target_vec)
    # 3. Compute dot products between vocab_matrix and target_vec using matmul
    dot_products = torch.matmul(vocab_matrix, target_vec)
    # 4. Compute cosine similarities using the formula
    cosine_sims = dot_products / (vocab_norms * target_norm + 1e-8)
    # 5. Use torch.topk() to find top k highest scores
    topk_vals, topk_idx = torch.topk(cosine_sims, n)
    # 6. Filter out excluded_words and return top n results as (word, score) tuples
    nearest = [(words[i], topk_vals[j].item()) for j, i in enumerate(topk_idx) if words[i] not in exclude_words]
    return nearest

    # Task 1.2: END STUDENT CODE

In [17]:
# Download GLOVE - you only need to do this once.

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip -d data/

# If this fails, you can download manually from:
# https://nlp.stanford.edu/data/glove.6B.zip
# and extract the zip file into the /data directory.

--2025-12-16 09:51:55--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2025-12-16 09:51:55--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2025-12-16 09:51:55--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [18]:
# Load embeddings and explore relationships

if os.path.exists(GLOVE_PATH):
    embeddings = load_glove(GLOVE_PATH)

    # Explore semantic relationships
    pairs = [
        ("king", "monarch"), ("love", "affection"),  # Synonyms
        ("war", "peace"), ("love", "hate"),  # Antonyms
        ("doctor", "nurse"), ("poison", "dagger"),  # Semantic fields
        ("romeo", "juliet"), ("tragedy", "comedy")  # Shakespeare-related
    ]

    print("\n--- Cosine Similarity Between Word Pairs ---")
    for w1, w2 in pairs:
        if w1 in embeddings and w2 in embeddings:
            sim = cosine_similarity(embeddings[w1], embeddings[w2])
            print(f"  {w1:12} - {w2:12}: {sim:.4f}")
        else:
            missing = [w for w in [w1, w2] if w not in embeddings]
            print(f"  Missing: {missing}")

    # Analogies
    print("\n--- Word Analogies ---")

    def solve_analogy(pos1, neg1, pos2):
        """Solve: pos1 - neg1 + pos2 = ?"""
        print(f"\n  {pos1} - {neg1} + {pos2} = ?")
        if all(w in embeddings for w in [pos1, neg1, pos2]):
            vec = embeddings[pos1] - embeddings[neg1] + embeddings[pos2]
            neighbors = find_nearest(embeddings, vec, n=5, exclude_words=[pos1, neg1, pos2])
            for word, score in neighbors:
                print(f"    {word}: {score:.4f}")
        else:
            print("    Word(s) not in vocabulary.")

    solve_analogy("king", "man", "woman")  # Expected: queen
    solve_analogy("paris", "france", "italy")  # Expected: rome
    solve_analogy("father", "man", "woman")  # Expected: mother
else:
    print(f"GloVe file not found at {GLOVE_PATH}")
    print("Please download from: https://nlp.stanford.edu/data/glove.6B.zip")
    print("Extract glove.6B.100d.txt to the data/ directory.")


Loading GloVe from data/glove.6B.100d.txt...
Loaded 400000 words.

--- Cosine Similarity Between Word Pairs ---
  king         - monarch     : 0.6978
  love         - affection   : 0.6255
  war          - peace       : 0.6155
  love         - hate        : 0.5704
  doctor       - nurse       : 0.7522
  poison       - dagger      : 0.3359
  romeo        - juliet      : 0.6607
  tragedy      - comedy      : 0.3790

--- Word Analogies ---

  king - man + woman = ?
    queen: 0.7834
    monarch: 0.6934
    throne: 0.6833
    daughter: 0.6809

  paris - france + italy = ?
    rome: 0.8084
    milan: 0.7317
    naples: 0.7090
    venice: 0.7010

  father - man + woman = ?
    mother: 0.9137
    daughter: 0.8749
    wife: 0.8636
    husband: 0.8385


---

# Stage 1: Character-Level Language Modeling

In this stage, you will build your first **neural language model** over Shakespeare's text using a **character-level recurrent neural network (RNN)** in PyTorch. You will:

- Construct a character-level dataset from one work
- Implement and train an RNN-based language model (on GPU if available)
- Generate text using greedy decoding and temperature sampling

---

## Exercise 1.1 – Character Vocabulary and Sequential Dataset

### Description

You will construct a **character-level representation** of a Shakespeare work and prepare a dataset for **next-character prediction**.

### Learning Objectives

After completing this exercise, you should be able to:
- Build a **character vocabulary** from raw text
- Map characters to integer indices and back
- Prepare sliding-window input–target pairs for sequence modeling
- Wrap the data in a PyTorch `Dataset` and `DataLoader`

### Task

Implement a PyTorch Dataset class that implements the usual functions; \_\_init\_\_, \_\_len\_\_, and \_\_getitem\_\_. The init function needs to take the text corpus as a string and a sequence length as an integer. The getitem function takes an index integer as usual and returns a number of characters (sequence length) starting at that index in the text corpus - this is the data - and a target that has the same length, but which is offset by one to the right. For example, if the corpus is the text `Hello, I am a dog.`, then the output at sequence length 5 and index 0 would be `Hello` (data) and `ello,` (target). **(3 x 0.5 points)**


In [21]:
# Exercise 1.1: Character Vocabulary and Sequential Dataset

from torch.utils.data import Dataset, DataLoader
import os

class CharDataset(Dataset):
    """
    A PyTorch Dataset for character-level language modeling.

    Creates input-target pairs using a sliding window over the text.
    Input: characters from position t to t + seq_len - 1
    Target: characters from position t + 1 to t + seq_len
    """

    def __init__(self, text: str, seq_len: int):
        """
        Initialize the dataset.

        Args:
            text: The full text as a string
            seq_len: Length of each sequence (number of characters)
        """
        # Task 1.3: START STUDENT CODE

        # HINT:
        # 1. Store text and seq_len
        self.text = text
        self.seq_len = seq_len
        # 2. Build character set: unique chars from text, sorted for consistency
        self.chars = sorted(list(set(text)))
        self.vocab_size = len(self.chars)
        # 3. Create bidirectional mappings: char_to_idx and idx_to_char
        self.char_to_idx = {ch: idx for idx, ch in enumerate(self.chars)}
        self.idx_to_char = {idx: ch for idx, ch in enumerate(self.chars)}
        # 4. Encode entire text as tensor of indices using char_to_idx
        self.data = torch.tensor(
            [self.char_to_idx[ch] for ch in text],
            dtype=torch.long
        )

        # Task 1.3: END STUDENT CODE

    def __len__(self):
        """Number of samples we can create from the text."""
        # Task 1.4: START STUDENT CODE

        # HINT: Each sample needs seq_len + 1 characters (seq_len for input, last one for target)
        # So max samples = len(self.data) - seq_len
        return len(self.data) - self.seq_len

        # Task 1.4: END STUDENT CODE

    def __getitem__(self, idx):
        """
        Get a single sample.

        Returns:
            input_seq: tensor of shape [seq_len]
            target_seq: tensor of shape [seq_len]
        """
        # Task 1.5: START STUDENT CODE

        # HINT: Implement sliding window for next-character prediction
        # 1. Extract seq_len + 1 characters starting at idx
        chunk = self.data[idx : idx + self.seq_len + 1]
        # 2. Split into input (first seq_len) and target (last seq_len, shifted by 1)
        input_seq = chunk[:-1]
        target_seq = chunk[1:]
        # 3. Return both as tensors
        return input_seq, target_seq

        # Task 1.5: END STUDENT CODE

In [22]:
# Test the CharDataset
print("--- Exercise 1.1: Character Dataset ---")

# Load Romeo and Juliet (or another work)
work_path = os.path.join(WORKS_DIR, 'the_tragedy_of_romeo_and_juliet.txt')
if not os.path.exists(work_path):
    print("Work file not found. Please run Exercise 0.1 first.")
else:
    with open(work_path, 'r', encoding='utf-8') as f:
        char_text = f.read()

    print(f"Loaded text length: {len(char_text)} characters")

    # Create dataset
    # seq_len=100 captures roughly 1-2 lines of verse - good context for learning structure
    CHAR_SEQ_LEN = 100
    CHAR_BATCH_SIZE = 64

    char_dataset = CharDataset(char_text, CHAR_SEQ_LEN)
    print(f"Vocabulary Size: {char_dataset.vocab_size}")
    print(f"Characters: {repr(''.join(char_dataset.char_to_idx.keys()))}")
    print(f"Number of samples: {len(char_dataset)}")

    # Create DataLoader
    char_dataloader = DataLoader(char_dataset, batch_size=CHAR_BATCH_SIZE, shuffle=True)

    # Verify with one batch
    inputs, targets = next(iter(char_dataloader))
    print(f"\nBatch Input Shape: {inputs.shape}")
    print(f"Batch Target Shape: {targets.shape}")

    # Show a sample
    sample_input = "".join([char_dataset.idx_to_char[i.item()] for i in inputs[0]])
    sample_target = "".join([char_dataset.idx_to_char[i.item()] for i in targets[0]])
    print(f"\nSample Input (first 50 chars): {repr(sample_input[:50])}")
    print(f"Sample Target (first 50 chars): {repr(sample_target[:50])}")

    print(f"\n--- Design Notes ---")
    print(f"seq_len={CHAR_SEQ_LEN}: Captures sufficient context (approx 1-2 lines of verse)")
    print(f"batch_size={CHAR_BATCH_SIZE}: Good balance between efficiency and memory usage")


--- Exercise 1.1: Character Dataset ---
Loaded text length: 142446 characters
Vocabulary Size: 70
Characters: '\n !&,-.:;?ABCDEFGHIJKLMNOPQRSTUVWYZ[]_abcdefghijklmnopqrstuvwxyzæ—‘’“”'
Number of samples: 142346

Batch Input Shape: torch.Size([64, 100])
Batch Target Shape: torch.Size([64, 100])

Sample Input (first 50 chars): ' And therefore, if you should deal double with\nher'
Sample Target (first 50 chars): 'And therefore, if you should deal double with\nher,'

--- Design Notes ---
seq_len=100: Captures sufficient context (approx 1-2 lines of verse)
batch_size=64: Good balance between efficiency and memory usage


---

## Exercise 1.2 – Character-Level RNN Language Model in PyTorch

### Description

You will implement and train a **character-level RNN-based language model** using the dataset from Exercise 1.1. The model will learn to predict the next character given the previous characters.

### Learning Objectives

After completing this exercise, you should be able to:
- Define a simple **recurrent neural network** for language modeling
- Use an **embedding layer**, an `nn.RNN`, and a linear output layer
- Train a neural language model with **cross-entropy loss**
- Run training on **GPU** where available

### Model Architecture

```
Input (char indices) → Embedding → RNN → Linear → Output (vocab logits)
```

### Tasks

1. Set up device selection (GPU if available)
2. Implement the `CharRNNLM` model class **(1 point)**
3. Train the model with cross-entropy loss and Adam optimizer


In [None]:
# Exercise 1.2: Character-Level RNN Language Model

import torch.nn as nn
import torch.optim as optim


class CharRNNLM(nn.Module):
    """
    Character-level RNN Language Model.

    Architecture:
    - Embedding layer: maps character indices to dense vectors
    - RNN layer: processes sequences and maintains hidden state
    - Linear layer: maps hidden states to vocabulary logits
    """

    def __init__(self, vocab_size: int, emb_dim: int, hidden_size: int):
        """
        Initialize the model.

        Args:
            vocab_size: Number of unique characters
            emb_dim: Dimension of character embeddings
            hidden_size: Number of hidden units in RNN
        """
        # Task 1.6: START STUDENT CODE

        # HINT: Build a simple RNN-based language model
        # 1. Call super().__init__()
        super().__init__()
        # 2. Create embedding layer: nn.Embedding(vocab_size, emb_dim)
        self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim)
        # 3. Create RNN layer: nn.RNN(emb_dim, hidden_size, batch_first=True)
        self.rnn = nn.RNN(input_size=emb_dim, hidden_size=hidden_size, batch_first=True)
        # 4. Create linear output layer: nn.Linear(hidden_size, vocab_size)
        self.fc = nn.Linear(hidden_size, vocab_size)

        # Task 1.6: END STUDENT CODE

    def forward(self, x, hidden=None):
        """
        Forward pass.

        Args:
            x: Input tensor of shape (batch, seq_len)
            hidden: Optional initial hidden state

        Returns:
            logits: Output logits of shape (batch, seq_len, vocab_size)
            hidden: Final hidden state
        """
        # Task 1.7: START STUDENT CODE

        # HINT: Forward pass through the model:
        # 1. Embed the input indices
        embeds = self.embedding(x)
        # 2. Pass through RNN to get hidden states
        rnn_out, hidden = self.rnn(embeds, hidden)
        # 3. Pass hidden states through linear layer to get logits
        logits = self.fc(rnn_out)
        # 4. Return logits and final hidden state
        return logits, hidden

        # Task 1.7: END STUDENT CODE

# Device setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")


In [None]:
# Exercise 1.2 (continued): Training the Character RNN

print("--- Exercise 1.2: Training Character RNN ---")

# Model hyperparameters
CHAR_EMB_DIM = 64
CHAR_HIDDEN_SIZE = 256
CHAR_EPOCHS = 5
CHAR_LR = 0.002

# Create model
char_model = CharRNNLM(
    vocab_size=char_dataset.vocab_size,
    emb_dim=CHAR_EMB_DIM,
    hidden_size=CHAR_HIDDEN_SIZE
).to(device)

print(f"Model created with {sum(p.numel() for p in char_model.parameters()):,} parameters")

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(char_model.parameters(), lr=CHAR_LR)



print("\n--- Training Notes ---")
print("The loss should decrease over epochs, indicating the model is learning.")
print("If loss doesn't decrease, try: lower learning rate, more epochs, or larger model.")

# Training loop
char_model.train()
for epoch in range(CHAR_EPOCHS):
    total_loss = 0
    for i, (inputs, targets) in enumerate(char_dataloader):
        inputs, targets = inputs.to(device), targets.to(device)

        optimizer.zero_grad()

        # Forward pass (hidden state starts as None/zeros)
        logits, _ = char_model(inputs)

        # Reshape for loss: (batch * seq_len, vocab_size) vs (batch * seq_len)
        loss = criterion(logits.view(-1, char_dataset.vocab_size), targets.view(-1))

        # Backward pass
        loss.backward()
        torch.nn.utils.clip_grad_norm_(char_model.parameters(), 5)  # Gradient clipping for stability
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(char_dataloader)
    print(f"Epoch {epoch+1}/{CHAR_EPOCHS}, Loss: {avg_loss:.4f}")

# Save model checkpoint
char_model_path = "char_rnn_model.pth"
torch.save({
    'model_state_dict': char_model.state_dict(),
    'vocab_size': char_dataset.vocab_size,
    'emb_dim': CHAR_EMB_DIM,
    'hidden_size': CHAR_HIDDEN_SIZE,
    'char_to_idx': char_dataset.char_to_idx,
    'idx_to_char': char_dataset.idx_to_char,
}, char_model_path)
print(f"\nModel saved to {char_model_path}")


---

## Exercise 1.3 – Text Generation and Temperature Sampling

### Description

You will implement text generation functions for your character-level language model and experiment with different **sampling strategies**, including **greedy decoding** and **temperature-scaled sampling**.

### Learning Objectives

After completing this exercise, you should be able to:
- Generate text autoregressively from a trained character-level language model
- Implement **greedy decoding** and observe its limitations
- Implement **temperature-based sampling** from a categorical distribution
- Qualitatively compare generated outputs at different temperatures

### Temperature Explained

- **Temperature = 0 (or very low)**: Greedy - always picks most likely character. Repetitive but "safe".
- **Temperature = 1.0**: Standard sampling from the learned distribution.
- **Temperature > 1.0**: More random/creative, but may produce nonsense.
- **Temperature < 1.0**: More focused/conservative, less variety.

### Tasks

1. Implement a `generate_greedy` function - This function should generate a string of predefined length, given a model and a start_text string. **(1 point)**
2. Implement a `generate_with_temperature` function - This function should generate a string of predefined length, given a model and a start_text string, but the generation of each new character is up to chance. The probabilities that a possible new character is selected is based on the temperature-scaled logits (our normal outputs times the temperature factor). **(1 point)**
3. Compare outputs at different temperatures. **(0.5 points)**


In [None]:
# Exercise 1.3: Text Generation and Temperature Sampling

import torch.nn.functional as F

def generate_greedy(model, start_text: str, length: int, char_to_idx: dict,
                    idx_to_char: dict, device='cpu') -> str:
    """
    Generate text using greedy decoding (always pick most likely next character).

    Args:
        model: Trained CharRNNLM model
        start_text: Initial prompt text
        length: Number of characters to generate
        char_to_idx: Character to index mapping
        idx_to_char: Index to character mapping
        device: Device to run on

    Returns:
        Generated text string (including prompt)
    """
    # Task 1.8: START STUDENT CODE

    # HINT: Implement greedy decoding for text generation
    # 1. Set model to eval mode and move to device
    # 2. Encode the start_text as character indices (handle unknown with fallback)
    # 3. Process prompt through model to initialize hidden state
    # 4. Loop for 'length' iterations:
    #    - Pick character with highest probability (torch.argmax)
    #    - Append to generated_text
    #    - Feed single character to model for next step
    # 5. Return generated text
    pass

    # Task 1.8: END STUDENT CODE


def generate_with_temperature(model, start_text: str, length: int, temperature: float,
                              char_to_idx: dict, idx_to_char: dict, device='cpu') -> str:
    """
    Generate text using temperature-scaled sampling.

    Args:
        model: Trained CharRNNLM model
        start_text: Initial prompt text
        length: Number of characters to generate
        temperature: Sampling temperature (higher = more random)
        char_to_idx: Character to index mapping
        idx_to_char: Index to character mapping
        device: Device to run on

    Returns:
        Generated text string (including prompt)
    """
    # Task 1.9: START STUDENT CODE

    # HINT: Implement temperature-scaled sampling for more creative text
    # 1. Set model to eval mode and move to device
    # 2. Encode start_text and process through model
    # 3. Loop for 'length' iterations:
    #    - If temperature very small (~0): use greedy (torch.argmax)
    #    - Otherwise: scale logits by temperature, softmax to probabilities, sample
    #    - Use torch.multinomial() to sample from probability distribution
    #    - Append character and feed to model
    # 4. Return generated text
    pass

    # Task 1.9: END STUDENT CODE

# Test generation
print("--- Exercise 1.3: Text Generation ---")

prompts = ["ROMEO.", "The "]

for prompt in prompts:
    print(f"\n{'='*60}")
    print(f"Prompt: '{prompt}'")
    print('='*60)

    print("\n--- Greedy Decoding ---")
    greedy_text = generate_greedy(
        char_model, prompt, 200,
        char_dataset.char_to_idx, char_dataset.idx_to_char, device
    )
    print(greedy_text)

    for temp in [0.5, 1.0, 2.0]:
        print(f"\n--- Temperature {temp} ---")
        temp_text = generate_with_temperature(
            char_model, prompt, 200, temp,
            char_dataset.char_to_idx, char_dataset.idx_to_char, device
        )
        print(temp_text)


---

# Stage 2: Word-Level Language Modeling & Theatrical Chat Interface

In this stage, you will move from **character-level** to **word-level** language modeling. You will:

- Build a word-level vocabulary and dataset from Shakespeare's works
- Implement and train a **word-level RNN language model**
- Construct a simple **theatrical chat interface** that simulates dialog between characters
- Create a **turn-aware** model with special end-of-turn tokens

---

## Exercise 2.1 – Word-Level Vocabulary and Sequential Dataset

### Description

You will construct a **word-level representation** of Shakespeare's text and prepare a dataset for **next-word prediction**.

### Learning Objectives

After completing this exercise, you should be able to:
- Extend a tokenizer for word-level modeling
- Build a **word vocabulary** with frequency cutoffs
- Handle out-of-vocabulary words with `<UNK>` token
- Prepare sliding-window input-target pairs for next-word prediction

### Task

Instead of a character-based dataset, we now transition to a word-based dataset. The methods are the same as before, except we now split the text into words and limit ourselves to a vocabulary of a fixed size `vocab_size`, which should consist of the most common tokens in the corpus (hint: use a Counter). The vocabulary should also contain an `unk_token = "<UNK>"`, which is our stand-in for any future words we do not know (either rare tokens from this corpus or other texts).

1. Build the vocabulary. **(0.5 points)**
2. Build `<UNK>` handling. **(0.5 points)**
3. Build the rest of the dataset. **(0.5 points)**


In [None]:
# Exercise 2.1: Word-Level Vocabulary and Sequential Dataset

# You can reuse the tokenize function from Exercise 0.2 (already defined above)

class WordDataset(Dataset):
    """
    A PyTorch Dataset for word-level language modeling.

    Creates input-target pairs using a sliding window over tokenized text.
    Handles vocabulary building and out-of-vocabulary words.
    """

    def __init__(self, text: str, seq_len: int, vocab_size: int = 30000):
        """
        Initialize the dataset.

        Args:
            text: The full text as a string
            seq_len: Length of each sequence (number of words)
            vocab_size: Maximum vocabulary size (most frequent words)
        """
        # Task 2.1: START STUDENT CODE

        # HINT:
        # 1. Store seq_len and tokenize the text
        # 2. Build vocabulary: count token frequencies, keep most common up to vocab_size
        # 3. Create bidirectional mappings with an <UNK> token for OOV words
        # 4. Encode all tokens to indices
        pass

        # Task 2.1: END STUDENT CODE

    def __len__(self):
        # Task 2.2: START STUDENT CODE
        """Number of samples available."""

        # HINT: Similar to CharDataset, return the number of valid sequences
        pass

        # Task 2.2: END STUDENT CODE

    def __getitem__(self, idx):
        # Task 2.3: START STUDENT CODE
        """Get a single sample (input, target pair)."""

        # HINT: Implement sliding window for next-word prediction (same as character level)
        pass

        # Task 2.3: END STUDENT CODE


In [None]:
# Load the full corpus for word-level modeling
print("--- Exercise 2.1: Word-Level Dataset ---")
print("\nLoading full Shakespeare corpus...")

all_text = []
work_files = sorted([f for f in os.listdir(WORKS_DIR) if f.endswith('.txt')])
# use a selective works if training takes too long
work_files = [f for f in work_files if "romeo" in f]

print(f"Found {len(work_files)} works")

for filename in work_files:
    with open(os.path.join(WORKS_DIR, filename), 'r', encoding='utf-8') as f:
        all_text.append(f.read())

full_corpus = "\n".join(all_text)
print(f"Total corpus length: {len(full_corpus):,} characters")

# Create dataset
WORD_SEQ_LEN = 100  # 100 words of context
WORD_BATCH_SIZE = 64
WORD_VOCAB_SIZE = 30000

word_dataset = WordDataset(full_corpus, WORD_SEQ_LEN, vocab_size=WORD_VOCAB_SIZE)
word_dataloader = DataLoader(word_dataset, batch_size=WORD_BATCH_SIZE, shuffle=True)

# Verify
inputs, targets = next(iter(word_dataloader))
print(f"\nBatch shapes: Input {inputs.shape}, Target {targets.shape}")

# Decode sample
sample_input = ' '.join([word_dataset.idx_to_word[i.item()] for i in inputs[0][:20]])
print(f"\nSample input (first 20 words): {sample_input}...")

print(f"\n--- Design Notes ---")
print(f"seq_len={WORD_SEQ_LEN}: Longer context window for word-level modeling")
print(f"vocab_size={WORD_VOCAB_SIZE}: Accommodates Shakespeare's vocabulary")
print(f"<UNK> handling: Rare words mapped to single UNK token")


---

## Exercise 2.2 – Word-Level RNN Language Model in PyTorch

### Description

You will implement and train a **word-level RNN-based language model** using the dataset from Exercise 2.1. The model will learn to predict the next word given a sequence of preceding words.

### Learning Objectives

After completing this exercise, you should be able to:
- Define a word-level recurrent neural language model
- Use larger embedding dimensions appropriate for word-level modeling
- Train with cross-entropy loss on next-word prediction

### Tasks

1. Define the `WordRNNLM` model class (similar to CharRNNLM but for words). **(1 point)**
2. Train for multiple epochs on the full corpus and save model checkpoints.


In [None]:
# Exercise 2.2: Word-Level RNN Language Model


class WordRNNLM(nn.Module):
    """
    Word-level RNN Language Model.

    Similar architecture to CharRNNLM but with:
    - Larger embedding dimensions (words need more representation capacity)
    - Multiple RNN layers for better modeling
    """

    def __init__(self, vocab_size: int, emb_dim: int, hidden_size: int, num_layers: int = 3):
        """
        Initialize the model.

        Args:
            vocab_size: Number of words in vocabulary
            emb_dim: Dimension of word embeddings
            hidden_size: Number of hidden units in RNN
            num_layers: Number of stacked RNN layers
        """
        # Task 2.4: START STUDENT CODE

        # HINT: Build a word-level RNN model (similar to CharRNNLM but with num_layers param)
        pass

        # Task 2.4: END STUDENT CODE

    def forward(self, x, hidden=None):
        """
        Forward pass.

        Args:
            x: Input tensor of shape (batch, seq_len)
            hidden: Optional initial hidden state

        Returns:
            logits: Output logits of shape (batch, seq_len, vocab_size)
            hidden: Final hidden state
        """
        # Task 2.5: START STUDENT CODE

        # HINT: Forward pass through RNN (same as CharRNNLM)
        pass

        # Task 2.5: END STUDENT CODE

print("--- Exercise 2.2: Word-Level RNN Model ---")


In [None]:
# Exercise 2.2 (continued): Training Word RNN

# Model hyperparameters - larger than char model
WORD_EMB_DIM = 300
WORD_HIDDEN_SIZE = 512
WORD_NUM_LAYERS = 3
WORD_EPOCHS = 3  # Fewer epochs due to larger corpus
WORD_LR = 0.001

# Create model
word_model = WordRNNLM(
    vocab_size=word_dataset.vocab_size,
    emb_dim=WORD_EMB_DIM,
    hidden_size=WORD_HIDDEN_SIZE,
    num_layers=WORD_NUM_LAYERS
).to(device)

print(f"Model created with {sum(p.numel() for p in word_model.parameters()):,} parameters")

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(word_model.parameters(), lr=WORD_LR)

# Training loop
total_batches = len(word_dataloader)
word_model.train()

for epoch in range(WORD_EPOCHS):
    total_loss = 0
    for i, (inputs, targets) in enumerate(word_dataloader):
        inputs, targets = inputs.to(device), targets.to(device)

        optimizer.zero_grad()
        logits, _ = word_model(inputs)
        loss = criterion(logits.view(-1, word_dataset.vocab_size), targets.view(-1))
        loss.backward()
        torch.nn.utils.clip_grad_norm_(word_model.parameters(), 5)
        optimizer.step()
        total_loss += loss.item()

        # Progress logging
        if i % 200 == 0:
            pct = (i / total_batches) * 100
            print(f"Epoch {epoch+1}/{WORD_EPOCHS} | Batch {i}/{total_batches} ({pct:.1f}%) | Loss: {loss.item():.4f}")

    avg_loss = total_loss / len(word_dataloader)
    print(f"Epoch {epoch+1}/{WORD_EPOCHS} Complete | Avg Loss: {avg_loss:.4f}")

# Save model
word_model_path = "word_rnn_model.pth"
torch.save({
    'model_state_dict': word_model.state_dict(),
    'vocab_size': word_dataset.vocab_size,
    'emb_dim': WORD_EMB_DIM,
    'hidden_size': WORD_HIDDEN_SIZE,
    'num_layers': WORD_NUM_LAYERS,
    'word_to_idx': word_dataset.word_to_idx,
    'idx_to_word': word_dataset.idx_to_word,
}, word_model_path)
print(f"\nModel saved to {word_model_path}")


---

## Exercise 2.3 – Theatrical Chat Interface with a Word-Level RNN

### Description

In this exercise, you will use your trained **word-level RNN language model** to build a simple **theatrical chat interface**. The model will be prompted with a speaker name and dialog, then continue the text in Shakespearean style.

### Learning Objectives

After completing this exercise, you should be able to:
- Use a word-level language model for **prompt-based generation**
- Design a simple **chat-style interface** around a language model
- Control text generation via temperature sampling at the word level

### Tasks

1. Implement a word-level generation function that detokenizes the generated text after generation. **(1 point)**
2. Test with different theatrical prompts (ROMEO., JULIET., etc.) and different temperatures. **(0.5 points)**


In [None]:
# Exercise 2.3: Theatrical Chat Interface

def detokenize(tokens: list) -> str:
    """
    Convert a list of tokens back into readable text.
    Handles punctuation attachment (no space before punctuation).

    Args:
        tokens: List of token strings

    Returns:
        Reconstructed text string
    """
    text = ""
    for t in tokens:
        # Attach punctuation without leading space
        if t in [".", ",", "?", "!", ":", ";", "'"] or t.startswith("'"):
            if text:
                text = text.rstrip() + t + " "
            else:
                text += t + " "
        else:
            text += t + " "
    return text.strip()


def generate_words(model, start_text: str, max_tokens: int, temperature: float,
                   word_to_idx: dict, idx_to_word: dict, device='cpu') -> str:
    """
    Generate text at the word level with temperature sampling.

    Args:
        model: Trained WordRNNLM model
        start_text: Initial prompt text
        max_tokens: Maximum number of words to generate
        temperature: Sampling temperature
        word_to_idx: Word to index mapping
        idx_to_word: Index to word mapping
        device: Device to run on

    Returns:
        Generated text string (including prompt)
    """
    # Task 2.6: START STUDENT CODE

    # HINT: Word-level generation is similar to character-level but:
    # 1. Tokenize the prompt using the tokenize() function
    # 2. Map tokens to indices (handle UNK)
    # 3. Generate words with temperature sampling (like Task 1.9)
    # 4. Detokenize the result (reconstruct readable text with proper spacing)
    pass

    # Task 2.6: END STUDENT CODE


In [None]:
# Test theatrical generation
print("--- Exercise 2.3: Theatrical Chat Interface ---")

examples = [
    ("ROMEO", "My heart is heavy with unspoken words."),
    ("JULIET", "Good even, my lord. Why art thou troubled?"),
    ("HAMLET", "To be, or not to be, that is the question.")
]

for speaker, line in examples:
    prompt = f"{speaker}.\n{line}\n"

    print(f"\n{'='*60}")
    print(f"{speaker}: {line}")
    print('='*60)

    for temp in [0.5, 0.8, 1.2]:
        response = generate_words(
            word_model, prompt, 50, temp,
            word_dataset.word_to_idx, word_dataset.idx_to_word, device
        )
        print(f"\n[Temperature {temp}]")
        print(response)

---

## Exercise 2.4 – Turn-Based Modeling with Special Tokens

### Description

To build a realistic theatrical chat interface, the model needs to understand when a speaker's turn ends. In this exercise, you will create a specialized dataset that inserts a special **End-of-Turn** token (`<EOS>`) before every speaker change.

### Learning Objectives

- Preprocess text to explicitly model dialog structure (turns)
- Use special tokens (`<EOS>`) to control generation length
- Train a language model that learns to stop generating at appropriate times

### Key Insight

By training with `<EOS>` markers before speaker changes, the model learns:
1. When to stop generating (predict `<EOS>`)
2. The natural rhythm of theatrical dialogue

### Tasks

1. Implement a `insert_turn_markers()` to add `<EOS>` before speaker names. **(1 point)**
2. Create a `TurnDataset` with the modified corpus and train a turn-aware model. **(1 point)**


In [None]:
# Exercise 2.4: Turn-Based Modeling with Special Tokens

def insert_turn_markers(text: str) -> str:
    """
    Insert <EOS> markers before speaker names in the text.

    Speaker detection heuristic:
    - Lines that are predominantly uppercase
    - Short (< 30 characters)
    - Not empty

    Args:
        text: Original text

    Returns:
        Text with <EOS> markers inserted before speaker names
    """
    # Task 2.7: START STUDENT CODE

    # HINT: Detect speaker names and insert <EOS> markers
    # 1. Split text into lines
    # 2. For each line, detect if it's a speaker name:
    #    - Non-empty, short (< 30 chars), mostly uppercase
    # 3. If speaker: insert "<EOS>" before the line
    # 4. Rejoin lines
    pass

    # Task 2.7: END STUDENT CODE


class TurnDataset(Dataset):
    """
    Dataset with turn markers for dialogue modeling.

    Similar to WordDataset but:
    - Preprocesses text with <EOS> markers
    - Ensures <EOS> token is in vocabulary
    """

    def __init__(self, text: str, seq_len: int, vocab_size: int = 30000):
        # Task 2.8: START STUDENT CODE

        # HINT: Similar to WordDataset but with turn markers:
        # 1. Store seq_len
        # 2. Insert <EOS> markers before speakers (use insert_turn_markers)
        # 3. Replace <EOS> with "eos_marker" so tokenizer handles it properly
        # 4. Tokenize and build vocabulary (reserve space for UNK and eos_marker)
        # 5. Ensure both UNK and EOS tokens are in vocabulary
        # 6. Encode all tokens
        pass

        # Task 2.8: END STUDENT CODE

    def __len__(self):
        # Task 2.9: START STUDENT CODE

        # HINT: Same as WordDataset
        pass

        # Task 2.9: END STUDENT CODE

    def __getitem__(self, idx):
        # Task 2.10: START STUDENT CODE

        # HINT: Same as WordDataset
        pass

        # Task 2.10: END STUDENT CODE

# Create turn-aware dataset
print("--- Exercise 2.4: Turn-Based Dataset ---")

turn_dataset = TurnDataset(full_corpus, WORD_SEQ_LEN, vocab_size=WORD_VOCAB_SIZE)
turn_dataloader = DataLoader(turn_dataset, batch_size=WORD_BATCH_SIZE, shuffle=True)

print(f"\nTurn Dataset Stats:")
print(f"  Vocabulary size: {turn_dataset.vocab_size}")
print(f"  EOS token index: {turn_dataset.word_to_idx[turn_dataset.eos_token]}")
print(f"  Number of samples: {len(turn_dataset)}")


In [None]:
# Exercise 2.4 (continued): Training Turn-Aware Model

print("--- Training Turn-Aware RNN ---")

# Create turn-aware model (same architecture, different vocab/data)
turn_model = WordRNNLM(
    vocab_size=turn_dataset.vocab_size,
    emb_dim=WORD_EMB_DIM,
    hidden_size=WORD_HIDDEN_SIZE,
    num_layers=WORD_NUM_LAYERS
).to(device)

print(f"Turn model created with {sum(p.numel() for p in turn_model.parameters()):,} parameters")

# Training setup
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(turn_model.parameters(), lr=WORD_LR)

# Training loop
total_batches = len(turn_dataloader)
turn_model.train()

TURN_EPOCHS = 3

for epoch in range(TURN_EPOCHS):
    total_loss = 0
    for i, (inputs, targets) in enumerate(turn_dataloader):
        inputs, targets = inputs.to(device), targets.to(device)

        optimizer.zero_grad()
        logits, _ = turn_model(inputs)
        loss = criterion(logits.view(-1, turn_dataset.vocab_size), targets.view(-1))
        loss.backward()
        torch.nn.utils.clip_grad_norm_(turn_model.parameters(), 5)
        optimizer.step()
        total_loss += loss.item()

        if i % 200 == 0:
            pct = (i / total_batches) * 100
            print(f"Epoch {epoch+1}/{TURN_EPOCHS} | Batch {i}/{total_batches} ({pct:.1f}%) | Loss: {loss.item():.4f}")

    avg_loss = total_loss / len(turn_dataloader)
    print(f"Epoch {epoch+1}/{TURN_EPOCHS} Complete | Avg Loss: {avg_loss:.4f}")

# Save turn-aware model
turn_model_path = "turn_rnn_model.pth"
torch.save({
    'model_state_dict': turn_model.state_dict(),
    'vocab_size': turn_dataset.vocab_size,
    'emb_dim': WORD_EMB_DIM,
    'hidden_size': WORD_HIDDEN_SIZE,
    'num_layers': WORD_NUM_LAYERS,
    'word_to_idx': turn_dataset.word_to_idx,
    'idx_to_word': turn_dataset.idx_to_word,
    'eos_token': turn_dataset.eos_token,
}, turn_model_path)
print(f"\nTurn-aware model saved to {turn_model_path}")


---

## Stage 2.5 – Theatrical Chat Interface (Model Comparison)

### Description

You can now use your trained models to build a chat interface that supports **both** the standard Word-RNN (from Exercise 2.2) and the Turn-Aware RNN (from Exercise 2.4).

### Key Differences

| Model | Generation Behavior |
|-------|---------------------|
| Standard Word-RNN | Generates exactly `max_tokens` words |
| Turn-Aware RNN | Stops when `<EOS>` is generated |

### Tasks

No tasks, just execute the code below and play around with it, to get a feeling for the performance of the two variants.

**The `generate_with_eos` function will be useful in the next tasks - you can and should use it, and should familiarize yourself with what it does.**


In [None]:
# Exercise 2.5: Theatrical Chat Interface (Model Comparison)

def generate_with_eos(model, start_text: str, max_tokens: int, temperature: float,
                      word_to_idx: dict, idx_to_word: dict, eos_token: str = None,
                      device='cpu') -> str:
    """
    Generate text with optional EOS stopping.

    Args:
        model: Trained language model
        start_text: Initial prompt
        max_tokens: Maximum tokens to generate
        temperature: Sampling temperature
        word_to_idx: Word to index mapping
        idx_to_word: Index to word mapping
        eos_token: If provided, stop when this token is generated
        device: Device to run on

    Returns:
        Generated text (excluding prompt tokens)
    """
    model.eval()
    model.to(device)
    hidden = None

    # Preprocess prompt - handle <eos> placeholder
    text_to_process = start_text.lower()
    if eos_token and "<eos>" in text_to_process:
        text_to_process = text_to_process.replace("<eos>", eos_token)

    tokens = tokenize(text_to_process)
    unk_idx = word_to_idx.get("<UNK>", 0)

    input_indices = [word_to_idx.get(t, unk_idx) for t in tokens]
    input_seq = torch.tensor(input_indices, dtype=torch.long).unsqueeze(0).to(device)

    generated_tokens = []

    # Get EOS index if applicable
    eos_idx = word_to_idx.get(eos_token, -1) if eos_token else -1

    with torch.no_grad():
        logits, hidden = model(input_seq, hidden)
        last_logits = logits[:, -1, :]

        for _ in range(max_tokens):
            if temperature <= 0:
                idx = torch.argmax(last_logits, dim=-1).item()
            else:
                probs = F.softmax(last_logits / temperature, dim=-1)
                idx = torch.multinomial(probs, 1).item()

            # Check for EOS
            if eos_token and idx == eos_idx:
                break

            word = idx_to_word.get(idx, "<UNK>")
            generated_tokens.append(word)

            input_seq = torch.tensor([[idx]], dtype=torch.long).to(device)
            logits, hidden = model(input_seq, hidden)
            last_logits = logits[:, -1, :]

    return detokenize(generated_tokens)

# Compare models
print("--- Exercise 2.5: Model Comparison ---")

examples = [
    ("ROMEO", "My heart is heavy with unspoken words."),
    ("JULIET", "Good even, my lord. Why art thou troubled?"),
]

for speaker, line in examples:
    print(f"\n{'='*70}")
    print(f"PROMPT: {speaker}: {line}")
    print('='*70)

    # Standard model (fixed length)
    standard_prompt = f"{speaker}.\n{line}\n"
    standard_response = generate_with_eos(
        word_model, standard_prompt, 50, 0.8,
        word_dataset.word_to_idx, word_dataset.idx_to_word,
        eos_token=None, device=device
    )
    print(f"\n[Standard Word-RNN (50 tokens)]")
    print(standard_response[:200] + "..." if len(standard_response) > 200 else standard_response)

    # Turn-aware model (EOS stopping)
    turn_prompt = f"<eos>\n{speaker}.\n{line}\n<eos>\n"
    turn_response = generate_with_eos(
        turn_model, turn_prompt, 200, 0.8,
        turn_dataset.word_to_idx, turn_dataset.idx_to_word,
        eos_token=turn_dataset.eos_token, device=device
    )
    print(f"\n[Turn-Aware RNN (EOS stopping)]")
    print(turn_response)


---

# Stage 3: Word-Level LSTM & RNN–LSTM Comparison

In this final stage, you will extend your word-level language model by replacing the RNN with an **LSTM**. You will:

- Implement and train a **word-level LSTM language model**
- Plug the LSTM into your existing **theatrical chat interface**
- Qualitatively compare the behavior of **RNN** vs **LSTM** models

## Why LSTM?

Long Short-Term Memory (LSTM) networks address the **vanishing gradient problem** in standard RNNs:

| Feature | RNN | LSTM |
|---------|-----|------|
| Memory | Short-term only | Long and short-term |
| Gradient flow | Degrades over long sequences | Gates preserve gradients |
| Training | Faster per step | More stable |
| Parameters | Fewer | ~4x more (3 gates + cell) |

---

## Exercise 3.1 – Word-Level LSTM Language Model in PyTorch

### Description

You will modify your word-level language model by replacing the RNN layer with an LSTM.

### Learning Objectives

- Implement a **word-level LSTM-based language model**
- Handle LSTM's dual hidden state `(h, c)`
- Train on the same turn-aware dataset for fair comparison

### Tasks

1. Define `WordLSTMLM` class with `nn.LSTM`. It has to fulfill the same conditions as normally, except this time, the model should return both the prediction and the LSTM's hidden state. **(1 point)**
2. Train on the same dataset as the RNN and compare training behavior in terms of loss, stability, etc. **(1 point)**


In [None]:
# Exercise 3.1: Word-Level LSTM Language Model

class WordLSTMLM(nn.Module):
    """
    Word-level LSTM Language Model.

    Key difference from RNN:
    - Uses nn.LSTM instead of nn.RNN
    - Hidden state is a tuple (h_n, c_n) where:
      - h_n: hidden state (same as RNN)
      - c_n: cell state (LSTM's long-term memory)
    """

    def __init__(self, vocab_size: int, emb_dim: int, hidden_size: int, num_layers: int = 3):
        """
        Initialize the LSTM model.

        Args:
            vocab_size: Number of words in vocabulary
            emb_dim: Dimension of word embeddings
            hidden_size: Number of hidden units in LSTM
            num_layers: Number of stacked LSTM layers
        """
        # Task 3.1: START STUDENT CODE

        # HINT: Build a word-level LSTM model (same as WordRNNLM but use nn.LSTM)
        pass

        # Task 3.1: END STUDENT CODE

    def forward(self, x, hidden=None):
        """
        Forward pass.

        Args:
            x: Input tensor of shape (batch, seq_len)
            hidden: Optional initial hidden state tuple (h_0, c_0)

        Returns:
            logits: Output logits of shape (batch, seq_len, vocab_size)
            hidden: Final hidden state tuple (h_n, c_n)
        """
        # Task 3.2: START STUDENT CODE

        # HINT: Forward pass through LSTM (same as RNN forward pass)
        pass

        # Task 3.2: END STUDENT CODE


print("--- Exercise 3.1: Word-Level LSTM Model ---")


In [None]:
# Exercise 3.1 (continued): Training LSTM Model

# Create LSTM model (same hyperparameters as RNN for fair comparison)
lstm_model = WordLSTMLM(
    vocab_size=turn_dataset.vocab_size,
    emb_dim=WORD_EMB_DIM,
    hidden_size=WORD_HIDDEN_SIZE,
    num_layers=WORD_NUM_LAYERS
).to(device)

print(f"LSTM model created with {sum(p.numel() for p in lstm_model.parameters()):,} parameters")
print(f"(Compare to RNN: ~{sum(p.numel() for p in turn_model.parameters()):,} parameters)")

# Training setup
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(lstm_model.parameters(), lr=WORD_LR)

# Training loop
total_batches = len(turn_dataloader)
lstm_model.train()

LSTM_EPOCHS = 3

for epoch in range(LSTM_EPOCHS):
    total_loss = 0
    for i, (inputs, targets) in enumerate(turn_dataloader):
        inputs, targets = inputs.to(device), targets.to(device)

        optimizer.zero_grad()
        logits, _ = lstm_model(inputs)
        loss = criterion(logits.view(-1, turn_dataset.vocab_size), targets.view(-1))
        loss.backward()
        torch.nn.utils.clip_grad_norm_(lstm_model.parameters(), 5)
        optimizer.step()
        total_loss += loss.item()

        if i % 200 == 0:
            pct = (i / total_batches) * 100
            print(f"Epoch {epoch+1}/{LSTM_EPOCHS} | Batch {i}/{total_batches} ({pct:.1f}%) | Loss: {loss.item():.4f}")

    avg_loss = total_loss / len(turn_dataloader)
    print(f"Epoch {epoch+1}/{LSTM_EPOCHS} Complete | Avg Loss: {avg_loss:.4f}")

# Save LSTM model
lstm_model_path = "word_lstm_model.pth"
torch.save({
    'model_state_dict': lstm_model.state_dict(),
    'vocab_size': turn_dataset.vocab_size,
    'emb_dim': WORD_EMB_DIM,
    'hidden_size': WORD_HIDDEN_SIZE,
    'num_layers': WORD_NUM_LAYERS,
    'word_to_idx': turn_dataset.word_to_idx,
    'idx_to_word': turn_dataset.idx_to_word,
    'eos_token': turn_dataset.eos_token,
}, lstm_model_path)
print(f"\nLSTM model saved to {lstm_model_path}")


---

## Exercise 3.2 – LSTM Chat Interface and RNN–LSTM Comparison

### Description

You will now plug your **word-level LSTM language model** into the theatrical chat interface and compare its behavior to the **word-level RNN** using identical prompts and settings.

### Learning Objectives

- Use a word-level LSTM for **prompt-based generation**
- Compare outputs of RNN vs LSTM models
- Reflect on advantages and limitations of both architectures

### Tasks

Plug the LSTM model into the chat interface you made and compare to the RNN chat generation, using the following comparison criteria **(1 point)**:

1. **Coherence**: Does the text make grammatical sense?
2. **Dialog structure**: Does it follow speaker patterns?
3. **Repetition**: Does one model repeat itself more?
4. **Creativity**: Which produces more varied outputs?

In [None]:
# Exercise 3.2: RNN vs LSTM Comparison

print("--- Exercise 3.2: RNN vs LSTM Comparison ---")

# Test prompts
comparison_prompts = [
    ("ROMEO", "My heart is heavy with unspoken words.", "JULIET"),
    ("JULIET", "Good even, my lord. Why art thou troubled?", "ROMEO"),
    ("HAMLET", "To be, or not to be, I ask again.", "HORATIO"),
]

for user_speaker, user_line, model_speaker in comparison_prompts:
    print(f"\n{'='*70}")
    print(f"{user_speaker}: {user_line}")
    print(f"(Response from: {model_speaker})")
    print('='*70)

    # Construct prompts
    prompt = f"<eos>\n{user_speaker}.\n{user_line}\n<eos>\n{model_speaker}.\n"

    # RNN response
    # Task 3.3: START STUDENT CODE

    # HINT: Generate rnn_response using the turn_model (RNN) with the prompt
    pass

    # Task 3.3: END STUDENT CODE
    print(f"\n[RNN] {model_speaker}:")
    print(rnn_response)

    # LSTM response
    # Task 3.4: START STUDENT CODE

    # HINT: Generate lstm_response using the lstm_model (LSTM) with the same prompt
    pass

    # Task 3.4: END STUDENT CODE
    print(f"\n[LSTM] {model_speaker}:")
    print(lstm_response)


---

# Stage 4: Fine-Tuning Modern Language Models with Hugging Face 🤗

In this final stage, you will learn to use the **Hugging Face ecosystem** to fine-tune a modern pretrained language model for Shakespearean dialogue generation.

## 📚 Introduction to Hugging Face

[Hugging Face](https://huggingface.co/) is the leading platform for machine learning, offering:

- **🤗 Transformers**: A library with 200,000+ pretrained models
- **📦 Datasets**: Easy-to-use datasets for ML
- **🏋️ Trainer**: High-level API for training models
- **🌐 Hub**: Share and discover models

### Essential Resources

| Resource | Link | Description |
|----------|------|-------------|
| HF Course | [huggingface.co/learn](https://huggingface.co/learn/nlp-course) | Free NLP course |
| Transformers Docs | [huggingface.co/docs/transformers](https://huggingface.co/docs/transformers) | Official documentation |
| Model Hub | [huggingface.co/models](https://huggingface.co/models) | Browse all models |

## Model Choice: Qwen2.5-0.5B

We use **Qwen2.5-0.5B** (released October 2024) because:
- ✅ Modern architecture (late 2024 release)
- ✅ Small but capable (494M parameters)
- ✅ Fast to fine-tune on consumer GPUs
- ✅ Strong base capabilities for text generation

**Alternative**: If Qwen is unavailable, you can use `HuggingFaceTB/SmolLM2-360M` (135-360M params, Nov 2024).

---


## Exercise 4.1 – Introduction to Hugging Face Transformers

### Learning Objectives

- Understand the core Hugging Face abstractions: **Tokenizer**, **Model**, **Trainer**
- Load pretrained models from the Hugging Face Hub
- Explore tokenization and understand how text becomes numbers

### Key Concepts

```
┌─────────────────────────────────────────────────────────────┐
│                    Hugging Face Pipeline                    │
├─────────────────────────────────────────────────────────────┤
│  Text → [Tokenizer] → Token IDs → [Model] → Logits → Text  │
└─────────────────────────────────────────────────────────────┘
```

- **Tokenizer**: Converts text ↔ token IDs (integers)
- **Model**: Neural network that processes token IDs
- **AutoClass**: Automatically selects the right class for any model

### 💡 Hint

The `Auto` classes (AutoTokenizer, AutoModel) are magic! They detect the model type and load the correct implementation automatically.


In [None]:
# Exercise 4.1: Introduction to Hugging Face

# Step 1: Install required packages
# Uncomment for Colab/fresh environment:
# !pip install transformers datasets accelerate -q

# Step 2: Import the core libraries
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

device = "mps"
print("✅ Hugging Face Transformers imported successfully!")
print(f"   Device: {device}")

# ============================================================
# UNDERSTANDING HUGGING FACE: Loading a Pretrained Model
# ============================================================

# Model identifier on Hugging Face Hub
# Browse models at: https://huggingface.co/models?sort=trending
MODEL_NAME = "Qwen/Qwen2.5-0.5B"  # 494M params, released Oct 2024

# Alternative smaller models (uncomment if needed):
# MODEL_NAME = "HuggingFaceTB/SmolLM2-360M"  # 360M params, Nov 2024
# MODEL_NAME = "HuggingFaceTB/SmolLM2-135M"  # 135M params, Nov 2024

print(f"\n📥 Loading model: {MODEL_NAME}")
print("   (First run downloads the model, subsequent runs use cache)\n")

In [None]:
# Exercise 4.1 (continued): Understanding Tokenizers

# ============================================================
# TOKENIZER: Converting Text ↔ Numbers
# ============================================================

# Load the tokenizer - this downloads vocabulary and tokenization rules
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

# Some models don't have a pad token - we set it to the EOS token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("✅ Tokenizer loaded!")
print(f"   Vocabulary size: {len(tokenizer):,} tokens")
print(f"   Model max length: {tokenizer.model_max_length}")

# ============================================================
# EXPLORE: How tokenization works
# ============================================================

sample_text = "Romeo, Romeo! Wherefore art thou Romeo?"

# Tokenize the text
tokens = tokenizer.tokenize(sample_text)
token_ids = tokenizer.encode(sample_text)

print(f"\n📝 Sample text: '{sample_text}'")
print(f"\n🔤 Tokens ({len(tokens)}): {tokens}")
print(f"\n🔢 Token IDs: {token_ids}")

# Decode back to text
decoded = tokenizer.decode(token_ids)
print(f"\n🔙 Decoded: '{decoded}'")

In [None]:
# Exercise 4.1 (continued): Loading the Model

# ============================================================
# MODEL: The Neural Network
# ============================================================

print("📥 Loading model (this may take a minute)...\n")

# AutoModelForCausalLM = Auto Model for Causal Language Modeling
# "Causal" means the model predicts the next token (like GPT)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
)
model = model.to(device)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print("✅ Model loaded!")
print(f"   Total parameters: {total_params:,} ({total_params/1e6:.1f}M)")
print(f"   Trainable parameters: {trainable_params:,}")
print(f"   Device: {next(model.parameters()).device}")

# ============================================================
# TEST: Generate some text!
# ============================================================

print("\n" + "="*50)
print("🎭 Quick test - generating text...")

test_prompt = "To be, or not to be"
inputs = tokenizer(test_prompt, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=30,
        do_sample=True,
        temperature=0.7,
        pad_token_id=tokenizer.pad_token_id,
    )

generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"\nPrompt: '{test_prompt}'")
print(f"Generated: '{generated}'")

---

## Exercise 4.2 – Preparing Data and Fine-Tuning

### Learning Objectives

- Create a dataset compatible with Hugging Face `Trainer`
- Add custom special tokens to mark dialogue structure
- Configure and run fine-tuning with the `Trainer` API

### Key Concepts

**Special Tokens**: We add custom tokens to help the model understand dialogue:
```
<|speaker|>ROMEO<|endname|>My text here<|endturn|>
```

**Dataset**: Hugging Face uses Arrow format for efficient data loading.

**Trainer**: High-level API that handles:
- Training loop
- Gradient accumulation
- Mixed precision (FP16)
- Checkpointing
- Evaluation

### 📖 Further Reading

- [Fine-tuning tutorial](https://huggingface.co/docs/transformers/training)
- [Trainer documentation](https://huggingface.co/docs/transformers/main_classes/trainer)


In [None]:
# Exercise 4.2: Preparing the Shakespeare Dataset

from datasets import Dataset
import re

# ============================================================
# SPECIAL TOKENS: Marking dialogue structure
# ============================================================

SPEAKER_TOKEN = "<|speaker|>"
ENDNAME_TOKEN = "<|endname|>"
ENDTURN_TOKEN = "<|endturn|>"

# Add special tokens to the tokenizer
special_tokens = {
    "additional_special_tokens": [SPEAKER_TOKEN, ENDNAME_TOKEN, ENDTURN_TOKEN]
}
num_added = tokenizer.add_special_tokens(special_tokens)

# IMPORTANT: Resize model embeddings to match new vocabulary
model.resize_token_embeddings(len(tokenizer))

print(f"✅ Added {num_added} special tokens")
print(f"   New vocabulary size: {len(tokenizer):,}")
print(f"   Format: {SPEAKER_TOKEN}NAME{ENDNAME_TOKEN}dialogue{ENDTURN_TOKEN}")

In [None]:
# Exercise 4.2 (continued): Extract dialogues from Shakespeare

def extract_dialogues(text):
    """Extract (speaker, dialogue) pairs from Shakespeare text."""
    dialogues = []
    lines = text.split('\n')
    current_speaker, current_lines = None, []

    for line in lines:
        stripped = line.strip()
        # Detect speaker: short, uppercase lines
        if stripped and len(stripped) < 40:
            alpha = re.sub(r'[^A-Za-z]', '', stripped)
            if alpha and alpha.isupper() and len(alpha) > 1:
                if current_speaker and current_lines:
                    text = ' '.join(current_lines).strip()
                    if len(text) > 20:
                        dialogues.append((current_speaker, text))
                current_speaker = stripped.rstrip('.').strip()
                current_lines = []
                continue
        if current_speaker and stripped:
            current_lines.append(stripped)

    return dialogues

def format_for_training(dialogues, context_turns=2):
    """Format dialogues with special tokens."""
    examples = []
    for i in range(len(dialogues)):
        start = max(0, i - context_turns)
        parts = []
        for speaker, text in dialogues[start:i+1]:
            parts.append(f"{SPEAKER_TOKEN}{speaker}{ENDNAME_TOKEN}{text}{ENDTURN_TOKEN}")
        full_text = "".join(parts)
        if len(full_text) > 50:
            examples.append(full_text)
    return examples

# Load Shakespeare works
print("📚 Loading Shakespeare corpus...")

all_dialogues = []
work_files = sorted([f for f in os.listdir(WORKS_DIR) if f.endswith('.txt')])
selected = [f for f in work_files if any(x in f for x in ['romeo', 'hamlet', 'macbeth'])][:3]

if not selected:
    selected = work_files[:3]

for fname in selected:
    with open(os.path.join(WORKS_DIR, fname), 'r') as f:
        dialogues = extract_dialogues(f.read())
        all_dialogues.extend(dialogues)
        print(f"   {fname}: {len(dialogues)} turns")

# Format and create dataset
training_texts = format_for_training(all_dialogues, context_turns=2)
print(f"\n✅ Created {len(training_texts)} training examples")

# Show example
print(f"\n📝 Sample formatted dialogue:")
print(training_texts[5][:200] + "...")

In [None]:
# Exercise 4.2 (continued): Fine-tuning with Trainer

from transformers import (
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)

# ============================================================
# CREATE DATASET
# ============================================================

# Split into train/val
split_idx = int(len(training_texts) * 0.9)
train_texts = training_texts[:split_idx]
val_texts = training_texts[split_idx:]

# Tokenize all examples
MAX_LENGTH = 256

def tokenize_texts(texts):
    return tokenizer(
        texts,
        truncation=True,
        max_length=MAX_LENGTH,
        padding="max_length",
    )

train_encodings = tokenize_texts(train_texts)
val_encodings = tokenize_texts(val_texts)

# Create HuggingFace Datasets
train_dataset = Dataset.from_dict({
    "input_ids": train_encodings["input_ids"],
    "attention_mask": train_encodings["attention_mask"],
})
val_dataset = Dataset.from_dict({
    "input_ids": val_encodings["input_ids"],
    "attention_mask": val_encodings["attention_mask"],
})

print(f"✅ Datasets created")
print(f"   Train: {len(train_dataset)} | Val: {len(val_dataset)}")

In [None]:
# Exercise 4.2 (continued): Configure and run Trainer

# ============================================================
# TRAINING CONFIGURATION
# ============================================================

training_args = TrainingArguments(
    output_dir="./shakespeare_model",

    # Training parameters
    num_train_epochs=2,                    # Number of passes through the data
    per_device_train_batch_size=2,         # Samples per GPU per step
    gradient_accumulation_steps=8,         # Effective batch = 2 * 8 = 16

    # Learning rate
    learning_rate=2e-5,                    # Fine-tuning uses smaller LR
    warmup_steps=50,                       # Gradual LR warmup

    # Logging
    logging_steps=25,
    eval_strategy="steps",
    eval_steps=100,

    # Saving
    save_strategy="steps",
    save_steps=200,
    save_total_limit=2,

    # Performance
    fp16=torch.cuda.is_available(),        # Mixed precision on GPU

    # Misc
    load_best_model_at_end=True,
    report_to="none",
)

# Data collator: handles padding and creates labels for CLM
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # Causal LM, not Masked LM
)

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
)

print("✅ Trainer configured!")
print(f"   Epochs: {training_args.num_train_epochs}")
print(f"   Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"   Learning rate: {training_args.learning_rate}")

In [None]:
# Exercise 4.2 (continued): Run training

print("🏋️ Starting fine-tuning...")
print("   (This may take 5-15 minutes on GPU, longer on CPU)\n")

# Train!
train_result = trainer.train()

print("\n" + "="*50)
print("✅ Training complete!")
print(f"   Final loss: {train_result.training_loss:.4f}")

# Save the model
SAVE_PATH = "./shakespeare_finetuned"
trainer.save_model(SAVE_PATH)
tokenizer.save_pretrained(SAVE_PATH)
print(f"   Model saved to: {SAVE_PATH}")

---

## Exercise 4.3 – Building the Chat Interface

### Learning Objectives

- Generate text with a fine-tuned model using `model.generate()`
- Understand generation parameters: `temperature`, `top_p`, `repetition_penalty`
- Build an interactive dialogue system

### Generation Parameters Explained

| Parameter | Effect | Typical Values |
|-----------|--------|----------------|
| `temperature` | Randomness (higher = more creative) | 0.7 - 1.0 |
| `top_p` | Nucleus sampling (cumulative probability) | 0.9 - 0.95 |
| `repetition_penalty` | Penalize repeated tokens | 1.1 - 1.3 |
| `max_new_tokens` | Maximum tokens to generate | 50 - 200 |

### 📖 Further Reading

- [Text Generation Strategies](https://huggingface.co/docs/transformers/generation_strategies)


In [None]:
# Exercise 4.3: Text Generation Function

def generate_response(
    prompt: str,
    responding_speaker: str,
    max_new_tokens: int = 100,
    temperature: float = 0.8,
    top_p: float = 0.9,
) -> str:
    """
    Generate a response from the specified speaker.

    💡 Key insight: We format the prompt with special tokens,
       then let the model continue from the speaker's name.
    """
    model.eval()

    # Build the full prompt
    full_prompt = f"{prompt}{SPEAKER_TOKEN}{responding_speaker}{ENDNAME_TOKEN}"

    # Tokenize
    inputs = tokenizer(full_prompt, return_tensors="pt").to(device)

    # Get end token ID for stopping
    endturn_id = tokenizer.convert_tokens_to_ids(ENDTURN_TOKEN)

    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            repetition_penalty=1.2,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=[tokenizer.eos_token_id, endturn_id],
        )

    # Decode and extract response
    full_output = tokenizer.decode(outputs[0], skip_special_tokens=False)

    # Find the response part
    marker = f"{SPEAKER_TOKEN}{responding_speaker}{ENDNAME_TOKEN}"
    if marker in full_output:
        response = full_output.split(marker)[-1]
        response = response.replace(ENDTURN_TOKEN, "").replace(tokenizer.eos_token, "")
    else:
        response = full_output[len(full_prompt):]

    return response.strip()

# Test generation
print("🎭 Testing generation...\n")

test_prompt = f"{SPEAKER_TOKEN}ROMEO{ENDNAME_TOKEN}What light through yonder window breaks?{ENDTURN_TOKEN}"
response = generate_response(test_prompt, "JULIET")

print(f"ROMEO: What light through yonder window breaks?")
print(f"\nJULIET: {response}")

In [None]:
# Exercise 4.3 (continued): Interactive Chat Interface

class ShakespeareChat:
    """Simple chat interface for Shakespearean dialogue."""

    def __init__(self, user_name="ROMEO", ai_name="JULIET"):
        self.user_name = user_name.upper()
        self.ai_name = ai_name.upper()
        self.history = ""

    def chat(self, user_message: str, temperature: float = 0.8) -> str:
        """Send a message and get a response."""
        # Add user message to history
        self.history += f"{SPEAKER_TOKEN}{self.user_name}{ENDNAME_TOKEN}{user_message}{ENDTURN_TOKEN}"

        # Generate response
        response = generate_response(self.history, self.ai_name, temperature=temperature)

        # Add AI response to history
        self.history += f"{SPEAKER_TOKEN}{self.ai_name}{ENDNAME_TOKEN}{response}{ENDTURN_TOKEN}"

        # Keep history manageable (last 4 turns)
        turns = self.history.split(ENDTURN_TOKEN)
        if len(turns) > 8:
            self.history = ENDTURN_TOKEN.join(turns[-8:]) + ENDTURN_TOKEN

        return response

    def reset(self, user_name=None, ai_name=None):
        """Reset conversation with optional new speakers."""
        if user_name:
            self.user_name = user_name.upper()
        if ai_name:
            self.ai_name = ai_name.upper()
        self.history = ""

# Demo conversation
print("\n" + "="*60)
print("🎭 SHAKESPEAREAN CHAT DEMO")
print("="*60)

chat = ShakespeareChat("ROMEO", "JULIET")

demo_lines = [
    "But soft! What light through yonder window breaks?",
    "My love for thee knows no bounds.",
]

for line in demo_lines:
    print(f"\nROMEO: {line}")
    response = chat.chat(line)
    print(f"JULIET: {response}")
    print("-" * 40)

In [None]:
# Exercise 4.3 (continued): Run Interactive Mode

def run_interactive_chat():
    """Run an interactive chat session."""
    print("\n" + "="*60)
    print("🎭 SHAKESPEAREAN CHAT")
    print("="*60)
    print("\nCharacters: ROMEO, JULIET, HAMLET, OPHELIA, MACBETH, etc.")

    user = input("\nYou are (default ROMEO): ").strip() or "ROMEO"
    ai = input("AI is (default JULIET): ").strip() or "JULIET"

    chat = ShakespeareChat(user, ai)
    print(f"\n🎭 {user} speaks with {ai}. Type 'quit' to exit.\n")

    while True:
        try:
            msg = input(f"{chat.user_name}: ").strip()
            if msg.lower() in ['quit', 'exit', 'q']:
                print("\n🎭 Exeunt omnes!")
                break
            if msg:
                response = chat.chat(msg)
                print(f"{chat.ai_name}: {response}\n")
        except KeyboardInterrupt:
            print("\n\n🎭 Exeunt omnes!")
            break

# Uncomment to run interactive chat:
# run_interactive_chat()

---

# 🎭 Congratulations!

You have completed all stages of this NLP workshop!

## What You Learned

| Stage | Topics |
|-------|--------|
| **0** | Corpus processing, tokenization, GloVe embeddings |
| **1** | Character-level RNN, PyTorch fundamentals |
| **2** | Word-level RNN, vocabulary handling, turn-based modeling |
| **3** | LSTM architecture, RNN vs LSTM comparison |
| **4** | Hugging Face ecosystem, fine-tuning transformers |

## Key Takeaways

1. **Evolution of NLP**: RNN → LSTM → Transformer (each solving limitations of the previous)
2. **Transfer Learning**: Fine-tuning pretrained models is more efficient than training from scratch
3. **Hugging Face**: Industry-standard platform for NLP/ML
4. **Special Tokens**: Help models understand structure (dialogue, turns, speakers)

## 📚 Continue Learning

- [Hugging Face NLP Course](https://huggingface.co/learn/nlp-course)
- [Fine-tuning LLMs Guide](https://huggingface.co/docs/transformers/training)
- [Text Generation Strategies](https://huggingface.co/docs/transformers/generation_strategies)

---

*"All the world's a stage, and all the men and women merely players."* 🎭
