# Lab 1 + Lab 2 Lesson 2: BPE, Zipf slope, and n-gram basics

**What we use**
- `datasets` (Hugging Face) for loading and splitting datasets.
- `transformers` (Hugging Face) for BPE tokenization via `AutoTokenizer`.
- `numpy` for log-log slope fitting and random sampling utilities.
- `Counter` for frequency counts.

**Goals**
- Apply BPE tokenization and inspect subword behavior.
- Fit a Zipf slope on log-log axes and interpret it.
- Start Lab 2: create train/dev/test splits and handle unknown tokens.
- Build n-gram counts, add smoothing, and compute perplexity.
- Generate short samples with top-k sampling.

**Structure**
1) Load a dataset subset.
2) BPE tokenization practice.
3) Zipf slope fitting.
4) Lab 2 intro: split + UNK handling.
5) N-gram counts + add-alpha smoothing.
6) Perplexity + simple grid search.
7) Top-k sampling.


In [1]:
import random
import numpy as np
from collections import Counter
from datasets import load_dataset
from transformers import AutoTokenizer

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

**Why the fixed seed matters here**
- It makes `train_test_split` reproducible when we create train/dev/test.
- It stabilizes `np.random.choice` in top-k sampling so examples are repeatable.
- It does not change the dataset content itself, only the randomized operations.


## Step 1: Load a dataset subset
Choose a dataset and a text field. Keep a small subset for speed.


In [2]:
# Dataset choice
# Examples: name = 'ag_news' (text), 'imdb' (text), 'yelp_polarity' (text)
name = 'ag_news'
config = None
text_field = 'text'

if config:
    ds = load_dataset(name, config, split='train')
else:
    ds = load_dataset(name, split='train')

subset = ds.select(range(2000))
print(subset.features)
print(subset[0])


{'text': Value('string'), 'label': ClassLabel(names=['World', 'Sports', 'Business', 'Sci/Tech'])}
{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.", 'label': 2}


In [3]:
for i in range(3):
    print('---')
    print(subset[i][text_field])


---
Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
---
Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.
---
Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums.


In [4]:
texts = [ex[text_field] for ex in subset if ex[text_field].strip()]
print('sample texts:', len(texts))


sample texts: 2000


## Step 2: BPE tokenization
We use a pretrained BPE tokenizer to see how subwords split words.


**About `AutoTokenizer` and BPE**
- `AutoTokenizer` loads a pretrained tokenizer from the Hugging Face Hub.
- For GPT-2, this is a Byte-Pair Encoding (BPE) tokenizer that splits words into subwords.
- `tokenizer(text)` returns `input_ids` (integers), and `convert_ids_to_tokens` shows subword pieces.
- We set `add_special_tokens=False` to keep the raw segmentation visible.


**What is BPE?**
Byte-Pair Encoding (BPE) builds a subword vocabulary by repeatedly merging
the most frequent *adjacent* symbol pairs. Symbols start as characters and
become longer subwords after merges.

BPE is an iterative process: after each merge, pairs are re-counted and the next most frequent pair is merged.

Example (simplified):
$$t\ h\ e\ r\ e \xrightarrow{(t,h)} th\ e\ r\ e \xrightarrow{(th,e)} the\ r\ e$$
Here `(t,h)` is chosen because it is the most frequent adjacent pair in the corpus at that step;
then `(th,e)` becomes frequent and is merged next. That is why the tokens become `the | r | e`.

**Can we compute Zipf on BPE tokens?** Yes. If you count BPE tokens instead of words,
you can make a Zipf plot for subword units. The curve shape can change because the vocabulary
now includes short subword pieces.

**Where BPE is used**
BPE tokenization is the standard input step for many pretrained transformer models.
In later labs, we will feed BPE token IDs into models and compare tokenization effects.


In [58]:
tokenizer = AutoTokenizer.from_pretrained('gpt2')


In [None]:
texts

In [None]:
tokenizer.vocab

In [8]:
sample = texts[:3]

for s in sample:
    enc = tokenizer(s, add_special_tokens=False)
    ids = enc["input_ids"]
    tokens = tokenizer.convert_ids_to_tokens(ids)
    dec = tokenizer.decode(ids)
    print("text", s[:100])
    print("BPE tokens", tokens[:30])
    print("dec", dec)
    print("dummy join", ' '.join(tokens)[:100])
    print()

text Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\b
BPE tokens ['Wall', 'Ä St', '.', 'Ä Bears', 'Ä Claw', 'Ä Back', 'Ä Into', 'Ä the', 'Ä Black', 'Ä (', 'Reuters', ')', 'Ä Reuters', 'Ä -', 'Ä Short', '-', 'sell', 'ers', ',', 'Ä Wall', 'Ä Street', "'s", 'Ä dwindling', '\\', 'band', 'Ä of', 'Ä ultra', '-', 'cy', 'n']
dec Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
dummy join Wall Ä St . Ä Bears Ä Claw Ä Back Ä Into Ä the Ä Black Ä ( Reuters ) Ä Reuters Ä - Ä Short - sell ers , Ä Wall Ä 

text Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,
BPE tokens ['Car', 'ly', 'le', 'Ä Looks', 'Ä Tow', 'ard', 'Ä Commercial', 'Ä Aerospace', 'Ä (', 'Reuters', ')', 'Ä Reuters', 'Ä -', 'Ä Private', 'Ä investment', 'Ä firm', 'Ä Carly', 'le', 'Ä Group', ',', '\\', 'which', 'Ä has', 'Ä a', 'Ä reputation

## Step 3: Zipf slope fitting
We fit a line to log(rank) vs log(freq) for ranks 10-100 to estimate the slope.


**About Zipf slope fitting**
- We compute token frequencies from a simple whitespace tokenizer.
- On a log-log plot, Zipf-like data forms an approximate straight line.
- We fit only a middle rank range (start/end) to avoid very frequent function words
  at the head and sparse, noisy counts in the tail.
- The slope is negative; a less steep slope often suggests higher lexical diversity.


**How `np.polyfit` works here**
- `np.polyfit(x, y, 1)` fits a line `y = m*x + b` by least squares.
- It returns `[m, b]` where `m` is the slope and `b` is the intercept.
- On log-log data, the slope estimates the Zipf exponent.
- We fit only a middle rank range to reduce head/tail distortion.


In [9]:
def tokenize_whitespace(text):
    return text.lower().split()

def get_token_counts(texts):
    counts = Counter()
    for t in texts:
        counts.update(tokenize_whitespace(t))
    return counts

texts = [ex[text_field] for ex in subset if ex[text_field].strip()]
counts = get_token_counts(texts)
freqs = sorted(counts.values(), reverse=True)
ranks = np.arange(1, len(freqs) + 1)


In [10]:
# TODO: choose a rank range (e.g., 10-100)
# TODO: compute log_ranks and log_freqs
# Hint: np.log10 or np.log
# TODO: fit a line with np.polyfit(log_ranks, log_freqs, 1)
# TODO: print the slope and intercept

# Write your code below


**Fit line formula (log-log space)**
We fit a line in log space and then map it back to frequency space:
$$y = m x + b$$
$$x = \log(\text{rank}), \quad y = \log(\text{freq})$$
$$\text{freq} = 10^{b} \cdot \text{rank}^{m}$$


In [None]:
# TODO: plot log-log Zipf and the fitted line
# Hint: plt.loglog for the data
# Hint: use the fitted slope/intercept to build the line

# Write your code below


## Step 4: Lab 2 intro - dataset splits and UNK handling
We create train/dev/test splits and replace rare tokens with `<UNK>`.


**About splits and `<UNK>`**
- We create train/dev/test splits so we can tune on dev and evaluate on test.
- The *train* split is used to build the vocabulary and estimate n-gram counts.
- Tokens not in the training vocab are replaced with `<UNK>` to handle unseen words.

**Why tune on the dev set?**
- Hyperparameters (like `alpha` for smoothing or `k` for sampling) are chosen to
  perform well on dev.
- We avoid tuning on test to prevent optimistic, biased evaluation.


In [11]:
# TODO: split into train/dev/test
# Hint: ds.train_test_split(test_size=...)
# Hint: split the train portion again to make dev

# Write your code below


split = ds.train_test_split(test_size=0.2, seed=SEED)

train_full = split["train"]
test = split["test"]

train_split = train_full.train_test_split(test_size=0.2, seed=SEED)
train = train_split["train"]
dev = train_split["test"]


print("train", len(train))
print("test", len(test))
print("dev", len(dev))

train 76800
test 24000
dev 19200


In [12]:
def tokenize_whitespace(text):
    return text.lower().split()

def get_token_counts(texts):
    counts = Counter()
    for t in texts:
        counts.update(tokenize_whitespace(t))
    return counts

def extract_texts(subset):
    return [ex[text_field] for ex in subset if ex[text_field].strip()]

In [35]:
tokenizer("asdas")['input_ids']

[292, 67, 292]

In [38]:
tokenizer.unk_token_id

50256

In [45]:
d.values()

dict_values([1, 12])

In [13]:
d = {"a": 1, "b": 12}
for key in d:
    print(type(key), type(d), type("a"))
    print(key, d[key])
    print(d["a"])

<class 'str'> <class 'dict'> <class 'str'>
a 1
1
<class 'str'> <class 'dict'> <class 'str'>
b 12
1


In [14]:
d.keys(), d.values(), d.items()

(dict_keys(['a', 'b']),
 dict_values([1, 12]),
 dict_items([('a', 1), ('b', 12)]))

In [15]:
d = {"a": 1, "b": 12}
for key, value in d.items():
    print(key, value)


{key: value for key, value in d.items() if value < 10}

a 1
b 12


{'a': 1}

In [16]:
train

Dataset({
    features: ['text', 'label'],
    num_rows: 76800
})

In [17]:
??extract_texts

[0;31mSignature:[0m [0mextract_texts[0m[0;34m([0m[0msubset[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mSource:[0m   
[0;32mdef[0m [0mextract_texts[0m[0;34m([0m[0msubset[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;32mreturn[0m [0;34m[[0m[0mex[0m[0;34m[[0m[0mtext_field[0m[0;34m][0m [0;32mfor[0m [0mex[0m [0;32min[0m [0msubset[0m [0;32mif[0m [0mex[0m[0;34m[[0m[0mtext_field[0m[0;34m][0m[0;34m.[0m[0mstrip[0m[0;34m([0m[0;34m)[0m[0;34m][0m[0;34m[0m[0;34m[0m[0m
[0;31mFile:[0m      /tmp/ipykernel_10156/1411551957.py
[0;31mType:[0m      function

In [None]:
def build_vocab(texts, min_freq=2):
    # TODO: return a vocab dict {token: count} filtered by min_freq
    # Hint: start from Counter and filter
    counts = Counter()
    for text in texts:
        counts.update(tokenizer(text)['input_ids'])
    return Counter({token_id: count for token_id, count in counts.items() if count >= min_freq})
    
# type annotations
def replace_unk(tokens: list[int], vocab: Counter) -> list[int]:
    # TODO: replace tokens not in vocab with '<UNK>'
    filtered = []
    for token in tokens:
        if token not in vocab:
            filtered.append(tokenizer.unk_token_id)
        else:
            filtered.append(token)
    return filtered



# TODO: use train texts to build vocab
# TODO: apply replace_unk to a few sample sentences

# Write your code below


In [19]:
vocab = build_vocab(extract_texts(train))

In [22]:
tokenizer.unk_token_id

50256

In [None]:
test_sentence = "JADASDAS hello world"
test_sentence

'ASj-fA2SDq hello world'

In [46]:
before = tokenizer(test_sentence)["input_ids"]
after = replace_unk(before, vocab)

In [47]:
before[:10], after[:10]

([1921, 73, 12, 69, 32, 17, 10305, 80, 23748, 995],
 [1921, 73, 12, 69, 32, 17, 10305, 80, 23748, 995])

In [79]:
vocab.most_common(10)

[(13, 126136),
 (262, 114031),
 (11, 102443),
 (284, 75474),
 (257, 63124),
 (286, 62716),
 (287, 59281),
 (26, 55133),
 (12, 51757),
 (290, 43971)]

In [89]:
[key for key, value in vocab.most_common(10)]

[13, 262, 11, 284, 257, 286, 287, 26, 12, 290]

In [90]:
tokenizer.convert_ids_to_tokens([token_id for token_id, count in vocab.most_common(10)])

['.', 'Ä the', ',', 'Ä to', 'Ä a', 'Ä of', 'Ä in', ';', '-', 'Ä and']

**Finding a real `<UNK>` replacement**
We want an example sentence where at least one token is *not* in the vocab.
If this is rare, increase `min_freq` to force more words to become `<UNK>`.


In [None]:
# TODO: find a sentence where replace_unk inserts '<UNK>'
# Hint: build a temporary vocab with higher min_freq (e.g., 5 or 10)
# Hint: scan train_texts and check if '<UNK>' appears in replaced tokens
# Hint: print a window around the first '<UNK>' token (index-5 : index+5)
# TODO: print before/after for one example

# Write your code below


**Homework (Lesson 2)**
- Use the `wikitext` dataset (`wikitext-2-raw-v1`).
- Load the GPT-2 tokenizer with `AutoTokenizer`.
- Build a vocabulary with `build_vocab` **over GPT-2 tokens** over token ids.
- Get the top 1000 tokens by frequency.
- Hint: use `Counter.most_common(1000)` instead of manual sorting.
- Tokenize the first 5 texts with GPT-2, keep only tokens in the top 1000,
  then reconstruct with `tokenizer.decode(...)`.
- Compare the original vs reconstructed texts.


In [None]:
# TODO: load wikitext-2-raw-v1 train split
# TODO: load GPT-2 tokenizer with AutoTokenizer
# TODO: build vocab with build_vocab over GPT-2 tokens
# Hint: convert ids to tokens and join with spaces before calling build_vocab
# TODO: get top 1000 tokens by frequency
# TODO: tokenize first 5 texts, keep only tokens from top-1000, then reconstruct using tokenizer.decode
# TODO: print original vs reconstructed texts

# Write your code below


## Step 5: N-gram counts and add-alpha smoothing
We build unigram and bigram counts from the training set, then apply add-alpha smoothing.


**About n-gram counts and smoothing**
- An *n-gram* is a contiguous sequence of `n` tokens (unigram n=1, bigram n=2).
- We add `<BOS>` and `<EOS>` to mark sentence boundaries for n-gram counting.
- Bigram counts from the *train* split define a conditional model: $P(w_i \mid w_{i-1})$.
  This is not gradient training; it is counting-based estimation.
- Add-alpha smoothing avoids zero probabilities by adding a small constant to counts:
$$P_{\alpha}(w_i \mid w_{i-1}) = \frac{C(w_{i-1}, w_i) + \alpha}{C(w_{i-1}) + \alpha V}$$


In [56]:
help(tokenizer)

Help on GPT2TokenizerFast in module transformers.models.gpt2.tokenization_gpt2_fast object:

class GPT2TokenizerFast(transformers.tokenization_utils_fast.PreTrainedTokenizerFast)
 |  GPT2TokenizerFast(vocab_file=None, merges_file=None, tokenizer_file=None, unk_token='<|endoftext|>', bos_token='<|endoftext|>', eos_token='<|endoftext|>', add_prefix_space=False, **kwargs)
 |
 |  Construct a "fast" GPT-2 tokenizer (backed by HuggingFace's *tokenizers* library). Based on byte-level
 |  Byte-Pair-Encoding.
 |
 |  This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
 |  be encoded differently whether it is at the beginning of the sentence (without space) or not:
 |
 |  ```python
 |  >>> from transformers import GPT2TokenizerFast
 |
 |  >>> tokenizer = GPT2TokenizerFast.from_pretrained("openai-community/gpt2")
 |  >>> tokenizer("Hello world")["input_ids"]
 |  [15496, 995]
 |
 |  >>> tokenizer(" Hello world")["input_ids"]
 |  [18435,

In [60]:
tokenizer.unk_token_id

50256

In [None]:
tokenizer.eos_token_id

50256

In [50]:
tokenizer.bos_token_id

50256

In [62]:
tokenizer("asdasd")

{'input_ids': [292, 67, 292, 67], 'attention_mask': [1, 1, 1, 1]}

In [63]:
??replace_unk

[0;31mSignature:[0m [0mreplace_unk[0m[0;34m([0m[0mtokens[0m[0;34m:[0m [0mlist[0m[0;34m[[0m[0mint[0m[0;34m][0m[0;34m,[0m [0mvocab[0m[0;34m:[0m [0mcollections[0m[0;34m.[0m[0mCounter[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mSource:[0m   
[0;32mdef[0m [0mreplace_unk[0m[0;34m([0m[0mtokens[0m[0;34m:[0m [0mlist[0m[0;34m[[0m[0mint[0m[0;34m][0m[0;34m,[0m [0mvocab[0m[0;34m:[0m [0mCounter[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;31m# TODO: replace tokens not in vocab with '<UNK>'[0m[0;34m[0m
[0;34m[0m    [0mfiltered[0m [0;34m=[0m [0;34m[[0m[0;34m][0m[0;34m[0m
[0;34m[0m    [0;32mfor[0m [0mtoken[0m [0;32min[0m [0mtokens[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;32mif[0m [0mtoken[0m [0;32mnot[0m [0;32min[0m [0mvocab[0m[0;34m:[0m[0;34m[0m
[0;34m[0m            [0mfiltered[0m[0;34m.[0m[0mappend[0m[0;34m([0m[0mtokenizer[0m[0;34m.[

In [None]:
def add_bos_eos(sentence_tokens: list[int]):
    return [tokenizer.bos_token_id] + sentence_tokens + [tokenizer.eos_token_id]

def tokenize(sentence: str) -> list[int]:
    sentence_tokens = tokenizer(sentence)["input_ids"]
    sentence_tokens = replace_unk(sentence_tokens, vocab)
    sentence_tokens = add_bos_eos(sentence_tokens)

    return sentence_tokens

In [74]:
tokenizer.convert_ids_to_tokens(tokenize("Hello world."))

['<|endoftext|>', 'Hello', 'Ä world', '.', '<|endoftext|>']

In [76]:
tokenize("Hello world."), tokenizer.decode(tokenize("Hello world."))

([50256, 15496, 995, 13, 50256], '<|endoftext|>Hello world.<|endoftext|>')

In [None]:
# List of str
train_text = extract_texts(train)

# List of sequences (i.e. list[int]) of tokens: list[list[int]]
train_tokens = []

for sentence in train_text:
    tokenized_sentence = tokenize(sentence) # list[int]
    train_tokens.append(tokenized_sentence)

In [None]:
train_tokens = [
    tokenize(sentence)
    for sentence in train_text
]

In [None]:
unigram_counts = Counter()
bigram_counts = Counter()



In [None]:






# TODO: build train_tokens and dev_tokens using replace_unk
# Hint: train_texts = [ex[text_field] for ex in train if ex[text_field].strip()]
# Hint: train_tokens = [replace_unk(tokenize_whitespace(t), vocab) for t in train_texts]
# Hint: add <BOS> and <EOS> for n-gram counts

# TODO: compute unigram_counts and bigram_counts with Counter
# Hint: bigrams can be pairs from zip(seq[:-1], seq[1:])

# TODO: define bigram_prob(prev, tok, alpha) with add-alpha smoothing

# Write your code below


## Step 6: Perplexity and simple grid search
We evaluate on the dev split and tune the smoothing strength by trying a few alphas.


**About perplexity**
- Perplexity is the exponent of the average negative log probability.
- The minus sign turns log-likelihoods (which are negative) into a positive surprisal value.
- The `-1/N` factor averages per token so results are comparable across lengths.
- With log, the formula is:
$$\mathrm{PPL} = \exp\left(-\frac{1}{N}\sum_{i=1}^N \log p(w_i \mid w_{i-1})\right)$$


In [None]:
# TODO: implement perplexity for a list of token sequences
# Hint: use log probabilities and average over total bigrams

# Write your code below


**Why grid search?**
- We do not know the best alpha in advance.
- A small grid search tries a few candidate values and picks the one
  with the lowest dev perplexity.


In [None]:
# TODO: grid search over a few alpha values (e.g., [0.1, 0.5, 1.0])
# TODO: print dev perplexity for each and pick the best

# Write your code below


## Step 7: Top-k sampling
We generate short samples using a bigram model and top-k sampling.


**About top-k sampling**
- We keep only the top-k most likely next tokens.
- Sampling from this trimmed distribution balances variety and coherence.


In [None]:
# TODO: implement sample_next(prev, alpha, k) using bigram_prob
# Hint: compute probabilities for all tokens, take top-k, sample with np.random.choice

# TODO: implement generate(max_len, alpha, k) starting from <BOS>
# TODO: print 2-3 generated samples

# Write your code below
