# Lab: Tokenization and N-gram Language Models 

## Why Are We Doing This?

In lectures, you learned that:
- Tokenization determines how text is represented.
- Language models assign probabilities to token sequences.
- Perplexity depends on vocabulary choice and sparsity.

In this lab, you will connect these ideas *empirically*.



## Quick Guide: Hugging Face Tokenizers

Load a tokenizer:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')
```

Tokenize text:
```python
tokens = tokenizer.tokenize(text)
```

We will compare:
- GPT-2 → Byte Pair Encoding (BPE)
- BERT → WordPiece


In [1]:
from transformers import AutoTokenizer
from datasets import load_dataset
from collections import Counter, defaultdict
import numpy as np
import math

## 1️⃣ Load Tokenizers

Hint:
```python
AutoTokenizer.from_pretrained('gpt2')
AutoTokenizer.from_pretrained('bert-base-uncased')
```

In [2]:
# TODO: Load GPT-2 tokenizer
gpt2 = AutoTokenizer.from_pretrained('gpt2')

# TODO: Load BERT tokenizer
bert = AutoTokenizer.from_pretrained('bert-base-uncased')



## 2️⃣ Tokenize Example Sentences

Goal: Observe how BPE and WordPiece split words differently.

For each sentence:
- Tokenize using GPT-2
- Tokenize using BERT
- Print tokens and their counts


In [3]:
sentences = [
    "Transformers are powerful models.",
    "Unbelievable tokenization differences!",
    "supercalifragilisticexpialidocious"
]

# TODO: Tokenize and print results
for i in range(3):
    print(f'{sentences[i]} \nGPT-2: {gpt2.tokenize(sentences[i])} \nBERT: {bert.tokenize(sentences[i])} \n')

Transformers are powerful models. 
GPT-2: ['Transform', 'ers', 'Ġare', 'Ġpowerful', 'Ġmodels', '.'] 
BERT: ['transformers', 'are', 'powerful', 'models', '.'] 

Unbelievable tokenization differences! 
GPT-2: ['Un', 'bel', 'iev', 'able', 'Ġtoken', 'ization', 'Ġdifferences', '!'] 
BERT: ['unbelievable', 'token', '##ization', 'differences', '!'] 

supercalifragilisticexpialidocious 
GPT-2: ['super', 'cal', 'if', 'rag', 'il', 'ist', 'ice', 'xp', 'ial', 'id', 'ocious'] 
BERT: ['super', '##cal', '##if', '##rag', '##ilis', '##tic', '##ex', '##pia', '##lid', '##oc', '##ious'] 



## 3️⃣ Load Dataset

We will train a simple bigram language model.

Hint:
```python
dataset = load_dataset('ag_news', split='train[:200]')
```

Use first 150 as train and last 50 as test.


In [5]:
# TODO: Load dataset and split train/test
dataset_train = load_dataset('ag_news', split = 'train[:150]')
dataset_test = load_dataset('ag_news', split = 'test[:50]')

training = dataset_train['text']
test = dataset_test['text']

print(training[0])
print(test[0])

Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
Fears for T N pension after talks Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.


## 4️⃣ Tokenize Dataset

Write a function that:

- Takes a list of raw text strings
- Returns a list of token lists (one list of tokens per sentence)

After writing the function, you MUST apply it to the training and test sets using the GPT-2 and BERT tokenizers.  


In [23]:
def tokenize_texts(texts, tokenizer):
    tokenized = []
    
    # Hint: loop over texts and use tokenizer.tokenize(text)
    for text in texts:
        tokenized.append(tokenizer.tokenize(text))

    return tokenized


training_gpt2 = tokenize_texts(training, gpt2)
test_gpt2 = tokenize_texts(test, gpt2)

training_bert = tokenize_texts(training, bert)
test_bert = tokenize_texts(test, bert)

## 5️⃣ Vocabulary & Sparsity Analysis
- Compute vocabulary size
- Compute average sentence length
- Compute singleton rate (tokens appearing once)

Think: What does singleton rate tell us about sparsity?

In [28]:
def get_vocab(tokenized_texts):
    vocab = set()

    # TODO: update vocabulary with tokens
    for text in tokenized_texts:
        for token in text:
            vocab.add(token)

    return vocab

In [29]:
def avg_sentence_length(tokenized_texts):
    
    lens = []
    for text in tokenized_texts:
        lens.append(len(text))
    
    return round(np.sum(lens)/ len(lens), 1)

avg_sentence_length(training_gpt2)

51.4

In [55]:
def singleton_rate(tokenized_texts):
    # Hint:
    # 1. Flatten tokens
    # 2. Count using Counter
    # 3. Count how many appear once
    all_tokens = []
    singletons = []

    for i in range(len(tokenized_texts)):
        all_tokens += tokenized_texts[i]
    
    counts = Counter(all_tokens)
    for token, count in counts.items():
        if count == 1:
            singletons.append(token)

    return len(singletons)
        
singleton_rate(training_gpt2)

2088

## 6️⃣ Build Bigram Language Model

Bigram probability:
$P(w_i | w_{i-1}) = Count(w_{i-1}, w_i) / Count(w_{i-1})$

You must:
- Count unigrams
- Count bigrams

You must build TWO separate bigram models:

1. One using GPT-2 tokens
2. One using BERT tokens


- Use ONLY the TRAINING set to compute:
  - Unigram counts
  - Bigram counts

- Use the TEST set ONLY to compute perplexity.


In [56]:
def build_bigram_counts(tokenized_texts):
    unigram_counts = Counter()
    bigram_counts = defaultdict(Counter)
    
    # TODO: implement counting
    for tokens in tokenized_texts:
        for i in range(len(tokens)):
            unigram_counts[tokens[i]] += 1

            if i > 0:
                bigram_counts[tokens[i-1]][tokens[i]] += 1
    return unigram_counts, bigram_counts


In [57]:
def compute_perplexity(tokenized_texts, unigram_counts, bigram_counts):
    log_prob = 0
    N = 0
    
    # TODO: compute log probabilities
    for tokens in tokenized_texts:
        for i in range(1, len(tokens)):
            prev_token = tokens[i-1]
            token = tokens[i]

            if unigram_counts[prev_token] > 0:
                prob = bigram_counts[prev_token][token] / unigram_counts[prev_token]
            else:
                prob = 0

            if prob > 0:
                log_prob += math.log(prob)
                N += 1

    return math.exp(-log_prob / N)

In [58]:
gpt2_uni, gpt2_bi = build_bigram_counts(training_gpt2)
bert_uni, bert_bi = build_bigram_counts(training_bert)

print('GPT-2 Perplexity:', compute_perplexity(test_gpt2, gpt2_uni, gpt2_bi))
print('BERT Perplexity:', compute_perplexity(test_bert, bert_uni, bert_bi))

GPT-2 Perplexity: 8.368013047450145
BERT Perplexity: 9.944979625785685
