# Lab: Tokenization and N-gram Language Models 

## Why Are We Doing This?

In lectures, you learned that:
- Tokenization determines how text is represented.
- Language models assign probabilities to token sequences.
- Perplexity depends on vocabulary choice and sparsity.

In this lab, you will connect these ideas *empirically*.



## Quick Guide: Hugging Face Tokenizers

Load a tokenizer:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')
```

Tokenize text:
```python
tokens = tokenizer.tokenize(text)
```

We will compare:
- GPT-2 → Byte Pair Encoding (BPE)
- BERT → WordPiece


In [3]:
from transformers import AutoTokenizer
from datasets import load_dataset
from collections import Counter, defaultdict
import numpy as np
import math

## 1️⃣ Load Tokenizers

Hint:
```python
AutoTokenizer.from_pretrained('gpt2')
AutoTokenizer.from_pretrained('bert-base-uncased')
```

In [10]:
# TODO: Load GPT-2 tokenizer
gpt2 = AutoTokenizer.from_pretrained('gpt2')

# TODO: Load BERT tokenizer
bert = AutoTokenizer.from_pretrained('bert-base-uncased')

## 2️⃣ Tokenize Example Sentences

Goal: Observe how BPE and WordPiece split words differently.

For each sentence:
- Tokenize using GPT-2
- Tokenize using BERT
- Print tokens and their counts


In [21]:
sentences = [
    "Transformers are powerful models.",
    "Unbelievable tokenization differences!",
    "supercalifragilisticexpialidocious"
]

# TODO: Tokenize and print results
for i in range(3):
    print(f'{sentences[i]} \nGPT-2: {gpt2.tokenize(sentences[i])} \nBERT: {bert.tokenize(sentences[i])} \n')

Transformers are powerful models. 
GPT-2: ['Transform', 'ers', 'Ġare', 'Ġpowerful', 'Ġmodels', '.'] 
BERT: ['transformers', 'are', 'powerful', 'models', '.'] 

Unbelievable tokenization differences! 
GPT-2: ['Un', 'bel', 'iev', 'able', 'Ġtoken', 'ization', 'Ġdifferences', '!'] 
BERT: ['unbelievable', 'token', '##ization', 'differences', '!'] 

supercalifragilisticexpialidocious 
GPT-2: ['super', 'cal', 'if', 'rag', 'il', 'ist', 'ice', 'xp', 'ial', 'id', 'ocious'] 
BERT: ['super', '##cal', '##if', '##rag', '##ilis', '##tic', '##ex', '##pia', '##lid', '##oc', '##ious'] 



## 3️⃣ Load Dataset

We will train a simple bigram language model.

Hint:
```python
dataset = load_dataset('ag_news', split='train[:200]')
```

Use first 150 as train and last 50 as test.


In [48]:
# TODO: Load dataset and split train/test
dataset_train = load_dataset('ag_news', split = 'train[:150]')
dataset_test = load_dataset('ag_news', split = 'test[:50]')

training = dataset_train['text']
test = dataset_test['text']

print(training[1])
print(test[1])

Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.
The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com) SPACE.com - TORONTO, Canada -- A second\team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for\privately funded suborbital space flight, has officially announced the first\launch date for its manned rocket.


## 4️⃣ Tokenize Dataset

Write a function that:

- Takes a list of raw text strings
- Returns a list of token lists (one list of tokens per sentence)

After writing the function, you MUST apply it to the training and test sets using the GPT-2 and BERT tokenizers.  


In [52]:
def tokenize_texts(texts, tokenizer):
    tokenized = []
    
    # Hint: loop over texts and use tokenizer.tokenize(text)
    for text in texts:
        tokenized.append(tokenizer.tokenize(text))

    return tokenized


training_gpt2 = tokenize_texts(training, gpt2)
test_gpt2 = tokenize_texts(test, gpt2)

training_bert = tokenize_texts(training, bert)
test_bert = tokenize_texts(test, bert)

## 5️⃣ Vocabulary & Sparsity Analysis
- Compute vocabulary size
- Compute average sentence length
- Compute singleton rate (tokens appearing once)

Think: What does singleton rate tell us about sparsity?

In [None]:
def get_vocab(tokenized_texts):
    vocab = set()
    # TODO: update vocabulary with tokens
    return vocab


In [None]:
def singleton_rate(tokenized_texts):
    # Hint:
    # 1. Flatten tokens
    # 2. Count using Counter
    # 3. Count how many appear once
    pass


## 6️⃣ Build Bigram Language Model

Bigram probability:
P(w_i | w_{i-1}) = Count(w_{i-1}, w_i) / Count(w_{i-1})

You must:
- Count unigrams
- Count bigrams

You must build TWO separate bigram models:

1. One using GPT-2 tokens
2. One using BERT tokens


- Use ONLY the TRAINING set to compute:
  - Unigram counts
  - Bigram counts

- Use the TEST set ONLY to compute perplexity.


In [None]:
def build_bigram_counts(tokenized_texts):
    unigram_counts = Counter()
    bigram_counts = defaultdict(Counter)
    
    # TODO: implement counting
    
    return unigram_counts, bigram_counts


In [None]:
def compute_perplexity(tokenized_texts, unigram_counts, bigram_counts):
    log_prob = 0
    N = 0
    
    # TODO: compute log probabilities
    
    return math.exp(-log_prob / N)
