[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/1.words/HW1_Tokenization.ipynb)

# Homework 1: Tokenization

In this homework, you'll compare the tokenizations outputs from different classes of tokenizers. This homework is also an opportunity for you to check in on your Python proficiency; for all of the operations below (downloading a file, reading it in, counting objects), you should either be comfortable implementing them already or know how to find out how to do so yourself (if you find yourself struggling with them, we encourage you to take this class at a later date, with a bit more Python experience under your belt).

We've added some space for you to write the code for each section, but feel free to create more code cells if you'd like.

## Part 1

Tokenize the following document with each of these models. Feel free to use the documentation linked (and AI Assistance) to do so for this low-level operation (but again remember that you have to be able to explain what your code is doing).  For each of the tokenizers above, we want to see a list of tokens for this document (not numeric token IDs, but legible words) -- e.g., \["London", ".", ...\]

* NLTK `word_tokenize` (https://www.nltk.org/book/ch03.html)
* Spacy `tokenize` (https://spacy.io/usage/spacy-101#annotations-token)
* Tiktoken BPE tokenization (https://github.com/openai/tiktoken) -- cl100k_base (GPT-3.5, GPT-4).



In [1]:
document = "London. Michaelmas term lately over, and the Lord Chancellor sitting in Lincoln’s Inn Hall. Implacable November weather. As much mud in the streets as if the waters had but newly retired from the face of the earth, and it would not be wonderful to meet a Megalosaurus, forty feet long or so, waddling like an elephantine lizard up Holborn Hill. Smoke lowering down from chimney-pots, making a soft black drizzle, with flakes of soot in it as big as full-grown snowflakes—gone into mourning, one might imagine, for the death of the sun. Dogs, undistinguishable in mire. Horses, scarcely better; splashed to their very blinkers. Foot passengers, jostling one another’s umbrellas in a general infection of ill temper, and losing their foot-hold at street-corners, where tens of thousands of other foot passengers have been slipping and sliding since the day broke (if this day ever broke), adding new deposits to the crust upon crust of mud, sticking at those points tenaciously to the pavement, and accumulating at compound interest."

In [15]:
# Your code here:
import nltk
from nltk.tokenize import word_tokenize
nltk.download("punkt")
nltk.download("punkt_tab")
import spacy
import tiktoken
import os, io, requests, json

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [8]:
nltk_tokens = word_tokenize(document)

In [9]:
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner", "lemmatizer", "attribute_ruler"])
spacy_doc = nlp(document)
spacy_tokens = [t.text for t in spacy_doc]

In [11]:
enc = tiktoken.get_encoding("cl100k_base")
token_ids = enc.encode(document)

token_pieces = [b"".join(enc.decode_tokens_bytes([tid])).decode("utf-8", errors="replace")
                for tid in token_ids]

print(token_pieces[:50])

['London', '.', ' Michael', 'mas', ' term', ' lately', ' over', ',', ' and', ' the', ' Lord', ' Chancellor', ' sitting', ' in', ' Lincoln', '’s', ' Inn', ' Hall', '.', ' Impl', 'ac', 'able', ' November', ' weather', '.', ' As', ' much', ' mud', ' in', ' the', ' streets', ' as', ' if', ' the', ' waters', ' had', ' but', ' newly', ' retired', ' from', ' the', ' face', ' of', ' the', ' earth', ',', ' and', ' it', ' would', ' not']


In [12]:
import json

def show(name, tokens):
    print(f"\n{name} ({len(tokens)} tokens):")
    print(json.dumps(tokens, ensure_ascii=False, indent=2))

show("NLTK word_tokenize", nltk_tokens)
show("spaCy tokenize", spacy_tokens)
show("tiktoken cl100k_base (string pieces)", token_pieces)


NLTK word_tokenize (199 tokens):
[
  "London",
  ".",
  "Michaelmas",
  "term",
  "lately",
  "over",
  ",",
  "and",
  "the",
  "Lord",
  "Chancellor",
  "sitting",
  "in",
  "Lincoln",
  "’",
  "s",
  "Inn",
  "Hall",
  ".",
  "Implacable",
  "November",
  "weather",
  ".",
  "As",
  "much",
  "mud",
  "in",
  "the",
  "streets",
  "as",
  "if",
  "the",
  "waters",
  "had",
  "but",
  "newly",
  "retired",
  "from",
  "the",
  "face",
  "of",
  "the",
  "earth",
  ",",
  "and",
  "it",
  "would",
  "not",
  "be",
  "wonderful",
  "to",
  "meet",
  "a",
  "Megalosaurus",
  ",",
  "forty",
  "feet",
  "long",
  "or",
  "so",
  ",",
  "waddling",
  "like",
  "an",
  "elephantine",
  "lizard",
  "up",
  "Holborn",
  "Hill",
  ".",
  "Smoke",
  "lowering",
  "down",
  "from",
  "chimney-pots",
  ",",
  "making",
  "a",
  "soft",
  "black",
  "drizzle",
  ",",
  "with",
  "flakes",
  "of",
  "soot",
  "in",
  "it",
  "as",
  "big",
  "as",
  "full-grown",
  "snowflakes—gone",
  "into",
 

## Part 2

Examine the different tokenizations for the passage above -- i.e., actually read through them and see how they differ. In a paragraph or two, characterize the salient differences in tokenization between a.) NLTK and Spacy and b.) NLTK and BPE.  Reference real examples in the text. At the end of this homework, you want to be able to discuss the practical differences between tokenization methods.

**Response**:

Your response here

NLTK vs spaCy: differences are in linguistic fidelity. spaCy preserves contractions and hyphenated words more naturally, which is often desirable in linguistic or syntactic tasks. NLTK breaks them more, which may be useful for strict matching but loses information.


NLTK vs BPE: differences are in granularity. NLTK assumes words are the smallest unit, while BPE works on subword units to balance vocabulary size with coverage. This makes BPE far more suitable for large-scale language models (like GPT), but it produces tokens that don’t always align with human intuitions about “words.”

## Part 3

Download the full text of *Pride and Prejudice* (https://raw.githubusercontent.com/dbamman/anlp25/main/data/1342_pride_and_prejudice.txt) and tokenize it using each of the methods above. How many word types (in the formal sense we discussed in class) does each tokenization method have for that complete file?

In [13]:
# Your code here:
try:
    nltk.data.find("tokenizers/punkt")
except LookupError:
    nltk.download("punkt")
try:
    nltk.data.find("tokenizers/punkt_tab")
except LookupError:
    nltk.download("punkt_tab")

URL = "https://raw.githubusercontent.com/dbamman/anlp25/main/data/1342_pride_and_prejudice.txt"
LOCAL_FNAME = "1342_pride_and_prejudice.txt"

In [16]:
def load_text():
    if os.path.exists(LOCAL_FNAME):
        with io.open(LOCAL_FNAME, "r", encoding="utf-8") as f:
            return f.read()
    resp = requests.get(URL, timeout=60)
    resp.raise_for_status()
    text = resp.text
    with io.open(LOCAL_FNAME, "w", encoding="utf-8") as f:
        f.write(text)
    return text

text = load_text()

In [17]:
nltk_tokens = word_tokenize(text)
nltk_types = set(nltk_tokens)

In [18]:
nlp = spacy.load(
    "en_core_web_sm",
    disable=["tagger","parser","ner","lemmatizer","attribute_ruler"]
)
spacy_tokens = [t.text for t in nlp.make_doc(text)]
spacy_types = set(spacy_tokens)

In [19]:
enc = tiktoken.get_encoding("cl100k_base")
bpe_ids = enc.encode(text)

bpe_tokens = [
    enc.decode_single_token_bytes(tid).decode("utf-8", errors="replace")
    for tid in bpe_ids
]
bpe_types = set(bpe_tokens)

In [20]:
def show_count(name, tokens, typeset):
    print(f"{name}:")
    print(f"  tokens: {len(tokens):,}")
    print(f"  UNIQUE WORD TYPES: {len(typeset):,}\n")

show_count("NLTK word_tokenize", nltk_tokens, nltk_types)
show_count("spaCy tokenize", spacy_tokens, spacy_types)
show_count("tiktoken cl100k_base (BPE pieces)", bpe_tokens, bpe_types)

NLTK word_tokenize:
  tokens: 142,522
  UNIQUE WORD TYPES: 7,475

spaCy tokenize:
  tokens: 155,437
  UNIQUE WORD TYPES: 6,780

tiktoken cl100k_base (BPE pieces):
  tokens: 161,075
  UNIQUE WORD TYPES: 8,364



## Part 4

Which text has the greater type-token ratio, *Pride and Prejudice* (https://raw.githubusercontent.com/dbamman/anlp25/main/data/1342_pride_and_prejudice.txt) or *Emma* (https://raw.githubusercontent.com/dbamman/anlp25/main/data/158_emma.txt)?  Calculate the TTR for both texts using the NLTK tokenizer, but only use the first 1,000 tokens from each text when calculating its TTR.

In [22]:
# Your code here:
from nltk.tokenize import TreebankWordTokenizer

pp_url = "https://raw.githubusercontent.com/dbamman/anlp25/main/data/1342_pride_and_prejudice.txt"
emma_url = "https://raw.githubusercontent.com/dbamman/anlp25/main/data/158_emma.txt"

pp_text = requests.get(pp_url).text
emma_text = requests.get(emma_url).text

tok = TreebankWordTokenizer()

def ttr_first_k(text, k=1000):
    toks = tok.tokenize(text)[:k]
    return len(set(toks)) / len(toks)

pp_ttr = ttr_first_k(pp_text, 1000)
emma_ttr = ttr_first_k(emma_text, 1000)

answer = "Emma" if emma_ttr > pp_ttr else "Pride and Prejudice"

print("The TTR for 'Pride and Prejudice' is", pp_ttr)
print("The TTR for 'Emma' is", emma_ttr)
print(f"{answer} has the higher TTR.")

The TTR for 'Pride and Prejudice' is 0.404
The TTR for 'Emma' is 0.425
Emma has the higher TTR.
