# Task 1.3: Tokenization Function

This notebook implements a `tokenize()` function using TransformerLens and tests it on 3 different inputs.

In [1]:
from transformer_lens import HookedTransformer

model = HookedTransformer.from_pretrained("gpt2-small")

`torch_dtype` is deprecated! Use `dtype` instead!


Loaded pretrained model gpt2-small into HookedTransformer


In [3]:
def tokenize(text: str) -> list[dict]:
    """Returns list of {index, token_str, token_id} for each token."""
    tokens = model.to_tokens(text)          # tensor of token IDs, shape (1, n_tokens)
    str_tokens = model.to_str_tokens(text)  # list of strings
    return [
        {"index": i, "token_str": s, "token_id": int(tokens[0, i])}
        for i, s in enumerate(str_tokens)
    ]

## Test 1: Short sentence

In [4]:
result1 = tokenize("Hello world")
for token in result1:
    print(token)

{'index': 0, 'token_str': '<|endoftext|>', 'token_id': 50256}
{'index': 1, 'token_str': 'Hello', 'token_id': 15496}
{'index': 2, 'token_str': ' world', 'token_id': 995}


## Test 2: Longer sentence

In [6]:
result2 = tokenize("The quick brown fox jumps over the lazy dog near the tower.")
for token in result2:
    print(token)

{'index': 0, 'token_str': '<|endoftext|>', 'token_id': 50256}
{'index': 1, 'token_str': 'The', 'token_id': 464}
{'index': 2, 'token_str': ' quick', 'token_id': 2068}
{'index': 3, 'token_str': ' brown', 'token_id': 7586}
{'index': 4, 'token_str': ' fox', 'token_id': 21831}
{'index': 5, 'token_str': ' jumps', 'token_id': 18045}
{'index': 6, 'token_str': ' over', 'token_id': 625}
{'index': 7, 'token_str': ' the', 'token_id': 262}
{'index': 8, 'token_str': ' lazy', 'token_id': 16931}
{'index': 9, 'token_str': ' dog', 'token_id': 3290}
{'index': 10, 'token_str': ' near', 'token_id': 1474}
{'index': 11, 'token_str': ' the', 'token_id': 262}
{'index': 12, 'token_str': ' tower', 'token_id': 10580}
{'index': 13, 'token_str': '.', 'token_id': 13}


## Test 3: Sentence with unusual words

In [7]:
result3 = tokenize("The transformer's hippocampus-like architecture is unparalleledly fascinating!")
for token in result3:
    print(token)

{'index': 0, 'token_str': '<|endoftext|>', 'token_id': 50256}
{'index': 1, 'token_str': 'The', 'token_id': 464}
{'index': 2, 'token_str': ' transformer', 'token_id': 47385}
{'index': 3, 'token_str': "'s", 'token_id': 338}
{'index': 4, 'token_str': ' hippocampus', 'token_id': 38587}
{'index': 5, 'token_str': '-', 'token_id': 12}
{'index': 6, 'token_str': 'like', 'token_id': 2339}
{'index': 7, 'token_str': ' architecture', 'token_id': 10959}
{'index': 8, 'token_str': ' is', 'token_id': 318}
{'index': 9, 'token_str': ' unparalleled', 'token_id': 39235}
{'index': 10, 'token_str': 'ly', 'token_id': 306}
{'index': 11, 'token_str': ' fascinating', 'token_id': 13899}
{'index': 12, 'token_str': '!', 'token_id': 0}


## Observations

Things to note as you run the cells above:

- **BOS token**: GPT-2 prepends a special `<|endoftext|>` token (index 0) to every input
- **Subword tokenization**: long or unusual words get split into pieces (e.g. `unparalleled` → `un` + `parallel` + `eled`)
- **Spaces are part of tokens**: notice tokens often start with a space (shown as `Ġ` internally), so `" world"` is one token, not `"world"`
- **Punctuation**: punctuation is usually its own token