# Task 1.3: Tokenization Function + Notebook

This notebook implements a `tokenize()` function and tests it with 3 different inputs.

## Load the selected model

In [1]:
from transformer_lens import HookedTransformer

model = HookedTransformer.from_pretrained("gpt2-small")

`torch_dtype` is deprecated! Use `dtype` instead!


Loaded pretrained model gpt2-small into HookedTransformer


## The Actual Function

This function uses TransformerLens to tokenize input text with GPT-2 small. It returns a list of dictionaries, including:
- index: the token position
- token_str: the token string (not guaranteed to be a full word)
- token_id: the integer token ID used by the model

Note that `model.to_tokens(text)` returns a tensor of shape `(batch, n_tokens)`. We then extract each token ID from the tensor and convert it to an integer.

In [2]:
def tokenize(text: str) -> list[dict]:
    
    """
    Returns list of {index, token_str, token_id} for each token.
    """
    
    tokens = model.to_tokens(text) # tensor of token IDs
    str_tokens = model.to_str_tokens(text) # list of strings
    
    return [
        {"index": i, "token_str": s, "token_id": int(tokens[0, i])}
        for i, s in enumerate(str_tokens)
    ]

## Test 1: Short Sentence

In [3]:
test_result1 = tokenize("I can do it!")
## test_result1 = tokenize("Transformer is fun!")

for token in test_result1:
    print(token)

{'index': 0, 'token_str': '<|endoftext|>', 'token_id': 50256}
{'index': 1, 'token_str': 'I', 'token_id': 40}
{'index': 2, 'token_str': ' can', 'token_id': 460}
{'index': 3, 'token_str': ' do', 'token_id': 466}
{'index': 4, 'token_str': ' it', 'token_id': 340}
{'index': 5, 'token_str': '!', 'token_id': 0}


## Test 2: Longer Sentence

In [4]:
test_result2 = tokenize("Gabrielle, Mario, Polly, and Steven are building 3 AI interpretability interfaces.")

for token in test_result2:
    print(token)

{'index': 0, 'token_str': '<|endoftext|>', 'token_id': 50256}
{'index': 1, 'token_str': 'Gab', 'token_id': 46079}
{'index': 2, 'token_str': 'riel', 'token_id': 11719}
{'index': 3, 'token_str': 'le', 'token_id': 293}
{'index': 4, 'token_str': ',', 'token_id': 11}
{'index': 5, 'token_str': ' Mario', 'token_id': 10682}
{'index': 6, 'token_str': ',', 'token_id': 11}
{'index': 7, 'token_str': ' Polly', 'token_id': 36898}
{'index': 8, 'token_str': ',', 'token_id': 11}
{'index': 9, 'token_str': ' and', 'token_id': 290}
{'index': 10, 'token_str': ' Steven', 'token_id': 8239}
{'index': 11, 'token_str': ' are', 'token_id': 389}
{'index': 12, 'token_str': ' building', 'token_id': 2615}
{'index': 13, 'token_str': ' 3', 'token_id': 513}
{'index': 14, 'token_str': ' AI', 'token_id': 9552}
{'index': 15, 'token_str': ' interpret', 'token_id': 6179}
{'index': 16, 'token_str': 'ability', 'token_id': 1799}
{'index': 17, 'token_str': ' interfaces', 'token_id': 20314}
{'index': 18, 'token_str': '.', 'token

## Test 3: Sentence with Unusual Words

In [5]:
test_result3 = tokenize("Pneumonoultramicroscopicsilicovolcanoconiosis is an unbelievably long word to test with GPT-2 small!")

for token in test_result3:
    print(token)

{'index': 0, 'token_str': '<|endoftext|>', 'token_id': 50256}
{'index': 1, 'token_str': 'P', 'token_id': 47}
{'index': 2, 'token_str': 'neum', 'token_id': 25668}
{'index': 3, 'token_str': 'on', 'token_id': 261}
{'index': 4, 'token_str': 'oult', 'token_id': 25955}
{'index': 5, 'token_str': 'ram', 'token_id': 859}
{'index': 6, 'token_str': 'icro', 'token_id': 2500}
{'index': 7, 'token_str': 'sc', 'token_id': 1416}
{'index': 8, 'token_str': 'op', 'token_id': 404}
{'index': 9, 'token_str': 'ics', 'token_id': 873}
{'index': 10, 'token_str': 'ilic', 'token_id': 41896}
{'index': 11, 'token_str': 'ov', 'token_id': 709}
{'index': 12, 'token_str': 'ol', 'token_id': 349}
{'index': 13, 'token_str': 'can', 'token_id': 5171}
{'index': 14, 'token_str': 'ocon', 'token_id': 36221}
{'index': 15, 'token_str': 'iosis', 'token_id': 42960}
{'index': 16, 'token_str': ' is', 'token_id': 318}
{'index': 17, 'token_str': ' an', 'token_id': 281}
{'index': 18, 'token_str': ' unbelievably', 'token_id': 48943}
{'ind

## Observations

- GPT-2 automatically prepends a `<|endoftext|>` token at index 0. This behavior can be disabled using `prepend_bos=False` when converting text to tokens or token strings, though keeping it helps maintain consistent indexing especially with later activations.
- Tokens are not always full words. For instance, "Transformer" was split into "Trans" and "former", and a rare and very long word in Test 3 was split into many smaller pieces.
- Spaces were included inside several tokens (e.g. `' is'` or `' word'`).
- Punctuation and symbols, such as `'-'` and `'!'`, and numbers were often treated as separate tokens.
- Same token (e.g. `' is'`) always maps to the same token ID, but note that `' is'` and `'is'` are treated as ***different*** tokens and thus have different token IDs (e.g. in my test, `' can'` has `'token_id': 460` while `'can'` has `'token_id': 5171`).