# Tokenization

Task: Convert text to numbers; interpret subword tokenization.

There are various different ways of converting text to numbers. This assignment works with one popular approach: assign numbers to parts of words.

## Setup

We'll be using the HuggingFace Transformers library, which provides a (mostly) consistent interface to many different language models. We'll focus on the OpenAI GPT-2 model, famous for OpenAI's assertion that it was "too dangerous" to release in full.

[Documentation](https://huggingface.co/transformers/model_doc/gpt2.html) for the model and tokenizer.

The `transformers` library is pre-installed on many systems, but in case you need to install it, you can run the following cell.

In [1]:
# Uncomment the following line to install the transformers library
!pip install -q transformers

In [2]:
import torch
from torch import tensor

### Download and load the model

This cell downloads the model and tokenizer, and loads them into memory.

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM
# We'll use this smaller version of GPT-2
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
# Alternative to add_prefix_space is to use `is_split_into_words=True`
# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id)

In [5]:
token_to_id_dict = tokenizer.get_vocab()
print(f"The tokenizer has {len(token_to_id_dict)} strings in its vocabulary.")
print(f"The model has {model.num_parameters():,d} parameters.")

The tokenizer has 50257 strings in its vocabulary.
The model has 81,912,576 parameters.


In [6]:
# warning: this assumes that there are no gaps in the token ids, which happens to be true for this tokenizer.
id_to_token = [token for token, id in sorted(token_to_id_dict.items(), key=lambda x: x[1])]
print(f"The first 10 tokens are: {id_to_token[:10]}")
print(f"The last 10 tokens are: {id_to_token[-10:]}")

The first 10 tokens are: ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*']
The last 10 tokens are: ['Ġ(/', 'âĢ¦."', 'Compar', 'Ġamplification', 'ominated', 'Ġregress', 'ĠCollider', 'Ġinformants', 'Ġgazed', '<|endoftext|>']


## Task

Consider the following phrase:

In [7]:
phrase = "I visited Muskegon"
# Another one to try later. This was a famous early example of the GPT-2 model:
# phrase = "In a shocking finding, scientists discovered a herd of unicorns living in"

### Getting familiar with tokens

1: Use `tokenizer.tokenize` to convert the phrase into a list of tokens. (What do you think the `Ġ` means?)

In [8]:
tokens = tokenizer.tokenize(phrase)
tokens

['ĠI', 'Ġvisited', 'ĠMus', 'ke', 'gon']

2: Use `tokenizer.convert_tokens_to_string` to convert the tokens back into a string.


In [11]:
string = tokenizer.convert_tokens_to_string(tokens)
print(string)

 I visited Muskegon


3: Use `tokenizer.encode` to convert the original phrase into token ids. (*Note: this is equivalent to `tokenize` followed by `convert_tokens_to_ids`*.) Call the result `input_ids`.


In [12]:
input_ids = tokenizer.encode(phrase, add_special_tokens=True)
input_ids

[314, 8672, 2629, 365, 14520]

4: Turn `input_ids` back into a readable string. Try this two ways: (1) using `convert_ids_to_tokens` and (2) using `tokenizer.decode`.

In [24]:
# using convert_ids_to_tokens
tokens = tokenizer.convert_ids_to_tokens(input_ids)
string1 = ''.join(tokens).replace('Ġ', ' ').strip()
print(string1)

I visited Muskegon


In [17]:
# using tokenizer.decode
string2 = tokenizer.decode(input_ids)
print(string2)

 I visited Muskegon


### Applying what you learned

5: Use `model.generate(tensor([input_ids]))` to generate a completion of this phrase. (Note that we needed to add `[]`s to give a "batch" dimension to the input.) Call the result `output_ids`.


In [26]:
input_ids = tokenizer.encode(string)
input_ids = torch.tensor([input_ids])  # add a batch dimension
output_ids = model.generate(input_ids)

6: Convert your `output_ids` into a readable form. (Note: it has an extra "batch" dimension, so you'll need to use `output_ids[0]`.)

In [29]:
print(output_ids)

tensor([[  314,  8672,  2629,   365, 14520,    11,   290,   314,   373,  6655,
           284,  1064,   326,   262,  1748,   550,   407,   587,  1498,   284]])


Note: `generate` uses a greedy decoding by default, but it's highly customizable. We'll play more with it in later exercises. For now, if you want more interesting results, try:

- Turn on `do_sample=True`. Run it a few times to see what it gives.
- Set `top_k=5`. Or 50.

7. What is the largest possible token id for this tokenizer? What token does it correspond to?

In [32]:
vocab_size = tokenizer.vocab_size
print("Vocabulary size:", vocab_size)
max_token_id = vocab_size - 1
print("Token corresponding to the largest possible id:", max_token_id)

Vocabulary size: 50257
Token corresponding to the largest possible id: 50256


## Analysis

Q1: Write a brief explanation of what a tokenizer does. Note that we worked with two parts of a tokenizer in this exercise (one that deals only with strings, and another that deals with numbers); make sure your explanation addresses both parts.

A tokenizer breaks down text into smaller units called tokens. It can convert a raw text string into a sequence of tokens, or a sequence of token ids that can be understood by a machine learning model. Tokens are building blocks for natural language processing, and can be individual words, subwords or characters. A tokenizer is essential in natural language processing for tasks such as language modeling, sentiment analysis, and machine translation.

Q2: What do you think the `Ġ` means? (Hint: it replaces a single well-known character.)

The Ġ character is a special symbol used by some tokenization libraries, such as the Hugging Face Transformers library used in this exercise, to indicate the beginning of a new word or subword. The Ġ symbol is used instead of a space character, which is a well-known whitespace delimiter in natural language text.When a tokenizer encodes a text string and adds Ġ symbols, it essentially converts the original string into a sequence of subwords or characters that can be fed as input to a machine learning model.

Q3: Suppose you add some personal flair to your writing by doubling some letters. Explain what the tokenizer we have loaded up in this notebook will do with your embellished writing.

If I double some letters in my writing, the tokenizer we have will treat them as separate subwords and assign a unique token ID to each one. This is because the tokenizer uses subword tokenization to break down words into smaller units based on their frequency of occurrence. The resulting sequence of token IDs is fed as input to a machine learning model, which predicts the most likely next word based on the learned patterns from the training data.