# Tokenization

Tokenization is a fundamental principle in natural language processing (NLP) that plays a crucial role in enabling language models to comprehend written information. It entails breaking down textual inputs into individual units called tokens, forming the foundation for effectively understanding and processing language by neural networks. In the previous lesson, we introduced the concept of tokens as a means to define the input for language models (LLMs).

## Byte Pair Encoding (BPE)

It is an iterative process to extract the most repetitive words or subwords in a corpus. The algorithm starts by counting the occurrence of each character and builds on top of it by merging the characters. It is a greedy process that carefully considers all possible combinations to identify the optimal set of words/subwords that covers the dataset with the least number of required tokens.

The next step involves creating the vocabulary for our model, which consists of a comprehensive dictionary comprising the most frequently occurring tokens extracted by BPE (or another technique of your choosing) from the dataset. The definition of a dictionary (dict type) is a data structure that holds a key and value pair for each row. In our particular scenario, each data point is assigned a key represented by an index that begins from 0, while the corresponding value is a token.

Due to the fact that neural networks only accept numerical inputs, we can utilize the vocabulary to establish a mapping between tokens and their corresponding IDs, like a lookup table. We have to save the vocabulary for future use cases to be able to decode the model's output from the IDs to words. This is known as a pre-trained vocabulary, an essential component accompanying published pre-trained models. Without the vocabulary, understanding the model's output (the IDs) would be impossible. For smaller models like BERT, the dictionary can consist of as few as 30K tokens, while larger models like GPT-3 can expand to encompass up to 50K tokens.

In [3]:
from transformers import AutoTokenizer

# Download and load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [4]:
print(tokenizer.vocab)



As you can see, each entry is a pair of token and ID. For example, we can represent the word optional with the number 11902. You might have noticed a special character, Ġ, preceding certain tokens. This character represents a space. The next code sample will use the tokenizer object to convert a sentence into tokens and IDs.

In [5]:
token_ids = tokenizer.encode("This is a sample text to test the tokenizer.")

print( "Tokens:   ", tokenizer.convert_ids_to_tokens( token_ids ) )
print( "Token IDs:", token_ids )

Tokens:    ['This', 'Ġis', 'Ġa', 'Ġsample', 'Ġtext', 'Ġto', 'Ġtest', 'Ġthe', 'Ġtoken', 'izer', '.']
Token IDs: [1212, 318, 257, 6291, 2420, 284, 1332, 262, 11241, 7509, 13]


The .encode() method can convert any given text into a numerical representation, a list of integers. To further investigate the process, we can use the .convert_ids_to_tokens() function that shows the extracted tokens. As an example, you can observe that the word "tokenizer" has been split into a combination of "token" + "izer" tokens.

## Tokenizers Shortcomings

Several issues with the present tokenization methods are worth mentioning.

- Uppercase/Lowercase Words: The tokenizer will treat the the same word differently based on cases. For example, a word like “hello” will result in token id 31373, while the word “HELLO” will be represented by three tokens as [13909, 3069, 46] which translates to [“HE”, “LL”, “O”].
- Dealing with Numbers: You might have heard that transformers are not naturally proficient in handling mathematical tasks. One reason for this is the tokenizer's inconsistency in representing each number, leading to unpredictable variations. For instance, the number 200 might be represented as one token, while the number 201 will be represented as two tokens like [20, 1].
- Trailing whitespace: The tokenizer will identify some tokens with trailing whitespace. For example a word like “last” could be represented as “ last” as one tokens instead of [" ", "last"]. This will impact the probability of predicting the next word if you finish your prompt with a whitespace or not. As evident from the sample output above, you may observe that certain tokens begin with a special character (Ġ) representing whitespace, while others lack this feature.
- Model-specific: Even though most language models are using BPE method for tokenization, they still train a new tokenizer for their own models. GPT-4, LLaMA, OpenAssistant, and similar models all develop their separate tokenizers.