# Tokenization in Transformers

Before a transformer model like BERT, DistilBERT, or GPT can process text, raw text needs to be converted into a format the model understands. This process is called **tokenization**, the first step in transforming human-readable text into numerical data for the model.

Tokenization splits text into smaller units called tokens, which can be words, subwords, or characters.

Example:

- Sentence: `"Transformers are amazing!"`

- Tokens: `["Transformers", "are", "amazing", "!"]`

These tokens are then mapped to token IDs, which are integers the model uses internally:

Token IDs: `[1012, 2024, 7894, 999]`

## Why Not Just Words?

Using only whole words as tokens is inefficient because the vocabulary would need millions of words, and unknown words would be impossible to process. The solution is **subword tokenization**.

## Subword Tokenization

Subword tokenization breaks words into smaller meaningful pieces.

Example:

- Word: "unclear"

- Tokens: ["un", "##clear"]

`"un"` is a common prefix indicating negation. `"##clear"` is the remainder of the word, with `##` showing continuation from the previous token. Subword tokenization allows the model to handle rare or unseen words, morphological variations, and reduces vocabulary size.

## Token IDs

After tokenization, each token is converted to a unique integer from the model's vocabulary.

Example:

- "are" → 2003

- "Transformers" → 7953

Token IDs are what the model actually processes.

## Word Embeddings

Once token IDs are obtained, they are converted to vectors called embeddings. Embeddings capture semantic meaning and are high-dimensional (e.g., 768 dimensions in BERT).

Example:

- Token ID: 2003 ("are")

- Embedding: [0.12, -0.34, 0.56, ..., 0.09]

## Vocabulary Size

Transformers have a fixed vocabulary size. BERT and DistilBERT have about 30,000 tokens. Multilingual BERT (mBERT) has around 110,000 tokens. The model learns the most common words, prefixes, and suffixes, and unknown words are split into subwords present in the vocabulary.

## Python Example


In [1]:
from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

In [2]:
sentence = "I love mathematics!"

In [3]:
tokens = tokenizer.tokenize(sentence)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print("Tokens:", tokens)
print("Token IDs:", token_ids)

Tokens: ['i', 'love', 'mathematics', '!']
Token IDs: [1045, 2293, 5597, 999]
