# Tokenization

- tokenization is the step of breaking down a string into atomic units used in the model
- There're several tokenization strategies...
- Optimal splitting of words into subunits is usuallay learned from the corpus.

## Character Tokenization

- Simplest tokenization scheme, feeding each character individually to the model

In [1]:
text = 'Tokenizing text is a core taks of NLP'
tokenized_text = list(text)
print(tokenized_text)

['T', 'o', 'k', 'e', 'n', 'i', 'z', 'i', 'n', 'g', ' ', 't', 'e', 'x', 't', ' ', 'i', 's', ' ', 'a', ' ', 'c', 'o', 'r', 'e', ' ', 't', 'a', 'k', 's', ' ', 'o', 'f', ' ', 'N', 'L', 'P']


Models expect each character to be converted to an integer, this process called *Numericalization*

- One simple way to do this is by encoding each unqiue token with a unique integer.

In [2]:
token2idx = {ch: idx for idx, ch in enumerate(sorted(set(tokenized_text)))}
print(token2idx)

{' ': 0, 'L': 1, 'N': 2, 'P': 3, 'T': 4, 'a': 5, 'c': 6, 'e': 7, 'f': 8, 'g': 9, 'i': 10, 'k': 11, 'n': 12, 'o': 13, 'r': 14, 's': 15, 't': 16, 'x': 17, 'z': 18}


In [3]:
# Transform tokenized_text according above mapping

input_ids = [token2idx[token] for token in tokenized_text]
print(input_ids)

[4, 13, 11, 7, 12, 10, 18, 10, 12, 9, 0, 16, 7, 17, 16, 0, 10, 15, 0, 5, 0, 6, 13, 14, 7, 0, 16, 5, 11, 15, 0, 13, 8, 0, 2, 1, 3]


Last step is to convert input_ids to a 2D tensor of one-hot vectors.

- **one hot** vectors are frequently used in ML to encode categorical data, which can be either ordinal or nomial

In [5]:
import pandas as pd

# Example
categorical_df = pd.DataFrame({'Name': ['Bumblebee', 'Optimus Prime', 'Megatron'], 'Label ID': [0, 1, 2]})
categorical_df

Unnamed: 0,Name,Label ID
0,Bumblebee,0
1,Optimus Prime,1
2,Megatron,2


The problem of this approach is this creates an order between the names, and models can learn this kind of relationships.

- Hence, we can create new columns for each category and assing 1 where the category is true, and a 0 otherwise.

In [6]:
# Pandas we can use get_dummies()
pd.get_dummies(categorical_df['Name'])

Unnamed: 0,Bumblebee,Megatron,Optimus Prime
0,1,0,0
1,0,0,1
2,0,1,0


In [10]:
# In Pytorch
import torch
import torch.nn.functional as F

input_ids = torch.tensor(input_ids) # converts to a tensor
one_hot_encodings = F.one_hot(input_ids, num_classes=len(token2idx))
one_hot_encodings.shape

  input_ids = torch.tensor(input_ids) # converts to a tensor


torch.Size([37, 19])

In [11]:
# verifying the idea
print(f'Token: {tokenized_text[0]}')
print(f'Tensor index: {input_ids[0]}')
print(f'One-hot: {one_hot_encodings[0]}')

Token: T
Tensor index: 4
One-hot: tensor([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])


Character-level tokenization ignores any structure in the text an treats the whole string as stream of characters.
- This helps to deal with misspelling and rare words
- Drawback: requires significant compute as linguistics structurs (such as words) need to be learned from the data.

Instead of this, we need to preserve some structure of the text is preserved during the tokenization step.
- Hence Word Tokenization is being used most of the time.

## Word Tokenization

- Split the text into words and map each word to an integer.

In [12]:
# using whitespace to tokenize the text
tokenized_text = text.split()
print(tokenized_text)

['Tokenizing', 'text', 'is', 'a', 'core', 'taks', 'of', 'NLP']


some word tokenizers have extra rules for punctuation.
- one can apply stemming or lemmatization, which normalize words to their stem, at the expense of losing some information of text
- Ex: `greater`, `great` become `great`

## Subword Tokenization