# Tokenization

![](../figs/deep_nlp/tokenization/entelecheia_puzzle_pieces.png)

## Overview

NLP systems have three main components that help machines understand natural language:

- **Tokenization**: Splitting a string into a list of tokens.
- **Embedding**: Mapping tokens to vectors.
- **Model**: A neural network that takes token vectors as input and outputs predictions.

Tokenization is the first step in the NLP pipeline. 

- Tokenization is the process of splitting a string into a list of tokens. 
- For example, the sentence "I like to eat apples" can be tokenized into the list of tokens `["I", "like", "to", "eat", "apples"]`. 
- The tokens can be words, characters, or subwords.

> In deep learning, tokenization is the process of converting a sequence of characters into a sequence of tokens, then converting each token into a numerical vector to be used as input to a neural network.

## What is Tokenization?

- Tokenization is the process of representing a text in smaller units called tokens.
- In a very simple case, we can simply map every word in the text to a numerical index.
- For example, the sentence "I like to eat apples" can be tokenized into the list of tokens:
  > `["I", "like", "to", "eat", "apples"]`. 
- Then, each token can be mapped to a unique index, such as:
  > `{"I": 0, "like": 1, "to": 2, "eat": 3, "apples": 4}`.
- There are more linguistic features to consider when tokenizing a text, such as punctuation, capitalization, and so on.

## Why do we need tokenization?

- "How can we make a machine read a sentence?"
- Machines donâ€™t know any language, nor do they understand sounds or phonetics.
- They need to be taught from scratch.
- The first step is to break down the sentence into smaller units that the machine can process.
- Tokenization determines how the input is represented to the model.
- This decision has a huge impact on the performance of the model.

## Tokenization Methods

- Word-level tokenization: Split a sentence into words.
- Character-level tokenization: Split a sentence into characters.
- Subword-level tokenization: Split a sentence into subwords.

##  Word (White Space) Tokenization

- The simplest tokenization method is to split a sentence into words.
- This is also called white space tokenization.
- The sentence "I like to eat apples" can be tokenized into the list of tokens:
  > `["I", "like", "to", "eat", "apples"]`.
- This method is very fast and easy to implement.
- However, it has some limitations.

### Problems with Word tokenizer

- Out-of-vocabulary (OOV) words: 
  - The risk of missing words that are not in the vocabulary.
  - The model will not recognize the variants of words that were not in the training set.
  - For example, even though the words `pine` and `apple` exist in the training set, the model will not recognize the word `pineapple`.
- Punctuation and abbreviations: 
  - The tokenizer will not recognize punctuation and abbreviations.
  - For example, the word `don't` will be tokenized as `["do", "n't"]`.
- Slang and informal language: 
  - The tokenizer will not recognize slang and informal language.
  - For example, the word `gonna` will be tokenized as `["gon", "na"]`.
  - `tl;dr` will be tokenized as `["tl", ";", "dr"]`.
- What if language does not use spaces for separating words?
  - Chinese, Japanese, and Korean do not use spaces to separate words.
  - The tokenizer will not work for these languages.

## Character Tokenization

- To solve the problems of word tokenization, we can split a sentence into characters.
- The sentence "I like to eat apples" can be tokenized into the list of tokens:
  > `["I", " ", "l", "i", "k", "e", " ", "t", "o", " ", "e", "a", "t", " ", "a", "p", "p", "l", "e", "s"]`.
- However, this method has its own problems.

### Problems with Character tokenizer

- The number of tokens is very large.
  - This requires more computation and memory.
- Limit the application of the model.
  - Only certain types of models can be used.
  - It is inefficient for the certain types of applications, such as NER.
- It would be difficult to understand the relationship between the tokens.
  - For example, the tokens `["a", "p", "p", "l", "e"]` do not represent the word `apple`.
  - The tokens `["a", "p", "p", "l", "e"]` do not have any relationship with the tokens `["a", "p", "p", "l", "e", "s"]`.
- Incorrect spelling could be generated.

## Subword Tokenization

- With character-level tokenization, we risk losing the semantic features of the words. 
- With word-level tokenization, we have out-of-vocabulary (OOV) words or very large vocabulary sizes.
- To solve the problems of word tokenization and character tokenization, an algorithm should be able to:
  - Retain the semantic features of the words.
  - Tokenize any words without the need for a huge vocabulary.
- Subword tokenization is a method that can solve these problems.
- For example, the sentence "I like to eat pineapples" can be tokenized into the list of tokens:
  > `["I", "like", "to", "eat", "pine", "##app", "##les"]`.
- The model only learns a few subwords that can be used to represent any word.
- This solves the problem of OOV words.

### How to decide which subwords to use?

- There are several algorithms that can be used to decide which subwords to use.
  - Byte Pair Encoding (BPE)
  - Unigram Language Model
  - Subword Sampling
  - Byte-level BPE
  - WordPiece
  - SentencePiece

## Byte Pair Encoding (BPE)

Sennrich et al. (2016) proposed a method called Byte Pair Encoding (BPE) to learn subword units. 
{cite}`sennrich-etal-2016-neural`

- Byte Pair Encoding algorithm is originally used for compressing text.
- It splits words into sequences of characters and iteratively combines the most frequent character pairs.

 ### Token Learning from Dataset

  - Count the frequency of each word shown in the corpus. 
  - For each word, append a special stop token `</w>` at the end of the word. 
  - Then, split the word into characters. 
  - Initially, the tokens of the word are all of its characters plus the additional `</w>` token. 
  - For example, the tokens for word `low` are [`l`, `o`, `w`, `</w>`] in order. 
  - So after counting all the words in the dataset, we will get a vocabulary for the tokenized word with its corresponding counts

```
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3}
```

- In each iteration, count the frequency of each consecutive byte pair, find out the most frequent one, and merge the two byte pair tokens to one token.

- For the above example, in the first iteration of the merge, because byte pair `e` and `s` occurred 6 + 3 = 9 times which is the most frequent, merge these into a new token `es`. 
- Note that token `s` is also gone in this particular example.

```
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3}
```

- In the second iteration of merge, token `es` and `t` occurred 6 + 3 = 9 times, which is the most frequent. 
- Merge these to into a new token `est`. 
- Note that token `es` and `t` are also gone.

```
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est </w>': 6, 'w i d est </w>': 3}
```

- In the third iteration of the merge, token `est` and `</w>` pair is the most frequent, etc.
- Do this until we have the desired number of tokens or reach the maximum number of iterations.

Stop token `</w>` is also important. 
- Without `</w>`, say if there is a token `st`, this token could be in the word `st ar`, or the wold `wide st`.
- Those two words are very different in meaning, but the token `st` is the same.
- With `</w>`, if there is a token `st</w>`, the model immediately know that it is the token for the wold `wide st</w>` but not `st ar</w>`.

To summarize, the algorithm is as follows:
1. Extract the words from the given dataset along with their count. 
2. Define the vocabulary size. 
3. Split the words into a character sequence. 
4. Add all the unique characters in our character sequence to the vocabulary. 
5. Select and merge the symbol pair that has a high frequency. 
6. Repeat step 5 until the vocabulary size is reached.

## References

- [NLP Tokenization](https://medium.com/nerd-for-tech/nlp-tokenization-2fdec7536d17)