# Unigram Tokenization

Unlike BPE (which is a frequency-based model), Unigram is a probability based model. We start with a large vocabulary and keep decreasing its size gradually until we reach the desired size. How do we get the base vocabulary? There are multiple ways to do so. Let's pass by a couple of ways here:

- Apply BPE on the initial corpus
- Consider the most common substrings in pre-tokenized words

We compute a loss function at each iteration of training. The token that results in the least increase in loss is considered 'least important' and hence removed. Due to financial reasons we don't just remove a single character, but a set of *'p'* characters (*p* being a hyperparameter representing a ratio of characters contributing to the lowest increase in loss).

`Why the name Unigram?` This is because unigram is the language model that considers each token to be independent of the tokens before it.

![](./../assets/tokenization/wordpiece.jpg)

We will be implementing WordPiece tokenization using HuggingFace's `tokenizers` library as done for wordpiece tokenizer. Let's get started by installing the library (if not installed already).

`pip3 install tokenizers`

`NOTE:` Unigram Tokenization can be implemented using Google's SentencePiece library as done for Byte-Pair Encoding. You may refer to `bpe.ipynb` for similar implementation of Unigram Tokenization.

In [1]:
# This code allows you to import '.py' file from one directory behind (i.e. root directory)
import sys
import os

root_dir = os.path.dirname(os.getcwd())
if root_dir not in sys.path:
    sys.path.append(root_dir)

In [2]:
import config
from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer
from tokenizers.pre_tokenizers import Whitespace

### Prepare the Tokenizer

In [3]:
unk_token = "<UNK>"
spl_tokens = ["<UNK>", "<SEP>", "<MASK>", "<CLS>"]

tokenizer = Tokenizer(Unigram())
trainer = UnigramTrainer(unk_token=unk_token, special_tokens=spl_tokens)

tokenizer.pre_tokenizer = Whitespace()

### Train the Tokenizer

In [4]:
data_files = ['./../{}'.format(config.DATA_PATH)]
model_path = './../{}/unigram.json'.format(config.MODEL_PATH)

tokenizer.train(data_files, trainer)
tokenizer.save(model_path)





### Tokenize Input String

In [5]:
tokenizer = Tokenizer.from_file(model_path)

In [6]:
text = "Good muffins cost $3.88. Please buy me two of them.\n\nThanks.🙂😍"

In [7]:
output = tokenizer.encode(text)

print(output.ids)
print(output.tokens)

[207, 457, 21, 1854, 19, 5, 1580, 0, 324, 6, 449, 449, 6, 90, 1226, 713, 12, 42, 372, 18, 202, 6, 1738, 5, 6, 0]
['G', 'ood', 'm', 'uff', 'in', 's', 'cost', '$', '3', '.', '8', '8', '.', 'P', 'lease', 'bu', 'y', 'me', 'two', 'of', 'them', '.', 'Thank', 's', '.', '🙂😍']
